Skip to content

Commit

Permalink
Merge branch 'main' into fix/make-convert-token-to-string-pickleable
Browse files Browse the repository at this point in the history
  • Loading branch information
saattrupdan authored Nov 7, 2024
2 parents 75ef6b8 + 5f39ded commit 93b13a5
Show file tree
Hide file tree
Showing 3 changed files with 279 additions and 0 deletions.
278 changes: 278 additions & 0 deletions docs/cookbook/earnings-reports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# Extracting financial data from earnings reports

A common task in finance is to extract financial data from earnings reports. Earnings reports are infamously poorly formatted, as the SEC does not have requirements for producing machine-readable documents.

Earnings reports are often provided as HTML documents, which can be difficult to parse. Investors often use complicated parsing systems or manual review to extract data. Entire companies are built around automating this task.

This cookbook is a proof of concept about how we can use LLMs to extract financial data directly into CSV. Comma-separated values are well-structured and can be defined by a regular expression, which Outlines can use to guide the LLM's output.

The example is a smaller subset of a full demo found [here](https://github.com/dottxt-ai/demos/tree/main/earnings-reports). The demo contains the full set of pre-processing steps needed to convert raw HTML into a structured CSV file, and tests the results across three company's 10k reports.

## Setup

Install outlines and required dependencies:

```bash
# Later versions of torch can have difficulty with certain CUDA drivers.
# We recommend using 2.4.0 for now, but you may wish to experiment with
# other versions.
pip install outlines pandas transformers torch==2.4.0 accelerate
```

## Load the model

Choose your language model. We'll use Phi-3 mini, which is small enough to run on reasonably small machines.

```python
import outlines
import torch

model_name = 'microsoft/Phi-3-mini-4k-instruct'
model = outlines.models.transformers(
model_name,
device='auto',
model_kwargs={
# To reduce memory usage, we'll use bfloat16
"torch_dtype": torch.bfloat16,
},
)
```

## Set up the data

For brevity, we've attached the markdown version of Nvidia's 10k report. The [full demonstration](https://github.com/dottxt-ai/demos/tree/main/earnings-reports) processes the raw HTML version of the report to these markdown tables. Pages are filtered by whether they seem to contain income statements, and then compacted into the string you see below.

```python
income_statement = """
Table of ContentsNVIDIA Corporation and SubsidiariesConsolidated Statements of Income(In millions, except per share data)
| | | | | | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| | | | Year Ended | | | | | | | | | | | | | | |
| | | | Jan 28, 2024 | | | | | | Jan 29, 2023 | | | | | | Jan 30, 2022 | | |
| Revenue | | | $ | 60,922 | | | | | $ | 26,974 | | | | | $ | 26,914 | |
| Cost of revenue | | | 16,621 | | | | | | 11,618 | | | | | | 9,439 | | |
| Gross profit | | | 44,301 | | | | | | 15,356 | | | | | | 17,475 | | |
| Operating expenses | | | | | | | | | | | | | | | | | |
| Research and development | | | 8,675 | | | | | | 7,339 | | | | | | 5,268 | | |
| Sales, general and administrative | | | 2,654 | | | | | | 2,440 | | | | | | 2,166 | | |
| Acquisition termination cost | | | — | | | | | | 1,353 | | | | | | — | | |
| Total operating expenses | | | 11,329 | | | | | | 11,132 | | | | | | 7,434 | | |
| Operating income | | | 32,972 | | | | | | 4,224 | | | | | | 10,041 | | |
| Interest income | | | 866 | | | | | | 267 | | | | | | 29 | | |
| Interest expense | | | (257) | | | | | | (262) | | | | | | (236) | | |
| Other, net | | | 237 | | | | | | (48) | | | | | | 107 | | |
| Other income (expense), net | | | 846 | | | | | | (43) | | | | | | (100) | | |
| Income before income tax | | | 33,818 | | | | | | 4,181 | | | | | | 9,941 | | |
| Income tax expense (benefit) | | | 4,058 | | | | | | (187) | | | | | | 189 | | |
| Net income | | | $ | 29,760 | | | | | $ | 4,368 | | | | | $ | 9,752 | |
| | | | | | | | | | | | | | | | | | |
| Net income per share: | | | | | | | | | | | | | | | | | |
| Basic | | | $ | 12\.05 | | | | | $ | 1\.76 | | | | | $ | 3\.91 | |
| Diluted | | | $ | 11\.93 | | | | | $ | 1\.74 | | | | | $ | 3\.85 | |
| | | | | | | | | | | | | | | | | | |
| Weighted average shares used in per share computation: | | | | | | | | | | | | | | | | | |
| Basic | | | 2,469 | | | | | | 2,487 | | | | | | 2,496 | | |
| Diluted | | | 2,494 | | | | | | 2,507 | | | | | | 2,535 | | |
"""
```

The markdown tables extracted from the earnings reports can vary widely in row names, column counts, data types, etc. The advantage of LLMs here is that we can define the data we want in terms of the data types, and the LLM will output the data in the desired format.

For comparison, here is how the income statement looks in the original HTML:

![Nvidia income statement](./images/nvidia-income.png)

## Define the data we want

Outlines is often used for JSON output, but it can also be used for CSV. We know the columns we want to extract, and we know the data types of the columns. Year for example is always a four-digit number, revenue is a number with commas, and so on.

We can define a regex pattern for each column type:

```python
# Define the column type regex patterns
column_types = {
# Year is always a four-digit number
"year": r"\d{4}",

# Revenue, operating income, and net income are always numbers with commas.
# This regex permits integers that may begin with a minus sign, and may have
# commas separating the thousands, millions, etc.
"integer_comma": r"((-?\d+),?\d+|(-?\d+))",
# Number is currently not used, but it represents a number with up to two decimal places.
"number": r"(-?\d+(?:\.\d{1,2})?)",
}
```

Next, let's choose the columns we want to extract. We want

- Year, always a four-digit number
- Revenue, a number with commas
- Operating income, a number with commas
- Net income, a number with commas

```python
# Define the columns to extract, and their data types.
columns_to_extract = {
"year": "year",
"revenue": "integer_comma",
"operating_income": "integer_comma",
"net_income": "integer_comma",
}
```

You can modify `column_type_regex` to match the data types of the columns you want to extract. Adding a new financial metric to extract is as simple as adding a new key/value pair to `columns_to_extract`:

```python
columns_to_extract["diluted_earnings_per_share"] = "number"
```

Additional columns are not well tested for accuracy, so use with caution.

## Create the regex describing the data we want


```python
# Create the header line. This is the requested column names
# separated by commas, i.e. "year,revenue,..."
header = ",".join(columns_to_extract.keys())

# Create the data capture patterns. These are the regex patterns
# that will be used to capture the data in each column
data_patterns = [column_types[dtype] for dtype in columns_to_extract.values()]
data_line = ",".join(data_patterns)

# Our final regex pattern.
max_rows = 3 # We expect 3 rows of data, firms usually report 3 years of income statements
csv_regex = f"{header}(\n{data_line}){{,{max_rows}}}\n\n"

print(csv_regex)
```

which gives us

```
year,revenue,operating_income,net_income,basic_earnings_per_share(
\d{4},((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),(-?\d+(?:\.\d{1,2})?)){,3}
```

Pretty hairy, right? Thankfully, we have a simple function to construct this regex for you. The regex defines a header line, followed by a data line that repeats for each row of data we want to extract. Passing the regex to `outlines.generate.regex` will produce a function that will __always__ produce a CSV string that is consistent with the regex.

## Prompting the model

Outlines does not add system or instruction tokens by default, so we need to use `transformers.AutoTokenizer` to add them for whatever model we're using.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

def add_instruction(prompt):
return tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)

print(add_instruction("Howdy"))
```
```
<|user|>
Howdy<|end|>
<|assistant|>
```

Our prompt roughly describes the task we want the model to perform, and a few pieces of information it may need to know about income statements.

```python
def extract_financial_data_prompt(columns_to_extract, income_statement):
user_prompt = f"""
Extract annual financial data from this set of pages. Pages
are from a 10k filing and were chosen because they may contain
a comprehensive income statement. Note that selected pages may
be incorrectly extracted, so you should verify that you are extracting
from the comprehensive income statement and not some other financial
statement.
Create a row for each year available in the income statement with the
following columns: {', '.join(columns_to_extract.keys())}. Firms typically report the
most recent 3 years of data, but this can vary.
Each column has types: {', '.join(columns_to_extract.values())}.
# Relevant pages:
{income_statement}
# Key instructions:
1. Look ONLY at the "Consolidated Statements of Income" table
2. For operating income, look for "Income from operations" or "Operating income"
3. For net income, use the TOTAL net income figure, not amounts allocated to specific share classes
4. Use NULL for missing values
5. Operating income must be less than revenue
6. Net income must be less than operating income
7. Ignore segment breakdowns, quarterly data, or per-share amounts
# Output format:
- CSV format with headers: {','.join(columns_to_extract.keys())}
- Use NULL for missing values
- If no data are found, do not create a row.
- Enter two newline characters to terminate the CSV when no more data are found.
# Definitions:
- Revenue: Total sales of goods and services. Usually this is at the top of the
income statement.
- Operating income: Revenue minus operating expenses for the entire company. This is revenue
minus costs. Operating income is also called operating profit, EBIT, or income from
operations.
- Net income: Operating income minus taxes. This is the bottom line of the
income statement.
"""

return add_instruction(user_prompt)
```

## Running the model

Now that we have our prompt and regular expression, we can run the model.

Construct our regex extractor function. We'll use a greedy sampler, which samples the most likely next token at each step. It's a simple sampler that is more reproducible than multinomial sampling.

```python
csv_extractor = outlines.generate.regex(
model, csv_regex, sampler=outlines.samplers.greedy()
)
```

Provide the prompt to the model and run it:

```python
csv_data = csv_extractor(
extract_financial_data_prompt(columns_to_extract, income_statement),
max_tokens=1024,
)

print(csv_data)
```
```
year,revenue,operating_income,net_income
2024,60922,32972,29760
2023,26974,4224,4368
2022,26914,10041,9752
```

Voila! We've extracted the financial data from the income statement, and it's correct upon inspection.

You can even load this into a `pandas` DataFrame for further analysis:

```python
import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(csv_data))
print(df)
```
```
year revenue operating_income net_income
0 2024 60922 32972 29760
1 2023 26974 4224 4368
2 2022 26914 10041 9752
```
Binary file added docs/cookbook/images/nvidia-income.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ nav:
- Chain of Thought (CoT): cookbook/chain_of_thought.md
- ReAct Agent: cookbook/react_agent.md
- Vision-Language Models: cookbook/atomic_caption.md
- Earnings reports to CSV: cookbook/earnings-reports.md
- Run on the cloud:
- BentoML: cookbook/deploy-using-bentoml.md
- Cerebrium: cookbook/deploy-using-cerebrium.md
Expand Down

0 comments on commit 93b13a5

Please sign in to comment.