This project focuses on extracting, analyzing, and visualizing key financial insights from 10-K filings of public firms using a large language model (LLM). The insights are presented through a web interface built with Streamlit, which provides a dynamic way to explore financial trends and data-driven insights.
-
Data Extraction:
- Python library SEC-Edgar-Download is used to extract SEC filings.
- The filings of any firms can be downloaded using the script located in
src/data/raw_data/download_sec_filings.py
Downloader.log
containes information about the missing as well as the downloaded files.
-
Data Pre-processing
Pre-processing steps takes in 3 parts :
- First the raw html,css contained is removed from downloaded filings and stored in
src\data\pre_processed_data\cleaned-sec-edgar-filings
- Then lemmetization and Stemming is performed on the cleaned data and only 5 words before and after the numerical feautres were selected and stored in
src\data\pre_processed_data\processed-numeric-contexts
- Finally using all feautres that are considered finacially insightful(mentioned below) are extracted using regex expression and finally stored in
src\data\pre_processed_data\feature
- First the raw html,css contained is removed from downloaded filings and stored in
-
Text Analysis:
- Each of these features files are sent to LLM (Mixtral-7b-Instruct) via OpenRouterAPI and saved in
src/output/output-responses
- Using a python script these annual files for each firm is combined in text format which is stored in
src/output/pre-analysis_combined
. - This combined text is again sent to LLM (Mixtral-7b-Instruct) via OpenRouterAPI to obtain two types of files : text_insights, csv_insights.
(These files are used to display content on web server) located in
src\analysis
- Each of these features files are sent to LLM (Mixtral-7b-Instruct) via OpenRouterAPI and saved in
-
Data Visualization: The csv file format obtained from LLM is in txt format due to API limitation , a simple python script
(src/analysis/csv/txt_to_csv.py)
converts the files into csv files
Web Interface: A Streamlit application has been created and deployed. src\app
contains all the elemnts to run streamlit app.
For hosting purposes, a separate repo has been created for streamlit but it contains the same files as located in src/app
.
- Revenue: Represents the total income generated by a business.
- Expenses: Refers to the costs incurred by a business in its operations.
- Net Income: Calculated as revenue minus expenses, indicating the profitability of a company.
- Assets: Include all resources owned by a company that have economic value.
- Liabilities: Represent the company's debts or obligations.
- Equity: Reflects the ownership interest in a company's assets after deducting liabilities.
- Cash Flow: Shows the movement of cash in and out of a business.
- Operating Margin: Indicates the profitability of a company's core business activities.
- Gross Margin: Represents the percentage of revenue that exceeds the cost of goods sold.
- EBITDA: Stands for Earnings Before Interest, Taxes, Depreciation, and Amortization.
- Financial Ratios: Include metrics like debt-to-equity ratio and return on equity used to assess a company's financial health.
- Earnings Per Share: Calculated as net income divided by the number of outstanding shares.
- Tax Rate: Refers to the percentage of income that a company pays in taxes.
- Debt: Represents borrowed funds that a company must repay.
- Investment Gains/Losses: Reflect the profits or losses from investment activities.
- Hedging Activities: Strategies used to reduce risks associated with price fluctuations.
- Derivative Instruments: Financial contracts whose value is derived from an underlying asset.
- Common Stock: Represents ownership in a company and typically carries voting rights.
- Subsequent Events: Events occurring after the end of a reporting period that may impact financial statements.
- Fair Value Measurements: Refers to the estimated value of an asset or liability based on market conditions.
- Geographic Concentration Risk: Risk associated with a company's heavy reliance on a particular geographic region.
FinTech-Lab-Summer-2024/ │ ├── .github/workflows # CI/CD pipelines for automated testing and deployment. │ └── python-ci.yml │ ├── docs # Documentation related to the project. │ ├── src # Source code for the project. │ ├── analysis # Scripts for data analysis. │ │ ├── csv # CSV files with analyzed financial data. │ │ ├── text-summaries # Textual summaries extracted from 10-K filings. │ │ └── txt_to_csv.py # Script to convert text data to CSV. │ │ │ ├── app # Streamlit application. │ │ └── streamlit_app.py # Main application script. │ │ │ ├── scripts # Utility scripts for data processing. │ │ ├── data_extraction.py # Script for downloading SEC filings. │ │ ├── feature_extraction.py # Script for feature extraction from text. │ │ └── lemmitization.py # Script for text normalization. │ │ │ └── data # Data used or generated by the scripts. │ ├── pre-processed_data # Preprocessed datasets. │ └── pre-processing_scripts # Scripts that preprocess data. │ ├── tests # Automated tests for the project. │ └── test_analysis.py # Test cases for data analysis scripts. │ ├── requirements.txt # Project dependencies. └── README.md # Project overview and setup instructions.
-Python has been used throughout the project. -Streamlit has been used to host the project. -Requirements.txt mentions the used libraries.
Since the project realibility heavily depends on LLM inference model,there is always scope for hallucinations and wrong inferences, thus it is recommended to always verify data from secondary sources. Note: The project is still under development and currently obtained graphs and contents are unreliable.
- Upgrading to a better model might help with inference.
Contributions are welcome! Please fork the repository and open a pull request with your features or fixes.
This project is licensed under the terms of the MIT license.