-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
cc5cd70
commit 89edfe1
Showing
1 changed file
with
109 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,118 @@ | ||
# Data Insights from SEC-10K filings | ||
# FinTech-Lab-Summer-2024 | ||
|
||
### Insights from SEC-10K filings using LLM | ||
## Project Overview | ||
This project focuses on extracting, analyzing, and visualizing key financial insights from 10-K filings of public firms using a large language model (LLM). The insights are presented through a web interface built with Streamlit, which provides a dynamic way to explore financial trends and data-driven insights. | ||
|
||
## Features | ||
- **Data Extraction**: | ||
1. Python library SEC-Edgar-Download is used to extract SEC filings. | ||
2. The filings of any firms can be downloaded using the script located in | ||
`src/data/raw_data/download_sec_filings.py` | ||
3. `Downloader.log` containes information about the missing as well as the downloaded files. | ||
|
||
- **Data Pre-processing** | ||
|
||
Pre-processing steps takes in 3 parts : | ||
1. First the raw html,css contained is removed from downloaded filings and stored in `src\data\pre_processed_data\cleaned-sec-edgar-filings` | ||
2. Then lemmetization and Stemming is performed on the cleaned data and only 5 words before and after the numerical feautres were selected and stored in | ||
`src\data\pre_processed_data\processed-numeric-contexts` | ||
3. Finally using all feautres that are considered finacially insightful(*mentioned below*) are extracted using regex expression and finally stored in | ||
`src\data\pre_processed_data\feature` | ||
|
||
## Installation | ||
```bash | ||
git clone https://github.com/siddharth7113/Fintech-Lab-Summer-2024 | ||
cd siddharth7113/Fintech-Lab-Summer-2024 | ||
pip install -r requirements.txt | ||
- **Text Analysis**: | ||
|
||
1. Each of these features files are sent to LLM (**Mixtral-7b-Instruct**) via OpenRouterAPI and saved in `src/output/output-responses` | ||
2. Using a python script these annual files for each firm is combined in text format which is stored in `src/output/pre-analysis_combined`. | ||
3. This combined text is again sent to LLM (**Mixtral-7b-Instruct**) via OpenRouterAPI to obtain two types of files : text_insights, csv_insights. | ||
(These files are used to ) | ||
|
||
``` | ||
## Usage | ||
``` | ||
python main.py | ||
``` | ||
## Contributing | ||
|
||
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.Please make sure to update tests as appropriate. | ||
- **Data Visualization**: Interactive charts and graphs to display financial metrics and trends. | ||
- **Web Interface**: A Streamlit application that allows users to select different firms and view corresponding financial insights and visualizations. | ||
|
||
--- | ||
|
||
### Important terms and their need : | ||
|
||
### Financial Statements and Performance Metrics | ||
- **Revenue**: Represents the total income generated by a business. | ||
- **Expenses**: Refers to the costs incurred by a business in its operations. | ||
- **Net Income**: Calculated as revenue minus expenses, indicating the profitability of a company. | ||
- **Assets**: Include all resources owned by a company that have economic value. | ||
- **Liabilities**: Represent the company's debts or obligations. | ||
- **Equity**: Reflects the ownership interest in a company's assets after deducting liabilities. | ||
- **Cash Flow**: Shows the movement of cash in and out of a business. | ||
- **Operating Margin**: Indicates the profitability of a company's core business activities. | ||
- **Gross Margin**: Represents the percentage of revenue that exceeds the cost of goods sold. | ||
- **EBITDA**: Stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. | ||
|
||
### Financial Analysis and Reporting | ||
- **Financial Ratios**: Include metrics like debt-to-equity ratio and return on equity used to assess a company's financial health. | ||
- **Earnings Per Share**: Calculated as net income divided by the number of outstanding shares. | ||
- **Tax Rate**: Refers to the percentage of income that a company pays in taxes. | ||
|
||
### Investment and Risk Management | ||
- **Debt**: Represents borrowed funds that a company must repay. | ||
- **Investment Gains/Losses**: Reflect the profits or losses from investment activities. | ||
- **Hedging Activities**: Strategies used to reduce risks associated with price fluctuations. | ||
- **Derivative Instruments**: Financial contracts whose value is derived from an underlying asset. | ||
|
||
### Other Financial Terms | ||
- **Common Stock**: Represents ownership in a company and typically carries voting rights. | ||
- **Subsequent Events**: Events occurring after the end of a reporting period that may impact financial statements. | ||
- **Fair Value Measurements**: Refers to the estimated value of an asset or liability based on market conditions. | ||
- **Geographic Concentration Risk**: Risk associated with a company's heavy reliance on a particular geographic region. | ||
|
||
--- | ||
|
||
## Directory Structure | ||
|
||
|
||
|
||
FinTech-Lab-Summer-2024/ | ||
│ | ||
├── .github/workflows # CI/CD pipelines for automated testing and deployment. | ||
│ └── python-ci.yml | ||
│ | ||
├── docs # Documentation related to the project. | ||
│ | ||
├── src # Source code for the project. | ||
│ ├── analysis # Scripts for data analysis. | ||
│ │ ├── csv # CSV files with analyzed financial data. | ||
│ │ ├── text-summaries # Textual summaries extracted from 10-K filings. | ||
│ │ └── txt_to_csv.py # Script to convert text data to CSV. | ||
│ │ | ||
│ ├── app # Streamlit application. | ||
│ │ └── streamlit_app.py # Main application script. | ||
│ │ | ||
│ ├── scripts # Utility scripts for data processing. | ||
│ │ ├── data_extraction.py # Script for downloading SEC filings. | ||
│ │ ├── feature_extraction.py # Script for feature extraction from text. | ||
│ │ └── lemmitization.py # Script for text normalization. | ||
│ │ | ||
│ └── data # Data used or generated by the scripts. | ||
│ ├── pre-processed_data # Preprocessed datasets. | ||
│ └── pre-processing_scripts # Scripts that preprocess data. | ||
│ | ||
├── tests # Automated tests for the project. | ||
│ └── test_analysis.py # Test cases for data analysis scripts. | ||
│ | ||
├── requirements.txt # Project dependencies. | ||
└── README.md # Project overview and setup instructions. | ||
|
||
--- | ||
|
||
## **Streamlit Link** | ||
|
||
[url-link](www.google.com) | ||
|
||
## Tech-Stack | ||
|
||
|
||
## Contributing | ||
|
||
Contributions are welcome! Please fork the repository and open a pull request with your features or fixes. | ||
|
||
## License | ||
This project is licensed under the terms of the MIT license. | ||
|