This project is an attempt to understand the shifting messaging ESG messaging in 10-K filings using a RAG (Retrival Augmented Generation Pipeline).
ESG metrics encompass three fundamental pillars:
-
The "E" dimension evaluates a company's management of natural resources and its environmental impact, encompassing both direct operations and supply chain activities.
-
The "S" aspect pertains to social factors, assessing a company's effectiveness in navigating social trends, labor practices, and political dynamics.
-
The "G" component of ESG focuses on governance factors, examining decision-making processes from governmental policy-making to the allocation of rights and responsibilities within corporations. This includes scrutiny of governance structures involving the board of directors, managers, shareholders, and stakeholders
ESG ratings play a role in guiding trillions of dollars in investments worldwide. For example, the investment strategy of the Maine Public Employees Retirement System Pension Fund is guided by ESG criteria, and they highlight the significance of integrating sustainability factors for long-term investment success.
In recent years, ESG-related shareholder proposals—an official vehicle through which shareholders can interface with the board of directors—have become more prominent.
- Lack of transparency among the ESG raters on how the scores are assigned.
- Lack of standards on how a particular concept is measured.
- Questionable tradeoffs: high scores in one domain may offset very low scores in another area.
- Absence of an overall score combining performance scores along Environmental, Social, and Governance (ESG) axes; weights given to the "E," "S," and "G."
- Lack of acknowledgement of stakeholder expectations.
This analysis specifically examines the 966 US equities in which the Norwegian Sovereign Wealth Fund has invested. The $1.4 trillion fund, managed by the Norwegian government, originated from oil and gas resources discovered in the late 1960s on the Norwegian continental shelves. Serving as a strategic financial reserve, the fund holds stakes in about 9,000 companies worldwide, owning approximately 1.4 percent of every listed company globally.
I have chosen to focus on this subset of publicly traded companies to benchmark US firms against European long-term sustainable investing strategies, specifically examining the Norwegian Sovereign Wealth Fund's voting guidelines. This approach aims to compare the governance and sustainability practices of American corporations with those promoted by one of Europe's most significant investors, known for its adherence to ESG principles.
The goal of this project is to equip large language models (LLMs) with domain-specific data derived from the 10-K disclosure filings of 968 publicly traded firms, as well as the Norwegian Wealth Fund's voting patterns on shareholder proposals. Enabling the LLMs to tailor their outputs, drawing context from authoritative sources concerning environmental, social, and governance (ESG) messaging and Corprate Goverannce.
This project is broken down into a few steps.
A link to the API can be found here: Norges Bank Investment Management API
The code used to collect data can be found: _Step_1_DataCollection.ipynb_
Link to sec-api.io : https://sec-api.io/docs/sec-filings-item-extraction-api
The code used to collect data can be found: _Step_2_SEC_EDGAR.ipynb_
Output from Step_1 can be found in Step_2 folder "cleaned_company_list.csv"
Link to open source nlp preprocesser spaCy: https://spacy.io/api/sentencizer
The code used to collect data can be found: Step_3_DataPreprocessing.ipynb
Output of step 3: utitlites.pdf
Link to hugging face: https://huggingface.co/sentence-transformers/all-mpnet-base-v2
Link to LLM: https://huggingface.co/google/gemma-7b-it
A link to the API can be found here: Norges Bank Investment Management API
The code used to collect data can be found: Step_1_DataCollection.ipynb
See link: SEC-API
!pip install sec-api
See query:
import pickle
import re
from tqdm import tqdm
# Assuming `df_documents_info_cleaned` is your DataFrame containing the URLs and metadata
documents_info = []
for index, row in tqdm(df_documents_info_cleaned.iterrows(), total=df_documents_info_cleaned.shape[0], desc="Fetching and Cleaning Documents"):
# Extract necessary information from the row
ticker = row['ticker']
filedAt = row['filedAt']
sector = row['sector']
filing_url = row['linkToFilingDetails']
# Fetch the document text for both sections 1 and 1A
section_1_text = extractorApi.get_section(filing_url, "1", "text")
section_1A_text = extractorApi.get_section(filing_url, "1A", "text")
# Combine both sections' texts
combined_text = section_1_text + " " + section_1A_text # Ensure there's a space between the two sections
# Clean the combined section text
cleaned_combined_section = re.sub(r"\n|&#[0-9]+;", "", combined_text)
# Append a dictionary for each document containing its metadata and cleaned combined text
documents_info.append({
'ticker': ticker,
'filedAt': filedAt,
'sector': sector,
'text': cleaned_combined_section
})
# Serialize the list of dictionaries to a file using pickle
with open('Cleaned_US_Item1_1A.pkl', 'wb') as f:
pickle.dump(documents_info, f)
Convert to Pandas Dataframe:
# Load the serialized data from the pickle file
with open('Cleaned_US_Item1_1A.pkl', 'rb') as f:
documents_info = pickle.load(f)
Example pull for a 10-K document, get section 1A Risks and clean the text)
For the sake of this example, filter out dataframe and only consider 10-K disclosures for 2023 in the Utility Sector
# Load the serialized data from the pickle file
with open('Cleaned_US_Item1_1A.pkl', 'rb') as f:
documents_info = pickle.load(f)
# Create a DataFrame from the documents_info list
Cleaned_US_Item1_1A = pd.DataFrame(documents_info)
df_2023 = Cleaned_US_Item1_1A[Cleaned_US_Item1_1A['filedAt'].str.startswith('2023')]
unique_utility_df = df_2023[df_2023['sector'] == 'Utilities']
def draw_metadata(c, metadata, width, height):
"""Draws metadata at the top of each page."""
c.setFont("Helvetica", 12)
c.drawString(72, height - 50, f"Ticker: {metadata['ticker']}, Sector: {metadata['sector']}, Filed At: {metadata['filedAt']}")
def add_text_to_page(c, text, metadata, width, height):
"""Adds text to a page ensuring metadata is drawn first and proper spacing is maintained."""
draw_metadata(c, metadata, width, height)
text_object = c.beginText(72, height - 100) # Adjusted to leave space below the metadata
text_object.setFont("Helvetica", 10)
for line in text.split():
line = preprocess_text(line) # Preprocess each line if necessary
if text_object.getX() + c.stringWidth(line) > width - 72:
text_object.textLine() # Move to next line if text exceeds the page width
if text_object.getY() < 100: # Check if we're near the bottom of the page
c.drawText(text_object) # Draw the text collected so far
c.showPage() # Start a new page
draw_metadata(c, metadata, width, height) # Redraw metadata at top of the new page
text_object = c.beginText(72, height - 100)
text_object.textOut(line + " ") # Add space between words
c.drawText(text_object) # Make sure to draw any remaining text
c.showPage() # Ensure a new page is started after finishing each company's text
def create_pdf(df, filename):
"""Creates a PDF file with each entry separated onto a new page with metadata at the top."""
c = canvas.Canvas(filename, pagesize=letter)
width, height = letter
for index, row in df.iterrows():
metadata = {'ticker': row['ticker'], 'sector': row['sector'], 'filedAt': row['filedAt']}
add_text_to_page(c, row['text'], metadata, width, height)
c.save()
print(f"PDF saved as {filename}")
create_pdf(df=unique_utility_df, filename='utility.pdf')
- Access company list of Norwegian Wealth Fund API:
-
Build company database
-
File list of 10-K companie by industry:
The following table provides an overview of the number of companies across various sectors:
Sector | Number of Companies |
---|---|
Basic Materials | 63 |
Communication Services | 35 |
Consumer Cyclical | 96 |
Consumer Defensive | 38 |
Energy | 58 |
Financial Services | 153 |
Healthcare | 108 |
Industrials | 157 |
Real Estate | 87 |
Technology | 134 |
Utilities | 37 |
- Pull 10-K filings using sec-io
- Validate data using yfinance
- Clean text
- Prepare of embedding
- Create standarized notebooks to pull data
- Started running TF-IDF
- Create dataframe for embedding across 966 companies.
- Generate U-MAP for a single time slice
- Apply ESG BERT See: https://www.sciencedirect.com/science/article/pii/S1544612324000096?via%3Dihub#da1
Download the model from hugging face: https://huggingface.co/ESGBERT
https://huggingface.co/climatebert/distilroberta-base-climate-sentiment
- Summarize text using GPT. https://medium.com/@jan_5421/summarize-sec-filings-with-openai-gtp3-a363282d8d8
Going to impliment a Retrieval-Augmented Generation (RAG) pipeline for 10-K disclousre text.
LLM to reference an authoritative knowledge base (10-K Text) outside the training data before generating a response.
See flowchart:
Using Google Colab Pro to access A100 GPU for embeddings + Gemma 7B as a LLM
GPU memory: 40 | Recommend model: Gemma 7B in 4-bit or float16 precision.
Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
use_quantization_config = True
model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
use_quantization_config = False
model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
use_quantization_config = False
model_id = "google/gemma-7b-it"
print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")
Refrence: https://github.com/mrdbourke/simple-local-rag/tree/main
- "Measuring Disclosure Using 8-K Filings"
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3354252
- VDisc_Ct is the number of voluntary 8K items and associated exhibits
- VDisc_WC is the number ofwords within the voluntary 8K items and associated exhibits
- "ESG In Corporate Filings: An AI Perspective"
Tie the corporate actions on the environment to the investor expectations.
https://arxiv.org/pdf/2212.00018.pdf
Used:
- keywords corresponding to the SASB categories of ESG terms
Among the key findings are:
- lack of transparency among the ESG raters on how the scores are assigned;
- lack of standards on how a particular concept is measured
- Questionable tradeoffs: high scores in one domain may offset very low scores in another area
- Absence of an overall score combining performance scores alongenvironmental, social and governance axes
- lack of acknowledgement of stakeholder expectations leading to lower acceptance rates.
- Capturing Firm Economic Events
- event-driven 8-K, Items (all 8-K items except Items 2.02, 7.01, 8.01, and 9.01)
- disclosure-driven, Items 2.02, 7.01, and 8.01 constitute 8-K items with a voluntary disclosure component
- Data Extraction and Preprocessing
Extract Relevant Sections: Identify and extract ESG-relevant sections from 8-K filings
Text Cleaning: Standardize formatting and remove non-essential elements.
Refrence: https://scholar.harvard.edu/jbenchimol/files/text-mining-methodologies.pdf
- Labeling Data for Sentiment Analysis
Develop a Labeling Guide: Define positive, negative, and neutral ESG sentiments.
Manual Labeling: Label a subset of filings for training the sentiment analysis model.
-
Hartzmark, Samuel M., and Kelly Shue. "Counterproductive Sustainable Investing: The Impact Elasticity of Brown and Green Firms." 1 Nov. 2022, SSRN, https://ssrn.com/abstract=4359282 or http://dx.doi.org/10.2139/ssrn.4359282.
-
Dubner, Stephen J. “Are E.S.G. Investors Actually Helping the Environment?” Freakonomics Radio, no. 546, Freakonomics, LLC, 14 June 2023, https://freakonomics.com/podcast/are-e-s-g-investors-actually-helping-the-environment/.