Purpose. The scripts on this repository provides an easy way to scrape the state of the union speeches from https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union.
Instructions.
- Clone repository and install the required Python modules.
git clone https://github.com/stressosaurus/raw-data-state-of-the-union.git
cd raw-data-state-of-the-union/
pip install -r requirements.txt
- Start scraping the website for the speeches by using the command below.
python wrangleSotu.py
The above command will create a 'html_files' folder with the html files of the speeches and a separate 'sotu.npy' will be created containing the processed data for easy access. The data is in a pandas DataFrame format containing columns 'year', 'month', 'day', 'president', 'title', and 'text'.
- You can open the "sotu.pkl" file by using the pandas module in Python.
import pandas as pd
sotu_df = pd.read_pickle('sotu.pkl')
print(sotu_df)