The data extraction pipeline is responsible for (a) downloading the health bulletins from the respective sources, (b) Setting up the database tables for all the states, and (c) Extracting the information from the health bulletins and inserting these records in the database
The data extraction pipeline is structured as follows:
run.py
- Main script used to either create the database from scratch or update an existing databasebulletin_download
- Downloades the health bulletins from the respective state sources, and saves them locallymain.py
- Main file that orchestrates the bulletin download routine for all statesstates
- This folder contains bulletin download scripts for each individual states. See the scripts for Delhi, West Bengal, and Telangana for reference.
db
- Interface to the database, defining the tables and data insertion queries.local_extractor
- Defines the data extraction procedure from the health bulletins of each state.main.py
- Main interface to the different state data extraction proceduresstates
- Folder where all states data extraction scripts reside. Each script is initialized with the date and the report file path, and need to define anextract
function which returns a dictionary of the extracted data.utils
- Stores commonly used utilities to be used across the data extraction scripts, such as, extracting tables from textual PDFs, standardizing date formats, etc.
The run.py
file -- main script that orchestrates the entire process -- accepts the following command line arguments:
--datadir <DIRPATH>
: local folder path to store the bulletins, metadata files, and the database.--run_only <STATE_CODES>
: the argument accepts comma-separated string of state codes. The procedure would then run only the states defined in the mentioned list.--force_run_states <STATE_CODES>
: the argument accepts comma-separated string of state codes. The procedure would then re-run the mentioned states, ignoring any previous dates the procedure might have already parsed.--skip_bulletin_downloader
: Adding this argument to the run script would skip executing the bulletin download step--skip_db_setup
: Adding this argument to the run script would skip executing the database setup step--skip_bulletin_parser
: Adding this argument to the run script would skip executing the bulletin parser and data extraction step
- Run data extraction for all states:
python run.py --datadir <DIRPATH>
- Run data extraction for Delhi (DL) and Kerala (KL):
python run.py --datadir <DIRPATH> --run_only "DL, KA"
- Run data extraction for all states, but re-run procedure for Delhi (DL):
python run.py --datadir <DIRPATH> --force_run_states "DL"
- Re-run procedure for only Delhi (DL) and Karnataka (KA):
python run.py --datadir <DIRPATH> --run_only "DL, KA" --force_run_states "DL, KA"
- Run only the bulletin download procedure for all states:
python run.py --datadir <DIRPATH> --skip_db_setup --skip_bulletin_parser
While you can conveniently download the up-to-date database from the following link https://www.dropbox.com/s/hbe04q6vtzapdam/covid-india.db?dl=1, you can also run the entire routine locally to recreate the database from scratch.
Locally, follow these steps to recreate the db:
- Ensure you have the dependencies defined in the
requirements.txt
installed. If not, runpip3 install -r requirements.txt
to install these dependencies. - Change directory into the
data_extractor
directory - Execute the
run.py
script aspython3 run.py --datadir DATADIR
--datadir
argument defines the local folder path to store the bulletins , metadata files, and the database.
Refer to this Wiki to integrate a new state to the pipeline.