These scripts and data files are designed to scrape DC Department of Health restaurant inspection report data from https://dc.healthinspections.us/?a=Inspections
Here is the workflow that produces files in the output
directory:
- Run
01_scrape_inspection_links.py
to generate or update thescraped_inspection_links.csv
file. As this downloads the complete set of active links from the page above, processing this script can take a little while. - Run
02_extract_inspection_data.py
to process those links in thescraped_inspection_links.csv
file that have not already had their data extracted. This will generate a local cache of html files for each link, and either create or append the data to theinspection_summary_data.csv
andviolation_details_data.csv
files.
Experimental alternative/additional steps:
- Run
02alt_cache_potential_inspections.py
to sequentially scrape the range of known possible values of 'inspection_id' and generate a local cache of possible inspection reports. This generates or updates the potential_inspection_ids.csv file. Note that some of these may not be valid reports (there are known broken duplicates on the server, for example). - Run
03alt_extract_potential_inspection_data.py
to process all such potential inspection reports (including those cached by #1 above) as in #2 above. This will produce thepotential_inspection_summary_data.csv
andpotential_violation_details_data.csv
files. The first of these has an additional column indicating if the given id is known to be valid (has been linked to by the dc.healthinspections.us site before, either in this scraping effort or in previous efforts).
Future versions of these scripts and data will resolve issues relating to duplicates and other invalid inspection reports.