Takes a spreadsheet of URLs and checks if each unique URL exists in a CDX index. Outputs a new spreadsheet of the URLs that do exist in the CDX index along with their three latest occurrences.
Currently assumes URLs will be in column E of the input spreadsheet.
Use the package manager pip to install necessary packages.
pip install openpyxl requests datetime
Takes three arguments:
- Path to the input spreadsheet
- Location to save the output spreadsheet
- URL of the CDX index
python3 find_urls_in_cdx_index.py <path to input spreadsheet> <path to destination folder> <url of cdx index>