Skip to content

NLNZDigitalPreservation/find_urls_in_cdx_index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Find URLs in a CDX index

Takes a spreadsheet of URLs and checks if each unique URL exists in a CDX index. Outputs a new spreadsheet of the URLs that do exist in the CDX index along with their three latest occurrences.

Currently assumes URLs will be in column E of the input spreadsheet.

Installation

Use the package manager pip to install necessary packages.

pip install openpyxl requests datetime

Usage

Takes three arguments:

  1. Path to the input spreadsheet
  2. Location to save the output spreadsheet
  3. URL of the CDX index
python3 find_urls_in_cdx_index.py <path to input spreadsheet> <path to destination folder> <url of cdx index>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages