Code used to generate some of the "seed lists" used for the End of Term Web Archive 2024 crawl.
pip install -r requirements.txt
Downloads 2 csvs from get.gov, listing all of the federal and non-federal domains registered in the .gov tld.
Download web graph summaries from CCF, as tab-separated values (tsv).
Given web graph domain and host ranks, grep out the .mil and .gov domains therein. Output is still the web graph table tsv format.
Take current-federal.csv plus the hosts webgraph, and output all .gov hosts whose domains are in current-federal.csv. For .mil hosts, output all hosts. This output is what is checked into eot2024/seed-lists.
- ccf-gov-federal-web-graph-2024-jun-jul-aug.txt
- ccf-mil-web-graph-2024-jun-jul-aug.txt