feat: add --crawl (#39)

cocrawler · Sep 9, 2024 · e5d122a · e5d122a
1 parent 83d1f31
commit e5d122a
Show file tree

Hide file tree

Showing 10 changed files with 276 additions and 78 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -14,7 +14,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
-      #max-parallel: 1
+      max-parallel: 1  # avoids ever triggering a rate limit
       matrix:
         python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12']
         os: [ubuntu-latest]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,10 @@
+- 0.9.37
+	+ --crawl for CCF
+
 - 0.9.36
 	+ ratelimit code; both IA and CCF are rate limiting their cdx endpoints
 	+ cache collinfo.json in ~/.cache/cdx_toolkit/
-	+ py3.11 and py3.12 pass testing
+	+ py3.11 and py3.12 pass testing; windows and macos pass testing
 
 - 0.9.35
     + exponential backoff retries now that IA is sending 429

diff --git a/Makefile b/Makefile
@@ -33,14 +33,13 @@ distcheck: distclean
 	twine check dist/*
 
 dist: distclean
-	echo "  Finishe CHANGELOG and commit it.
+	echo "  Finishe CHANGELOG.md and commit it."
 	echo "  git tag --list"
-	echo "  git tag v0.x.x"
+	echo "  git tag 0.x.x  # no v"
 	echo "  git push --tags"
 	python ./setup.py sdist
 	twine check dist/*
 	twine upload dist/* -r pypi
 
 install:
-	python ./setup.py install
-
+	pip install .
diff --git a/README.md b/README.md
@@ -3,61 +3,118 @@
 [![build](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml/badge.svg)](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [![coverage](https://codecov.io/gh/cocrawler/cdx_toolkit/graph/badge.svg?token=M1YJB998LE)](https://codecov.io/gh/cocrawler/cdx_toolkit) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg)](LICENSE)
 
 cdx_toolkit is a set of tools for working with CDX indices of web
-crawls and archives, including those at CommonCrawl and the Internet
-Archive's Wayback Machine.
+crawls and archives, including those at the Common Crawl Foundation
+(CCF) and those at the Internet Archive's Wayback Machine.
 
-CommonCrawl uses Ilya Kreymer's pywb to serve the CDX API, which is
-somewhat different from the Internet Archive's CDX API server. cdx_toolkit
-hides these differences as best it can. cdx_toolkit also knits
-together the monthly Common Crawl CDX indices into a single, virtual
-index.
+Common Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is
+somewhat different from the Internet Archive's CDX API server.
+cdx_toolkit hides these differences as best it can. cdx_toolkit also
+knits together the monthly Common Crawl CDX indices into a single,
+virtual index.
 
 Finally, cdx_toolkit allows extracting archived pages from CC and IA
-into WARC files.  If you're looking to create subsets of CC or IA data
-and then process them into WET or WAT files, this is a feature you'll
-find useful.
+into WARC files. If you're looking to create subsets of CC or IA data
+and then further process them, this is a feature you'll find useful.
 
 ## Installing
 
-cdx toolkit requires Python 3.
-
 ```
 $ pip install cdx_toolkit
 ```
 
-or clone this repo and use `python ./setup.py install`.
+or clone this repo and use `pip install .`
 
 ## Command-line tools
 
 ```
 $ cdxt --cc size 'commoncrawl.org/*'
-$ cdxt --cc --limit 10 iter 'commoncrawl.org/*'
+$ cdxt --cc --limit 10 iter 'commoncrawl.org/*'  # returns the most recent year
+$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*'  # returns the most recent 3 crawls
 $ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'
-$ cdxt --ia --limit 10 iter 'commoncrawl.org/*'
+
+$ cdxt --ia --limit 10 iter 'commoncrawl.org/*'  # will show the beginning of IA's crawl
 $ cdxt --ia --limit 10 warc 'commoncrawl.org/*'
 ```
 
 cdxt takes a large number of command line switches, controlling
 the time period and all other CDX query options. cdxt can generate
 WARC, jsonl, and csv outputs.
 
-** Note that by default, cdxt --cc will iterate over the previous
-year of captures. **
+If you don't specify much about the crawls or dates or number of
+records you're interested in, some default limits will kick in to
+prevent overly-large queries. These default limits include a maximum
+of 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes.
+To exceed these limits, use `--limit` and `--crawl` or `--from` and
+`--to`.
+
+If it seems like nothing is happening, add `-v` or `-vv` at the start:
+
+```
+$ cdxt -vv --cc size 'commoncrawl.org/*'
+```
+
+## Selecting particular CCF crawls
+
+Common Crawl's data is divided into "crawls", which were yearly at the
+start, and are currently done monthly. There are over 100 of them.
+[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html)
+
+Unlike some web archives, CCF doesn't have a single CDX index that
+covers all of these crawls -- we have 1 index per crawl. The way
+you ask for a particular crawl is:
+
+```
+$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*'
+```
+
+- `--crawl CC-MAIN-2024-33` is a single crawl.
+- `--crawl 3` is the latest 3 crawls.
+- `--crawl CC-MAIN-2018` will match all of the crawls from 2018.
+- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019.
+
+CCF also has a hive-sharded parquet index (called the columnar index)
+that covers all of our crawls. Querying broad time ranges is much
+faster with the columnar index. You can find more information about
+this index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format).
+
+The Internet Archive cdx index is organized as a single crawl that goes
+from the very beginning until now. That's why there is no `--crawl` for
+`--ia`. Note that cdx queries to `--ia` will default to one year year
+and limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`.
+
+## Selecting by time
+
+In most cases you'll probably use --crawl to select the time range for
+Common Crawl queries, but for the Internet Archive you'll need to specify
+a time range like this:
+
+```
+$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*'
+```
+
+In this example the time range starts at the beginning of 2008 and
+ends on June 30, 2009 at 23:59. All times are in UTC. If you do not
+specify a time range (and also don't use `--crawl`), you'll get the
+most recent year.
 
-See
+## The full syntax for command-line tools
 
 ```
 $ cdxt --help
 $ cdxt iter --help
 $ cdxt warc --help
+$ cdxt size --help
 ```
 
 for full details. Note that argument order really matters; each switch
 is valid only either before or after the {iter,warc,size} command.
 
 Add -v (or -vv) to see what's going on under the hood.
 
-## Programming example
+## Python programming example
+
+Everything that you can do on the command line, and much more, can
+be done by writing a Python program.
 
 ```
 import cdx_toolkit
@@ -231,5 +288,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-
-
diff --git a/cdx_toolkit/__init__.py b/cdx_toolkit/__init__.py
@@ -197,12 +197,14 @@ def __next__(self):
                 LOGGER.debug('getting more in __next__')
                 self.get_more()
                 if len(self.captures) <= 0:
+                    # XXX print out a warning if this hits the default limit of 1000
                     raise StopIteration
 
 
 class CDXFetcher:
-    def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
+    def __init__(self, source='cc', crawl=None, wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
         self.source = source
+        self.crawl = crawl
         self.cc_sort = cc_sort
         self.source = source
         if wb is not None and warc_download_prefix is not None:
@@ -211,12 +213,11 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
         self.warc_download_prefix = warc_download_prefix
 
         if source == 'cc':
-            self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
-            self.raw_index_list = get_cc_endpoints(self.cc_mirror)
             if wb is not None:
                 raise ValueError('cannot specify wb= for source=cc')
+            self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
+            self.raw_index_list = get_cc_endpoints(self.cc_mirror)
             self.warc_download_prefix = warc_download_prefix or 'https://data.commoncrawl.org'
-            #https://commoncrawl.s3.amazonaws.com
         elif source == 'ia':
             self.index_list = ('https://web.archive.org/cdx/search/cdx',)
             if self.warc_download_prefix is None and self.wb is None:
@@ -230,8 +231,10 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
             LOGGER.setLevel(level=loglevel)
 
     def customize_index_list(self, params):
-        if self.source == 'cc' and ('from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
+        if self.source == 'cc' and (self.crawl or 'crawl' in params or 'from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
             LOGGER.info('making a custom cc index list')
+            if self.crawl and 'crawl' not in params:
+                params['crawl'] = self.crawl
             return filter_cc_endpoints(self.raw_index_list, self.cc_sort, params=params)
         else:
             return self.index_list
@@ -243,6 +246,8 @@ def get(self, url, **kwargs):
         validate_timestamps(params)
         params['url'] = url
         params['output'] = 'json'
+        if 'crawl' not in params:
+            params['crawl'] = self.crawl
         if 'filter' in params:
             if isinstance(params['filter'], str):
                 params['filter'] = (params['filter'],)
@@ -272,13 +277,15 @@ def iter(self, url, **kwargs):
         validate_timestamps(params)
         params['url'] = url
         params['output'] = 'json'
+        if 'crawl' not in params:
+            params['crawl'] = self.crawl
         if 'filter' in params:
             if isinstance(params['filter'], str):
                 params['filter'] = (params['filter'],)
             params['filter'] = munge_filter(params['filter'], self.source)
 
         if self.source == 'cc':
-            apply_cc_defaults(params)
+            apply_cc_defaults(params, crawl_present=bool(self.crawl))
 
         index_list = self.customize_index_list(params)
         return CDXFetcherIter(self, params=params, index_list=index_list)

diff --git a/cdx_toolkit/cli.py b/cdx_toolkit/cli.py
@@ -6,6 +6,7 @@
 import os
 
 import cdx_toolkit
+from cdx_toolkit.commoncrawl import normalize_crawl
 
 LOGGER = logging.getLogger(__name__)
 
@@ -17,13 +18,14 @@ def main(args=None):
     parser.add_argument('--verbose', '-v', action='count', help='set logging level to INFO (-v) or DEBUG (-vv)')
 
     parser.add_argument('--cc', action='store_const', const='cc', help='direct the query to the Common Crawl CDX/WARCs')
+    parser.add_argument('--crawl', action='store', help='crawl names (comma separated) or an integer for the most recent N crawls. Implies --cc')
     parser.add_argument('--ia', action='store_const', const='ia', help='direct the query to the Internet Archive CDX/wayback')
     parser.add_argument('--source', action='store', help='direct the query to this CDX server')
     parser.add_argument('--wb', action='store', help='direct replays for content to this wayback')
     parser.add_argument('--limit', type=int, action='store')
     parser.add_argument('--cc-mirror', action='store', help='use this Common Crawl index mirror')
     parser.add_argument('--cc-sort', action='store', help='default mixed, alternatively: ascending')
-    parser.add_argument('--from', action='store')  # XXX default for cc
+    parser.add_argument('--from', action='store')
     parser.add_argument('--to', action='store')
     parser.add_argument('--filter', action='append', help='see CDX API documentation for usage')
     parser.add_argument('--get', action='store_true', help='use a single get instead of a paged iteration. default limit=1000')
@@ -93,13 +95,15 @@ def get_version():
 
 def setup(cmd):
     kwargs = {}
-    kwargs['source'] = cmd.cc or cmd.ia or cmd.source or None
+    kwargs['source'] = 'cc' if cmd.crawl else cmd.cc or cmd.ia or cmd.source or None
     if kwargs['source'] is None:
         raise ValueError('must specify --cc, --ia, or a --source')
     if cmd.wb:
         kwargs['wb'] = cmd.wb
     if cmd.cc_mirror:
         kwargs['cc_mirror'] = cmd.cc_mirror
+    if cmd.crawl:
+        kwargs['crawl'] = normalize_crawl([cmd.crawl])  # currently a string, not a list
     if getattr(cmd, 'warc_download_prefix', None) is not None:
         kwargs['warc_download_prefix'] = cmd.warc_download_prefix