Skip to content

Commit

Permalink
feat: add --crawl (#39)
Browse files Browse the repository at this point in the history
  • Loading branch information
wumpus authored Sep 9, 2024
1 parent 83d1f31 commit e5d122a
Show file tree
Hide file tree
Showing 10 changed files with 276 additions and 78 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
#max-parallel: 1
max-parallel: 1 # avoids ever triggering a rate limit
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest]
Expand Down
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
- 0.9.37
+ --crawl for CCF

- 0.9.36
+ ratelimit code; both IA and CCF are rate limiting their cdx endpoints
+ cache collinfo.json in ~/.cache/cdx_toolkit/
+ py3.11 and py3.12 pass testing
+ py3.11 and py3.12 pass testing; windows and macos pass testing

- 0.9.35
+ exponential backoff retries now that IA is sending 429
Expand Down
7 changes: 3 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,13 @@ distcheck: distclean
twine check dist/*

dist: distclean
echo " Finishe CHANGELOG and commit it.
echo " Finishe CHANGELOG.md and commit it."
echo " git tag --list"
echo " git tag v0.x.x"
echo " git tag 0.x.x # no v"
echo " git push --tags"
python ./setup.py sdist
twine check dist/*
twine upload dist/* -r pypi

install:
python ./setup.py install

pip install .
97 changes: 76 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,61 +3,118 @@
[![build](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml/badge.svg)](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [![coverage](https://codecov.io/gh/cocrawler/cdx_toolkit/graph/badge.svg?token=M1YJB998LE)](https://codecov.io/gh/cocrawler/cdx_toolkit) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg)](LICENSE)

cdx_toolkit is a set of tools for working with CDX indices of web
crawls and archives, including those at CommonCrawl and the Internet
Archive's Wayback Machine.
crawls and archives, including those at the Common Crawl Foundation
(CCF) and those at the Internet Archive's Wayback Machine.

CommonCrawl uses Ilya Kreymer's pywb to serve the CDX API, which is
somewhat different from the Internet Archive's CDX API server. cdx_toolkit
hides these differences as best it can. cdx_toolkit also knits
together the monthly Common Crawl CDX indices into a single, virtual
index.
Common Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is
somewhat different from the Internet Archive's CDX API server.
cdx_toolkit hides these differences as best it can. cdx_toolkit also
knits together the monthly Common Crawl CDX indices into a single,
virtual index.

Finally, cdx_toolkit allows extracting archived pages from CC and IA
into WARC files. If you're looking to create subsets of CC or IA data
and then process them into WET or WAT files, this is a feature you'll
find useful.
into WARC files. If you're looking to create subsets of CC or IA data
and then further process them, this is a feature you'll find useful.

## Installing

cdx toolkit requires Python 3.

```
$ pip install cdx_toolkit
```

or clone this repo and use `python ./setup.py install`.
or clone this repo and use `pip install .`

## Command-line tools

```
$ cdxt --cc size 'commoncrawl.org/*'
$ cdxt --cc --limit 10 iter 'commoncrawl.org/*'
$ cdxt --cc --limit 10 iter 'commoncrawl.org/*' # returns the most recent year
$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*' # returns the most recent 3 crawls
$ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'
$ cdxt --ia --limit 10 iter 'commoncrawl.org/*'
$ cdxt --ia --limit 10 iter 'commoncrawl.org/*' # will show the beginning of IA's crawl
$ cdxt --ia --limit 10 warc 'commoncrawl.org/*'
```

cdxt takes a large number of command line switches, controlling
the time period and all other CDX query options. cdxt can generate
WARC, jsonl, and csv outputs.

** Note that by default, cdxt --cc will iterate over the previous
year of captures. **
If you don't specify much about the crawls or dates or number of
records you're interested in, some default limits will kick in to
prevent overly-large queries. These default limits include a maximum
of 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes.
To exceed these limits, use `--limit` and `--crawl` or `--from` and
`--to`.

If it seems like nothing is happening, add `-v` or `-vv` at the start:

```
$ cdxt -vv --cc size 'commoncrawl.org/*'
```

## Selecting particular CCF crawls

Common Crawl's data is divided into "crawls", which were yearly at the
start, and are currently done monthly. There are over 100 of them.
[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html)

Unlike some web archives, CCF doesn't have a single CDX index that
covers all of these crawls -- we have 1 index per crawl. The way
you ask for a particular crawl is:

```
$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*'
```

- `--crawl CC-MAIN-2024-33` is a single crawl.
- `--crawl 3` is the latest 3 crawls.
- `--crawl CC-MAIN-2018` will match all of the crawls from 2018.
- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019.

CCF also has a hive-sharded parquet index (called the columnar index)
that covers all of our crawls. Querying broad time ranges is much
faster with the columnar index. You can find more information about
this index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format).

The Internet Archive cdx index is organized as a single crawl that goes
from the very beginning until now. That's why there is no `--crawl` for
`--ia`. Note that cdx queries to `--ia` will default to one year year
and limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`.

## Selecting by time

In most cases you'll probably use --crawl to select the time range for
Common Crawl queries, but for the Internet Archive you'll need to specify
a time range like this:

```
$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*'
```

In this example the time range starts at the beginning of 2008 and
ends on June 30, 2009 at 23:59. All times are in UTC. If you do not
specify a time range (and also don't use `--crawl`), you'll get the
most recent year.

See
## The full syntax for command-line tools

```
$ cdxt --help
$ cdxt iter --help
$ cdxt warc --help
$ cdxt size --help
```

for full details. Note that argument order really matters; each switch
is valid only either before or after the {iter,warc,size} command.

Add -v (or -vv) to see what's going on under the hood.

## Programming example
## Python programming example

Everything that you can do on the command line, and much more, can
be done by writing a Python program.

```
import cdx_toolkit
Expand Down Expand Up @@ -231,5 +288,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


19 changes: 13 additions & 6 deletions cdx_toolkit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,12 +197,14 @@ def __next__(self):
LOGGER.debug('getting more in __next__')
self.get_more()
if len(self.captures) <= 0:
# XXX print out a warning if this hits the default limit of 1000
raise StopIteration


class CDXFetcher:
def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
def __init__(self, source='cc', crawl=None, wb=None, warc_download_prefix=None, cc_mirror=None, cc_sort='mixed', loglevel=None):
self.source = source
self.crawl = crawl
self.cc_sort = cc_sort
self.source = source
if wb is not None and warc_download_prefix is not None:
Expand All @@ -211,12 +213,11 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
self.warc_download_prefix = warc_download_prefix

if source == 'cc':
self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
self.raw_index_list = get_cc_endpoints(self.cc_mirror)
if wb is not None:
raise ValueError('cannot specify wb= for source=cc')
self.cc_mirror = cc_mirror or 'https://index.commoncrawl.org/'
self.raw_index_list = get_cc_endpoints(self.cc_mirror)
self.warc_download_prefix = warc_download_prefix or 'https://data.commoncrawl.org'
#https://commoncrawl.s3.amazonaws.com
elif source == 'ia':
self.index_list = ('https://web.archive.org/cdx/search/cdx',)
if self.warc_download_prefix is None and self.wb is None:
Expand All @@ -230,8 +231,10 @@ def __init__(self, source='cc', wb=None, warc_download_prefix=None, cc_mirror=No
LOGGER.setLevel(level=loglevel)

def customize_index_list(self, params):
if self.source == 'cc' and ('from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
if self.source == 'cc' and (self.crawl or 'crawl' in params or 'from' in params or 'from_ts' in params or 'to' in params or 'closest' in params):
LOGGER.info('making a custom cc index list')
if self.crawl and 'crawl' not in params:
params['crawl'] = self.crawl
return filter_cc_endpoints(self.raw_index_list, self.cc_sort, params=params)
else:
return self.index_list
Expand All @@ -243,6 +246,8 @@ def get(self, url, **kwargs):
validate_timestamps(params)
params['url'] = url
params['output'] = 'json'
if 'crawl' not in params:
params['crawl'] = self.crawl
if 'filter' in params:
if isinstance(params['filter'], str):
params['filter'] = (params['filter'],)
Expand Down Expand Up @@ -272,13 +277,15 @@ def iter(self, url, **kwargs):
validate_timestamps(params)
params['url'] = url
params['output'] = 'json'
if 'crawl' not in params:
params['crawl'] = self.crawl
if 'filter' in params:
if isinstance(params['filter'], str):
params['filter'] = (params['filter'],)
params['filter'] = munge_filter(params['filter'], self.source)

if self.source == 'cc':
apply_cc_defaults(params)
apply_cc_defaults(params, crawl_present=bool(self.crawl))

index_list = self.customize_index_list(params)
return CDXFetcherIter(self, params=params, index_list=index_list)
Expand Down
8 changes: 6 additions & 2 deletions cdx_toolkit/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os

import cdx_toolkit
from cdx_toolkit.commoncrawl import normalize_crawl

LOGGER = logging.getLogger(__name__)

Expand All @@ -17,13 +18,14 @@ def main(args=None):
parser.add_argument('--verbose', '-v', action='count', help='set logging level to INFO (-v) or DEBUG (-vv)')

parser.add_argument('--cc', action='store_const', const='cc', help='direct the query to the Common Crawl CDX/WARCs')
parser.add_argument('--crawl', action='store', help='crawl names (comma separated) or an integer for the most recent N crawls. Implies --cc')
parser.add_argument('--ia', action='store_const', const='ia', help='direct the query to the Internet Archive CDX/wayback')
parser.add_argument('--source', action='store', help='direct the query to this CDX server')
parser.add_argument('--wb', action='store', help='direct replays for content to this wayback')
parser.add_argument('--limit', type=int, action='store')
parser.add_argument('--cc-mirror', action='store', help='use this Common Crawl index mirror')
parser.add_argument('--cc-sort', action='store', help='default mixed, alternatively: ascending')
parser.add_argument('--from', action='store') # XXX default for cc
parser.add_argument('--from', action='store')
parser.add_argument('--to', action='store')
parser.add_argument('--filter', action='append', help='see CDX API documentation for usage')
parser.add_argument('--get', action='store_true', help='use a single get instead of a paged iteration. default limit=1000')
Expand Down Expand Up @@ -93,13 +95,15 @@ def get_version():

def setup(cmd):
kwargs = {}
kwargs['source'] = cmd.cc or cmd.ia or cmd.source or None
kwargs['source'] = 'cc' if cmd.crawl else cmd.cc or cmd.ia or cmd.source or None
if kwargs['source'] is None:
raise ValueError('must specify --cc, --ia, or a --source')
if cmd.wb:
kwargs['wb'] = cmd.wb
if cmd.cc_mirror:
kwargs['cc_mirror'] = cmd.cc_mirror
if cmd.crawl:
kwargs['crawl'] = normalize_crawl([cmd.crawl]) # currently a string, not a list
if getattr(cmd, 'warc_download_prefix', None) is not None:
kwargs['warc_download_prefix'] = cmd.warc_download_prefix

Expand Down
Loading

0 comments on commit e5d122a

Please sign in to comment.