Skip to content

Commit

Permalink
Update README and test database
Browse files Browse the repository at this point in the history
This change updates the repository README due to the recent rename
from "crawsqueal" to "website-indexer".

It also documents the new wpull-based crawler added in PR 81.

Additionally, it updates the test database with some test data that
should have come along with that PR.
  • Loading branch information
chosak committed Nov 2, 2023
1 parent bcd66f0 commit ebc94e6
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 60 deletions.
78 changes: 18 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,8 @@
# crawsqueal = "crawl" + "SQL" 🦜
# website-indexer 🪱

Explore a website archive in your browser.
This repository crawls a website and stores its content in a SQLite database file.

First, you'll need a
[Website ARChive (WARC) file](https://archive-it.org/blog/post/the-stack-warc-file/)
generated by crawling your website of interest. This repository contains
[one method to run a crawler](#generating-a-crawl-database-from-a-warc-file),
although numerous other popular tools exist for this purpose. Alternatively,
you can use an existing WARC from another source, for example the
[Internet Archive](https://archive.org/search.php?query=mediatype%3A%28web%29).

Next, use this repository to convert your WARC file into a SQLite database file
for easier querying. Use the SQLite command-line interface to
Use the SQLite command-line interface to
[make basic queries](#searching-the-crawl-database)
about website content including:

Expand All @@ -24,44 +15,26 @@ about website content including:
- Crawler errors (404s and more)
- Redirects

Finally,
[run the viewer application](#running-the-viewer-application)
in this repository to explore website content in your browser.
This repository also contains a Django-based
[web application](#running-the-viewer-application)
to explore crawled website content in your browser.
Make queries through an easy-to-use web form, review page details,
and export results as CSV or JSON reports.

## Generating a crawl database from a WARC file

A [WARC](https://archive-it.org/blog/post/the-stack-warc-file/)
(Web ARChive) is a file standard for storing web content in its original context,
maintained by the International Internet Preservation Consortium (IIPC).

Many tools exist to generate WARCs.
The Internet Archive maintains the
[Heritrix](https://github.com/internetarchive/heritrix3) web crawler that can generate WARCs;
a longer list of additional tools for this purpose can be found
[here](http://dhamaniasad.github.io/WARCTools/).
## Crawling a website

The common command-line tool
[wget](https://wiki.archiveteam.org/index.php/Wget_with_WARC_output)
can also be used to generate WARCs. A sample script to do so can be found in this repository,
and can be invoked like this:
Create a Python virtual environment and install required packages:

```sh
./wget_crawl.sh https://www.consumerfinance.gov/
```

This will generate a WARC archive file named `crawl.warc.gz`.
This file can then be converted to a SQLite database using a command like:

```sh
./manage.py warc_to_db crawl.warc.gz crawl.sqlite3
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt
```

Alternatively, to dump a WARC archive file to a set of CSVs:
Crawl a website:

```sh
./manage.py warc_to_csv crawl.warc.gz
./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
```

## Searching the crawl database
Expand Down Expand Up @@ -174,7 +147,7 @@ pip install -r requirements/base.txt
Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database:

```
export CRAWL_DATABASE=cfgov.sqlite3
export CRAWL_DATABASE=crawl.sqlite3
```

Finally, run the Django webserver:
Expand Down Expand Up @@ -237,13 +210,12 @@ yarn fix

### Sample test data

This repository includes sample web archive and database files for testing
purposes at `/sample/crawl.warc.gz` and `/sample/sample.sqlite3`.
This repository includes a sample database file for testing purposes at `/sample/sample.sqlite3`.

The sample database file is used by the viewer application when no other crawl
database file has been specified.

The source website content used to generate these files is included in this repository
The source website content used to generate this file is included in this repository
under the `/sample/src` subdirectory.
To regenerate these files, first serve the sample website locally:

Expand All @@ -256,24 +228,10 @@ This starts the sample website running at http://localhost:8000.
Then, in another terminal, start a crawl against the locally running site:

```
./wget_crawl.sh http://localhost:8000
```

This will create a WARC archive named `crawl.warc.gz` in your working directory.

Next, convert this to a test database file:

```
./manage.py warc_to_db crawl.warc.gz sample.sqlite3
./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3
```

This will create a SQLite database named `sample.sqlite3` in your working directory.

Finally, use these newly created files to replace the existing ones in the `/sample` subdirectory:

```
mv crawl.warc.gz sample.sqlite3 ./sample
```
This will overwrite the test database with a fresh crawl.

## Deployment

Expand Down
Binary file modified sample/sample.sqlite3
Binary file not shown.

0 comments on commit ebc94e6

Please sign in to comment.