Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and test database #82

Merged
merged 2 commits into from
Nov 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 18 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,8 @@
# crawsqueal = "crawl" + "SQL" 🦜
# website-indexer 🪱

Explore a website archive in your browser.
This repository crawls a website and stores its content in a SQLite database file.

First, you'll need a
[Website ARChive (WARC) file](https://archive-it.org/blog/post/the-stack-warc-file/)
generated by crawling your website of interest. This repository contains
[one method to run a crawler](#generating-a-crawl-database-from-a-warc-file),
although numerous other popular tools exist for this purpose. Alternatively,
you can use an existing WARC from another source, for example the
[Internet Archive](https://archive.org/search.php?query=mediatype%3A%28web%29).

Next, use this repository to convert your WARC file into a SQLite database file
for easier querying. Use the SQLite command-line interface to
Use the SQLite command-line interface to
[make basic queries](#searching-the-crawl-database)
about website content including:

Expand All @@ -24,44 +15,26 @@ about website content including:
- Crawler errors (404s and more)
- Redirects

Finally,
[run the viewer application](#running-the-viewer-application)
in this repository to explore website content in your browser.
This repository also contains a Django-based
[web application](#running-the-viewer-application)
to explore crawled website content in your browser.
Make queries through an easy-to-use web form, review page details,
and export results as CSV or JSON reports.

## Generating a crawl database from a WARC file

A [WARC](https://archive-it.org/blog/post/the-stack-warc-file/)
(Web ARChive) is a file standard for storing web content in its original context,
maintained by the International Internet Preservation Consortium (IIPC).

Many tools exist to generate WARCs.
The Internet Archive maintains the
[Heritrix](https://github.com/internetarchive/heritrix3) web crawler that can generate WARCs;
a longer list of additional tools for this purpose can be found
[here](http://dhamaniasad.github.io/WARCTools/).
## Crawling a website

The common command-line tool
[wget](https://wiki.archiveteam.org/index.php/Wget_with_WARC_output)
can also be used to generate WARCs. A sample script to do so can be found in this repository,
and can be invoked like this:
Create a Python virtual environment and install required packages:

```sh
./wget_crawl.sh https://www.consumerfinance.gov/
```

This will generate a WARC archive file named `crawl.warc.gz`.
This file can then be converted to a SQLite database using a command like:

```sh
./manage.py warc_to_db crawl.warc.gz crawl.sqlite3
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt
```

Alternatively, to dump a WARC archive file to a set of CSVs:
Crawl a website:

```sh
./manage.py warc_to_csv crawl.warc.gz
./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
```

## Searching the crawl database
Expand Down Expand Up @@ -174,7 +147,7 @@ pip install -r requirements/base.txt
Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database:

```
export CRAWL_DATABASE=cfgov.sqlite3
export CRAWL_DATABASE=crawl.sqlite3
```

Finally, run the Django webserver:
Expand Down Expand Up @@ -237,13 +210,12 @@ yarn fix

### Sample test data

This repository includes sample web archive and database files for testing
purposes at `/sample/crawl.warc.gz` and `/sample/sample.sqlite3`.
This repository includes a sample database file for testing purposes at `/sample/sample.sqlite3`.

The sample database file is used by the viewer application when no other crawl
database file has been specified.

The source website content used to generate these files is included in this repository
The source website content used to generate this file is included in this repository
under the `/sample/src` subdirectory.
To regenerate these files, first serve the sample website locally:

Expand All @@ -256,24 +228,10 @@ This starts the sample website running at http://localhost:8000.
Then, in another terminal, start a crawl against the locally running site:

```
./wget_crawl.sh http://localhost:8000
```

This will create a WARC archive named `crawl.warc.gz` in your working directory.

Next, convert this to a test database file:

```
./manage.py warc_to_db crawl.warc.gz sample.sqlite3
./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3
```

This will create a SQLite database named `sample.sqlite3` in your working directory.

Finally, use these newly created files to replace the existing ones in the `/sample` subdirectory:

```
mv crawl.warc.gz sample.sqlite3 ./sample
```
This will overwrite the test database with a fresh crawl.

## Deployment

Expand Down
Binary file modified sample/sample.sqlite3
Binary file not shown.
2 changes: 1 addition & 1 deletion viewer/tests/test_csv_export.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ def test_csv_generation(self):
self.assertEqual(response["Content-Type"], "text/csv; charset=utf-8")

rows = BytesIO(response.getvalue()).readlines()
self.assertEqual(len(rows), 3)
self.assertEqual(len(rows), 4)
self.assertEqual(rows[0], codecs.BOM_UTF8 + b"url,title,language\r\n")
Loading