Update README and test database

This change updates the repository README due to the recent rename from "crawsqueal" to "website-indexer". It also documents the new wpull-based crawler added in PR 81. Additionally, it updates the test database with some test data that should have come along with that PR.
cfpb · Nov 2, 2023 · ebc94e6 · ebc94e6
1 parent bcd66f0
commit ebc94e6
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -1,17 +1,8 @@
-# crawsqueal = "crawl" + "SQL" 🦜
+# website-indexer 🪱
 
-Explore a website archive in your browser.
+This repository crawls a website and stores its content in a SQLite database file.
 
-First, you'll need a
-[Website ARChive (WARC) file](https://archive-it.org/blog/post/the-stack-warc-file/)
-generated by crawling your website of interest. This repository contains
-[one method to run a crawler](#generating-a-crawl-database-from-a-warc-file),
-although numerous other popular tools exist for this purpose. Alternatively,
-you can use an existing WARC from another source, for example the
-[Internet Archive](https://archive.org/search.php?query=mediatype%3A%28web%29).
-
-Next, use this repository to convert your WARC file into a SQLite database file
-for easier querying. Use the SQLite command-line interface to
+Use the SQLite command-line interface to
 [make basic queries](#searching-the-crawl-database)
 about website content including:
 
@@ -24,44 +15,26 @@ about website content including:
 - Crawler errors (404s and more)
 - Redirects
 
-Finally,
-[run the viewer application](#running-the-viewer-application)
-in this repository to explore website content in your browser.
+This repository also contains a Django-based
+[web application](#running-the-viewer-application)
+to explore crawled website content in your browser.
 Make queries through an easy-to-use web form, review page details,
 and export results as CSV or JSON reports.
 
-## Generating a crawl database from a WARC file
-
-A [WARC](https://archive-it.org/blog/post/the-stack-warc-file/)
-(Web ARChive) is a file standard for storing web content in its original context,
-maintained by the International Internet Preservation Consortium (IIPC).
-
-Many tools exist to generate WARCs.
-The Internet Archive maintains the
-[Heritrix](https://github.com/internetarchive/heritrix3) web crawler that can generate WARCs;
-a longer list of additional tools for this purpose can be found
-[here](http://dhamaniasad.github.io/WARCTools/).
+## Crawling a website
 
-The common command-line tool
-[wget](https://wiki.archiveteam.org/index.php/Wget_with_WARC_output)
-can also be used to generate WARCs. A sample script to do so can be found in this repository,
-and can be invoked like this:
+Create a Python virtual environment and install required packages:
 
-```sh
-./wget_crawl.sh https://www.consumerfinance.gov/
 ```
-
-This will generate a WARC archive file named `crawl.warc.gz`.
-This file can then be converted to a SQLite database using a command like:
-
-```sh
-./manage.py warc_to_db crawl.warc.gz crawl.sqlite3
+python3.6 -m venv venv
+source venv/bin/activate
+pip install -r requirements/base.txt
 ```
 
-Alternatively, to dump a WARC archive file to a set of CSVs:
+Crawl a website:
 
 ```sh
-./manage.py warc_to_csv crawl.warc.gz
+./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3
 ```
 
 ## Searching the crawl database
@@ -174,7 +147,7 @@ pip install -r requirements/base.txt
 Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database:
 
 ```
-export CRAWL_DATABASE=cfgov.sqlite3
+export CRAWL_DATABASE=crawl.sqlite3
 ```
 
 Finally, run the Django webserver:
@@ -237,13 +210,12 @@ yarn fix
 
 ### Sample test data
 
-This repository includes sample web archive and database files for testing
-purposes at `/sample/crawl.warc.gz` and `/sample/sample.sqlite3`.
+This repository includes a sample database file for testing purposes at `/sample/sample.sqlite3`.
 
 The sample database file is used by the viewer application when no other crawl
 database file has been specified.
 
-The source website content used to generate these files is included in this repository
+The source website content used to generate this file is included in this repository
 under the `/sample/src` subdirectory.
 To regenerate these files, first serve the sample website locally:
 
@@ -256,24 +228,10 @@ This starts the sample website running at http://localhost:8000.
 Then, in another terminal, start a crawl against the locally running site:
 
 ```
-./wget_crawl.sh http://localhost:8000
-```
-
-This will create a WARC archive named `crawl.warc.gz` in your working directory.
-
-Next, convert this to a test database file:
-
-```
-./manage.py warc_to_db crawl.warc.gz sample.sqlite3
+./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3
 ```
 
-This will create a SQLite database named `sample.sqlite3` in your working directory.
-
-Finally, use these newly created files to replace the existing ones in the `/sample` subdirectory:
-
-```
-mv crawl.warc.gz sample.sqlite3 ./sample
-```
+This will overwrite the test database with a fresh crawl.
 
 ## Deployment
 

diff --git a/sample/sample.sqlite3 b/sample/sample.sqlite3