diff --git a/README.md b/README.md index 7a660cb..2e84ae7 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,8 @@ -# crawsqueal = "crawl" + "SQL" 🦜 +# website-indexer 🪱 -Explore a website archive in your browser. +This repository crawls a website and stores its content in a SQLite database file. -First, you'll need a -[Website ARChive (WARC) file](https://archive-it.org/blog/post/the-stack-warc-file/) -generated by crawling your website of interest. This repository contains -[one method to run a crawler](#generating-a-crawl-database-from-a-warc-file), -although numerous other popular tools exist for this purpose. Alternatively, -you can use an existing WARC from another source, for example the -[Internet Archive](https://archive.org/search.php?query=mediatype%3A%28web%29). - -Next, use this repository to convert your WARC file into a SQLite database file -for easier querying. Use the SQLite command-line interface to +Use the SQLite command-line interface to [make basic queries](#searching-the-crawl-database) about website content including: @@ -24,44 +15,26 @@ about website content including: - Crawler errors (404s and more) - Redirects -Finally, -[run the viewer application](#running-the-viewer-application) -in this repository to explore website content in your browser. +This repository also contains a Django-based +[web application](#running-the-viewer-application) +to explore crawled website content in your browser. Make queries through an easy-to-use web form, review page details, and export results as CSV or JSON reports. -## Generating a crawl database from a WARC file - -A [WARC](https://archive-it.org/blog/post/the-stack-warc-file/) -(Web ARChive) is a file standard for storing web content in its original context, -maintained by the International Internet Preservation Consortium (IIPC). - -Many tools exist to generate WARCs. -The Internet Archive maintains the -[Heritrix](https://github.com/internetarchive/heritrix3) web crawler that can generate WARCs; -a longer list of additional tools for this purpose can be found -[here](http://dhamaniasad.github.io/WARCTools/). +## Crawling a website -The common command-line tool -[wget](https://wiki.archiveteam.org/index.php/Wget_with_WARC_output) -can also be used to generate WARCs. A sample script to do so can be found in this repository, -and can be invoked like this: +Create a Python virtual environment and install required packages: -```sh -./wget_crawl.sh https://www.consumerfinance.gov/ ``` - -This will generate a WARC archive file named `crawl.warc.gz`. -This file can then be converted to a SQLite database using a command like: - -```sh -./manage.py warc_to_db crawl.warc.gz crawl.sqlite3 +python3.6 -m venv venv +source venv/bin/activate +pip install -r requirements/base.txt ``` -Alternatively, to dump a WARC archive file to a set of CSVs: +Crawl a website: ```sh -./manage.py warc_to_csv crawl.warc.gz +./manage.py crawl https://www.consumerfinance.gov crawl.sqlite3 ``` ## Searching the crawl database @@ -174,7 +147,7 @@ pip install -r requirements/base.txt Optionally set the `CRAWL_DATABASE` environment variable to point to a local crawl database: ``` -export CRAWL_DATABASE=cfgov.sqlite3 +export CRAWL_DATABASE=crawl.sqlite3 ``` Finally, run the Django webserver: @@ -237,13 +210,12 @@ yarn fix ### Sample test data -This repository includes sample web archive and database files for testing -purposes at `/sample/crawl.warc.gz` and `/sample/sample.sqlite3`. +This repository includes a sample database file for testing purposes at `/sample/sample.sqlite3`. The sample database file is used by the viewer application when no other crawl database file has been specified. -The source website content used to generate these files is included in this repository +The source website content used to generate this file is included in this repository under the `/sample/src` subdirectory. To regenerate these files, first serve the sample website locally: @@ -256,24 +228,10 @@ This starts the sample website running at http://localhost:8000. Then, in another terminal, start a crawl against the locally running site: ``` -./wget_crawl.sh http://localhost:8000 -``` - -This will create a WARC archive named `crawl.warc.gz` in your working directory. - -Next, convert this to a test database file: - -``` -./manage.py warc_to_db crawl.warc.gz sample.sqlite3 +./manage.py crawl http://localhost:8000/ --recreate ./sample/src/sample.sqlite3 ``` -This will create a SQLite database named `sample.sqlite3` in your working directory. - -Finally, use these newly created files to replace the existing ones in the `/sample` subdirectory: - -``` -mv crawl.warc.gz sample.sqlite3 ./sample -``` +This will overwrite the test database with a fresh crawl. ## Deployment diff --git a/sample/sample.sqlite3 b/sample/sample.sqlite3 index 01b0e9b..c5c44fe 100644 Binary files a/sample/sample.sqlite3 and b/sample/sample.sqlite3 differ diff --git a/viewer/tests/test_csv_export.py b/viewer/tests/test_csv_export.py index e6e9015..049c5b5 100644 --- a/viewer/tests/test_csv_export.py +++ b/viewer/tests/test_csv_export.py @@ -12,5 +12,5 @@ def test_csv_generation(self): self.assertEqual(response["Content-Type"], "text/csv; charset=utf-8") rows = BytesIO(response.getvalue()).readlines() - self.assertEqual(len(rows), 3) + self.assertEqual(len(rows), 4) self.assertEqual(rows[0], codecs.BOM_UTF8 + b"url,title,language\r\n")