Robot code for accessioning and preservation of Web Archiving Service Seed and Crawl objects.
[Deprecated] Check the Wiki in the robot-master repo.
To run, use the lyber-core
infrastructure, which uses bundle exec controller boot
to start all robots defined in config/environments/robots_ENV.yml
.
Various dependencies, including cdxj-indexer
which is installed via pip3
and poetry
, can be found in config/settings.yml
and shared_configs (was-robotsxxx branches). To install cdxj-indexer
:
$ poetry install
And then to run it:
$ poetry run cdxj-indexer --args --follow --here
See below.
See consul pages in Web Archival portal, esp Web Archiving Development Documentation
Preassembly workflow for web archiving crawl objects (that include WARC or ARC files) to extract and create metadata. It consists of these robots:
build-was-crawl-druid-tree
: this robot reads the crawl object content (ARCs or WARCs and logs) from directory defined by crawl object label, builds a druid tree, and copies the content to the druid tree content directory.end_was_crawl_preassembly
: initiates the accessionWF (of common-accessioning).
Dissemination workflow for web archiving crawl objects. It is kicked off by the last step in the common-accessioning end-accession step that reads the disseminationWF that is suitable for this object type based on APO. It consists of these robots:
warc-extractor
: extracts WARC files from WACZ filescdxj-generator
: performs the basic indexing for the WARC/ARC files and generates CDXJ files (web archiving index files used by pywb). Generates 1 CDXJ file for each WARC file; the generated CDXJ files will be copied to/web-archiving-stacks
.cdxj-merge
: performs two main tasks: 1) Merges the individual CDXJ files that are generated in the previous step with the main index file (/web-archiving-stacks/data/indexes/cdxj/level0.cdxj
) 2) Sorts the new generated index file.
Preassembly workflow for web archiving seed objects.
It consists of 4 robots:
desc-metadata-generator
: generates the descMetadata in MODS format for the seed object.thumbnail-generator
: captures a screenshot for the first memento using puppeteer and includes it as the main image for the object. This image will be used in Argo and SearchWorks's sul-embed. If the robot fails to generate a thumbnail, it shows as an error in Argo.content-metadata-generator
: generates contentMetadata.xml for the thumbnail by processing the contentMetadata.XSLT template against the available thumbnail.jp2.end-was-seed-preassembly
: initiates the accessionWF (of common-accessioning) and opens/closes version for the old object.
Workflow to route web archiving objects to wasCrawlDisseminationWF based on content type. Note that the wasDisseminationWF itself is fired off by the accessionWF by using the administrative.disseminationWorkflow value in the APO. For example, if the APO has the following, it'll fire off wasDisseminationWF:
"administrative": {
"disseminationWorkflow": "wasDisseminationWF",
It consists of 1 robot:
start_special_dissemination
: sends objects with content typewebarchive-binary
to wasCrawlDisseminationWF.
There is a scheduled task to roll up the level0.cdxj
files into level1
each night, plus additional rollups to level2
and level3
, monthly and yearly respectively.
- Kakadu Proprietary Software Binaries - for JP2 generation
- libvips
- Exiftool
- Puppeteer
- Google Chrome
Download and install demonstration binaries from Kakadu: http://kakadusoftware.com/downloads/
NOTE: If you have upgrade to El Capitan on OS X, you will need to donwload and re-install the latest version of Kakadu, due to changes made with SIP. These changes moved the old executable binaries to an inaccessible location.
brew install libvips
sudo apt install libvips42
Download latest version from: http://www.sno.phy.queensu.ca/~phil/exiftool
tar -xf Image-ExifTool-#.##.tar.gz
cd Image-ExifTool-#.##
perl Makefile.PL
make test
sudo make install
yarn install
- Verify there are no jobs on the was-robots at https://robot-console-stage.stanford.edu/busy
- Clear collections:
rm -rf /web-archiving-stacks/data/collections/*
- Clear indexes:
rm -rf /web-archiving-stacks/data/indexes/*
- Clear seeds:
rm -rf /was_unaccessioned_data/seed/*
- Clear jobs:
rm -rf /was_unaccessioned_data/jobs/*