Skip to content
Andrés Aguilar-Umana edited this page Jul 24, 2017 · 7 revisions

Welcome to the heritrix-connector wiki!

The Heritrix Connector is designed to be used in conjunction with the Aspire Content processing system.

It's a web crawler (obviously!) based on the Heritrix engine. It accepts a number of url seeds (and all the other usual Heritrix parameters), starts the Heritrix engine and passes a job to Heritrix that performs all the hard work.

Urls found by Heritrix are passed to the Aspire content processing system as adds. The connector handles deletes using settings allowing a number of iterations or days since a url was seen to pass before the url is send to Aspire as a delete.

In order to use this connector "as is", you'll need to download and install the Aspire Content Processing system.

Downloading and Installing Aspire

Building the Heritrix Connector

Using the Heritrix Connector