Home

Welcome to the heritrix-connector wiki!

The Heritrix Connector is designed to be used in conjunction with the Aspire Content processing system.

It's a web crawler (obviously!) based on the Heritrix engine. It accepts a number of url seeds (and all the other usual Heritrix parameters), starts the Heritrix engine and passes a job to Heritrix that performs all the hard work.

Urls found by Heritrix are passed to the Aspire content processing system as adds. The connector handles deletes using settings allowing a number of iterations or days since a url was seen to pass before the url is send to Aspire as a delete.

In order to use this connector "as is", you'll need to download and install the Aspire Content Processing system.

Downloading and Installing Aspire

Building the Heritrix Connector

Using the Heritrix Connector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the heritrix-connector wiki!

Clone this wiki locally