Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

IO Module

butlermh edited this page Jun 4, 2011 · 6 revisions

IO commands are found in behemoth-io.job.

usage: com.digitalpebble.behemoth.io.nutch.NutchSegmentConverter <segment> <output>
<segment>             The Nutch segment on HDFS.
<output>              The output path on HDFS.

Converts a Nutch segment into a Behemoth corpus.

usage: com.digitalpebble.behemoth.io.warc.WARCConverterJob <archive> <output>
<archive>             The WARC archive on HDFS.
<output>              The output path on HDFS.

Converts a WARC archive into a Behemoth corpus.

Behemoth Modules | Home

Clone this wiki locally