This repository has been archived by the owner on Jul 10, 2019. It is now read-only.
jnioche
released this
10 Feb 12:04
·
29 commits
to master
since this release
- core classes are unpacked in the job archives so that the core comman…
- changed version of the job files in behemoth script
- TikaDriver displays help if missing options
- Fixed UIMA processor so that annotations of type Annotations (and not…
- UIMAMapper iterates on AnnotationFS instead of casting to AnnotationImpl
- Index file created by ContentExtractor points to the right part number
- Tika 1.3 + copyright year on license
- behemoth script calls jobs relatively to itself
- example custom config uses block compression
- IO : moved lemur code to original package + skip parsing of http resp…
- SparseVectorsFromBehemoth dumps the usage if the input or output is m…
- bugfix toString() BehemothDocument (ArrayOutOfBoundsException) + avoi…
- Applied code formatting + added timings to MapReduce jobs
- CorpusGenerator has timings + more compression and archive formats re…
- Compatability with CDH 4.1
- Merge pull request #44 from mumrah/master
- Upgrade to Solr 4.3 (thanks to LucidWorks)
- Updated version of Javadoc plugin
- Upgraded to Tika 1.4
- Upgraded version of commons-compress to 1.5
- Can specify AS name for input to GATE doc
- Bugfix NPE when using the GATECorpusGenerator
- updating gate version to 7.1
- Nutch converter takes dir as input + prints out timings
- POM sign artefacts when releasing
- Upgrade hadoop to 1.2.1 and add override method to upgrade Add metadata fields as solr dynamic fields if dynamic.fields param is…
- Use prefixes for dynamic fields on annotations and metdata
- Merge pull request #47 from kiranchitturi/master
- Corpusreader uses the filesystem specified in the input path before r…
- Update LICENSE.txt
- WarcFileRecordReader can read from S3 + WARCConverterJob stores http …
- Merge branch 'master' of github.com:DigitalPebble/behemoth
- Get IP address from WARC metadata and store in MD
- WARCConverterJob uses filters
- exclude asm dependency as breaks builds
- GATEDriver returns -1 on error
- GATE documents generated from plain text are marked as not markup awa…
- bugfix httpresponse content length skipped when empty
- Added option to force reparse with Tika