Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Language id

Carmen-digitalPebble edited this page Aug 30, 2012 · 3 revisions

behemoth-language-id*job.jar 

For simple processing with language id:

usage: 
hadoop jar ./behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i Tika-corpus -o Tikacorpus-lang

For processing & filtering on a specific language
:
bc. usage:
hadoop jar behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D document.filter.md.keep.lang=en -i Tika-corpus -o Tikacorpus-EN

Clone this wiki locally