This repository has been archived by the owner on Jul 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
Mahout module
Julien Nioche edited this page May 16, 2013
·
3 revisions
Mahout commands are found in behemoth-mahout.job
.
usage: com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth
-i <input> -o <output> -t <typeToken> -f <featureName> --analyzerName <analyzerName>
[--minSupport <minSupport> --chunkSize <chunkSize> --minDF <minDF>
--maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR>
--numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help
--sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--typeToken (-t) typeToken The annotation type for Tokens
--featureName (-f) featureName The name of the feature containing the
token values
--analyzerName (-a) analyzerName The class name of the Lucene Analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output The output directory
--input (-i) input input dir containing the documents in
sequence file format
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
create (2 = bigrams, 3 = trigrams, etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false
This converts a set of Behemoth documents to vectors. It is based on DictionaryVectorizer in Mahout.
An example of how to use this with the standard Mahout commands can be found here.