Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Mahout module

Julien Nioche edited this page May 16, 2013 · 3 revisions

Mahout commands are found in behemoth-mahout.job.

usage: com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth
-i <input> -o <output> -t <typeToken> -f <featureName> --analyzerName <analyzerName> 
[--minSupport <minSupport> --chunkSize <chunkSize>   --minDF <minDF>       
--maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> 
--numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help       
--sequentialAccessVector --namedVector --logNormalize]                          
Options                                                                         
--minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                    Value: 2                                  
--typeToken (-t) typeToken          The annotation type for Tokens            
--featureName (-f) featureName      The name of the feature containing the    
                                    token values 
--analyzerName (-a) analyzerName    The class name of the Lucene Analyzer                                                                                      
--chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
--output (-o) output                The output directory                      
--input (-i) input                  input dir containing the documents in     
                                    sequence file format                      
--minDF (-md) minDF                 The minimum document frequency.  Default  
                                    is 1                                      
--maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                    Can be used to remove really high         
                                    frequency terms. Expressed as an integer  
                                    between 0 and 100. Default is 99.         
--weight (-wt) weight               The kind of weight to use. Currently TF   
                                    or TFIDF                                  
--norm (-n) norm                    The norm to use, expressed as either a    
                                    float or "INF" if you want to use the     
                                    Infinite norm.  Must be greater or equal  
                                    to 0.  The default is not to normalize    
--minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                    Ratio(Float)  Default is 1.0              
--numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                    Default Value: 1                          
--maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                    create (2 = bigrams, 3 = trigrams, etc)   
                                    Default Value:1                           
--overwrite (-ow)                   If set, overwrite the output directory    
--help (-h)                         Print out help                            
--sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                    be SequentialAccessVectors. If set true   
                                    else false                                
--namedVector (-nv)                 (Optional) Whether output vectors should  
                                    be NamedVectors. If set true else false   
--logNormalize (-lnorm)             (Optional) Whether output vectors should  
                                    be logNormalize. If set true else false 

This converts a set of Behemoth documents to vectors. It is based on DictionaryVectorizer in Mahout.

An example of how to use this with the standard Mahout commands can be found here.

Behemoth Modules | Home

Clone this wiki locally