This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Mahout module

Julien Nioche edited this page May 16, 2013 · 3 revisions

Mahout commands are found in behemoth-mahout.job.

usage: com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth
-i <input> -o <output> -t <typeToken> -f <featureName> --analyzerName <analyzerName> 
[--minSupport <minSupport> --chunkSize <chunkSize>   --minDF <minDF>       
--maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> 
--numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help       
--sequentialAccessVector --namedVector --logNormalize]                          
--minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                    Value: 2                                  
--typeToken (-t) typeToken          The annotation type for Tokens            
--featureName (-f) featureName      The name of the feature containing the    
                                    token values 
--analyzerName (-a) analyzerName    The class name of the Lucene Analyzer                                                                                      
--chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
--output (-o) output                The output directory                      
--input (-i) input                  input dir containing the documents in     
                                    sequence file format                      
--minDF (-md) minDF                 The minimum document frequency.  Default  
                                    is 1                                      
--maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                    Can be used to remove really high         
                                    frequency terms. Expressed as an integer  
                                    between 0 and 100. Default is 99.         
--weight (-wt) weight               The kind of weight to use. Currently TF   
                                    or TFIDF                                  
--norm (-n) norm                    The norm to use, expressed as either a    
                                    float or "INF" if you want to use the     
                                    Infinite norm.  Must be greater or equal  
                                    to 0.  The default is not to normalize    
--minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                    Ratio(Float)  Default is 1.0              
--numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                    Default Value: 1                          
--maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                    create (2 = bigrams, 3 = trigrams, etc)   
                                    Default Value:1                           
--overwrite (-ow)                   If set, overwrite the output directory    
--help (-h)                         Print out help                            
--sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                    be SequentialAccessVectors. If set true   
                                    else false                                
--namedVector (-nv)                 (Optional) Whether output vectors should  
                                    be NamedVectors. If set true else false   
--logNormalize (-lnorm)             (Optional) Whether output vectors should  
                                    be logNormalize. If set true else false 

This converts a set of Behemoth documents to vectors. It is based on DictionaryVectorizer in Mahout.

An example of how to use this with the standard Mahout commands can be found here.

