Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Mahout Commands for Different Processing Steps

Carmen-digitalPebble edited this page Jul 26, 2012 · 1 revision


Mahout commands for kmeans:


Usage:                                                                          
 [--input input --output <output> --distanceMeasure <distanceMeasure>         
--clusters <clusters> --numClusters <k> --convergenceDelta <convergenceDelta>   
--maxIter <maxIter> --overwrite --clustering --method <method>                  
--outlierThreshold <outlierThreshold> --help --tempDir <tempDir> --startPhase   
<startPhase> --endPhase <endPhase>]                                             
--clusters (-c) clusters    The input centroids, as Vectors.  Must be a         
                            SequenceFile of Writable, Cluster/Canopy.  If k is  
                            also specified, then a random set of vectors will   
                            be selected and written out to this path first

 Mahout commands for clusterdump:
 Usage: 

 --input (-i) input                         Path to job input directory.       
  --output (-o) output                       The directory pathname for output. 
  --outputFormat (-of) outputFormat          The optional output format to      
                                             write the results as.  Options:    
                                             TEXT, CSV or GRAPH_ML              
  --substring (-b) substring                 The number of chars of the         
                                             asFormatString() to print          
  --numWords (-n) numWords                   The number of top terms to print   
  --pointsDir (-p) pointsDir                 The directory containing points    
                                             sequence files mapping input       
                                             vectors to their cluster.  If      
                                             specified, then the program will   
                                             output the points associated with  
                                             a cluster                          
  --samplePoints (-sp) samplePoints          Specifies the maximum number of    
                                             points to include _per_ cluster.   
                                             The default is to include all      
                                             points                             
  --dictionary (-d) dictionary               The dictionary file                
  --dictionaryType (-dt) dictionaryType      The dictionary file type           
                                             (text|sequencefile)                
  --evaluate (-e)                            Run ClusterEvaluator and           
                                             CDbwEvaluator over the input.  The 
                                             output will be appended to the     
                                             rest of the output at the end.     
  --distanceMeasure (-dm) distanceMeasure    The classname of the               
                                             DistanceMeasure. Default is        
                                             SquaredEuclidean                   
  --help (-h)                                Print out help                     
  --tempDir tempDir                          Intermediate output directory      
  --startPhase startPhase                    First phase to run                 
  --endPhase endPhase                        Last phase to run

Clone this wiki locally