This repository has been archived by the owner on Jul 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
Mahout Commands for Different Processing Steps
Carmen-digitalPebble edited this page Jul 26, 2012
·
1 revision
Mahout commands for kmeans:
Usage:
[--input input --output <output> --distanceMeasure <distanceMeasure>
--clusters <clusters> --numClusters <k> --convergenceDelta <convergenceDelta>
--maxIter <maxIter> --overwrite --clustering --method <method>
--outlierThreshold <outlierThreshold> --help --tempDir <tempDir> --startPhase
<startPhase> --endPhase <endPhase>]
--clusters (-c) clusters The input centroids, as Vectors. Must be a
SequenceFile of Writable, Cluster/Canopy. If k is
also specified, then a random set of vectors will
be selected and written out to this path first
Mahout commands for clusterdump:
Usage:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--outputFormat (-of) outputFormat The optional output format to
write the results as. Options:
TEXT, CSV or GRAPH_ML
--substring (-b) substring The number of chars of the
asFormatString() to print
--numWords (-n) numWords The number of top terms to print
--pointsDir (-p) pointsDir The directory containing points
sequence files mapping input
vectors to their cluster. If
specified, then the program will
output the points associated with
a cluster
--samplePoints (-sp) samplePoints Specifies the maximum number of
points to include _per_ cluster.
The default is to include all
points
--dictionary (-d) dictionary The dictionary file
--dictionaryType (-dt) dictionaryType The dictionary file type
(text|sequencefile)
--evaluate (-e) Run ClusterEvaluator and
CDbwEvaluator over the input. The
output will be appended to the
rest of the output at the end.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run