-
Notifications
You must be signed in to change notification settings - Fork 14
Quick Start
jpatanooga edited this page Dec 4, 2012
·
34 revisions
- A CDH4.1 Hadoop cluster with YARN installed
- Iterative Reduce Framework
- Build Iterative Reduce
- Needs https://github.com/emsixteeen/IterativeReduce
- git clone https://github.com/emsixteeen/IterativeReduce.git
- cd IterativeReduce
- mvn package -DskipTests
- Get/Build Knitting Boar
- Needs the Iterative Reduce jar
- git clone https://github.com/jpatanooga/KnittingBoar.git
- cd KnittingBoar
- mvn package -DskipTests
- Download the 20newsgroups dataset: [20Newsgroups] (http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
- Run the converter tool as [described here] (https://github.com/jpatanooga/KnittingBoar/wiki/Command-Line-Usage)
- Unzip the data
- Run the converter creating a single large archive in the correct format:
- ./convert_20newsgroups.sh --input ./20news-bydate-train/ --output ./ --recordsPerBlock 12000
- Copy the dataset to your Hadoop Cluster (HDFS)
- Run the yarn job via the command line driver
We want to send a single file to Knitting Boar (currently we don't have multiple file input working) and also set the "partition size" of the file. The partition size will be the size of each split a worker in Iterative Reduce works on.
hdfs dfs -Ddfs.block.size=bytes -put <src> <dst>
In this case we'll use the 20newsgroups dataset which is around 16MB converted. We want to set the block size to 4MB (4194304 bytes). The command looks like:
hdfs dfs -Ddfs.block.size=4194304 -put /my/local/dir/20news.txt hdfs:///somewhere/in/hdfs/
- Primer on running YARN Apps:
- Make sure your input data is in HDFS
- Compile Knitting Boar
- link here
- Create a custom
app.properties
file- Use this example: https://github.com/emsixteeen/IterativeReduce/blob/master/app.properties
- we need to add a few properties for our custom setup
- We've done this for you here:
- https://github.com/jpatanooga/KnittingBoar/blob/master/app.properties
- You'll need to modify a few things such as:
app.input.path
app.output.path
git pull
mvn package -DskipTests
rsync app.properties target/KnittingBoar-1.0-SNAPSHOT-jar-with-dependencies.jar target/lib/iterativereduce-0.1-SNAPSHOT.jar target/lib/avro-1.7.1.jar target/lib/avro-ipc-1.7.1.jar [email protected]:~/hommies/
[cluster01]$ hdfs dfs -put -Ddfs.block.size=4194304 /my/local/dir/20news.txt hdfs:///somewhere/in/hdfs/
[cluster01]$ yarn jar iterativereduce-0.1-SNAPSHOT.jar app.properties
On a YARN-client execute the following:
# yarn jar iterativereduce-0.1-SNAPSHOT.jar app.properties
app.properties
is the local filesystem path that contains your application properties file. If not specified, IterativeReduce looks for a file ./app.properties
# This is the path for the KnittingBoar JAR
iterativereduce.jar.path=iterativereduce-0.1-SNAPSHOT.jar
# Path to your application (which was compiled against KB!)
app.jar.path=KnittingBoar-1.0-SNAPSHOT-jar-with-dependencies.jar
# Comma separated list of other JAR's required for depenedencies
app.lib.jar.path=avro-1.7.1.jar,avro-ipc-1.7.1.jar
# Input file(s) to process
app.input.path=hdfs:///user/josh/datasets/20news/four_shards/kboar.txt
# Output results to
app.output.path=/tmp/josh/d11_20_2012/model_12_55pm.model
# Number of iterations
app.iteration.count=3
app.name=IR_SGD_Broski
# Requested memory for YARN clients
yarn.memory=512
# The main() class/entry for the AppMaster
yarn.master.main=com.cloudera.knittingboar.sgd.iterativereduce.POLRMasterNode
# Any extra command-line args
yarn.master.args=
# The main() class/entry for the AppWorker
yarn.worker.main=com.cloudera.knittingboar.sgd.iterativereduce.POLRWorkerNode
# Any extra command-line args
yarn.worker.args=
# Any other configuration params, will be pushed down to clients
com.cloudera.knittingboar.setup.FeatureVectorSize=10000
com.cloudera.knittingboar.setup.numCategories=20
com.cloudera.knittingboar.setup.RecordFactoryClassname=com.cloudera.knittingboar.records.TwentyNewsgroupsRecordFactory
- Go to the cluster web interface
- http://(cluster-address):8088/cluster
- find the job name, something like
- application_1352770589658_0020
- For the Application Master click on "logs"
- Then click on stdout
- The output should look similar to below
-----------------------------------------
# Master Conf #
Number Iterations: 3
-----------------------------------------
Master Compute: SuperStep - Worker Info -----
[Master] WorkerReport[0]: I: 0, IC: 1 Trained Recs: 2850 AvgLogLikelihood: -1.8742067 PercentCorrect: 70.4782
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 0, IC: 1 Trained Recs: 2664 AvgLogLikelihood: -1.7493868 PercentCorrect: 71.17317
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 0, IC: 1 Trained Recs: 2991 AvgLogLikelihood: -1.7783203 PercentCorrect: 70.35493
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 0, IC: 1 Trained Recs: 2809 AvgLogLikelihood: -1.8381199 PercentCorrect: 73.0591
> worker 3 is done with current iteration
Master Compute: SuperStep - Worker Info -----
[Master] WorkerReport[0]: I: 1, IC: 1 Trained Recs: 5700 AvgLogLikelihood: -1.7117476 PercentCorrect: 78.23344
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 1, IC: 1 Trained Recs: 5328 AvgLogLikelihood: -1.5498475 PercentCorrect: 81.4208
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 1, IC: 1 Trained Recs: 5982 AvgLogLikelihood: -1.6347487 PercentCorrect: 80.863655
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 1, IC: 1 Trained Recs: 5618 AvgLogLikelihood: -1.6497002 PercentCorrect: 78.26145
> worker 3 is done with current iteration
Master Compute: SuperStep - Worker Info -----
[Master] WorkerReport[0]: I: 2, IC: 1 Trained Recs: 8550 AvgLogLikelihood: -1.7519032 PercentCorrect: 77.998276
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 2, IC: 1 Trained Recs: 7992 AvgLogLikelihood: -1.5914761 PercentCorrect: 80.05547
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 2, IC: 1 Trained Recs: 8973 AvgLogLikelihood: -1.6793529 PercentCorrect: 80.11336
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 2, IC: 1 Trained Recs: 8427 AvgLogLikelihood: -1.6903981 PercentCorrect: 77.78894
> worker 3 is done with current iteration