Quick Start

Pre-Requisites

A CDH4.1 Hadoop cluster with YARN installed
- https://ccp.cloudera.com/display/CDH4DOC/Installing+CDH4+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
Iterative Reduce Framework
- https://github.com/emsixteeen/IterativeReduce

Run Knitting Boar in a Few Easy Steps

Build Iterative Reduce
- Needs https://github.com/emsixteeen/IterativeReduce
- git clone https://github.com/emsixteeen/IterativeReduce.git
- cd IterativeReduce
- mvn package -DskipTests
Get/Build Knitting Boar
- Needs the Iterative Reduce jar
- git clone https://github.com/jpatanooga/KnittingBoar.git
- cd KnittingBoar
- mvn package -DskipTests
Download the 20newsgroups dataset: [20Newsgroups] (http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
- Run the converter tool as [described here] (https://github.com/jpatanooga/KnittingBoar/wiki/Command-Line-Usage)
- Unzip the data
- Run the converter creating a single large archive in the correct format:
  - ./convert_20newsgroups.sh --input ./20news-bydate-train/ --output ./ --recordsPerBlock 12000
- Copy the dataset to your Hadoop Cluster (HDFS)
Run the yarn job via the command line driver

Copying data to HDFS

We want to send a single file to Knitting Boar (currently we don't have multiple file input working) and also set the "partition size" of the file. The partition size will be the size of each split a worker in Iterative Reduce works on.

hdfs dfs -Ddfs.block.size=bytes -put <src> <dst>

In this case we'll use the 20newsgroups dataset which is around 16MB converted. We want to set the block size to 4MB (4194304 bytes). The command looks like:

hdfs dfs -Ddfs.block.size=4194304 -put /my/local/dir/20news.txt hdfs:///somewhere/in/hdfs/

Running Knitting Boar as a CDH4 YARN Application

Primer on running YARN Apps:
- https://github.com/emsixteeen/IterativeReduce/wiki/Running-an-Application-with-IterativeReduce
Make sure your input data is in HDFS
Compile Knitting Boar
- link here
Create a custom app.properties file
- Use this example: https://github.com/emsixteeen/IterativeReduce/blob/master/app.properties
- we need to add a few properties for our custom setup
- We've done this for you here:
  - https://github.com/jpatanooga/KnittingBoar/blob/master/app.properties
  - You'll need to modify a few things such as:
    - app.input.path
    - app.output.path

From Your Laptop

git pull
mvn package -DskipTests
rsync app.properties target/KnittingBoar-1.0-SNAPSHOT-jar-with-dependencies.jar target/lib/iterativereduce-0.1-SNAPSHOT.jar target/lib/avro-1.7.1.jar target/lib/avro-ipc-1.7.1.jar [email protected]:~/hommies/

[cluster01]$ hdfs dfs -put -Ddfs.block.size=4194304 /my/local/dir/20news.txt hdfs:///somewhere/in/hdfs/
[cluster01]$ yarn jar iterativereduce-0.1-SNAPSHOT.jar app.properties

Launch the KnittingBoar Application

On a YARN-client execute the following:

# yarn jar iterativereduce-0.1-SNAPSHOT.jar app.properties

app.properties is the local filesystem path that contains your application properties file. If not specified, IterativeReduce looks for a file ./app.properties

What Goes Into app.properties ?

# This is the path for the KnittingBoar JAR
iterativereduce.jar.path=iterativereduce-0.1-SNAPSHOT.jar

# Path to your application (which was compiled against KB!)
app.jar.path=KnittingBoar-1.0-SNAPSHOT-jar-with-dependencies.jar

# Comma separated list of other JAR's required for depenedencies
app.lib.jar.path=avro-1.7.1.jar,avro-ipc-1.7.1.jar

# Input file(s) to process
app.input.path=hdfs:///user/josh/datasets/20news/four_shards/kboar.txt

# Output results to
app.output.path=/tmp/josh/d11_20_2012/model_12_55pm.model

# Number of iterations
app.iteration.count=3

app.name=IR_SGD_Broski

# Requested memory for YARN clients
yarn.memory=512
# The main() class/entry for the AppMaster
yarn.master.main=com.cloudera.knittingboar.sgd.iterativereduce.POLRMasterNode
# Any extra command-line args
yarn.master.args=

# The main() class/entry for the AppWorker
yarn.worker.main=com.cloudera.knittingboar.sgd.iterativereduce.POLRWorkerNode
# Any extra command-line args
yarn.worker.args=

# Any other configuration params, will be pushed down to clients
com.cloudera.knittingboar.setup.FeatureVectorSize=10000
com.cloudera.knittingboar.setup.numCategories=20
com.cloudera.knittingboar.setup.RecordFactoryClassname=com.cloudera.knittingboar.records.TwentyNewsgroupsRecordFactory

How Do I Check Results?

Go to the cluster web interface
- http://(cluster-address):8088/cluster
find the job name, something like
- application_1352770589658_0020
For the Application Master click on "logs"
Then click on stdout
- The output should look similar to below

-----------------------------------------
# Master Conf #
Number Iterations: 3
-----------------------------------------



Master Compute: SuperStep - Worker Info ----- 
[Master] WorkerReport[0]: I: 0, IC: 1 Trained Recs: 2850 AvgLogLikelihood: -1.8742067 PercentCorrect: 70.4782
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 0, IC: 1 Trained Recs: 2664 AvgLogLikelihood: -1.7493868 PercentCorrect: 71.17317
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 0, IC: 1 Trained Recs: 2991 AvgLogLikelihood: -1.7783203 PercentCorrect: 70.35493
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 0, IC: 1 Trained Recs: 2809 AvgLogLikelihood: -1.8381199 PercentCorrect: 73.0591
> worker 3 is done with current iteration

Master Compute: SuperStep - Worker Info ----- 
[Master] WorkerReport[0]: I: 1, IC: 1 Trained Recs: 5700 AvgLogLikelihood: -1.7117476 PercentCorrect: 78.23344
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 1, IC: 1 Trained Recs: 5328 AvgLogLikelihood: -1.5498475 PercentCorrect: 81.4208
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 1, IC: 1 Trained Recs: 5982 AvgLogLikelihood: -1.6347487 PercentCorrect: 80.863655
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 1, IC: 1 Trained Recs: 5618 AvgLogLikelihood: -1.6497002 PercentCorrect: 78.26145
> worker 3 is done with current iteration

Master Compute: SuperStep - Worker Info ----- 
[Master] WorkerReport[0]: I: 2, IC: 1 Trained Recs: 8550 AvgLogLikelihood: -1.7519032 PercentCorrect: 77.998276
> worker 0 is done with current iteration
[Master] WorkerReport[1]: I: 2, IC: 1 Trained Recs: 7992 AvgLogLikelihood: -1.5914761 PercentCorrect: 80.05547
> worker 1 is done with current iteration
[Master] WorkerReport[2]: I: 2, IC: 1 Trained Recs: 8973 AvgLogLikelihood: -1.6793529 PercentCorrect: 80.11336
> worker 2 is done with current iteration
[Master] WorkerReport[3]: I: 2, IC: 1 Trained Recs: 8427 AvgLogLikelihood: -1.6903981 PercentCorrect: 77.78894
> worker 3 is done with current iteration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly