-
Notifications
You must be signed in to change notification settings - Fork 324
SparkR Example: Digit Recognition on EC2
SparkR provides a digit recognition example program. To try it out on EC2, follow the next couple steps.
You can follow the instructions here. Note that for SparkR to work, Spark's version should be no older than 0.9.0.
The solver program uses the popular R package, Matrix
. To make sure it is
available, do the following:
cd /root
wget http://cran.cnr.berkeley.edu/src/contrib/Matrix_1.1-2-2.tar.gz
tar xvzf Matrix_1.1-2-2.tar.gz
R CMD INSTALL Matrix
/root/spark-ec2/copy-dir Matrix_1.1-2-2.tar.gz
/root/spark/sbin/slaves.sh R CMD INSTALL ~/Matrix_1.1-2-2.tar.gz
To obtain the MNIST data sets, we use s3cmd. Use the following commands to download and configure it:
cd /root
git clone https://github.com/s3tools/s3cmd.git
cd s3cmd
./s3cmd --configure
You should now be able to configure s3cmd
, enter your AWS credentials, etc.
After this is done, simply run
./s3cmd get s3://mnist-data/train-mnist-dense-with-labels.data /data/train-mnist-dense-with-labels.data
./s3cmd get s3://mnist-data/test-mnist-dense-with-labels.data /data/test-mnist-dense-with-labels.data
/root/spark-ec2/copy-dir /data/
If you wish to store the data on ephemeral disks instead of EBS, you can run /root/ephemeral-hdfs/bin/hadoop fs -copyFromLocal /data/train-mnist-dense-with-labels.data /
, and change the textFile()
function to take the corresponding HDFS path.
As the last step, we launch the linear solver program provided in SparkR-pkg/examples
:
source /root/spark/conf/spark-env.sh
cd /root/SparkR-pkg
SPARK_MEM=6g ./sparkR examples/linear_solver_mnist.R `cat ~/spark-ec2/cluster-url`
If you are using an instance type that has more memory, you can set a larger executor memory size in the last command. The above should work for m1.large
with one slave.
You can now monitor the job progress using Spark's web UI, at http://<master_hostname>:4040
.