SparkR on EC2

This page describes steps to use SparkR on EC2.

Cluster launch

First, launch an EC2 cluster using Spark's EC2 scripts. Note that you should use EC2 scripts which ship with Spark >= 0.9.0 for SparkR to work correctly.

Installing dependencies

Next login to the EC2 cluster by running ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>.

Install SparkR

Now we are ready to install SparkR on EC2 cluster. To do this we need to build SparkR with the same Spark version that is running on the cluster. (You can find this by running cat /root/spark/RELEASE). To install SparkR on all your machines you can run:

cd /root
git clone https://github.com/amplab-extras/SparkR-pkg.git
cd SparkR-pkg
SPARK_VERSION=1.2.1 ./install-dev.sh
cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
/root/spark-ec2/copy-dir /root/SparkR-pkg
/root/spark/sbin/slaves.sh cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/

Launch SparkR

Finally to launch SparkR and connect to the Spark EC2 cluster, we run

MASTER=spark://<master_hostname>:7077 ./sparkR

where <master_hostname> can be queried using:

cat /root/spark-ec2/cluster-url

You can check if you are using the EC2 cluster using Spark's Web UI at http://<master_hostname>:8080.

Troubleshooting

If you experience connectivity problems, first check to make sure that port 7077 on your master node is open to the machine where you are running R. Then, consider the troubleshooting steps here: http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark

Provide feedback

Saved searches