-
Notifications
You must be signed in to change notification settings - Fork 324
SparkR on EC2
This page describes steps to use SparkR on EC2.
First, launch an EC2 cluster using Spark's EC2 scripts. Note that you should use EC2 scripts which ship with Spark >= 0.9.0 for SparkR to work correctly.
Next login to the EC2 cluster by running
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
.
Now we are ready to install SparkR on EC2 cluster. To do this we need to build SparkR with the same
Spark version that is running on the cluster. (You can find this by running cat /root/spark/RELEASE
).
To install SparkR on all your machines you can run:
cd /root
git clone https://github.com/amplab-extras/SparkR-pkg.git
cd SparkR-pkg
SPARK_VERSION=1.2.1 ./install-dev.sh
cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
/root/spark-ec2/copy-dir /root/SparkR-pkg
/root/spark/sbin/slaves.sh cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
Finally to launch SparkR and connect to the Spark EC2 cluster, we run
MASTER=spark://<master_hostname>:7077 ./sparkR
where <master_hostname>
can be queried using:
cat /root/spark-ec2/cluster-url
You can check if you are using the EC2 cluster using Spark's Web UI at
http://<master_hostname>:8080
.
If you experience connectivity problems, first check to make sure that port 7077 on your master node is open to the machine where you are running R. Then, consider the troubleshooting steps here: http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark