deploy-instructions-2019.txt

########################################################################################################################
##### Spark deploy instructions - First step of A2 #####

# These are instructions for deploying Apache Spark.
# They include a hacks relevant for the SNIC cloud, which we wouldn't use for a production system.
# For the lab, a spark cluster has already been deployed -- this will be an experiment!

# 0. Check the Spark and HDFS cluster is operating by opening these links in your browser (on campus):
#        http://130.238.29.245:8080
#        http://130.238.29.245:50070

# 0. Create a virtual machine (or use an existing one), this instance will be used to connect to the Spark cluster as a client.
#    Your virtual machine should be 'ssc.small' flavor.
#    Use Ubuntu 18.04 LTS as the source image.
# 0. Add it to the 'spark-cluster-client' security group for it to work correctly with Spark.
#    (The machines in the Spark cluster need to be able to connect to your VM)
# 0. Add a floating IP to the VM
# 0. Configure the ~/.ssh/config on your local laptop/desktop/lab computer like this (this is for the university lab machines or other unix-like systems and WSL, you may have to modify the instruction if you are using some other system):

# replace 130.238.x.y and ~/.ssh/id_rsa with your floating IP and key path appropriately.
Host 130.238.x.y
  User ubuntu
  IdentityFile ~/.ssh/id_rsa
  LocalForward 8888 localhost:8888
  LocalForward 4040 localhost:4040
  LocalForward 4041 localhost:4041
  LocalForward 4042 localhost:4042
  LocalForward 4043 localhost:4043

This will open ssh-tunnels (for the given ports) between your local computer and the VM. This will enable you to access e.g a jupyter
notebook via your browser by localhost:8888


#####################

# Hack to fix issue with ubuntu packages
sudo sed -ie 's/nova.clouds.archive.ubuntu.com/se.archive.ubuntu.com/' /etc/apt/sources.list 

## For this example, we'll install Spark worker and master on the same virual machine. Normally we'd put the master on its own machine.

# update apt repo metadata
sudo apt update

# install java
sudo apt-get install -y openjdk-8-jdk


# manually define a hostname for all the hosts on the ldsa project. this will make networking easier with spark:
# NOTE! if you have added entries to /etc/hosts yourself, you need to remove those.
for i in {1..255}; do echo "192.168.1.$i host-192-168-1-$i-ldsa" | sudo tee -a /etc/hosts; done
for i in {1..255}; do echo "192.168.2.$i host-192-168-2-$i-ldsa" | sudo tee -a /etc/hosts; done

# set the hostname according to the scheme above:
sudo hostname host-$(hostname -I | awk '{$1=$1};1' | sed 's/\./-/'g)-ldsa ; hostname


# ... the next steps depend on what we want this node to be... a master, a worker, or run a Python notebook...
# For the lab, we'll just create a Python notebook, and use the existing Spark cluster.

########################################################################################################################
##### Start Spark Master/Worker -- !!!SKIP THIS FOR THE LAB SESSION!!! Proceed to 'Install the Python Notebook...') #####

cd ~

# download Spark
# (Get download link for Hadoop 2.7)
wget http://apache.mirrors.spacedump.net/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz

tar -zxvf spark-2.4.1-bin-hadoop2.7.tgz

# Tell Spark that we're using Python 3 -- we need to use the same version of Python everywhere.
echo "export SPARK_HOME=~/spark-2.4.1-bin-hadoop2.7" >> ~/.bashrc
source ~/.bashrc


cd ~/spark-2.4.1-bin-hadoop2.7/


# lets have a look at some of the spark directories..
ls -l

# start the master on the current machine:
~/spark-2.4.1-bin-hadoop2.7/sbin/start-master.sh

# -or-

# start the worker on the current machine (and tell it where the master is listening -- the IP address of our master node from above):
~/spark-2.4.1-bin-hadoop2.7/sbin/start-slave.sh spark://192.168.X.Y:7077

netstat -tna

# is it running?
jps


########################################################################################################################
##### Install the Python Notebook -- RESUME HERE FOR THE LAB  #####

# Env variable so the workers know which Python to use...
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
source ~/.bashrc

# install git
sudo apt-get install -y git

# install python dependencies, start notebook

# install the python package manager 'pip' -- it is recommended to do this directly 
sudo apt-get install -y python3-pip

# this is a very old version of pip:
python3 -m pip --version

# upgrade it
python3 -m pip install pip

# install jupyter (installing via pip seems to be broken)
sudo apt install -y jupyter-notebook

# install pyspark, jupyter, and some other useful deps
python3 -m pip install pyspark --user
python3 -m pip install pandas --user
python3 -m pip install matplotlib --user

# clone the examples from the lectures, so we have a copy to experiment with
git clone https://github.com/benblamey/jupyters-public.git

# start the notebook!
jupyter notebook

# Follow the instructions you see: 
#
#    Copy/paste this URL into your browser when you connect for the first time,
#    to login with a token:
#        http://localhost:8888/?token=8af4be03b08713c66d8d093a7d684108c69c86f5b63dd

# Now you can run the examples from the lectures in your own notebook.
# Start with ldsa-2019/Lecture1_Example1_ArraySquareandSum.ipynb
# You'll need to change the host name for the Spark master, and namenode, to:
#  192.168.1.153

# When you start your application, you'll see it running in the Spark master web GUI (link at the top).
# If you hover over the link to your application, you'll see the port number for the Web GUI for your application.
# It will be 4040,4041,...
# You can open the GUI in your web browser like this (e.g.):
#   http://localhost:4040

########################################################################################################################
##### Creating your own notebook that deploys spark jobs to the cluster #####

# When working on your own notebooks, save them in your own git repository (which you created in A1, do a git clone) and
# make sure to commit and push changes often (for backup purposes).
# You can use ldsa-2019/Lecture1_Example1_ArraySquareandSum.ipynb as a template for your notebooks

# You need to share the Spark cluster with the other students:

# 1. Start your application with dynamic allocation enabled, a timeout of no more than 30 seconds, and a cap on CPU cores:
#spark_session = SparkSession\
#        .builder\
#        .master("spark://master:7077") \
#        .appName("blameyben_lecture1_simple_example")\
#        .config("spark.dynamicAllocation.enabled", True)\
#        .config("spark.shuffle.service.enabled", True)\
#        .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
#        .config("spark.executor.cores",4)\
#        .getOrCreate()

# 2. Put your name in the name of your application.
# 3. Kill your application when your have finished with it.
# 4. Don't interfere with any of the virtual machines in the cluster. Don't attempt to add your own machine to the cluster.
# 5. Run one app at a time.
# 6. When the lab is not running, you can use more resources, but keep an eye on other people using the system.