forked from benblamey/jupyters-public
-
Notifications
You must be signed in to change notification settings - Fork 0
/
deploy-instructions-2019.txt
172 lines (122 loc) · 7.12 KB
/
deploy-instructions-2019.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
########################################################################################################################
##### Spark deploy instructions - First step of A2 #####
# These are instructions for deploying Apache Spark.
# They include a hacks relevant for the SNIC cloud, which we wouldn't use for a production system.
# For the lab, a spark cluster has already been deployed -- this will be an experiment!
# 0. Check the Spark and HDFS cluster is operating by opening these links in your browser (on campus):
# http://130.238.29.245:8080
# http://130.238.29.245:50070
# 0. Create a virtual machine (or use an existing one), this instance will be used to connect to the Spark cluster as a client.
# Your virtual machine should be 'ssc.small' flavor.
# Use Ubuntu 18.04 LTS as the source image.
# 0. Add it to the 'spark-cluster-client' security group for it to work correctly with Spark.
# (The machines in the Spark cluster need to be able to connect to your VM)
# 0. Add a floating IP to the VM
# 0. Configure the ~/.ssh/config on your local laptop/desktop/lab computer like this (this is for the university lab machines or other unix-like systems and WSL, you may have to modify the instruction if you are using some other system):
# replace 130.238.x.y and ~/.ssh/id_rsa with your floating IP and key path appropriately.
Host 130.238.x.y
User ubuntu
IdentityFile ~/.ssh/id_rsa
LocalForward 8888 localhost:8888
LocalForward 4040 localhost:4040
LocalForward 4041 localhost:4041
LocalForward 4042 localhost:4042
LocalForward 4043 localhost:4043
This will open ssh-tunnels (for the given ports) between your local computer and the VM. This will enable you to access e.g a jupyter
notebook via your browser by localhost:8888
#####################
# Hack to fix issue with ubuntu packages
sudo sed -ie 's/nova.clouds.archive.ubuntu.com/se.archive.ubuntu.com/' /etc/apt/sources.list
## For this example, we'll install Spark worker and master on the same virual machine. Normally we'd put the master on its own machine.
# update apt repo metadata
sudo apt update
# install java
sudo apt-get install -y openjdk-8-jdk
# manually define a hostname for all the hosts on the ldsa project. this will make networking easier with spark:
# NOTE! if you have added entries to /etc/hosts yourself, you need to remove those.
for i in {1..255}; do echo "192.168.1.$i host-192-168-1-$i-ldsa" | sudo tee -a /etc/hosts; done
for i in {1..255}; do echo "192.168.2.$i host-192-168-2-$i-ldsa" | sudo tee -a /etc/hosts; done
# set the hostname according to the scheme above:
sudo hostname host-$(hostname -I | awk '{$1=$1};1' | sed 's/\./-/'g)-ldsa ; hostname
# ... the next steps depend on what we want this node to be... a master, a worker, or run a Python notebook...
# For the lab, we'll just create a Python notebook, and use the existing Spark cluster.
########################################################################################################################
##### Start Spark Master/Worker -- !!!SKIP THIS FOR THE LAB SESSION!!! Proceed to 'Install the Python Notebook...') #####
cd ~
# download Spark
# (Get download link for Hadoop 2.7)
wget http://apache.mirrors.spacedump.net/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
tar -zxvf spark-2.4.1-bin-hadoop2.7.tgz
# Tell Spark that we're using Python 3 -- we need to use the same version of Python everywhere.
echo "export SPARK_HOME=~/spark-2.4.1-bin-hadoop2.7" >> ~/.bashrc
source ~/.bashrc
cd ~/spark-2.4.1-bin-hadoop2.7/
# lets have a look at some of the spark directories..
ls -l
# start the master on the current machine:
~/spark-2.4.1-bin-hadoop2.7/sbin/start-master.sh
# -or-
# start the worker on the current machine (and tell it where the master is listening -- the IP address of our master node from above):
~/spark-2.4.1-bin-hadoop2.7/sbin/start-slave.sh spark://192.168.X.Y:7077
netstat -tna
# is it running?
jps
########################################################################################################################
##### Install the Python Notebook -- RESUME HERE FOR THE LAB #####
# Env variable so the workers know which Python to use...
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
source ~/.bashrc
# install git
sudo apt-get install -y git
# install python dependencies, start notebook
# install the python package manager 'pip' -- it is recommended to do this directly
sudo apt-get install -y python3-pip
# this is a very old version of pip:
python3 -m pip --version
# upgrade it
python3 -m pip install pip
# install jupyter (installing via pip seems to be broken)
sudo apt install -y jupyter-notebook
# install pyspark, jupyter, and some other useful deps
python3 -m pip install pyspark --user
python3 -m pip install pandas --user
python3 -m pip install matplotlib --user
# clone the examples from the lectures, so we have a copy to experiment with
git clone https://github.com/benblamey/jupyters-public.git
# start the notebook!
jupyter notebook
# Follow the instructions you see:
#
# Copy/paste this URL into your browser when you connect for the first time,
# to login with a token:
# http://localhost:8888/?token=8af4be03b08713c66d8d093a7d684108c69c86f5b63dd
# Now you can run the examples from the lectures in your own notebook.
# Start with ldsa-2019/Lecture1_Example1_ArraySquareandSum.ipynb
# You'll need to change the host name for the Spark master, and namenode, to:
# 192.168.1.153
# When you start your application, you'll see it running in the Spark master web GUI (link at the top).
# If you hover over the link to your application, you'll see the port number for the Web GUI for your application.
# It will be 4040,4041,...
# You can open the GUI in your web browser like this (e.g.):
# http://localhost:4040
########################################################################################################################
##### Creating your own notebook that deploys spark jobs to the cluster #####
# When working on your own notebooks, save them in your own git repository (which you created in A1, do a git clone) and
# make sure to commit and push changes often (for backup purposes).
# You can use ldsa-2019/Lecture1_Example1_ArraySquareandSum.ipynb as a template for your notebooks
# You need to share the Spark cluster with the other students:
# 1. Start your application with dynamic allocation enabled, a timeout of no more than 30 seconds, and a cap on CPU cores:
#spark_session = SparkSession\
# .builder\
# .master("spark://master:7077") \
# .appName("blameyben_lecture1_simple_example")\
# .config("spark.dynamicAllocation.enabled", True)\
# .config("spark.shuffle.service.enabled", True)\
# .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
# .config("spark.executor.cores",4)\
# .getOrCreate()
# 2. Put your name in the name of your application.
# 3. Kill your application when your have finished with it.
# 4. Don't interfere with any of the virtual machines in the cluster. Don't attempt to add your own machine to the cluster.
# 5. Run one app at a time.
# 6. When the lab is not running, you can use more resources, but keep an eye on other people using the system.