-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathday-001_installing_local_spark_environment
208 lines (143 loc) · 8.05 KB
/
day-001_installing_local_spark_environment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
Day 001: How to install a local Spark environment
=================================================
The challenge for my first day with Apache Spark is to get it installed and running on my Ubuntu laptop.
Even though Apache Spark can also run on Windows machines, I prefer using a Linux environment. Spark is intended to run
on big clusters of, in most cases, Linux machines. Therefor most of the Spark documentation and resources, I find
on the internet, relates to Linux. By using Ubuntu instead of my Windows, I can apply replay most tutorials nearly
one-to-one on my laptop without being confused by backslash-formatted Paths, different filesystem structures and
environment variable schemas, or powershell vs. bash syntax.
Let's get started.
Step 1: Java
------------
Apache Spark is natively written in Scala but runs in JVM, therefore I need a JRE on my system.
The current pyspark version (2.4.5) requires Java 8, so the first step to do is ist to check, if it is already installed
on our system.
jie@prime73:~$ java -version
jie@prime73:~$
Ok, it seems I have no Java at all so I have to install right now.
sudo apt install openjdk-8-jdk
sudo update-alternatives --config java
Now I select option "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java"
Edit ~/.bashrc to set the JAVA_HOME environment variable and include JAVA_HOME/bin in PATH:
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export PATH=$JAVA_HOME/bin:$PATH
Let's have a look, if the Java version check succeeds now.
jie@prime73:~$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~19.10-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Fine, so let's move on.
Step 2: Downloading Spark
-------------------------
Apache Spark is open source, so I can download the latest version (spark-2.4.5-bin-hadoop2.7.tgz) directly from
https://spark.apache.org/downloads.html.
No I extract the archive into my spark folder and edit ~/.bashrc again to set the SPARK_HOME environment variable
and include SPARK_HOME/bin in my PATH:
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export PATH=$JAVA_HOME/bin:$PATH
Step 3: Python and pyspark
--------------------------
Next to Scala and Java, Spark provides a Python API, which I prefer here because it is the most popular language in the
data science area so it integrates well with Python based libraries like pandas, numpy, scikit-learn etc.
The current pyspark version (2.4.5) requires Python 3.7 as interpreter which is already the default version on my
Ubuntu 19.10:
jie@prime73:/usr/bin$ ls -la python3
lrwxrwxrwx 1 root root 9 Feb 22 02:06 python3 -> python3.7
On older Ubuntu versions I had to install it manually:
sudo apt install python3.7
The easiest way to install pyspark is by using pip. To get pyspark running, I need to install some further packages it
depends on:
pip install findspark
pip install jupyter
pip install numpy
pip install pandas
pip install pypandoc
pip install pyspark
and that's it. Installation completed.
The jupyter notebook package is not mandatory for running Spark, but I want to use it as a convenient way
to explore Spark in an interactive step-by-step way with the opportunity to combine directly runnable code with comments
for documentation using mark-down and observed results for demonstration in one place. Furthermore I can easily share
my jupyter notebooks with other people.
Now I want to check, if Spark runs on my system.
Step 4: Starting an interactive Spark shell
-------------------------------------------
I created a virtual Python enviroment for my pyspark-tutorial project to avoid side effects on other python projects
on my system, so I change to my pyspark-tutorial venv folder.
The easiest way to check, if my Spark installation was successful, is to start the interactive Python shell of Spark.
jie@prime73:~/python/python3-7/pyspark-tutorial$ ./bin/pyspark
Python 2.7.17 (default, Nov 7 2019, 10:07:09)
[GCC 9.2.1 20191008] on linux2
Type "help", "copyright", "credits" or "license" for more information.
20/03/29 13:21:03 WARN Utils: Your hostname, prime73 resolves to a loopback address: 127.0.1.1; using 192.168.178.29 instead (on interface wlp116s0)
20/03/29 13:21:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/03/29 13:21:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Python version 2.7.17 (default, Nov 7 2019 10:07:09)
SparkSession available as 'spark'.
>>>
Looks good. The prompt >>> indicates that I can enter Python code directly into Spark now. So it's time for the 'hello
world' moment. I want to generate a sequence of 100 integers as my test data and put them in something table like. To
keep the output short, I just want to see the first 10 rows.
>>> myFirstDataFrame = spark.range(100).toDF("number")
>>> myFirstDataFrame.show(10)
+------+
|number|
+------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+------+
only showing top 10 rows
>>>
Great. I think, that's pretty well for my first day, so I leave the Spark Python shell by pressing Ctrl-D.
Step 5: Launching Jupyter Notebook
----------------------------------
Before I finish for today, I'd like to test that Jupyter notebooks are working on my machine. From my Python venv for
this project, I just have to issue the Jupyter Notebook command and pass the root pass of my project, so the notebook
will open up right in the folder, I want to store my project files in.
jie@prime73:~/python/python3-7/pyspark-tutorial/bin$ ./jupyter notebook /home/jie/github/pyspark-tutorial
If my installation was successful, I should get the following response including the URL of the notebook to open in
my web browser.
[I 15:48:27.655 NotebookApp] The Jupyter Notebook is running at:
[I 15:48:27.655 NotebookApp] http://localhost:8888/?token=b826df9a204526edb0d1a5547fe193a61ccf0da6e25467e1
[I 15:48:27.655 NotebookApp] or http://127.0.0.1:8888/?token=b826df9a204526edb0d1a5547fe193a61ccf0da6e25467e1
[I 15:48:27.655 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 15:48:27.688 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/jie/.local/share/jupyter/runtime/nbserver-7384-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=b826df9a204526edb0d1a5547fe193a61ccf0da6e25467e1
or http://127.0.0.1:8888/?token=b826df9a204526edb0d1a5547fe193a61ccf0da6e25467e1
Cool, it works. So from day 2 on, I can develop and demonstrate my Spark examples in the more convenient Jupyter
environment.
There is only one question left. Where can I get data from to play around with Spark?
Step 6: Downloading test data
-----------------------------
In our days more and more Open Data emerges in the internet, often published by public institutions. For my very first
learning steps I will use the data set published by Databricks, the company who is the main contributor to the
Apache Spark project, on github:
https://github.com/databricks/Spark-The-Definitive-Guide
Summary
-------
Challenge completed. My Spark environment is up and running on my Ubuntu laptop and has the following setup:
- openjdk "1.8.0_242" JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
- spark-2.4.5 SPARK_HOME="/home/jie/spark/spark-2.4.5-bin-hadoop2.7"
- Python3.7
- pyspark 2.4.5
- my project repo /home/jie/github/pyspark-tutorial
It's time to have a beer.