This repo is to show you how to set up Jupyter Notebook with PySpark.
- Install Java.
- Download Apache Spark.
.bash_profile
must look like this.export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_271) export SPARK_HOME=/usr/local/spark-3.2.0-bin-hadoop3.2 export PATH=${PATH}:$JAVA_HOME/bin:$SPARK_HOME/bin
- Create a conda environment.
conda env remove -n pyspark conda create -n pyspark conda activate pyspark conda install -c conda-forge jupyterlab
- Start pyspark which will start jupyterlab.
# Set SPARK_HOME to switch to different versions. export SPARK_HOME='/usr/local/spark-3.2.1-bin-hadoop3.2' export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='lab --ip=0.0.0.0 --port 8888 --allow-root --no-browser --NotebookApp.token=""' ${SPARK_HOME}/bin/pyspark\ --master local[*]\ --conf spark.driver.memory=31G\ --conf spark.ui.enabled=true\ --conf spark.ui.showConsoleProgress=true\ --conf spark.sql.catalogImplementation=in-memory\ --conf spark.sql.execution.arrow.pyspark.enabled=true\ --conf spark.memory.offHeap.size=31G\ --conf spark.memory.offHeap.enabled=true\ --conf spark.kryo.unsafe=true\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer\ --conf spark.submit.pyFiles='bdt-3.0.0+snapshot-py3.10.egg'\ --conf spark.jars='bdt-3.0.0-3.2.0-2.12-SNAPSHOT.jar'
- (Optional) In IntelliJ, add python SDK.
I couldn't figure out how to set up pyspark installed from the conda-forge with jupyter notebook
and without
the full spark from here.
The following instruction doesn't
work.
conda create -n pyspark
conda activate pyspark
conda install -c conda-forge pyspark=3.2.1 jupyterlab
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
pyspark --master local[*]
- Set up for
local-cluster
andspark://
. - Try
spark.pyspark.python
,spark.pyspark.driver.python
-
$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark
You can customize the
jupyter
commands by settingPYSPARK_DRIVER_PYTHON_OPTS
. -
https://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell
-
https://spark.apache.org/docs/latest/configuration.html#environment-variables