Skip to content

Commit

Permalink
add spark helper
Browse files Browse the repository at this point in the history
  • Loading branch information
Tianhao-Gu committed May 20, 2024
1 parent b09aa49 commit 1921729
Show file tree
Hide file tree
Showing 8 changed files with 51 additions and 2 deletions.
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ RUN apt-get update && apt-get install -y \
# Install Jupyterlab and other python dependencies
RUN pip3 install jupyterlab==4.2.0 pyspark==3.5.1

COPY ./src/ /src
ENV PYTHONPATH "${PYTHONPATH}:/src"

COPY scripts/entrypoint.sh /opt/
RUN chmod a+x /opt/entrypoint.sh

Expand Down
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,23 +63,37 @@ When running Spark in the Jupyter notebook container, the default `spark.driver.
the hostname (`SPARK_DRIVER_HOST`) of the container.
In addition, the environment variable `SPARK_MASTER_URL` should also be configured.

#### Using Predefined SparkSession from `spark.utils.get_spark_session` method
```python
from spark.utils import get_spark_session

spark = get_spark_session('TestApp')
```

#### Manually Configuring SparkSession/SparkContext

If you want to configure the SparkSession manually, you can do so as follows:

#### Example SparkSession Configuration
```python
spark = SparkSession.builder \
.master(os.environ['SPARK_MASTER_URL']) \
.appName("TestSparkJob") \
.getOrCreate()
```

#### Example SparkContext Configuration
```python
conf = SparkConf(). \
setAppName("TestSparkJob")
conf = SparkConf() \
.setMaster(os.environ['SPARK_MASTER_URL']) \
.setAppName("TestSparkJob")
sc = SparkContext(conf=conf)
```

#### Submitting a Job Using Terminal
```bash
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \
2>/dev/null
```
Expand Down
Empty file added src/__init__.py
Empty file.
Binary file added src/__pycache__/__init__.cpython-311.pyc
Binary file not shown.
Empty file added src/spark/__init__.py
Empty file.
Binary file added src/spark/__pycache__/__init__.cpython-311.pyc
Binary file not shown.
Binary file added src/spark/__pycache__/utils.cpython-311.pyc
Binary file not shown.
32 changes: 32 additions & 0 deletions src/spark/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession


def get_spark_session(app_name: str, local: bool = False) -> SparkSession:
"""
Helper to get and manage the `SparkSession` and keep all of our spark configuration params in one place.
:param app_name: The name of the application
:param local: Whether to run the spark session locally or not
:return: A `SparkSession` object
"""

if local:
return SparkSession.builder.appName(app_name).getOrCreate()

spark_conf = SparkConf()

spark_conf.setAll(
[
(
"spark.master",
os.environ.get("SPARK_MASTER_URL", "spark://spark-master:7077"),
),
("spark.app.name", app_name),
]
)

return SparkSession.builder.config(conf=spark_conf).getOrCreate()

0 comments on commit 1921729

Please sign in to comment.