Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spark helper #4

Merged
merged 5 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.DS_Store
.idea
.coverage

*_pycache__
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ RUN apt-get update && apt-get install -y \
# Install Jupyterlab and other python dependencies
RUN pip3 install jupyterlab==4.2.0 pyspark==3.5.1

COPY ./src/ /src
ENV PYTHONPATH "${PYTHONPATH}:/src"

COPY scripts/entrypoint.sh /opt/
RUN chmod a+x /opt/entrypoint.sh

Expand Down
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,23 +63,37 @@ When running Spark in the Jupyter notebook container, the default `spark.driver.
the hostname (`SPARK_DRIVER_HOST`) of the container.
In addition, the environment variable `SPARK_MASTER_URL` should also be configured.

#### Using Predefined SparkSession from `spark.utils.get_spark_session` method
```python
from spark.utils import get_spark_session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will spark not conflict with other package names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. But just in case changed the repo name to common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common seems like it'd conflict too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pypi.org/project/common/ seems dead.

That being said, https://pypi.org/project/spark/ seems super dead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So spark seems safe and is probably a better name, sorry about that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


spark = get_spark_session('TestApp')
```

#### Manually Configuring SparkSession/SparkContext

If you want to configure the SparkSession manually, you can do so as follows:

#### Example SparkSession Configuration
```python
spark = SparkSession.builder \
.master(os.environ['SPARK_MASTER_URL']) \
.appName("TestSparkJob") \
.getOrCreate()
```

#### Example SparkContext Configuration
```python
conf = SparkConf(). \
setAppName("TestSparkJob")
conf = SparkConf() \
.setMaster(os.environ['SPARK_MASTER_URL']) \
.setAppName("TestSparkJob")
sc = SparkContext(conf=conf)
```

#### Submitting a Job Using Terminal
```bash
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \
2>/dev/null
```
Expand Down
Empty file added src/__init__.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think __init__ files are necessary any more

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 removed.

Empty file.
Empty file added src/spark/__init__.py
Empty file.
32 changes: 32 additions & 0 deletions src/spark/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession


def get_spark_session(app_name: str, local: bool = False) -> SparkSession:
"""
Helper to get and manage the `SparkSession` and keep all of our spark configuration params in one place.

:param app_name: The name of the application
:param local: Whether to run the spark session locally or not

:return: A `SparkSession` object
"""

if local:
return SparkSession.builder.appName(app_name).getOrCreate()

spark_conf = SparkConf()

spark_conf.setAll(
[
(
"spark.master",
os.environ.get("SPARK_MASTER_URL", "spark://spark-master:7077"),
),
("spark.app.name", app_name),
]
)

return SparkSession.builder.config(conf=spark_conf).getOrCreate()
Loading