Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add driver_host to spark config #2

Merged
merged 3 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,29 +59,27 @@ sc.stop()

### Spark Session/Context Configuration

Ensure to configure `spark.driver.host` for the Spark driver to bind to the Jupyter notebook container's hostname
When running Spark in the Jupyter notebook container, the default `spark.driver.host` configuration is set to
the hostname (`SPARK_DRIVER_HOST`) of the container.
In addition, the environment variable `SPARK_MASTER_URL` should also be configured.

#### Example SparkSession Configuration
```python
spark = SparkSession.builder \
.master(os.environ['SPARK_MASTER_URL']) \
.appName("TestSparkJob") \
.config("spark.driver.host", os.environ['SPARK_DRIVER_HOST']) \
.getOrCreate()
```
Or

#### Example SparkContext Configuration
```python
conf = SparkConf(). \
setMaster( os.environ['SPARK_MASTER_URL']). \
setAppName("TestSparkJob"). \
set("spark.driver.host", os.environ['SPARK_DRIVER_HOST'])
setAppName("TestSparkJob")
sc = SparkContext(conf=conf)
Comment on lines 75 to 77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much easier for the user

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

```

Submitting job using terminal
#### Submitting a Job Using Terminal
```bash
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
--conf spark.driver.host=$SPARK_DRIVER_HOST \
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \
2>/dev/null
```
Expand Down
3 changes: 2 additions & 1 deletion docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,5 @@ services:
- spark-master
environment:
- NOTEBOOK_PORT=4041
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_DRIVER_HOST=spark-notebook
10 changes: 10 additions & 0 deletions scripts/entrypoint.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work in the CI or CDM stack?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. I already deployed the new image in the CDM namespace and tested it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

echo "starting jupyter notebook"

if [ -n "$SPARK_DRIVER_HOST" ]; then
echo "Setting spark.driver.host to $SPARK_DRIVER_HOST"
source /opt/bitnami/scripts/spark-env.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to get the conf file env var?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. I think this is the script bitnami is using to load all env variables. It only has export commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if [ -z "$SPARK_CONF_FILE" ]; then
Copy link
Collaborator Author

@Tianhao-Gu Tianhao-Gu May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrCreosote I think the previous attempt to write to the Spark config file failed because we didn't set $SPARK_CONF_FILE.

echo "Error: unable to find SPARK_CONF_FILE path"
exit 1
fi
echo "spark.driver.host $SPARK_DRIVER_HOST" >> $SPARK_CONF_FILE
fi

WORKSPACE_DIR="/cdm_shared_workspace"
mkdir -p "$WORKSPACE_DIR"
cd "$WORKSPACE_DIR"
Expand Down
Loading