Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add jupyter notebook #1

Merged
merged 10 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,23 @@
FROM bitnami/spark:3.5.1

RUN export ORI_USER=$(id -u)
# Switch to root to install packages
USER root

ENTRYPOINT ["sleep 10s"]
ENV PYTHON_VER=python3.11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think about it I'm pretty sure bitnami installs python in the image. Is there a reason you're not using that python, assuming I'm right? If the python version here doesn't match what's in the cluster the job won't run

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. I can use python from the image. But I remember you want to control which python version we are using. I have no strong option on that. I am fine either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the prior repo the version was set for both the cluster and the test image (assuming you were using the same image). Here the python version could be different for the cluster and the notebook which would break things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Switched to use image python. Both base image has python installed. I don't see the differences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both base image has python installed.

What do you mean by both base images?

I don't see the differences.

Not sure what you mean here either

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both bitnami/spark and the image you used previously has python. I don't see why we spent a lot of time trying to install python before. We could just use the python form the image before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old image had 3.10. If it had 3.11 I wouldn't have bothered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


# Install necessary packages
RUN apt-get update && apt-get install -y \
$PYTHON_VER python3-pip $PYTHON_VER-dev \
&& rm -rf /var/lib/apt/lists/*

# Install Jupyterlab and other python dependencies
RUN pip3 install jupyterlab==4.2.0 pyspark==3.5.1

COPY scripts/entrypoint.sh /opt/
RUN chmod a+x /opt/entrypoint.sh

# Switch back to the original user
USER ${ORI_USER}

ENTRYPOINT ["/opt/entrypoint.sh"]
92 changes: 91 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,94 @@
# CDM Jupyterhub dockerfiles (Prototype)

This prototype establishes a Docker container configuration for JupyterHub, designed to furnish a multi-user
environment tailored for executing Spark jobs via Jupyter notebooks.
environment tailored for executing Spark jobs via Jupyter notebooks.

## Using `docker-compose.yaml`

To deploy the JupyterHub container and Spark nodes locally, execute the following command:

```bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can daemonize with -d

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. But I usually want to see the logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker compose logs

but if you want to leave it as is that's fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you can view logs that way. I think it's convenient to view it as it's happening. I will just leave it as is for now. It's individual dev's choice I guess.

docker-compose up --build
```

## Test Submitting a Spark Job Locally

### Submitting a Spark Job via spark-test-node
```bash
docker exec -it spark-test-node \
sh -c '
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
examples/src/main/python/pi.py 10 \
2>/dev/null
'
```

### Submitting a Spark Job via Jupyter Notebook
After launching the [Jupyter Notebook](http://localhost:4041/), establish a Spark context or session with the Spark
master set to the environment variable `SPARK_MASTER_URL` and proceed to submit your job. Once the job is submitted,
you can monitor the job status and logs in the [Spark UI](http://localhost:8080/).

Sample code to calculate Pi using `SparkContext`:
```python
from pyspark import SparkConf, SparkContext
import random
import os

spark_master_url = os.environ['SPARK_MASTER_URL']

conf = SparkConf().setMaster(spark_master_url).setAppName("Pi")
sc = SparkContext(conf=conf)

num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
```

## Racher Deployment

### Environment Variables
- `SPARK_MASTER_URL`: `spark://spark-master:7077`
- `NOTEBOOK_PORT`: 4041
- `SPARK_DRIVER_HOST`: `notebook` (the hostname of the Jupyter notebook container).

### Spark Session/Context Configuration

Ensure to configure `spark.driver.host` for the Spark driver to bind to the Jupyter notebook container's hostname

```python
spark = SparkSession.builder \
.master(os.environ['SPARK_MASTER_URL']) \
.appName("TestSparkJob") \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just add this to the config in the image in the entrypoint so the user doesn't have to worry about it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to make changes to the Spark configuration file? I can do it in my next PR. I think it will involve a lot of testing commit since locally it always works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was thinking, but regardless of the implemenation the behavior should be that we set some environment var in rancher to the notebook hostname and then the user doesn't need to worry about the host name at all, it Just Works

next PR is fine

.config("spark.driver.host", os.environ['SPARK_DRIVER_HOST']) \
.getOrCreate()
```
Or
```python
conf = SparkConf(). \
setMaster( os.environ['SPARK_MASTER_URL']). \
setAppName("TestSparkJob"). \
set("spark.driver.host", os.environ['SPARK_DRIVER_HOST'])
Comment on lines +72 to +77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to set up both the session and the context? Since they're in different code blocks it seems like they're different things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you don't need to set up both. SparkContext has more control over spark but more complicated to config. Some people have a preference for either option. So I just give 2 examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put "OR" between them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

sc = SparkContext(conf=conf)
```

Submitting job using terminal
```bash
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
--conf spark.driver.host=$SPARK_DRIVER_HOST \
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \
2>/dev/null
```







65 changes: 65 additions & 0 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
version: '3'

# This docker-compose is for developer convenience, not for running in production.

services:

spark-master:
image: bitnami/spark:3.5.1
container_name: spark-master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for providing container names that appear to be redundant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the container name is _1. I just don't like the suffix when I need to docker exec -it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

ports:
- "8080:8080"
- "7077:7077"
environment:
- SPARK_MODE=master
- SPARK_MASTER_WEBUI_PORT=8080
- SPARK_MASTER_HOST=0.0.0.0

spark-worker-1:
image: bitnami/spark:3.5.1
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_WEBUI_PORT=8081

spark-worker-2:
image: bitnami/spark:3.5.1
container_name: spark-worker-2
depends_on:
- spark-master
ports:
- "8082:8082"
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_WEBUI_PORT=8082

spark-test-node:
image: bitnami/spark:3.5.1
container_name: spark-test-node
depends_on:
- spark-master
environment:
- SPARK_MASTER_URL=spark://spark-master:7077
Tianhao-Gu marked this conversation as resolved.
Show resolved Hide resolved

notebook:
build:
context: .
dockerfile: Dockerfile
container_name: spark-notebook
ports:
- "4041:4041"
depends_on:
- spark-master
environment:
- NOTEBOOK_PORT=4041
- SPARK_MASTER_URL=spark://spark-master:7077
16 changes: 16 additions & 0 deletions scripts/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

echo "starting jupyter notebook"

WORKSPACE_DIR="/cdm_shared_workspace"
mkdir -p "$WORKSPACE_DIR"
cd "$WORKSPACE_DIR"

# Start Jupyter Lab
jupyter lab --ip=0.0.0.0 \
--port=$NOTEBOOK_PORT \
--no-browser \
--allow-root \
--notebook-dir="$WORKSPACE_DIR" \
--ServerApp.token='' \
--ServerApp.password=''
Loading