-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add jupyter notebook #1
Changes from 9 commits
1e98bb4
a1047ed
7df710e
68d386a
8e28235
6bfee7e
7d5e88b
a13d27f
908f340
f9754a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,23 @@ | ||
FROM bitnami/spark:3.5.1 | ||
|
||
RUN export ORI_USER=$(id -u) | ||
# Switch to root to install packages | ||
USER root | ||
|
||
ENTRYPOINT ["sleep 10s"] | ||
ENV PYTHON_VER=python3.11 | ||
|
||
# Install necessary packages | ||
RUN apt-get update && apt-get install -y \ | ||
$PYTHON_VER python3-pip $PYTHON_VER-dev \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Install Jupyterlab and other python dependencies | ||
RUN pip3 install jupyterlab==4.2.0 pyspark==3.5.1 | ||
|
||
COPY scripts/entrypoint.sh /opt/ | ||
RUN chmod a+x /opt/entrypoint.sh | ||
|
||
# Switch back to the original user | ||
USER ${ORI_USER} | ||
|
||
ENTRYPOINT ["/opt/entrypoint.sh"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,94 @@ | ||
# CDM Jupyterhub dockerfiles (Prototype) | ||
|
||
This prototype establishes a Docker container configuration for JupyterHub, designed to furnish a multi-user | ||
environment tailored for executing Spark jobs via Jupyter notebooks. | ||
environment tailored for executing Spark jobs via Jupyter notebooks. | ||
|
||
## Using `docker-compose.yaml` | ||
|
||
To deploy the JupyterHub container and Spark nodes locally, execute the following command: | ||
|
||
```bash | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can daemonize with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know. But I usually want to see the logs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
but if you want to leave it as is that's fine There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know you can view logs that way. I think it's convenient to view it as it's happening. I will just leave it as is for now. It's individual dev's choice I guess. |
||
docker-compose up --build | ||
``` | ||
|
||
## Test Submitting a Spark Job Locally | ||
|
||
### Submitting a Spark Job via spark-test-node | ||
```bash | ||
docker exec -it spark-test-node \ | ||
sh -c ' | ||
/opt/bitnami/spark/bin/spark-submit \ | ||
--master $SPARK_MASTER_URL \ | ||
examples/src/main/python/pi.py 10 \ | ||
2>/dev/null | ||
' | ||
``` | ||
|
||
### Submitting a Spark Job via Jupyter Notebook | ||
After launching the [Jupyter Notebook](http://localhost:4041/), establish a Spark context or session with the Spark | ||
master set to the environment variable `SPARK_MASTER_URL` and proceed to submit your job. Once the job is submitted, | ||
you can monitor the job status and logs in the [Spark UI](http://localhost:8080/). | ||
|
||
Sample code to calculate Pi using `SparkContext`: | ||
```python | ||
from pyspark import SparkConf, SparkContext | ||
import random | ||
import os | ||
|
||
spark_master_url = os.environ['SPARK_MASTER_URL'] | ||
|
||
conf = SparkConf().setMaster(spark_master_url).setAppName("Pi") | ||
sc = SparkContext(conf=conf) | ||
|
||
num_samples = 100000000 | ||
def inside(p): | ||
x, y = random.random(), random.random() | ||
return x*x + y*y < 1 | ||
count = sc.parallelize(range(0, num_samples)).filter(inside).count() | ||
pi = 4 * count / num_samples | ||
print(pi) | ||
sc.stop() | ||
``` | ||
|
||
## Racher Deployment | ||
|
||
### Environment Variables | ||
- `SPARK_MASTER_URL`: `spark://spark-master:7077` | ||
- `NOTEBOOK_PORT`: 4041 | ||
- `SPARK_DRIVER_HOST`: `notebook` (the hostname of the Jupyter notebook container). | ||
|
||
### Spark Session/Context Configuration | ||
|
||
Ensure to configure `spark.driver.host` for the Spark driver to bind to the Jupyter notebook container's hostname | ||
|
||
```python | ||
spark = SparkSession.builder \ | ||
.master(os.environ['SPARK_MASTER_URL']) \ | ||
.appName("TestSparkJob") \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should just add this to the config in the image in the entrypoint so the user doesn't have to worry about it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you referring to make changes to the Spark configuration file? I can do it in my next PR. I think it will involve a lot of testing commit since locally it always works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's what I was thinking, but regardless of the implemenation the behavior should be that we set some environment var in rancher to the notebook hostname and then the user doesn't need to worry about the host name at all, it Just Works next PR is fine |
||
.config("spark.driver.host", os.environ['SPARK_DRIVER_HOST']) \ | ||
.getOrCreate() | ||
``` | ||
Or | ||
```python | ||
conf = SparkConf(). \ | ||
setMaster( os.environ['SPARK_MASTER_URL']). \ | ||
setAppName("TestSparkJob"). \ | ||
set("spark.driver.host", os.environ['SPARK_DRIVER_HOST']) | ||
Comment on lines
+72
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have to set up both the session and the context? Since they're in different code blocks it seems like they're different things There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, you don't need to set up both. SparkContext has more control over spark but more complicated to config. Some people have a preference for either option. So I just give 2 examples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe put "OR" between them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
sc = SparkContext(conf=conf) | ||
``` | ||
|
||
Submitting job using terminal | ||
```bash | ||
/opt/bitnami/spark/bin/spark-submit \ | ||
--master $SPARK_MASTER_URL \ | ||
--conf spark.driver.host=$SPARK_DRIVER_HOST \ | ||
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \ | ||
2>/dev/null | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
version: '3' | ||
|
||
# This docker-compose is for developer convenience, not for running in production. | ||
|
||
services: | ||
|
||
spark-master: | ||
image: bitnami/spark:3.5.1 | ||
container_name: spark-master | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for providing container names that appear to be redundant? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Otherwise the container name is _1. I just don't like the suffix when I need to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
ports: | ||
- "8080:8080" | ||
- "7077:7077" | ||
environment: | ||
- SPARK_MODE=master | ||
- SPARK_MASTER_WEBUI_PORT=8080 | ||
- SPARK_MASTER_HOST=0.0.0.0 | ||
|
||
spark-worker-1: | ||
image: bitnami/spark:3.5.1 | ||
container_name: spark-worker-1 | ||
depends_on: | ||
- spark-master | ||
ports: | ||
- "8081:8081" | ||
environment: | ||
- SPARK_MODE=worker | ||
- SPARK_MASTER_URL=spark://spark-master:7077 | ||
- SPARK_WORKER_CORES=2 | ||
- SPARK_WORKER_MEMORY=1G | ||
- SPARK_WORKER_WEBUI_PORT=8081 | ||
|
||
spark-worker-2: | ||
image: bitnami/spark:3.5.1 | ||
container_name: spark-worker-2 | ||
depends_on: | ||
- spark-master | ||
ports: | ||
- "8082:8082" | ||
environment: | ||
- SPARK_MODE=worker | ||
- SPARK_MASTER_URL=spark://spark-master:7077 | ||
- SPARK_WORKER_CORES=2 | ||
- SPARK_WORKER_MEMORY=1G | ||
- SPARK_WORKER_WEBUI_PORT=8082 | ||
|
||
spark-test-node: | ||
image: bitnami/spark:3.5.1 | ||
container_name: spark-test-node | ||
depends_on: | ||
- spark-master | ||
environment: | ||
- SPARK_MASTER_URL=spark://spark-master:7077 | ||
Tianhao-Gu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
notebook: | ||
build: | ||
context: . | ||
dockerfile: Dockerfile | ||
container_name: spark-notebook | ||
ports: | ||
- "4041:4041" | ||
depends_on: | ||
- spark-master | ||
environment: | ||
- NOTEBOOK_PORT=4041 | ||
- SPARK_MASTER_URL=spark://spark-master:7077 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
#!/bin/bash | ||
|
||
echo "starting jupyter notebook" | ||
|
||
WORKSPACE_DIR="/cdm_shared_workspace" | ||
mkdir -p "$WORKSPACE_DIR" | ||
cd "$WORKSPACE_DIR" | ||
|
||
# Start Jupyter Lab | ||
jupyter lab --ip=0.0.0.0 \ | ||
--port=$NOTEBOOK_PORT \ | ||
--no-browser \ | ||
--allow-root \ | ||
--notebook-dir="$WORKSPACE_DIR" \ | ||
--ServerApp.token='' \ | ||
--ServerApp.password='' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think about it I'm pretty sure bitnami installs python in the image. Is there a reason you're not using that python, assuming I'm right? If the python version here doesn't match what's in the cluster the job won't run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea. I can use python from the image. But I remember you want to control which python version we are using. I have no strong option on that. I am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the prior repo the version was set for both the cluster and the test image (assuming you were using the same image). Here the python version could be different for the cluster and the notebook which would break things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Switched to use image python. Both base image has python installed. I don't see the differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by both base images?
Not sure what you mean here either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both
bitnami/spark
and the image you used previously has python. I don't see why we spent a lot of time trying to install python before. We could just use the python form the image before.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old image had 3.10. If it had 3.11 I wouldn't have bothered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍