Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add JUPYTER_MODE for JupyterHub #77

Merged
merged 5 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ RUN groupadd -r spark && useradd -r -g spark spark_user

RUN apt-get update && apt-get install -y \
# GCC required to resolve error during JupyterLab installation: psutil could not be installed from sources because gcc is not installed.
gcc curl git graphviz graphviz-dev libgdal-dev build-essential python3-dev\
gcc curl git npm nodejs graphviz graphviz-dev libgdal-dev build-essential python3-dev\
&& rm -rf /var/lib/apt/lists/*

ENV HADOOP_AWS_VER=3.3.4
Expand Down Expand Up @@ -42,13 +42,25 @@ RUN pipenv sync --system

RUN chown -R spark_user:spark /opt/bitnami

# Set up Jupyter directories
# Set up Jupyter Lab directories
ENV JUPYTER_CONFIG_DIR=/.jupyter
ENV JUPYTER_RUNTIME_DIR=/.jupyter/runtime
ENV JUPYTER_DATA_DIR=/.jupyter/data
RUN mkdir -p ${JUPYTER_CONFIG_DIR} ${JUPYTER_RUNTIME_DIR} ${JUPYTER_DATA_DIR}
RUN chown -R spark_user:spark /.jupyter

# Set up Jupyter Hub directories
ENV JUPYTERHUB_CONFIG_DIR=/srv/jupyterhub
RUN mkdir -p ${JUPYTERHUB_CONFIG_DIR}
COPY ./src/notebook_utils/startup.py ${JUPYTERHUB_CONFIG_DIR}/startup.py
RUN chown -R spark_user:spark ${JUPYTERHUB_CONFIG_DIR}

# Jupyter Hub user home directory
RUN mkdir -p /home
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
RUN chown -R spark_user:spark /home

RUN npm install -g configurable-http-proxy

COPY ./src/ /src
ENV PYTHONPATH "${PYTHONPATH}:/src"

Expand Down
73 changes: 67 additions & 6 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,21 +123,22 @@ services:
- ./config/yarn-write-policy.json:/config/yarn-write-policy.json
- ./scripts/minio_create_bucket_entrypoint.sh:/scripts/minio_create_bucket_entrypoint.sh

dev_notebook:
dev_jupyterlab:
build:
context: .
dockerfile: Dockerfile
container_name: spark-dev-notebook
container_name: dev-jupyterlab
ports:
- "4041:4041"
depends_on:
- spark-master
- minio-create-bucket
environment:
- NOTEBOOK_PORT=4041
- JUPYTER_MODE=jupyterlab
- YARN_RESOURCE_MANAGER_URL=http://yarn-resourcemanager:8032
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_DRIVER_HOST=spark-dev-notebook
- SPARK_DRIVER_HOST=dev-jupyterlab
- MINIO_URL=http://minio:9002
- MINIO_ACCESS_KEY=minio-readwrite
- MINIO_SECRET_KEY=minio123
Expand All @@ -151,21 +152,22 @@ services:
volumes:
- ./cdr/cdm/jupyter:/cdm_shared_workspace

user_notebook:
user-jupyterlab:
build:
context: .
dockerfile: Dockerfile
container_name: spark-user-notebook
container_name: user-jupyterlab
ports:
- "4042:4042"
depends_on:
- spark-master
- minio-create-bucket
environment:
- NOTEBOOK_PORT=4042
- JUPYTER_MODE=jupyterlab
- YARN_RESOURCE_MANAGER_URL=http://yarn-resourcemanager:8032
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_DRIVER_HOST=spark-user-notebook
- SPARK_DRIVER_HOST=user-jupyterlab
- MINIO_URL=http://minio:9002
- MINIO_ACCESS_KEY=minio-readonly
- MINIO_SECRET_KEY=minio123
Expand All @@ -179,6 +181,65 @@ services:
volumes:
- ./cdr/cdm/jupyter/user_shared_workspace:/cdm_shared_workspace/user_shared_workspace

dev_jupyterhub:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably only need one jupyterhub instance eventually and control permission based on different user groups. But I haven't completely sort it out yet.

build:
context: .
dockerfile: Dockerfile
container_name: dev-jupyterhub
ports:
- "4043:4043"
depends_on:
- spark-master
- minio-create-bucket
environment:
- NOTEBOOK_PORT=4043
- JUPYTER_MODE=jupyterhub
- YARN_RESOURCE_MANAGER_URL=http://yarn-resourcemanager:8032
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_DRIVER_HOST=dev-jupterhub
- MINIO_URL=http://minio:9002
- MINIO_ACCESS_KEY=minio-readwrite
- MINIO_SECRET_KEY=minio123
- S3_YARN_BUCKET=yarn
- MAX_EXECUTORS=4
- POSTGRES_USER=hive
- POSTGRES_PASSWORD=hivepassword
- POSTGRES_DB=hive
- POSTGRES_URL=postgres:5432
- USAGE_MODE=dev
volumes:
- ./cdr/cdm/jupyter:/cdm_shared_workspace
- ./cdr/cdm/jupyter/home:/home
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved

user_jupyterhub:
build:
context: .
dockerfile: Dockerfile
container_name: user-jupyterhub
ports:
- "4044:4044"
depends_on:
- spark-master
- minio-create-bucket
environment:
- NOTEBOOK_PORT=4044
- JUPYTER_MODE=jupyterhub
- YARN_RESOURCE_MANAGER_URL=http://yarn-resourcemanager:8032
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_DRIVER_HOST=user-jupyterhub
- MINIO_URL=http://minio:9002
- MINIO_ACCESS_KEY=minio-readonly
- MINIO_SECRET_KEY=minio123
- S3_YARN_BUCKET=yarn
- JUPYTER_MODE=jupyterhub
- MAX_EXECUTORS=4
- POSTGRES_USER=hive
- POSTGRES_PASSWORD=hivepassword
- POSTGRES_DB=hive
- POSTGRES_URL=postgres:5432
Comment on lines +236 to +239
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really the point of this PR but it makes me nervous that users have access to these creds...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. We do have a todo item for this.

# TODO: create postgres user w/ only write access to the hive tables

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even that would make me nervous. It means that any user could blow away the hive tables

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the todo item to

# TODO: create postgres user r/ only read access to the hive tables

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users won't need to create tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess another question is if users and devs are all in the same jupyterhub instance is it possible to have different environments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users won't since they don't have minIO write permission. I am hoping it can be configured. But TBD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so users = read only, devs = write, essentially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could ever get the remote metastore working that might provide some protection as well, not sure

volumes:
- ./cdr/cdm/jupyter/home:/home
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No shared workspace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. Each user will have their own dir and python environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should dev_jupyterhub have the shared workspace mounted in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both instances mount - ./cdr/cdm/jupyter/jupyterhub/users_home:/jupyterhub/users_home. Each user will have their own dir inside of /jupyterhub/users_home and create notebook there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that dev-jupyterhub also mounts /cdr/cdm/jupyter:/cdm_shared_workspace. I'm asking if that line needs to be there or if it should be removed to match user_juypterhub

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues with using - ./cdr/cdm/jupyter:/cdm_shared_workspace for the dev environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying there's an issue, I just wasn't sure why it was there. can you remind me what's in the hive_metastore and cdm-postgres dirs? Are they both still relevant with the standalone postgres service?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cdm-postgres is not longer used, it's not getting updated for 2 months.
In hive_metastore there are <db_name>.db files. I think they are used when users run spark select statement.

Screenshot 2024-09-03 at 3 01 27 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe cdm-postgres was the local DB, and hive_metastore is cache data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure. But I want to worry about those later.


postgres:
image: postgres:16.3
restart: always
Expand Down
37 changes: 23 additions & 14 deletions scripts/notebook_entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
#!/bin/bash

echo "starting jupyter notebook"

# Ensure NOTEBOOK_DIR is set
if [ -z "$NOTEBOOK_DIR" ]; then
echo "ERROR: NOTEBOOK_DIR is not set. Please run setup.sh first."
Expand All @@ -10,17 +8,28 @@ fi

mkdir -p "$NOTEBOOK_DIR" && cd "$NOTEBOOK_DIR"

# install Plotly extension
jupyter labextension install [email protected]

# install ipywidgets extension
jupyter labextension install @jupyter-widgets/[email protected]
if [ "$JUPYTER_MODE" = "jupyterlab" ]; then
echo "starting jupyterlab"
# install Plotly extension
jupyter labextension install [email protected]

# install ipywidgets extension
jupyter labextension install @jupyter-widgets/[email protected]

# Start Jupyter Lab
jupyter lab --ip=0.0.0.0 \
--port="$NOTEBOOK_PORT" \
--no-browser \
--allow-root \
--notebook-dir="$NOTEBOOK_DIR" \
--ServerApp.token='' \
--ServerApp.password=''
elif [ "$JUPYTER_MODE" = "jupyterhub" ]; then
echo "starting jupyterhub"

# Start Jupyter Lab
jupyter lab --ip=0.0.0.0 \
--port="$NOTEBOOK_PORT" \
--no-browser \
--allow-root \
--notebook-dir="$NOTEBOOK_DIR" \
--ServerApp.token='' \
--ServerApp.password=''
echo "TO BE IMPLEMENTED"
else
echo "ERROR: JUPYTER_MODE is not set to jupyterlab or jupyterhub. Please set JUPYTER_MODE to either jupyterlab or jupyterhub."
exit 1
fi
Loading