Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Make SageMaker image build faster and smaller. #731

Merged
merged 1 commit into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ RUN yum erase -y openssl-devel && \
sudo \
xz-devel && \
rm -rf /var/cache/yum
RUN git clone https://github.com/pyenv/pyenv.git ${PYENV_ROOT} && \
RUN git clone https://github.com/pyenv/pyenv.git ${PYENV_ROOT} --single-branch && \
pyenv install ${PYTHON_VERSION} && \
pyenv global ${PYTHON_VERSION}

Expand All @@ -41,7 +41,7 @@ WORKDIR /usr/lib/spark/code/
COPY requirements.txt requirements.txt
# Use --mount=type=cache,target=/root/.cache when Buildkit CI issue is fixed:
# https://github.com/moby/buildkit/issues/1512
RUN pip install -r /usr/lib/spark/code/requirements.txt \
RUN pip install --no-cache-dir -r /usr/lib/spark/code/requirements.txt \
&& rm -rf /root/.cache

# GSProcessing codebase
Expand All @@ -63,7 +63,8 @@ RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm_processing-*.
rm /usr/lib/spark/code/graphstorm_processing-*.whl && rm -rf /root/.cache

FROM runtime AS test
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ && rm -rf /root/.cache
RUN python -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ mock && \
jalencato marked this conversation as resolved.
Show resolved Hide resolved
rm -rf /root/.cache

USER hadoop:hadoop
WORKDIR /home/hadoop
36 changes: 11 additions & 25 deletions graphstorm-processing/docker/0.2.2/sagemaker/Dockerfile.cpu
Original file line number Diff line number Diff line change
Expand Up @@ -11,32 +11,17 @@ ENV LC_ALL=C.UTF-8
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/conda/lib"
ENV PATH=/opt/conda/bin:$PATH
ENV PIP_NO_CACHE_DIR=1

# Install GSProcessing requirements to pipenv Python
RUN pipenv install \
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
boto3==1.28.38 \
joblib==1.3.1 \
mock==5.1.0 \
pandas==1.3.5 \
pip==23.1.2 \
protobuf==3.20.3 \
psutil==5.9.5 \
pyarrow==13.0.0 \
pyspark==3.4.1 \
scipy==1.11.3 \
setuptools \
transformers==4.37.1 \
spacy==3.6.0 \
torch==2.1.0 \
wheel \
&& rm -rf /root/.cache
# Do a pipenv sync so our base libs are independent from our editable code, making them cacheable
RUN pipenv sync --system && python3 -m spacy download en_core_web_lg \
WORKDIR /usr/lib/spark/code/

# Install GSProcessing dependencies to system Python 3.9
COPY requirements.txt requirements.txt
RUN /usr/local/bin/python3.9 -m pip install --no-cache-dir -r /usr/lib/spark/code/requirements.txt \
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
&& rm -rf /root/.cache

# Graphloader codebase
COPY code/ /usr/lib/spark/code/
WORKDIR /usr/lib/spark/code/

# Base container assumes this is the workdir
ENV SPARK_HOME /usr/lib/spark
Expand All @@ -60,10 +45,11 @@ fi
ENTRYPOINT ["bash", "/usr/lib/spark/code/docker-entry.sh"]

FROM base AS prod
RUN python3 -m pip install /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl
RUN python3 -m pip install --no-deps /usr/lib/spark/code/graphstorm_processing-*.whl && \
rm /usr/lib/spark/code/graphstorm_processing-*.whl && rm -rf /root/.cache
CMD ["gs-processing"]

FROM base AS test
RUN python3 -m pip install /usr/lib/spark/code/graphstorm-processing/
CMD ["sh", "-c", "pytest ./code/tests/"]
RUN python3 -m pip install --no-deps /usr/lib/spark/code/graphstorm-processing/ mock && \
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
rm -rf /root/.cache
CMD ["sh", "-c", "pytest /usr/lib/spark/code/graphstorm-processing/tests/"]
44 changes: 29 additions & 15 deletions graphstorm-processing/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,34 +12,48 @@ with Amazon SageMaker see docs/source/usage/distributed-processing-setup.rst.
## Building the image

To build the image you will run `bash build_gsprocessing_image.sh`
script that has one required parameter, `--target` that can take
one of two values, `prod` and `test` that determine whether we
include the source and tests on the image (when `test` is used),
or just install the libary on the image (when `prod` is used).
jalencato marked this conversation as resolved.
Show resolved Hide resolved
script that has one required parameter, `--environment` that
determines the intended execution environment of the image.
We currently support either `sagemaker` or `emr-serverless`.

The script copies the necessary code, optionally builds and packages
the library as a `wheel` file and builds and tags the image.

You can get the other parameters of the script using
`bash build_gsprocessing_image.sh -h/--help` that include:

* `-p, --path` Path to graphstorm-processing directory, default is one level above this script.
* `-i, --image` Docker image name, default is 'graphstorm-processing'.
* `-v, --version` Docker version tag, default is the library's current version (`poetry version --short`)
* `-b, --build` Docker build directory, default is '/tmp/`


* `-e, --environment` Intended execution environment, must be one of `sagemaker` or `emr-serverless`. Required.
* `-p, --path` Path to graphstorm-processing directory, default is one level above this script.
* `-i, --image` Docker image name, default is 'graphstorm-processing'.
* `-v, --version` Docker version tag, default is the library's current version (`poetry version --short`)
* `-b, --build` Docker build directory, default is `/tmp/`
* `-a, --architecture` Target architecture for the image. Both execution environments support `x86_64`, while
EMR Serverless also supports `arm64`.
* `-s, --suffix` A suffix to add to the image tag, e.g. `-test` will name the image
`graphstorm-processing-${ENVIRONMENT}:${VERSION}-${ARCH}-test`.
* `-t, --target` Target of the image. Use `test` if you intend to use the image for testing
new library functionality, otherwise `prod`. Default: `prod`

## Pushing the image

After having built the image you will run `bash push_gsprocessing_image.sh`
to push the image to ECR. By default the script will optionally create
a repository on ECR named `graphstorm-processing` in the `us-west-2` region
a repository on ECR named `graphstorm-processing-${ENVIRONMENT}` in the `us-west-2` region
and push the image we just built to it.

You can change these default values using the other parameters of the script:

* `-i, --image` Docker image name, default is 'graphstorm-processing'.
* `-v, --version` Docker version tag, default is the library's current version (`poetry version --short`)
* `-r, --region` AWS Region to which we'll push the image. By default will get from aws-cli configuration.
* `-a, --account` AWS Account ID. By default will get from aws-cli configuration.
* `-e, --environment` Intended execution environment, must be one of `sagemaker` or `emr-serverless`. Required.
* `-i, --image` Docker image name prefix, default is `graphstorm-processing-${ENVIRONMENT}`.
* `-v, --version` Docker version tag, default is the library's current version (`poetry version --short`)
* `-r, --region` AWS Region to which we'll push the image. By default will get from aws-cli configuration.
* `-a, --account` AWS Account ID. By default will get from aws-cli configuration.

## Testing the image

If you build the image with the argument `--target test` the
build script will include the source and tests on the image.

To run the unit tests inside on a container running you have created, which helps ensure the deployed container will
behave as expected, you can run `docker run -it --rm --name gsp graphstorm-processing-${ENV}:0.2.2-${ARCH}${SUFFIX}`
which will execute the library's unit tests inside a local instance of the provided image.
21 changes: 12 additions & 9 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd -P)

usage() {
cat <<EOF
Usage: $(basename "${BASH_SOURCE[0]}") [-h] [-x] -t prod
Usage: $(basename "${BASH_SOURCE[0]}") [-h] [-x] -e sagemaker

Script description here.
Builds the GraphStorm Processing Docker image.

Available options:

Expand All @@ -18,13 +18,13 @@ Available options:
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-a, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
Note that only x86_64 architecture is supported for SageMaker.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
-p, --path Path to graphstorm-processing directory, default is the current directory.
-i, --image Docker image name, default is 'graphstorm-processing'.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'prod'.
-p, --path Path to graphstorm-processing root directory, default is one level above this script's location.
-i, --image Docker image name, default is 'graphstorm-processing-\${environment}'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "".
-b, --build Docker build directory, default is '/tmp/'.
-m, --hf-model Huggingface Model name that needs to be packed into the docker image. Default is "".
-b, --build Docker build directory prefix, default is '/tmp/'.
-m, --hf-model Provide a Huggingface Model name to be packed into the docker image. Default is "".

EOF
exit
Expand All @@ -47,7 +47,7 @@ parse_params() {
IMAGE_NAME='graphstorm-processing'
VERSION=`poetry version --short`
BUILD_DIR='/tmp'
TARGET='test'
TARGET='prod'
ARCH='x86_64'
SUFFIX=""
MODEL=""
Expand Down Expand Up @@ -133,6 +133,8 @@ if [[ ${EXEC_ENV} == "sagemaker" && ${ARCH} == "arm64" ]]; then
die "arm64 architecture is not supported for SageMaker"
fi

# TODO: Ensure that the version requested has a corresponding directory

# script logic here
msg "Execution parameters:"
msg "- ENVIRONMENT: ${EXEC_ENV}"
Expand All @@ -155,7 +157,8 @@ if [[ ${TARGET} == "prod" ]]; then
"${BUILD_DIR}/docker/code"
else
# Copy library source code along with test files
rsync -r ${GSP_HOME} "${BUILD_DIR}/docker/code/graphstorm-processing/" --exclude .venv --exclude dist
rsync -r ${GSP_HOME} "${BUILD_DIR}/docker/code/graphstorm-processing/" --exclude .venv --exclude dist \
--exclude "*__pycache__" --exclude "*.pytest_cache" --exclude "*.mypy_cache"
cp ${GSP_HOME}/../graphstorm_job.sh "${BUILD_DIR}/docker/code/"
fi

Expand Down
6 changes: 3 additions & 3 deletions graphstorm-processing/docker/push_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ Script description here.
Available options:

-h, --help Print this help and exit
-x, --verbose Print script debug info
-x, --verbose Print script debug info (set -x)
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-c, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
-i, --image Docker image name, default is 'graphstorm-processing'.
-i, --image Docker image name, default is 'graphstorm-processing-\${environment}'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "".
-r, --region AWS Region to which we'll push the image. By default will get from aws-cli configuration.
Expand Down Expand Up @@ -117,7 +117,7 @@ msg "- REGION: ${REGION}"
msg "- ACCOUNT: ${ACCOUNT}"

TAG="${VERSION}-${ARCH}${SUFFIX}"
LATEST_TAG="latest-${ARCH}"
LATEST_TAG="latest-${ARCH}${SUFFIX}"
IMAGE_WITH_ENV="${IMAGE}-${EXEC_ENV}"


Expand Down
Loading