Skip to content

Commit

Permalink
[GSProcessing] Bump version to 0.2.1, add support for arm64 images fo…
Browse files Browse the repository at this point in the history
…r EMR-S.
  • Loading branch information
thvasilo committed Nov 10, 2023
1 parent 56f8851 commit 0bf95e3
Show file tree
Hide file tree
Showing 7 changed files with 98 additions and 19 deletions.
2 changes: 1 addition & 1 deletion docs/source/gs-processing/usage/amazon-sagemaker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ job, followed by the re-partitioning job, both on SageMaker:
INSTANCE_TYPE="ml.t3.xlarge"
NUM_FILES="4"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
Expand Down
61 changes: 60 additions & 1 deletion docs/source/gs-processing/usage/distributed-processing-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,13 +104,65 @@ the following to build the SageMaker image:
bash docker/build_gsprocessing_image.sh --environment sagemaker
The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where
build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}-x86_64`` where
``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``).

The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
for more information.

Support for arm64 architecture
------------------------------

You might have noticed that we include the image's architecture, ``x86_64`` in the image name.
For EMR Serverless images, it is possible to build images that support ``arm64`` instances,
which can lead to improved runtime and cost compared to ``x86_64``. To build ``arm64`` images
on an ``x86_64`` host you need to enable multi-platform builds for Docker. The easiest way
to do so is to use QEMU emulation. To install the QEMU related libraries you can run

On Ubuntu

.. code-block:: bash
sudo apt install -y qemu binfmt-support qemu-user-static
On Amazon Linux/CentOS:

.. code-block:: bash
sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \
libvirt virt-install libvirt-python libguestfs-tools-c
Finally you'd need to ensure ``binfmt_misc`` is configured for different platforms by running

.. code-block:: bash
docker run --privileged --rm tonistiigi/binfmt --install all
To verify your Docker installation is ready for multi-platform builds you can run:

.. code-block:: bash
docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default * docker
default default running v0.8+unknown linux/amd64, linux/arm64
To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you can run:

.. code-block:: bash
bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64
.. note::

Building images under emulation using QEMU can be significantly slower than native builds
(more than 20 minutes to build the GSProcessing ``arm64`` image).
To speed up the build process you can look into using ``buildx`` with multiple native nodes,
or cross-compilation.
See `the official Docker documentation <https://docs.docker.com/build/building/multi-platform/>`_ for details.

Push the image to the Amazon Elastic Container Registry (ECR)
-------------------------------------------------------------

Expand All @@ -136,6 +188,13 @@ Example:
bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
To push an EMR Serverless ``arm64`` image you'd similarly run:

.. code-block:: bash
bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
-i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
.. _gsp-upload-data-ref:

Upload data to S3
Expand Down
10 changes: 5 additions & 5 deletions docs/source/gs-processing/usage/emr-serverless.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,14 +88,14 @@ Here we will just show the custom image application creation using the AWS CLI:
aws emr-serverless create-application \
--name gsprocessing-0.2.1 \
--release-label emr-6.11.0 \
--release-label emr-6.13.0 \
--type SPARK \
--image-configuration '{
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1"
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-<arch>"
}'
Here you will need to replace ``<aws-account-id>`` and ``<region>`` with the correct values
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.11.0`` as its
Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``arm64``), and ``<region>`` with the correct values
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.


Expand Down Expand Up @@ -234,7 +234,7 @@ and building the GSProcessing SageMaker ECR image:
bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
INSTANCE_TYPE="ml.t3.xlarge"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
FROM public.ecr.aws/emr-serverless/spark/emr-6.11.0:20230629-x86_64 as runtime
ARG ARCH=x86_64
FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
FROM base as runtime

USER root
ENV PYTHON_VERSION=3.9.18

Expand Down
24 changes: 17 additions & 7 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info (set -x)
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-a, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
Note that only x86_64 architecture is supported for SageMaker.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
-p, --path Path to graphstorm-processing directory, default is the current directory.
-i, --image Docker image name, default is 'graphstorm-processing'.
Expand Down Expand Up @@ -43,6 +45,7 @@ parse_params() {
VERSION=`poetry version --short`
BUILD_DIR='/tmp'
TARGET='test'
ARCH='x86_64'
while :; do
case "${1-}" in
Expand All @@ -57,6 +60,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
-a | --architecture)
ARCH="${2-}"
shift
;;
-p | --path)
GSP_HOME="${2-}"
shift
Expand Down Expand Up @@ -103,15 +110,20 @@ else
die "--target parameter needs to be one of 'prod' or 'test', got ${TARGET}"
fi
if [[ ${EXEC_ENV} == "sagemaker" || ${EXEC_ENV} == "emr-serverless" ]]; then
if [[ ${ARCH} == "x86_64" || ${ARCH} == "arm64" ]]; then
: # Do nothing
else
die "--environment parameter needs to be one of 'emr-serverless' or 'sagemaker', got ${EXEC_ENV}"
die "--architecture parameter needs to be one of 'arm64' or 'x86_64', got ${ARCH}"
fi
if [[ ${EXEC_ENV} == "sagemaker" && ${ARCH} == "arm64" ]]; then
die "arm64 architecture is not supported for SageMaker"
fi
# script logic here
msg "Execution parameters:"
msg "- ENVIRONMENT: ${EXEC_ENV}"
msg "- ARCHITECTURE: ${ARCH}"
msg "- TARGET: ${TARGET}"
msg "- GSP_HOME: ${GSP_HOME}"
msg "- IMAGE_NAME: ${IMAGE_NAME}"
Expand Down Expand Up @@ -139,18 +151,16 @@ cp ${GSP_HOME}/docker-entry.sh "${BUILD_DIR}/docker/code/"
poetry export -f requirements.txt --output "${BUILD_DIR}/docker/requirements.txt"
# Set image name
DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}"
DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}-${ARCH}"
# Login to ECR to be able to pull source SageMaker image
if [[ ${EXEC_ENV} == "sagemaker" ]]; then
aws ecr get-login-password --region us-west-2 \
| docker login --username AWS --password-stdin 153931337802.dkr.ecr.us-west-2.amazonaws.com
else
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
# aws ecr get-login-password --region us-west-2 \
# | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
fi
echo "Build a Docker image ${DOCKER_FULLNAME}"
DOCKER_BUILDKIT=1 docker build -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET}
DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
11 changes: 9 additions & 2 deletions graphstorm-processing/docker/push_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-c, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-r, --region AWS Region to which we'll push the image. By default will get from aws-cli configuration.
Expand Down Expand Up @@ -43,6 +44,7 @@ parse_params() {
REGION=$(aws configure get region)
REGION=${REGION:-us-west-2}
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ARCH='x86_64'


while :; do
Expand All @@ -54,6 +56,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
-a | --architecture)
ARCH="${2-}"
shift
;;
-i | --image)
IMAGE="${2-}"
shift
Expand Down Expand Up @@ -98,13 +104,14 @@ fi
# script logic here
msg "Execution parameters: "
msg "- ENVIRONMENT: ${EXEC_ENV}"
msg "- ARCHITECTURE: ${ARCH}"
msg "- IMAGE: ${IMAGE}"
msg "- VERSION: ${VERSION}"
msg "- REGION: ${REGION}"
msg "- ACCOUNT: ${ACCOUNT}"

SUFFIX="${VERSION}"
LATEST_SUFFIX="latest"
SUFFIX="${VERSION}-${ARCH}"
LATEST_SUFFIX="latest-${ARCH}"
IMAGE_WITH_ENV="${IMAGE}-${EXEC_ENV}"


Expand Down
4 changes: 2 additions & 2 deletions graphstorm-processing/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "graphstorm_processing"
version = "0.1.0"
version = "0.2.1"
description = "Distributed graph pre-processing for GraphStorm"
readme = "README.md"
packages = [{include = "graphstorm_processing"}]
Expand All @@ -10,7 +10,7 @@ authors = [

[tool.poetry.dependencies]
python = "~3.9.12"
pyspark = "~3.3.0"
pyspark = ">=3.3.0, < 3.5.0"
pyarrow = "~13.0.0"
spacy = "3.6.0"
boto3 = "~1.28.1"
Expand Down

0 comments on commit 0bf95e3

Please sign in to comment.