Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Bump version to 0.2.1, add support for arm64 images for EMR Serverless #630

Merged
merged 2 commits into from
Nov 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/gs-processing/usage/amazon-sagemaker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ job, followed by the re-partitioning job, both on SageMaker:
INSTANCE_TYPE="ml.t3.xlarge"
NUM_FILES="4"

IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"

OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
Expand Down
66 changes: 65 additions & 1 deletion docs/source/gs-processing/usage/distributed-processing-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,13 +104,70 @@ the following to build the SageMaker image:
bash docker/build_gsprocessing_image.sh --environment sagemaker

The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where
build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}-x86_64`` where
``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``).

The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
for more information.

Support for arm64 architecture
------------------------------

For EMR Serverless images, it is possible to build images that support ``arm64`` instances,
which can lead to improved runtime and cost compared to ``x86_64``. You can build an ``arm64``
image natively by installing Docker and following the above process on an ARM instance such
as ``M6G`` or ``M7G``. See the `AWS documentation <https://aws.amazon.com/ec2/graviton/>`_
for instances powered by the Graviton processor.

To build ``arm64`` images
on an ``x86_64`` host you need to enable multi-platform builds for Docker. The easiest way
to do so is to use QEMU emulation. To install the QEMU related libraries you can run

On Ubuntu

.. code-block:: bash

sudo apt install -y qemu binfmt-support qemu-user-static

On Amazon Linux/CentOS:

.. code-block:: bash

sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \
libvirt virt-install libvirt-python libguestfs-tools-c

Finally you'd need to ensure ``binfmt_misc`` is configured for different platforms by running

.. code-block:: bash

docker run --privileged --rm tonistiigi/binfmt --install all

To verify your Docker installation is ready for multi-platform builds you can run:

.. code-block:: bash

docker buildx ls

NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default * docker
default default running v0.8+unknown linux/amd64, linux/arm64

To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you can run:

.. code-block:: bash

bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64

.. note::

Building images under emulation using QEMU can be significantly slower than native builds
(more than 20 minutes to build the GSProcessing ``arm64`` image).
To speed up the build process you can build on an ARM instances,
look into using ``buildx`` with multiple native nodes, or use cross-compilation.
See `the official Docker documentation <https://docs.docker.com/build/building/multi-platform/>`_
for details.

Push the image to the Amazon Elastic Container Registry (ECR)
-------------------------------------------------------------

Expand All @@ -136,6 +193,13 @@ Example:

bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"

To push an EMR Serverless ``arm64`` image you'd similarly run:

.. code-block:: bash

bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
-i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"

.. _gsp-upload-data-ref:

Upload data to S3
Expand Down
10 changes: 5 additions & 5 deletions docs/source/gs-processing/usage/emr-serverless.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,14 +88,14 @@ Here we will just show the custom image application creation using the AWS CLI:

aws emr-serverless create-application \
--name gsprocessing-0.2.1 \
--release-label emr-6.11.0 \
--release-label emr-6.13.0 \
--type SPARK \
--image-configuration '{
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1"
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-<arch>"
}'

Here you will need to replace ``<aws-account-id>`` and ``<region>`` with the correct values
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.11.0`` as its
Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``arm64``), and ``<region>`` with the correct values
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.


Expand Down Expand Up @@ -234,7 +234,7 @@ and building the GSProcessing SageMaker ECR image:
bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}

SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
INSTANCE_TYPE="ml.t3.xlarge"

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
FROM public.ecr.aws/emr-serverless/spark/emr-6.11.0:20230629-x86_64 as runtime
ARG ARCH=x86_64
FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
FROM base as runtime

USER root
ENV PYTHON_VERSION=3.9.18

Expand Down
24 changes: 17 additions & 7 deletions graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info (set -x)
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-a, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
Note that only x86_64 architecture is supported for SageMaker.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
-p, --path Path to graphstorm-processing directory, default is the current directory.
-i, --image Docker image name, default is 'graphstorm-processing'.
Expand Down Expand Up @@ -43,6 +45,7 @@ parse_params() {
VERSION=`poetry version --short`
BUILD_DIR='/tmp'
TARGET='test'
ARCH='x86_64'

while :; do
case "${1-}" in
Expand All @@ -57,6 +60,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
-a | --architecture)
ARCH="${2-}"
shift
;;
-p | --path)
GSP_HOME="${2-}"
shift
Expand Down Expand Up @@ -103,15 +110,20 @@ else
die "--target parameter needs to be one of 'prod' or 'test', got ${TARGET}"
fi

if [[ ${EXEC_ENV} == "sagemaker" || ${EXEC_ENV} == "emr-serverless" ]]; then
if [[ ${ARCH} == "x86_64" || ${ARCH} == "arm64" ]]; then
: # Do nothing
else
die "--environment parameter needs to be one of 'emr-serverless' or 'sagemaker', got ${EXEC_ENV}"
die "--architecture parameter needs to be one of 'arm64' or 'x86_64', got ${ARCH}"
fi

if [[ ${EXEC_ENV} == "sagemaker" && ${ARCH} == "arm64" ]]; then
die "arm64 architecture is not supported for SageMaker"
fi

# script logic here
msg "Execution parameters:"
msg "- ENVIRONMENT: ${EXEC_ENV}"
msg "- ARCHITECTURE: ${ARCH}"
msg "- TARGET: ${TARGET}"
msg "- GSP_HOME: ${GSP_HOME}"
msg "- IMAGE_NAME: ${IMAGE_NAME}"
Expand Down Expand Up @@ -139,18 +151,16 @@ cp ${GSP_HOME}/docker-entry.sh "${BUILD_DIR}/docker/code/"
poetry export -f requirements.txt --output "${BUILD_DIR}/docker/requirements.txt"

# Set image name
DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}"
DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}-${ARCH}"

# Login to ECR to be able to pull source SageMaker image
if [[ ${EXEC_ENV} == "sagemaker" ]]; then
aws ecr get-login-password --region us-west-2 \
| docker login --username AWS --password-stdin 153931337802.dkr.ecr.us-west-2.amazonaws.com
else
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
# aws ecr get-login-password --region us-west-2 \
# | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
fi

echo "Build a Docker image ${DOCKER_FULLNAME}"
DOCKER_BUILDKIT=1 docker build -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET}
DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
"${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
11 changes: 9 additions & 2 deletions graphstorm-processing/docker/push_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
-c, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-r, --region AWS Region to which we'll push the image. By default will get from aws-cli configuration.
Expand Down Expand Up @@ -43,6 +44,7 @@ parse_params() {
REGION=$(aws configure get region)
REGION=${REGION:-us-west-2}
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ARCH='x86_64'


while :; do
Expand All @@ -54,6 +56,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
-a | --architecture)
ARCH="${2-}"
shift
;;
-i | --image)
IMAGE="${2-}"
shift
Expand Down Expand Up @@ -98,13 +104,14 @@ fi
# script logic here
msg "Execution parameters: "
msg "- ENVIRONMENT: ${EXEC_ENV}"
msg "- ARCHITECTURE: ${ARCH}"
msg "- IMAGE: ${IMAGE}"
msg "- VERSION: ${VERSION}"
msg "- REGION: ${REGION}"
msg "- ACCOUNT: ${ACCOUNT}"

SUFFIX="${VERSION}"
LATEST_SUFFIX="latest"
SUFFIX="${VERSION}-${ARCH}"
LATEST_SUFFIX="latest-${ARCH}"
IMAGE_WITH_ENV="${IMAGE}-${EXEC_ENV}"


Expand Down
4 changes: 2 additions & 2 deletions graphstorm-processing/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "graphstorm_processing"
version = "0.1.0"
version = "0.2.1"
description = "Distributed graph pre-processing for GraphStorm"
readme = "README.md"
packages = [{include = "graphstorm_processing"}]
Expand All @@ -10,7 +10,7 @@ authors = [

[tool.poetry.dependencies]
python = "~3.9.12"
pyspark = "~3.3.0"
pyspark = ">=3.3.0, < 3.5.0"
pyarrow = "~13.0.0"
spacy = "3.6.0"
boto3 = "~1.28.1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
],
"separator": ","
},
"type": "movies",
"type": "movie",
"column": "~id"
},
{
Expand Down
Loading