diff --git a/docs/source/gs-processing/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst
index 78621c4909..624025914f 100644
--- a/docs/source/gs-processing/usage/amazon-sagemaker.rst
+++ b/docs/source/gs-processing/usage/amazon-sagemaker.rst
@@ -45,7 +45,7 @@ job, followed by the re-partitioning job, both on SageMaker:
INSTANCE_TYPE="ml.t3.xlarge"
NUM_FILES="4"
- IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
+ IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst
index d003b93579..261c0ce9a9 100644
--- a/docs/source/gs-processing/usage/distributed-processing-setup.rst
+++ b/docs/source/gs-processing/usage/distributed-processing-setup.rst
@@ -104,13 +104,70 @@ the following to build the SageMaker image:
bash docker/build_gsprocessing_image.sh --environment sagemaker
The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
-build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where
+build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}-x86_64`` where
``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``).
The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
for more information.
+Support for arm64 architecture
+------------------------------
+
+For EMR Serverless images, it is possible to build images that support ``arm64`` instances,
+which can lead to improved runtime and cost compared to ``x86_64``. You can build an ``arm64``
+image natively by installing Docker and following the above process on an ARM instance such
+as ``M6G`` or ``M7G``. See the `AWS documentation `_
+for instances powered by the Graviton processor.
+
+To build ``arm64`` images
+on an ``x86_64`` host you need to enable multi-platform builds for Docker. The easiest way
+to do so is to use QEMU emulation. To install the QEMU related libraries you can run
+
+On Ubuntu
+
+.. code-block:: bash
+
+ sudo apt install -y qemu binfmt-support qemu-user-static
+
+On Amazon Linux/CentOS:
+
+.. code-block:: bash
+
+ sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \
+ libvirt virt-install libvirt-python libguestfs-tools-c
+
+Finally you'd need to ensure ``binfmt_misc`` is configured for different platforms by running
+
+.. code-block:: bash
+
+ docker run --privileged --rm tonistiigi/binfmt --install all
+
+To verify your Docker installation is ready for multi-platform builds you can run:
+
+.. code-block:: bash
+
+ docker buildx ls
+
+ NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
+ default * docker
+ default default running v0.8+unknown linux/amd64, linux/arm64
+
+To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you can run:
+
+.. code-block:: bash
+
+ bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64
+
+.. note::
+
+ Building images under emulation using QEMU can be significantly slower than native builds
+ (more than 20 minutes to build the GSProcessing ``arm64`` image).
+ To speed up the build process you can build on an ARM instances,
+ look into using ``buildx`` with multiple native nodes, or use cross-compilation.
+ See `the official Docker documentation `_
+ for details.
+
Push the image to the Amazon Elastic Container Registry (ECR)
-------------------------------------------------------------
@@ -136,6 +193,13 @@ Example:
bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
+To push an EMR Serverless ``arm64`` image you'd similarly run:
+
+.. code-block:: bash
+
+ bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
+ -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
+
.. _gsp-upload-data-ref:
Upload data to S3
diff --git a/docs/source/gs-processing/usage/emr-serverless.rst b/docs/source/gs-processing/usage/emr-serverless.rst
index adef4a4a05..35b54e9f1d 100644
--- a/docs/source/gs-processing/usage/emr-serverless.rst
+++ b/docs/source/gs-processing/usage/emr-serverless.rst
@@ -88,14 +88,14 @@ Here we will just show the custom image application creation using the AWS CLI:
aws emr-serverless create-application \
--name gsprocessing-0.2.1 \
- --release-label emr-6.11.0 \
+ --release-label emr-6.13.0 \
--type SPARK \
--image-configuration '{
- "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.1"
+ "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-"
}'
-Here you will need to replace ```` and ```` with the correct values
-from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.11.0`` as its
+Here you will need to replace ````, ```` (``x86_64`` or ``arm64``), and ```` with the correct values
+from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.
@@ -234,7 +234,7 @@ and building the GSProcessing SageMaker ECR image:
bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
- IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
+ IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
INSTANCE_TYPE="ml.t3.xlarge"
diff --git a/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu b/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu
index 267f986358..8ef9d7bca6 100644
--- a/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu
+++ b/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu
@@ -1,4 +1,7 @@
-FROM public.ecr.aws/emr-serverless/spark/emr-6.11.0:20230629-x86_64 as runtime
+ARG ARCH=x86_64
+FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base
+FROM base as runtime
+
USER root
ENV PYTHON_VERSION=3.9.18
diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh
index 7ecf1e3094..4c53f74416 100644
--- a/graphstorm-processing/docker/build_gsprocessing_image.sh
+++ b/graphstorm-processing/docker/build_gsprocessing_image.sh
@@ -16,6 +16,8 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info (set -x)
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
+-a, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
+ Note that only x86_64 architecture is supported for SageMaker.
-t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
-p, --path Path to graphstorm-processing directory, default is the current directory.
-i, --image Docker image name, default is 'graphstorm-processing'.
@@ -43,6 +45,7 @@ parse_params() {
VERSION=`poetry version --short`
BUILD_DIR='/tmp'
TARGET='test'
+ ARCH='x86_64'
while :; do
case "${1-}" in
@@ -57,6 +60,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
+ -a | --architecture)
+ ARCH="${2-}"
+ shift
+ ;;
-p | --path)
GSP_HOME="${2-}"
shift
@@ -103,15 +110,20 @@ else
die "--target parameter needs to be one of 'prod' or 'test', got ${TARGET}"
fi
-if [[ ${EXEC_ENV} == "sagemaker" || ${EXEC_ENV} == "emr-serverless" ]]; then
+if [[ ${ARCH} == "x86_64" || ${ARCH} == "arm64" ]]; then
: # Do nothing
else
- die "--environment parameter needs to be one of 'emr-serverless' or 'sagemaker', got ${EXEC_ENV}"
+ die "--architecture parameter needs to be one of 'arm64' or 'x86_64', got ${ARCH}"
+fi
+
+if [[ ${EXEC_ENV} == "sagemaker" && ${ARCH} == "arm64" ]]; then
+ die "arm64 architecture is not supported for SageMaker"
fi
# script logic here
msg "Execution parameters:"
msg "- ENVIRONMENT: ${EXEC_ENV}"
+msg "- ARCHITECTURE: ${ARCH}"
msg "- TARGET: ${TARGET}"
msg "- GSP_HOME: ${GSP_HOME}"
msg "- IMAGE_NAME: ${IMAGE_NAME}"
@@ -139,7 +151,7 @@ cp ${GSP_HOME}/docker-entry.sh "${BUILD_DIR}/docker/code/"
poetry export -f requirements.txt --output "${BUILD_DIR}/docker/requirements.txt"
# Set image name
-DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}"
+DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}-${ARCH}"
# Login to ECR to be able to pull source SageMaker image
if [[ ${EXEC_ENV} == "sagemaker" ]]; then
@@ -147,10 +159,8 @@ if [[ ${EXEC_ENV} == "sagemaker" ]]; then
| docker login --username AWS --password-stdin 153931337802.dkr.ecr.us-west-2.amazonaws.com
else
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
- # aws ecr get-login-password --region us-west-2 \
- # | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
fi
echo "Build a Docker image ${DOCKER_FULLNAME}"
-DOCKER_BUILDKIT=1 docker build -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
- "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET}
+DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \
+ "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH}
diff --git a/graphstorm-processing/docker/push_gsprocessing_image.sh b/graphstorm-processing/docker/push_gsprocessing_image.sh
index 5d6753d083..eaab38876a 100644
--- a/graphstorm-processing/docker/push_gsprocessing_image.sh
+++ b/graphstorm-processing/docker/push_gsprocessing_image.sh
@@ -16,6 +16,7 @@ Available options:
-h, --help Print this help and exit
-x, --verbose Print script debug info
-e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required.
+-c, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'.
-i, --image Docker image name, default is 'graphstorm-processing'.
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-r, --region AWS Region to which we'll push the image. By default will get from aws-cli configuration.
@@ -43,6 +44,7 @@ parse_params() {
REGION=$(aws configure get region)
REGION=${REGION:-us-west-2}
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+ ARCH='x86_64'
while :; do
@@ -54,6 +56,10 @@ parse_params() {
EXEC_ENV="${2-}"
shift
;;
+ -a | --architecture)
+ ARCH="${2-}"
+ shift
+ ;;
-i | --image)
IMAGE="${2-}"
shift
@@ -98,13 +104,14 @@ fi
# script logic here
msg "Execution parameters: "
msg "- ENVIRONMENT: ${EXEC_ENV}"
+msg "- ARCHITECTURE: ${ARCH}"
msg "- IMAGE: ${IMAGE}"
msg "- VERSION: ${VERSION}"
msg "- REGION: ${REGION}"
msg "- ACCOUNT: ${ACCOUNT}"
-SUFFIX="${VERSION}"
-LATEST_SUFFIX="latest"
+SUFFIX="${VERSION}-${ARCH}"
+LATEST_SUFFIX="latest-${ARCH}"
IMAGE_WITH_ENV="${IMAGE}-${EXEC_ENV}"
diff --git a/graphstorm-processing/pyproject.toml b/graphstorm-processing/pyproject.toml
index 7bb87f2752..16a5749533 100644
--- a/graphstorm-processing/pyproject.toml
+++ b/graphstorm-processing/pyproject.toml
@@ -1,6 +1,6 @@
[tool.poetry]
name = "graphstorm_processing"
-version = "0.1.0"
+version = "0.2.1"
description = "Distributed graph pre-processing for GraphStorm"
readme = "README.md"
packages = [{include = "graphstorm_processing"}]
@@ -10,7 +10,7 @@ authors = [
[tool.poetry.dependencies]
python = "~3.9.12"
-pyspark = "~3.3.0"
+pyspark = ">=3.3.0, < 3.5.0"
pyarrow = "~13.0.0"
spacy = "3.6.0"
boto3 = "~1.28.1"
diff --git a/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json b/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json
index 48b3b2deb8..1ea789b69b 100644
--- a/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json
+++ b/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json
@@ -21,7 +21,7 @@
],
"separator": ","
},
- "type": "movies",
+ "type": "movie",
"column": "~id"
},
{