diff --git a/docs/source/gs-processing/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst index 78621c4909..624025914f 100644 --- a/docs/source/gs-processing/usage/amazon-sagemaker.rst +++ b/docs/source/gs-processing/usage/amazon-sagemaker.rst @@ -45,7 +45,7 @@ job, followed by the re-partitioning job, both on SageMaker: INSTANCE_TYPE="ml.t3.xlarge" NUM_FILES="4" - IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1" + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64" ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/" diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst index d003b93579..261c0ce9a9 100644 --- a/docs/source/gs-processing/usage/distributed-processing-setup.rst +++ b/docs/source/gs-processing/usage/distributed-processing-setup.rst @@ -104,13 +104,70 @@ the following to build the SageMaker image: bash docker/build_gsprocessing_image.sh --environment sagemaker The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version, -build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where +build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}-x86_64`` where ``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``). The script also supports other arguments to customize the image name, tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help`` for more information. +Support for arm64 architecture +------------------------------ + +For EMR Serverless images, it is possible to build images that support ``arm64`` instances, +which can lead to improved runtime and cost compared to ``x86_64``. You can build an ``arm64`` +image natively by installing Docker and following the above process on an ARM instance such +as ``M6G`` or ``M7G``. See the `AWS documentation `_ +for instances powered by the Graviton processor. + +To build ``arm64`` images +on an ``x86_64`` host you need to enable multi-platform builds for Docker. The easiest way +to do so is to use QEMU emulation. To install the QEMU related libraries you can run + +On Ubuntu + +.. code-block:: bash + + sudo apt install -y qemu binfmt-support qemu-user-static + +On Amazon Linux/CentOS: + +.. code-block:: bash + + sudo yum instal -y qemu-system-arm qemu qemu-user qemu-kvm qemu-kvm-tools \ + libvirt virt-install libvirt-python libguestfs-tools-c + +Finally you'd need to ensure ``binfmt_misc`` is configured for different platforms by running + +.. code-block:: bash + + docker run --privileged --rm tonistiigi/binfmt --install all + +To verify your Docker installation is ready for multi-platform builds you can run: + +.. code-block:: bash + + docker buildx ls + + NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS + default * docker + default default running v0.8+unknown linux/amd64, linux/arm64 + +To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you can run: + +.. code-block:: bash + + bash docker/build_gsprocessing_image.sh --environment sagemaker --architecture arm64 + +.. note:: + + Building images under emulation using QEMU can be significantly slower than native builds + (more than 20 minutes to build the GSProcessing ``arm64`` image). + To speed up the build process you can build on an ARM instances, + look into using ``buildx`` with multiple native nodes, or use cross-compilation. + See `the official Docker documentation `_ + for details. + Push the image to the Amazon Elastic Container Registry (ECR) ------------------------------------------------------------- @@ -136,6 +193,13 @@ Example: bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890" +To push an EMR Serverless ``arm64`` image you'd similarly run: + +.. code-block:: bash + + bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \ + -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890" + .. _gsp-upload-data-ref: Upload data to S3 diff --git a/docs/source/gs-processing/usage/emr-serverless.rst b/docs/source/gs-processing/usage/emr-serverless.rst index adef4a4a05..35b54e9f1d 100644 --- a/docs/source/gs-processing/usage/emr-serverless.rst +++ b/docs/source/gs-processing/usage/emr-serverless.rst @@ -88,14 +88,14 @@ Here we will just show the custom image application creation using the AWS CLI: aws emr-serverless create-application \ --name gsprocessing-0.2.1 \ - --release-label emr-6.11.0 \ + --release-label emr-6.13.0 \ --type SPARK \ --image-configuration '{ - "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.1" + "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-" }' -Here you will need to replace ```` and ```` with the correct values -from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.11.0`` as its +Here you will need to replace ````, ```` (``x86_64`` or ``arm64``), and ```` with the correct values +from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its base image, so we need to ensure our application uses the same release. @@ -234,7 +234,7 @@ and building the GSProcessing SageMaker ECR image: bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION} SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here" - IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1" + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64" ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" INSTANCE_TYPE="ml.t3.xlarge" diff --git a/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu b/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu index 267f986358..8ef9d7bca6 100644 --- a/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu +++ b/graphstorm-processing/docker/0.2.1/emr-serverless/Dockerfile.cpu @@ -1,4 +1,7 @@ -FROM public.ecr.aws/emr-serverless/spark/emr-6.11.0:20230629-x86_64 as runtime +ARG ARCH=x86_64 +FROM public.ecr.aws/emr-serverless/spark/emr-6.13.0:20230906-${ARCH} as base +FROM base as runtime + USER root ENV PYTHON_VERSION=3.9.18 diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh index 7ecf1e3094..4c53f74416 100644 --- a/graphstorm-processing/docker/build_gsprocessing_image.sh +++ b/graphstorm-processing/docker/build_gsprocessing_image.sh @@ -16,6 +16,8 @@ Available options: -h, --help Print this help and exit -x, --verbose Print script debug info (set -x) -e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required. +-a, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'. + Note that only x86_64 architecture is supported for SageMaker. -t, --target Docker image target, must be one of 'prod' or 'test'. Default is 'test'. -p, --path Path to graphstorm-processing directory, default is the current directory. -i, --image Docker image name, default is 'graphstorm-processing'. @@ -43,6 +45,7 @@ parse_params() { VERSION=`poetry version --short` BUILD_DIR='/tmp' TARGET='test' + ARCH='x86_64' while :; do case "${1-}" in @@ -57,6 +60,10 @@ parse_params() { EXEC_ENV="${2-}" shift ;; + -a | --architecture) + ARCH="${2-}" + shift + ;; -p | --path) GSP_HOME="${2-}" shift @@ -103,15 +110,20 @@ else die "--target parameter needs to be one of 'prod' or 'test', got ${TARGET}" fi -if [[ ${EXEC_ENV} == "sagemaker" || ${EXEC_ENV} == "emr-serverless" ]]; then +if [[ ${ARCH} == "x86_64" || ${ARCH} == "arm64" ]]; then : # Do nothing else - die "--environment parameter needs to be one of 'emr-serverless' or 'sagemaker', got ${EXEC_ENV}" + die "--architecture parameter needs to be one of 'arm64' or 'x86_64', got ${ARCH}" +fi + +if [[ ${EXEC_ENV} == "sagemaker" && ${ARCH} == "arm64" ]]; then + die "arm64 architecture is not supported for SageMaker" fi # script logic here msg "Execution parameters:" msg "- ENVIRONMENT: ${EXEC_ENV}" +msg "- ARCHITECTURE: ${ARCH}" msg "- TARGET: ${TARGET}" msg "- GSP_HOME: ${GSP_HOME}" msg "- IMAGE_NAME: ${IMAGE_NAME}" @@ -139,7 +151,7 @@ cp ${GSP_HOME}/docker-entry.sh "${BUILD_DIR}/docker/code/" poetry export -f requirements.txt --output "${BUILD_DIR}/docker/requirements.txt" # Set image name -DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}" +DOCKER_FULLNAME="${IMAGE_NAME}-${EXEC_ENV}:${VERSION}-${ARCH}" # Login to ECR to be able to pull source SageMaker image if [[ ${EXEC_ENV} == "sagemaker" ]]; then @@ -147,10 +159,8 @@ if [[ ${EXEC_ENV} == "sagemaker" ]]; then | docker login --username AWS --password-stdin 153931337802.dkr.ecr.us-west-2.amazonaws.com else aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws - # aws ecr get-login-password --region us-west-2 \ - # | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com fi echo "Build a Docker image ${DOCKER_FULLNAME}" -DOCKER_BUILDKIT=1 docker build -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \ - "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} +DOCKER_BUILDKIT=1 docker build --platform "linux/${ARCH}" -f "${GSP_HOME}/docker/${VERSION}/${EXEC_ENV}/Dockerfile.cpu" \ + "${BUILD_DIR}/docker/" -t $DOCKER_FULLNAME --target ${TARGET} --build-arg ARCH=${ARCH} diff --git a/graphstorm-processing/docker/push_gsprocessing_image.sh b/graphstorm-processing/docker/push_gsprocessing_image.sh index 5d6753d083..eaab38876a 100644 --- a/graphstorm-processing/docker/push_gsprocessing_image.sh +++ b/graphstorm-processing/docker/push_gsprocessing_image.sh @@ -16,6 +16,7 @@ Available options: -h, --help Print this help and exit -x, --verbose Print script debug info -e, --environment Image execution environment. Must be one of 'emr-serverless' or 'sagemaker'. Required. +-c, --architecture Image architecture. Must be one of 'x86_64' or 'arm64'. Default is 'x86_64'. -i, --image Docker image name, default is 'graphstorm-processing'. -v, --version Docker version tag, default is the library's current version (`poetry version --short`) -r, --region AWS Region to which we'll push the image. By default will get from aws-cli configuration. @@ -43,6 +44,7 @@ parse_params() { REGION=$(aws configure get region) REGION=${REGION:-us-west-2} ACCOUNT=$(aws sts get-caller-identity --query Account --output text) + ARCH='x86_64' while :; do @@ -54,6 +56,10 @@ parse_params() { EXEC_ENV="${2-}" shift ;; + -a | --architecture) + ARCH="${2-}" + shift + ;; -i | --image) IMAGE="${2-}" shift @@ -98,13 +104,14 @@ fi # script logic here msg "Execution parameters: " msg "- ENVIRONMENT: ${EXEC_ENV}" +msg "- ARCHITECTURE: ${ARCH}" msg "- IMAGE: ${IMAGE}" msg "- VERSION: ${VERSION}" msg "- REGION: ${REGION}" msg "- ACCOUNT: ${ACCOUNT}" -SUFFIX="${VERSION}" -LATEST_SUFFIX="latest" +SUFFIX="${VERSION}-${ARCH}" +LATEST_SUFFIX="latest-${ARCH}" IMAGE_WITH_ENV="${IMAGE}-${EXEC_ENV}" diff --git a/graphstorm-processing/pyproject.toml b/graphstorm-processing/pyproject.toml index 7bb87f2752..16a5749533 100644 --- a/graphstorm-processing/pyproject.toml +++ b/graphstorm-processing/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "graphstorm_processing" -version = "0.1.0" +version = "0.2.1" description = "Distributed graph pre-processing for GraphStorm" readme = "README.md" packages = [{include = "graphstorm_processing"}] @@ -10,7 +10,7 @@ authors = [ [tool.poetry.dependencies] python = "~3.9.12" -pyspark = "~3.3.0" +pyspark = ">=3.3.0, < 3.5.0" pyarrow = "~13.0.0" spacy = "3.6.0" boto3 = "~1.28.1" diff --git a/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json b/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json index 48b3b2deb8..1ea789b69b 100644 --- a/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json +++ b/graphstorm-processing/tests/resources/small_heterogeneous_graph/gsprocessing-config.json @@ -21,7 +21,7 @@ ], "separator": "," }, - "type": "movies", + "type": "movie", "column": "~id" }, {