Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Add support for EMR Serverless execution #613

Merged
merged 10 commits into from
Nov 8, 2023
11 changes: 6 additions & 5 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -381,20 +381,21 @@ arguments.
Valid values are:
``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``normalizer`` (String, optional): Applies a normalization to the data, after imputation.
Can take the following values:

- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
apply no imputation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
Expand Down
31 changes: 19 additions & 12 deletions docs/source/gs-processing/gs-processing-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,34 +138,41 @@ make sure the produced data matches the assumptions of DGL [#f1]_.

.. code-block:: bash

gs-repartition \
--input-prefix /path/to/output/data
gs-repartition --input-prefix /path/to/output/data

Once this script completes, the data are ready to be fed into DGL's distributed
partitioning pipeline.
See `this guide <https://github.com/awslabs/graphstorm/blob/main/sagemaker/README.md#launch-graph-partitioning-task>`_
for more details on how to use GraphStorm distributed partitioning on SageMaker.
See `this guide <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
for more details on how to use GraphStorm distributed partitioning and training on SageMaker.

See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
wrangle data into a format that's ready to be consumed by the GraphStorm/DGL
partitioning pipeline.
wrangle data into a format that's ready to be consumed by the GraphStorm
distributed training pipeline.


Using with Amazon SageMaker
---------------------------
Running on AWS resources
------------------------

To run distributed jobs on Amazon SageMaker we will have to build a Docker image
GSProcessing supports Amazon SageMaker and EMR Serverless as execution environments.
To run distributed jobs on AWS resources we will have to build a Docker image
and push it to the Amazon Elastic Container Registry, which we cover in
:doc:`usage/distributed-processing-setup` and run a SageMaker Processing
job which we describe in :doc:`usage/amazon-sagemaker`.
job which we describe in :doc:`usage/amazon-sagemaker`, or EMR Serverless
job that is covered in :doc:`usage/emr-serverless`.


Input configuration
-------------------

GSProcessing supports both the GConstruct JSON configuration format,
as well as its own GSProcessing config. You can learn about the
GSProcessing JSON configuration in :doc:`developer/input-configuration`.


Developer guide
---------------

To get started with developing the package refer to :doc:`developer/developer-guide`.
To see the input configuration format that GSProcessing uses internally see
:doc:`developer/input-configuration`.


.. rubric:: Footnotes
Expand Down
55 changes: 9 additions & 46 deletions docs/source/gs-processing/usage/amazon-sagemaker.rst
Original file line number Diff line number Diff line change
@@ -1,43 +1,15 @@
Running distributed jobs on Amazon SageMaker
============================================

Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete, we can
Once the :doc:`Amazon SageMaker setup <distributed-processing-setup>` is complete, we can
use the Amazon SageMaker launch scripts to launch distributed processing
jobs that use AWS resources.

To demonstrate the usage of GSProcessing on Amazon SageMaker, we will execute the same job we used in our local
execution example, but this time use Amazon SageMaker to provide the compute resources instead of our
local machine.

Upload data to S3
-----------------

Amazon SageMaker uses S3 as its storage target, so before starting
we'll need to upload our test data to S3. To do so you will need
to have read/write access to an S3 bucket, and the requisite AWS credentials
and permissions.

We will use the AWS CLI to upload data so make sure it is
`installed <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html>`_
and `configured <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
in you local environment.

Assuming ``graphstorm/graphstorm-processing`` is our current working
directory we can upload the test data to S3 using:

.. code-block:: bash

MY_BUCKET="enter-your-bucket-name-here"
REGION="bucket-region" # e.g. us-west-2
aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \
"${MY_BUCKET}/gsprocessing-input"

.. note::

Make sure you are uploading your data to a bucket
that was created in the same region as the ECR image
you pushed in :doc:`distributed-processing-setup`.

Before starting make sure you have uploaded the input data as described in :ref:`gsp-upload-data-ref`.

Launch the GSProcessing job on Amazon SageMaker
-----------------------------------------------
Expand Down Expand Up @@ -65,39 +37,30 @@ job, followed by the re-partitioning job, both on SageMaker:
MY_BUCKET="enter-your-bucket-name-here"
SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
REGION="bucket-region" # e.g. us-west-2
DATASET_S3_PATH="s3://${MY_BUCKET}/gsprocessing-input"
INPUT_PREFIX="s3://${MY_BUCKET}/gsprocessing-input"
OUTPUT_BUCKET=${MY_BUCKET}
DATASET_NAME="small-graph"
GRAPH_NAME="small-graph"
CONFIG_FILE="gconstruct-config.json"
INSTANCE_COUNT="2"
INSTANCE_TYPE="ml.t3.xlarge"
NUM_FILES="4"

IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing:0.1.0"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"

OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/${DATASET_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"

# Conditionally delete data at output
echo "Delete all data under output path? ${OUTPUT_PREFIX}"
select yn in "Yes" "No"; do
case $yn in
Yes ) aws s3 rm --recursive ${OUTPUT_PREFIX} --quiet; break;;
No ) break;;
esac
done
OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"

# This will run and block until the GSProcessing job is done
python scripts/run_distributed_processing.py \
--s3-input-prefix ${DATASET_S3_PATH} \
--s3-input-prefix ${INPUT_PREFIX} \
--s3-output-prefix ${OUTPUT_PREFIX} \
--role ${ROLE} \
--image ${IMAGE_URI} \
--region ${REGION} \
--config-filename ${CONFIG_FILE} \
--instance-count ${INSTANCE_COUNT} \
--instance-type ${INSTANCE_TYPE} \
--job-name "${DATASET_NAME}-${INSTANCE_COUNT}x-${INSTANCE_TYPE//./-}-${NUM_FILES}files" \
--job-name "${GRAPH_NAME}-${INSTANCE_COUNT}x-${INSTANCE_TYPE//./-}-${NUM_FILES}files" \
--num-output-files ${NUM_FILES} \
--wait-for-job

Expand Down Expand Up @@ -150,5 +113,5 @@ Run distributed partitioning and training on Amazon SageMaker
-------------------------------------------------------------

With the data now processed you can follow the
`GraphStorm Amazon SageMaker guide <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_
`GraphStorm Amazon SageMaker guide <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
to partition your data and run training on AWS.
74 changes: 62 additions & 12 deletions docs/source/gs-processing/usage/distributed-processing-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,17 +87,25 @@ Once Docker and Poetry are installed, and your AWS credentials are set up,
we can use the provided scripts
in the ``graphstorm-processing/docker`` directory to build the image.

GSProcessing supports Amazon SageMaker and EMR Serverless as
execution environments, so we need to choose which image we want
to build first.

The ``build_gsprocessing_image.sh`` script can build the image
locally and tag it. For example, assuming our current directory is where
we cloned ``graphstorm/graphstorm-processing``:
locally and tag it, provided the intended execution environment,
using the ``-e/--environment`` argument. The supported environments
are ``sagemaker`` and ``emr-serverless``.
For example, assuming our current directory is where
we cloned ``graphstorm/graphstorm-processing``, we can use
the following to build the SageMaker image:

.. code-block:: bash

bash docker/build_gsprocessing_image.sh
bash docker/build_gsprocessing_image.sh --environment sagemaker

The above will use the Dockerfile of the latest available GSProcessing version,
build an image and tag it as ``graphstorm-processing:${VERSION}`` where
``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.1.0``).
The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where
``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``).

The script also supports other arguments to customize the image name,
tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
Expand All @@ -109,28 +117,70 @@ Push the image to the Amazon Elastic Container Registry (ECR)
Once the image is built we can use the ``push_gsprocessing_image.sh`` script
that will create an ECR repository if needed and push the image we just built.

The script does not require any arguments and by default will
create a repository named ``graphstorm-processing`` in the ``us-west-2`` region,
The script again requires us to provide the intended execution environment using
the ``-e/--environment`` argument,
and by default will create a repository named ``graphstorm-processing-<environment>`` in the ``us-west-2`` region,
on the default AWS account ``aws-cli`` is configured for,
and push the image tagged with the latest version of GSProcessing.

The script supports 4 optional arguments:

1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing``
2. Image tag. Default: (``-v/--version``) ``<latest_library_version>`` e.g. ``0.1.0``.
1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-<environment>``
2. Image tag. Default: (``-v/--version``) ``<latest_library_version>`` e.g. ``0.2.1``.
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
3. ECR region. Default: (``-r/--region``) ``us-west-2``.
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``.

Example:

.. code-block:: bash

bash push_gsprocessing_image.sh -i "graphstorm-processing" -v "0.1.0" -r "us-west-2" -a "1234567890"
bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"

.. _gsp-upload-data-ref:

Upload data to S3
-----------------

For distributed jobs we use S3 as our storage source and target, so before
running any example
we'll need to upload our test data to S3. To do so you will need
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
to have read/write access to an S3 bucket, and the requisite AWS credentials
and permissions.

We will use the AWS CLI to upload data so make sure it is
`installed <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html>`_
and `configured <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
in you local environment.

Assuming ``graphstorm/graphstorm-processing`` is our current working
directory we can upload the test data to S3 using:
thvasilo marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

MY_BUCKET="enter-your-bucket-name-here"
REGION="bucket-region" # e.g. us-west-2
aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \
"s3://${MY_BUCKET}/gsprocessing-input"

.. note::

Make sure you are uploading your data to a bucket
that was created in the same region as the ECR image
you pushed.

Launch a SageMaker Processing job using the example scripts.
------------------------------------------------------------

Once the setup is complete, you can follow the
:doc:`SageMaker Processing job guide <amazon-sagemaker>`
to launch your distributed processing job using AWS resources.
to launch your distributed processing job using Amazon SageMaker resources.

Launch an EMR Serverless job using the example scripts.
------------------------------------------------------------

Amazon SageMaker has a limitation of cluster sizes of 20 instances.
If your data are larger (e.g. 30+ billion edges), you can use EMR Serverless
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
as an execution environment. Its setup is more involved so we only recommend
it for experienced AWS users.
Follow the :doc:`EMR Serverless job guide <emr-serverless>`
to launch your distributed processing job using EMR Serverless resources.
Loading
Loading