awslabs · thvasilo · Nov 8, 2023 · Sep 25, 2023 · Sep 22, 2023 · Nov 1, 2023
diff --git a/docs/source/gs-processing/developer/input-configuration.rst b/docs/source/gs-processing/developer/input-configuration.rst
@@ -381,8 +381,9 @@ arguments.
          Valid values are:
          ``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
          with the respective value computed from the data.
-      - ``normalizer`` (String, optional): Applies a normalization to the data, after
-         imputation. Can take the following values:
+      - ``normalizer`` (String, optional): Applies a normalization to the data, after imputation.
+        Can take the following values:
+
          - ``none``: (Default) Don't normalize the numerical values during encoding.
          - ``min-max``: Normalize each value by subtracting the minimum value from it,
          and then dividing it by the difference between the maximum value and the minimum.
@@ -399,7 +400,7 @@ arguments.
    -  ``kwargs``:
 
       - ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
-        apply the ``mean`` transformation by default.
+        apply no imputation by default.
       - ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
         normalization is applied by default.
       - ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical

diff --git a/docs/source/gs-processing/gs-processing-getting-started.rst b/docs/source/gs-processing/gs-processing-getting-started.rst
@@ -138,34 +138,41 @@ make sure the produced data matches the assumptions of DGL [#f1]_.
 
 .. code-block:: bash
 
-    gs-repartition \
-        --input-prefix /path/to/output/data
+    gs-repartition --input-prefix /path/to/output/data
 
 Once this script completes, the data are ready to be fed into DGL's distributed
 partitioning pipeline.
-See `this guide <https://github.com/awslabs/graphstorm/blob/main/sagemaker/README.md#launch-graph-partitioning-task>`_
-for more details on how to use GraphStorm distributed partitioning on SageMaker.
+See `this guide <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
+for more details on how to use GraphStorm distributed partitioning and training on SageMaker.
 
 See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
-wrangle data into a format that's ready to be consumed by the GraphStorm/DGL
-partitioning pipeline.
+wrangle data into a format that's ready to be consumed by the GraphStorm
+distributed training pipeline.
 
 
-Using with Amazon SageMaker
----------------------------
+Running on AWS resources
+------------------------
 
-To run distributed jobs on Amazon SageMaker we will have to build a Docker image
+GSProcessing supports Amazon SageMaker and EMR Serverless as execution environments.
+To run distributed jobs on AWS resources we will have to build a Docker image
 and push it to the Amazon Elastic Container Registry, which we cover in
 :doc:`usage/distributed-processing-setup` and run a SageMaker Processing
-job which we describe in :doc:`usage/amazon-sagemaker`.
+job which we describe in :doc:`usage/amazon-sagemaker`, or EMR Serverless
+job that is covered in :doc:`usage/emr-serverless`.
+
+
+Input configuration
+-------------------
+
+GSProcessing supports both the GConstruct JSON configuration format,
+as well as its own GSProcessing config. You can learn about the
+GSProcessing JSON configuration in :doc:`developer/input-configuration`.
 
 
 Developer guide
 ---------------
 
 To get started with developing the package refer to :doc:`developer/developer-guide`.
-To see the input configuration format that GSProcessing uses internally see
-:doc:`developer/input-configuration`.
 
 
 .. rubric:: Footnotes

diff --git a/docs/source/gs-processing/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst
@@ -1,43 +1,15 @@
 Running distributed jobs on Amazon SageMaker
 ============================================
 
-Once the :doc:`distributed processing setup <distributed-processing-setup>` is complete, we can
+Once the :doc:`Amazon SageMaker setup <distributed-processing-setup>` is complete, we can
 use the Amazon SageMaker launch scripts to launch distributed processing
 jobs that use AWS resources.
 
 To demonstrate the usage of GSProcessing on Amazon SageMaker, we will execute the same job we used in our local
 execution example, but this time use Amazon SageMaker to provide the compute resources instead of our
 local machine.
 
-Upload data to S3
------------------
-
-Amazon SageMaker uses S3 as its storage target, so before starting
-we'll need to upload our test data to S3. To do so you will need
-to have read/write access to an S3 bucket, and the requisite AWS credentials
-and permissions.
-
-We will use the AWS CLI to upload data so make sure it is
-`installed <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html>`_
-and `configured <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
-in you local environment.
-
-Assuming ``graphstorm/graphstorm-processing`` is our current working
-directory we can upload the test data to S3 using:
-
-.. code-block:: bash
-
-    MY_BUCKET="enter-your-bucket-name-here"
-    REGION="bucket-region" # e.g. us-west-2
-    aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \
-        "${MY_BUCKET}/gsprocessing-input"
-
-.. note::
-
-    Make sure you are uploading your data to a bucket
-    that was created in the same region as the ECR image
-    you pushed in :doc:`distributed-processing-setup`.
-
+Before starting make sure you have uploaded the input data as described in :ref:`gsp-upload-data-ref`.
 
 Launch the GSProcessing job on Amazon SageMaker
 -----------------------------------------------
@@ -49,7 +21,7 @@ to run a GSProcessing job on Amazon SageMaker.
 For this example we'll use a SageMaker Spark cluster with 2 ``ml.t3.xlarge`` instances
 since this is a tiny dataset. Using SageMaker you'll be able to create clusters
 of up to 20 instances, allowing you to scale your processing to massive graphs,
-using larger instances like `ml.r5.24xlarge`.
+using larger instances like ``ml.r5.24xlarge``.
 
 Since we're now executing on AWS, we'll need access to an execution role
 for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`.
@@ -65,39 +37,30 @@ job, followed by the re-partitioning job, both on SageMaker:
     MY_BUCKET="enter-your-bucket-name-here"
     SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
     REGION="bucket-region" # e.g. us-west-2
-    DATASET_S3_PATH="s3://${MY_BUCKET}/gsprocessing-input"
+    INPUT_PREFIX="s3://${MY_BUCKET}/gsprocessing-input"
     OUTPUT_BUCKET=${MY_BUCKET}
-    DATASET_NAME="small-graph"
+    GRAPH_NAME="small-graph"
     CONFIG_FILE="gconstruct-config.json"
     INSTANCE_COUNT="2"
     INSTANCE_TYPE="ml.t3.xlarge"
     NUM_FILES="4"
 
-    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing:0.1.0"
+    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
     ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
 
-    OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/${DATASET_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
-
-    # Conditionally delete data at output
-    echo "Delete all data under output path? ${OUTPUT_PREFIX}"
-    select yn in "Yes" "No"; do
-        case $yn in
-            Yes ) aws s3 rm --recursive ${OUTPUT_PREFIX} --quiet; break;;
-            No ) break;;
-        esac
-    done
+    OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
 
     # This will run and block until the GSProcessing job is done
     python scripts/run_distributed_processing.py \
-        --s3-input-prefix ${DATASET_S3_PATH} \
+        --s3-input-prefix ${INPUT_PREFIX} \
         --s3-output-prefix ${OUTPUT_PREFIX} \
         --role ${ROLE} \
         --image ${IMAGE_URI} \
         --region ${REGION} \
         --config-filename ${CONFIG_FILE} \
         --instance-count ${INSTANCE_COUNT} \
         --instance-type ${INSTANCE_TYPE} \
-        --job-name "${DATASET_NAME}-${INSTANCE_COUNT}x-${INSTANCE_TYPE//./-}-${NUM_FILES}files" \
+        --job-name "${GRAPH_NAME}-${INSTANCE_COUNT}x-${INSTANCE_TYPE//./-}-${NUM_FILES}files" \
         --num-output-files ${NUM_FILES} \
         --wait-for-job
 
@@ -150,5 +113,6 @@ Run distributed partitioning and training on Amazon SageMaker
 -------------------------------------------------------------
 
 With the data now processed you can follow the
-`GraphStorm Amazon SageMaker guide <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_
+`GraphStorm Amazon SageMaker guide
+<https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html#run-graphstorm-on-sagemaker>`_
 to partition your data and run training on AWS.
diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst
@@ -1,5 +1,5 @@
-GraphStorm Processing setup for Amazon SageMaker
-================================================
+GraphStorm Processing Distributed Setup
+=======================================
 
 In this guide we'll demonstrate how to prepare your environment to run
 GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker.
@@ -15,7 +15,7 @@ The steps required are:
 - Set up AWS access.
 - Build the GraphStorm Processing image using Docker.
 - Push the image to the Amazon Elastic Container Registry (ECR).
-- Launch a SageMaker Processing job using the example scripts.
+- Launch a SageMaker Processing or EMR Serverless job using the example scripts.
 
 Clone the GraphStorm repository
 -------------------------------
@@ -87,17 +87,25 @@ Once Docker and Poetry are installed, and your AWS credentials are set up,
 we can use the provided scripts
 in the ``graphstorm-processing/docker`` directory to build the image.
 
+GSProcessing supports Amazon SageMaker and EMR Serverless as
+execution environments, so we need to choose which image we want
+to build first.
+
 The ``build_gsprocessing_image.sh`` script can build the image
-locally and tag it. For example, assuming our current directory is where
-we cloned ``graphstorm/graphstorm-processing``:
+locally and tag it, provided the intended execution environment,
+using the ``-e/--environment`` argument. The supported environments
+are ``sagemaker`` and ``emr-serverless``.
+For example, assuming our current directory is where
+we cloned ``graphstorm/graphstorm-processing``, we can use
+the following to build the SageMaker image:
 
 .. code-block:: bash
 
-    bash docker/build_gsprocessing_image.sh
+    bash docker/build_gsprocessing_image.sh --environment sagemaker
 
-The above will use the Dockerfile of the latest available GSProcessing version,
-build an image and tag it as ``graphstorm-processing:${VERSION}`` where
-``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.1.0``).
+The above will use the SageMaker-specific Dockerfile of the latest available GSProcessing version,
+build an image and tag it as ``graphstorm-processing-sagemaker:${VERSION}`` where
+``${VERSION}`` will take be the latest available GSProcessing version (e.g. ``0.2.1``).
 
 The script also supports other arguments to customize the image name,
 tag and other aspects of the build. See ``bash docker/build_gsprocessing_image.sh --help``
@@ -109,28 +117,71 @@ Push the image to the Amazon Elastic Container Registry (ECR)
 Once the image is built we can use the ``push_gsprocessing_image.sh`` script
 that will create an ECR repository if needed and push the image we just built.
 
-The script does not require any arguments and by default will
-create a repository named ``graphstorm-processing`` in the ``us-west-2`` region,
+The script again requires us to provide the intended execution environment using
+the ``-e/--environment`` argument,
+and by default will create a repository named ``graphstorm-processing-<environment>`` in the ``us-west-2`` region,
 on the default AWS account ``aws-cli`` is configured for,
 and push the image tagged with the latest version of GSProcessing.
 
 The script supports 4 optional arguments:
 
-1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing``
-2. Image tag. Default: (``-v/--version``) ``<latest_library_version>`` e.g. ``0.1.0``.
-3. ECR region. Default: (``-r/--region``) ``us-west-2``.
+1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-<environment>``
+2. Image tag. (``-v/--version``) Default: ``<latest_library_version>`` e.g. ``0.2.1``.
+3. ECR region. (``-r/--region``) Default: ``us-west-2``.
 4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``.
 
 Example:
 
 .. code-block:: bash
 
-    bash push_gsprocessing_image.sh -i "graphstorm-processing" -v "0.1.0" -r "us-west-2" -a "1234567890"
+    bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
+
+.. _gsp-upload-data-ref:
+
+Upload data to S3
+-----------------
+
+For distributed jobs we use S3 as our storage source and target, so before
+running any example
+we'll need to upload our data to S3. To do so you will need
+to have read/write access to an S3 bucket, and the requisite AWS credentials
+and permissions.
+
+We will use the AWS CLI to upload data so make sure it is
+`installed <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html>`_
+and `configured <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
+in you local environment.
+
+Assuming ``graphstorm/graphstorm-processing`` is our current working
+directory we can upload the data to S3 using:
+
+.. code-block:: bash
+
+    MY_BUCKET="enter-your-bucket-name-here"
+    REGION="bucket-region" # e.g. us-west-2
+    aws --region ${REGION} s3 sync ./tests/resources/small_heterogeneous_graph/ \
+        "s3://${MY_BUCKET}/gsprocessing-input"
+
+.. note::
 
+    Make sure you are uploading your data to a bucket
+    that was created in the same region as the ECR image
+    you pushed.
 
 Launch a SageMaker Processing job using the example scripts.
 ------------------------------------------------------------
 
 Once the setup is complete, you can follow the
 :doc:`SageMaker Processing job guide <amazon-sagemaker>`
-to launch your distributed processing job using AWS resources.
+to launch your distributed processing job using Amazon SageMaker resources.
+
+Launch an EMR Serverless job using the example scripts.
+------------------------------------------------------------
+
+In addition to Amazon SageMaker you can also use EMR Serverless
+as an execution environment to allow you to scale to even larger datasets
+(recommended when your graph has 30B+ edges).
+Its setup is more involved than Amazon SageMaker, so we only recommend
+it for experienced AWS users.
+Follow the :doc:`EMR Serverless job guide <emr-serverless>`
+to launch your distributed processing job using EMR Serverless resources.