[GSProcessing] Documentation updates for 0.2.2

awslabs · Feb 14, 2024 · d654146 · d654146
1 parent 767e929
commit d654146
Show file tree

Hide file tree

Showing 7 changed files with 28 additions and 29 deletions.
diff --git a/docs/source/gs-processing/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst
@@ -43,9 +43,9 @@ job, followed by the re-partitioning job, both on SageMaker:
     CONFIG_FILE="gconstruct-config.json"
     INSTANCE_COUNT="2"
     INSTANCE_TYPE="ml.t3.xlarge"
-    NUM_FILES="4"
+    NUM_FILES="-1"
 
-    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
+    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
     ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
 
     OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
@@ -84,7 +84,7 @@ You can see that we provided a parameter named
 ``--num-output-files`` to ``run_distributed_processing.py``. This is an
 important parameter, as it provides a hint to set the parallelism for Spark.
 
-It can safely be skipped and let Spark decide the proper value based on the cluster's
+It can safely be skipped (or set to `-1`) to let Spark decide the proper value based on the cluster's
 instance type and count. If setting it yourself a good value to use is
 ``num_instances * num_cores_per_instance * 2``, which will ensure good
 utilization of the cluster resources.

diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst
@@ -179,7 +179,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you
     (more than 20 minutes to build the GSProcessing ``arm64`` image).
     After the first build, follow up builds that only change the GSProcessing code
     will be less than a minute thanks to Docker's caching.
-    To speed up the build process you can build on an ARM instances,
+    To speed up the build process you can build on an ARM-native instance,
     look into using ``buildx`` with multiple native nodes, or use cross-compilation.
     See `the official Docker documentation <https://docs.docker.com/build/building/multi-platform/>`_
     for details.
@@ -199,22 +199,22 @@ and push the image tagged with the latest version of GSProcessing.
 The script supports 4 optional arguments:
 
 1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-<environment>``
-2. Image tag. (``-v/--version``) Default: ``<latest_library_version>`` e.g. ``0.2.1``.
+2. Image tag. (``-v/--version``) Default: ``<latest_library_version>`` e.g. ``0.2.2``.
 3. ECR region. (``-r/--region``) Default: ``us-west-2``.
 4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``.
 
 Example:
 
 .. code-block:: bash
 
-    bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
+    bash docker/push_gsprocessing_image.sh -e sagemaker -r "us-west-2" -a "1234567890"
 
 To push an EMR Serverless ``arm64`` image you'd similarly run:
 
 .. code-block:: bash
 
     bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
-        -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
+        -r "us-west-2" -a "1234567890"
 
 .. _gsp-upload-data-ref:
 

diff --git a/docs/source/gs-processing/usage/emr-serverless.rst b/docs/source/gs-processing/usage/emr-serverless.rst
@@ -87,15 +87,15 @@ Here we will just show the custom image application creation using the AWS CLI:
 .. code-block:: bash
 
     aws emr-serverless create-application \
-        --name gsprocessing-0.2.1 \
+        --name gsprocessing-0.2.2 \
         --release-label emr-6.13.0 \
         --type SPARK \
         --image-configuration '{
-            "imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-<arch>"
+            "imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.2-<arch>"
         }'
 
 Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``arm64``), and ``<region>`` with the correct values
-from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
+from the image you just created. GSProcessing version ``0.2.2`` uses ``emr-6.13.0`` as its
 base image, so we need to ensure our application uses the same release.
 
 Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
@@ -179,7 +179,7 @@ as described in :ref:`gsp-upload-data-ref`.
     OUTPUT_BUCKET=${MY_BUCKET}
     GRAPH_NAME="small-graph"
     CONFIG_FILE="gconstruct-config.json"
-    NUM_FILES="4"
+    NUM_FILES="-1"
     GSP_HOME="enter/path/to/graphstorm/graphstorm-processing/"
 
     LOCAL_ENTRY_POINT=$GSP_HOME/graphstorm_processing/distributed_executor.py
@@ -240,7 +240,7 @@ and building the GSProcessing SageMaker ECR image:
     bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
 
     SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
-    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
+    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
     ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
     INSTANCE_TYPE="ml.t3.xlarge"
 
@@ -253,7 +253,7 @@ Note that ``${OUTPUT_PREFIX}`` here will need to match the value assigned when l
 the EMR-S job, i.e. ``"s3://${OUTPUT_BUCKET}/gsprocessing/emr-s/small-graph/4files/"``
 
 For more details on the re-partitioning step see
-::doc:`row-count-alignment`.
+:doc:`row-count-alignment`.
 
 Examine the output
 ------------------

diff --git a/docs/source/gs-processing/usage/example.rst b/docs/source/gs-processing/usage/example.rst
@@ -95,7 +95,7 @@ The contents of the ``gconstruct-config.json`` can be:
         "edges" : [
             {
                 # Note that the file is a relative path
-                "files": ["edges/movie-included_in-genre.csv"],
+                "files": ["edge_data/movie-included_in-genre.csv"],
                 "format": {
                     "name": "csv",
                     "separator" : ","
@@ -130,22 +130,24 @@ file:
     > python run_distributed_processing.py --input-data s3://my-bucket/data \
         --config-filename gconstruct-config.json
 
-Node files are optional
-^^^^^^^^^^^^^^^^^^^^^^^
+Node files are optional (but recommended)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 GSProcessing does not require node files to be provided for
-every node type. If a node type appears in one of the edges,
+every node type. Any node types  that appears as source or destination in one of the edges,
 its unique node identifiers will be determined by the edge files.
 
-In the example GConstruct file above (`gconstruct-config.json`), the node ids for the node types
-``movie`` and ``genre`` will be extracted from the edge list provided.
+However, this is an expensive operation, so if you know your node ID
+space from the start we recommend providing node input files for each
+node type. You can also have a mix of some node types being provided
+and others inferred by the edges.
 
 Example data and configuration
 ------------------------------
 
 For this example we use a small heterogeneous graph inspired by the Movielens dataset.
 You can see the configuration file under
-``graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json``
+``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json``
 
 We have 4 node types, ``movie``, ``genre``, ``director``, and ``user``. The graph has 3
 edge types, ``movie:included_in:genre``, ``user:rated:movie``, and ``director:directed:movie``.
@@ -166,18 +168,14 @@ to process the data and create the output on our local storage.
 We will provide an input and output prefix for our data, passing
 local paths to the script.
 
-We also provide the argument ``--num-output-files`` that instructs PySpark
-to try and create output with 4 partitions [#f1]_.
-
 Assuming our working directory is ``graphstorm/graphstorm-processing/``
 we can use the following command to run the processing job locally:
 
 .. code-block:: bash
 
     gs-processing --config-filename gconstruct-config.json \
         --input-prefix ./tests/resources/small_heterogeneous_graph \
-        --output-prefix /tmp/gsprocessing-example/ \
-        --num-output-files 4
+        --output-prefix /tmp/gsprocessing-example/
 
 
 To finalize processing and to wrangle the data into the structure that

diff --git a/docs/source/gs-processing/usage/row-count-alignment.rst b/docs/source/gs-processing/usage/row-count-alignment.rst
@@ -96,7 +96,7 @@ on SageMaker:
     bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
 
     SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
-    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
+    IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
     ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
     INSTANCE_TYPE="ml.t3.xlarge"
 
@@ -137,7 +137,7 @@ The file streaming implementation will hold at most 2 files worth of data
 in memory, so by choosing an appropriate file number when processing you should
 be able to process any data size.
 
-.. note:: text
+.. note::
 
     The file streaming implementation will be much slower than the in-memory
     one, so only use in case no instance size can handle your data.
diff --git a/graphstorm-processing/docker/README.md b/graphstorm-processing/docker/README.md
@@ -7,7 +7,7 @@ To build the image you will use `build_gsprocessing_image.sh` and to
 push it to ECR you will use `push_gsprocessing_image.sh`.
 
 For a tutorial on building and pushing the images to ECR to use
-with Amazon SageMaker see docs/source/usage/distributed-processing-setup.rst.
+with Amazon SageMaker see https://graphstorm.readthedocs.io/en/latest/gs-processing/usage/distributed-processing-setup.html.
 
 ## Building the image
 
@@ -33,6 +33,7 @@ You can get the other parameters of the script using
                         `graphstorm-processing-${ENVIRONMENT}:${VERSION}-${ARCH}-test`.
 * `-t, --target`        Target of the image. Use `test` if you intend to use the image for testing
                         new library functionality, otherwise `prod`. Default: `prod`
+* `-m, --hf-model`      When provided with a valid Huggingface model name, will include it in the image.
 
 ## Pushing the image
 

diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh
@@ -24,7 +24,7 @@ Available options:
 -v, --version       Docker version tag, default is the library's current version (`poetry version --short`)
 -s, --suffix        Suffix for the image tag, can be used to push custom image tags. Default is "".
 -b, --build         Docker build directory prefix, default is '/tmp/'.
--m, --hf-model      Provide a Huggingface Model name to be packed into the docker image. Default is "".
+-m, --hf-model      Provide a Huggingface Model name to be packed into the docker image. Default is "", no model included.
 
 EOF
   exit