From d6541465b39efd8ca0430fc08aeb0bc363107558 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Wed, 14 Feb 2024 00:46:38 +0000 Subject: [PATCH] [GSProcessing] Documentation updates for 0.2.2 --- .../gs-processing/usage/amazon-sagemaker.rst | 6 ++--- .../usage/distributed-processing-setup.rst | 8 +++---- .../gs-processing/usage/emr-serverless.rst | 12 +++++----- docs/source/gs-processing/usage/example.rst | 22 +++++++++---------- .../usage/row-count-alignment.rst | 4 ++-- graphstorm-processing/docker/README.md | 3 ++- .../docker/build_gsprocessing_image.sh | 2 +- 7 files changed, 28 insertions(+), 29 deletions(-) diff --git a/docs/source/gs-processing/usage/amazon-sagemaker.rst b/docs/source/gs-processing/usage/amazon-sagemaker.rst index 624025914f..fb1d82b898 100644 --- a/docs/source/gs-processing/usage/amazon-sagemaker.rst +++ b/docs/source/gs-processing/usage/amazon-sagemaker.rst @@ -43,9 +43,9 @@ job, followed by the re-partitioning job, both on SageMaker: CONFIG_FILE="gconstruct-config.json" INSTANCE_COUNT="2" INSTANCE_TYPE="ml.t3.xlarge" - NUM_FILES="4" + NUM_FILES="-1" - IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64" + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64" ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/" @@ -84,7 +84,7 @@ You can see that we provided a parameter named ``--num-output-files`` to ``run_distributed_processing.py``. This is an important parameter, as it provides a hint to set the parallelism for Spark. -It can safely be skipped and let Spark decide the proper value based on the cluster's +It can safely be skipped (or set to `-1`) to let Spark decide the proper value based on the cluster's instance type and count. If setting it yourself a good value to use is ``num_instances * num_cores_per_instance * 2``, which will ensure good utilization of the cluster resources. diff --git a/docs/source/gs-processing/usage/distributed-processing-setup.rst b/docs/source/gs-processing/usage/distributed-processing-setup.rst index 053837f1d3..5083e927ce 100644 --- a/docs/source/gs-processing/usage/distributed-processing-setup.rst +++ b/docs/source/gs-processing/usage/distributed-processing-setup.rst @@ -179,7 +179,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you (more than 20 minutes to build the GSProcessing ``arm64`` image). After the first build, follow up builds that only change the GSProcessing code will be less than a minute thanks to Docker's caching. - To speed up the build process you can build on an ARM instances, + To speed up the build process you can build on an ARM-native instance, look into using ``buildx`` with multiple native nodes, or use cross-compilation. See `the official Docker documentation `_ for details. @@ -199,7 +199,7 @@ and push the image tagged with the latest version of GSProcessing. The script supports 4 optional arguments: 1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-`` -2. Image tag. (``-v/--version``) Default: ```` e.g. ``0.2.1``. +2. Image tag. (``-v/--version``) Default: ```` e.g. ``0.2.2``. 3. ECR region. (``-r/--region``) Default: ``us-west-2``. 4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``. @@ -207,14 +207,14 @@ Example: .. code-block:: bash - bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890" + bash docker/push_gsprocessing_image.sh -e sagemaker -r "us-west-2" -a "1234567890" To push an EMR Serverless ``arm64`` image you'd similarly run: .. code-block:: bash bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \ - -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890" + -r "us-west-2" -a "1234567890" .. _gsp-upload-data-ref: diff --git a/docs/source/gs-processing/usage/emr-serverless.rst b/docs/source/gs-processing/usage/emr-serverless.rst index e399b9872e..67977826c9 100644 --- a/docs/source/gs-processing/usage/emr-serverless.rst +++ b/docs/source/gs-processing/usage/emr-serverless.rst @@ -87,15 +87,15 @@ Here we will just show the custom image application creation using the AWS CLI: .. code-block:: bash aws emr-serverless create-application \ - --name gsprocessing-0.2.1 \ + --name gsprocessing-0.2.2 \ --release-label emr-6.13.0 \ --type SPARK \ --image-configuration '{ - "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-" + "imageUri": ".dkr.ecr..amazonaws.com/graphstorm-processing-emr-serverless:0.2.2-" }' Here you will need to replace ````, ```` (``x86_64`` or ``arm64``), and ```` with the correct values -from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its +from the image you just created. GSProcessing version ``0.2.2`` uses ``emr-6.13.0`` as its base image, so we need to ensure our application uses the same release. Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless @@ -179,7 +179,7 @@ as described in :ref:`gsp-upload-data-ref`. OUTPUT_BUCKET=${MY_BUCKET} GRAPH_NAME="small-graph" CONFIG_FILE="gconstruct-config.json" - NUM_FILES="4" + NUM_FILES="-1" GSP_HOME="enter/path/to/graphstorm/graphstorm-processing/" LOCAL_ENTRY_POINT=$GSP_HOME/graphstorm_processing/distributed_executor.py @@ -240,7 +240,7 @@ and building the GSProcessing SageMaker ECR image: bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION} SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here" - IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64" + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64" ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" INSTANCE_TYPE="ml.t3.xlarge" @@ -253,7 +253,7 @@ Note that ``${OUTPUT_PREFIX}`` here will need to match the value assigned when l the EMR-S job, i.e. ``"s3://${OUTPUT_BUCKET}/gsprocessing/emr-s/small-graph/4files/"`` For more details on the re-partitioning step see -::doc:`row-count-alignment`. +:doc:`row-count-alignment`. Examine the output ------------------ diff --git a/docs/source/gs-processing/usage/example.rst b/docs/source/gs-processing/usage/example.rst index 6fe435f355..2aef802277 100644 --- a/docs/source/gs-processing/usage/example.rst +++ b/docs/source/gs-processing/usage/example.rst @@ -95,7 +95,7 @@ The contents of the ``gconstruct-config.json`` can be: "edges" : [ { # Note that the file is a relative path - "files": ["edges/movie-included_in-genre.csv"], + "files": ["edge_data/movie-included_in-genre.csv"], "format": { "name": "csv", "separator" : "," @@ -130,22 +130,24 @@ file: > python run_distributed_processing.py --input-data s3://my-bucket/data \ --config-filename gconstruct-config.json -Node files are optional -^^^^^^^^^^^^^^^^^^^^^^^ +Node files are optional (but recommended) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ GSProcessing does not require node files to be provided for -every node type. If a node type appears in one of the edges, +every node type. Any node types that appears as source or destination in one of the edges, its unique node identifiers will be determined by the edge files. -In the example GConstruct file above (`gconstruct-config.json`), the node ids for the node types -``movie`` and ``genre`` will be extracted from the edge list provided. +However, this is an expensive operation, so if you know your node ID +space from the start we recommend providing node input files for each +node type. You can also have a mix of some node types being provided +and others inferred by the edges. Example data and configuration ------------------------------ For this example we use a small heterogeneous graph inspired by the Movielens dataset. You can see the configuration file under -``graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json`` +``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json`` We have 4 node types, ``movie``, ``genre``, ``director``, and ``user``. The graph has 3 edge types, ``movie:included_in:genre``, ``user:rated:movie``, and ``director:directed:movie``. @@ -166,9 +168,6 @@ to process the data and create the output on our local storage. We will provide an input and output prefix for our data, passing local paths to the script. -We also provide the argument ``--num-output-files`` that instructs PySpark -to try and create output with 4 partitions [#f1]_. - Assuming our working directory is ``graphstorm/graphstorm-processing/`` we can use the following command to run the processing job locally: @@ -176,8 +175,7 @@ we can use the following command to run the processing job locally: gs-processing --config-filename gconstruct-config.json \ --input-prefix ./tests/resources/small_heterogeneous_graph \ - --output-prefix /tmp/gsprocessing-example/ \ - --num-output-files 4 + --output-prefix /tmp/gsprocessing-example/ To finalize processing and to wrangle the data into the structure that diff --git a/docs/source/gs-processing/usage/row-count-alignment.rst b/docs/source/gs-processing/usage/row-count-alignment.rst index 4ff8d8aaab..b256ef6ff7 100644 --- a/docs/source/gs-processing/usage/row-count-alignment.rst +++ b/docs/source/gs-processing/usage/row-count-alignment.rst @@ -96,7 +96,7 @@ on SageMaker: bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION} SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here" - IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1" + IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64" ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}" INSTANCE_TYPE="ml.t3.xlarge" @@ -137,7 +137,7 @@ The file streaming implementation will hold at most 2 files worth of data in memory, so by choosing an appropriate file number when processing you should be able to process any data size. -.. note:: text +.. note:: The file streaming implementation will be much slower than the in-memory one, so only use in case no instance size can handle your data. diff --git a/graphstorm-processing/docker/README.md b/graphstorm-processing/docker/README.md index 1e78ab613f..8e829c7d7c 100644 --- a/graphstorm-processing/docker/README.md +++ b/graphstorm-processing/docker/README.md @@ -7,7 +7,7 @@ To build the image you will use `build_gsprocessing_image.sh` and to push it to ECR you will use `push_gsprocessing_image.sh`. For a tutorial on building and pushing the images to ECR to use -with Amazon SageMaker see docs/source/usage/distributed-processing-setup.rst. +with Amazon SageMaker see https://graphstorm.readthedocs.io/en/latest/gs-processing/usage/distributed-processing-setup.html. ## Building the image @@ -33,6 +33,7 @@ You can get the other parameters of the script using `graphstorm-processing-${ENVIRONMENT}:${VERSION}-${ARCH}-test`. * `-t, --target` Target of the image. Use `test` if you intend to use the image for testing new library functionality, otherwise `prod`. Default: `prod` +* `-m, --hf-model` When provided with a valid Huggingface model name, will include it in the image. ## Pushing the image diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh index e92edfebf5..5b53ace508 100644 --- a/graphstorm-processing/docker/build_gsprocessing_image.sh +++ b/graphstorm-processing/docker/build_gsprocessing_image.sh @@ -24,7 +24,7 @@ Available options: -v, --version Docker version tag, default is the library's current version (`poetry version --short`) -s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "". -b, --build Docker build directory prefix, default is '/tmp/'. --m, --hf-model Provide a Huggingface Model name to be packed into the docker image. Default is "". +-m, --hf-model Provide a Huggingface Model name to be packed into the docker image. Default is "", no model included. EOF exit