Skip to content

Commit

Permalink
[GSProcessing] Documentation updates for 0.2.2
Browse files Browse the repository at this point in the history
  • Loading branch information
thvasilo committed Feb 14, 2024
1 parent 767e929 commit d654146
Show file tree
Hide file tree
Showing 7 changed files with 28 additions and 29 deletions.
6 changes: 3 additions & 3 deletions docs/source/gs-processing/usage/amazon-sagemaker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ job, followed by the re-partitioning job, both on SageMaker:
CONFIG_FILE="gconstruct-config.json"
INSTANCE_COUNT="2"
INSTANCE_TYPE="ml.t3.xlarge"
NUM_FILES="4"
NUM_FILES="-1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
OUTPUT_PREFIX="s3://${OUTPUT_BUCKET}/gsprocessing/sagemaker/${GRAPH_NAME}/${INSTANCE_COUNT}x-${INSTANCE_TYPE}-${NUM_FILES}files/"
Expand Down Expand Up @@ -84,7 +84,7 @@ You can see that we provided a parameter named
``--num-output-files`` to ``run_distributed_processing.py``. This is an
important parameter, as it provides a hint to set the parallelism for Spark.

It can safely be skipped and let Spark decide the proper value based on the cluster's
It can safely be skipped (or set to `-1`) to let Spark decide the proper value based on the cluster's
instance type and count. If setting it yourself a good value to use is
``num_instances * num_cores_per_instance * 2``, which will ensure good
utilization of the cluster resources.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ To build an EMR Serverless GSProcessing image for the ``arm64`` architecture you
(more than 20 minutes to build the GSProcessing ``arm64`` image).
After the first build, follow up builds that only change the GSProcessing code
will be less than a minute thanks to Docker's caching.
To speed up the build process you can build on an ARM instances,
To speed up the build process you can build on an ARM-native instance,
look into using ``buildx`` with multiple native nodes, or use cross-compilation.
See `the official Docker documentation <https://docs.docker.com/build/building/multi-platform/>`_
for details.
Expand All @@ -199,22 +199,22 @@ and push the image tagged with the latest version of GSProcessing.
The script supports 4 optional arguments:

1. Image name/repository. (``-i/--image``) Default: ``graphstorm-processing-<environment>``
2. Image tag. (``-v/--version``) Default: ``<latest_library_version>`` e.g. ``0.2.1``.
2. Image tag. (``-v/--version``) Default: ``<latest_library_version>`` e.g. ``0.2.2``.
3. ECR region. (``-r/--region``) Default: ``us-west-2``.
4. AWS Account ID. (``-a/--account``) Default: Uses the account ID detected by the ``aws-cli``.

Example:

.. code-block:: bash
bash docker/push_gsprocessing_image.sh -e sagemaker -i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
bash docker/push_gsprocessing_image.sh -e sagemaker -r "us-west-2" -a "1234567890"
To push an EMR Serverless ``arm64`` image you'd similarly run:

.. code-block:: bash
bash docker/push_gsprocessing_image.sh -e emr-serverless --architecture arm64 \
-i "graphstorm-processing" -v "0.2.1" -r "us-west-2" -a "1234567890"
-r "us-west-2" -a "1234567890"
.. _gsp-upload-data-ref:

Expand Down
12 changes: 6 additions & 6 deletions docs/source/gs-processing/usage/emr-serverless.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,15 +87,15 @@ Here we will just show the custom image application creation using the AWS CLI:
.. code-block:: bash
aws emr-serverless create-application \
--name gsprocessing-0.2.1 \
--name gsprocessing-0.2.2 \
--release-label emr-6.13.0 \
--type SPARK \
--image-configuration '{
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.1-<arch>"
"imageUri": "<aws-account-id>.dkr.ecr.<region>.amazonaws.com/graphstorm-processing-emr-serverless:0.2.2-<arch>"
}'
Here you will need to replace ``<aws-account-id>``, ``<arch>`` (``x86_64`` or ``arm64``), and ``<region>`` with the correct values
from the image you just created. GSProcessing version ``0.2.1`` uses ``emr-6.13.0`` as its
from the image you just created. GSProcessing version ``0.2.2`` uses ``emr-6.13.0`` as its
base image, so we need to ensure our application uses the same release.

Additionally, if it is required to use text feature transformation with Huggingface model, it is suggested to download the model cache inside the emr-serverless
Expand Down Expand Up @@ -179,7 +179,7 @@ as described in :ref:`gsp-upload-data-ref`.
OUTPUT_BUCKET=${MY_BUCKET}
GRAPH_NAME="small-graph"
CONFIG_FILE="gconstruct-config.json"
NUM_FILES="4"
NUM_FILES="-1"
GSP_HOME="enter/path/to/graphstorm/graphstorm-processing/"
LOCAL_ENTRY_POINT=$GSP_HOME/graphstorm_processing/distributed_executor.py
Expand Down Expand Up @@ -240,7 +240,7 @@ and building the GSProcessing SageMaker ECR image:
bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1-x86_64"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
INSTANCE_TYPE="ml.t3.xlarge"
Expand All @@ -253,7 +253,7 @@ Note that ``${OUTPUT_PREFIX}`` here will need to match the value assigned when l
the EMR-S job, i.e. ``"s3://${OUTPUT_BUCKET}/gsprocessing/emr-s/small-graph/4files/"``

For more details on the re-partitioning step see
::doc:`row-count-alignment`.
:doc:`row-count-alignment`.

Examine the output
------------------
Expand Down
22 changes: 10 additions & 12 deletions docs/source/gs-processing/usage/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ The contents of the ``gconstruct-config.json`` can be:
"edges" : [
{
# Note that the file is a relative path
"files": ["edges/movie-included_in-genre.csv"],
"files": ["edge_data/movie-included_in-genre.csv"],
"format": {
"name": "csv",
"separator" : ","
Expand Down Expand Up @@ -130,22 +130,24 @@ file:
> python run_distributed_processing.py --input-data s3://my-bucket/data \
--config-filename gconstruct-config.json
Node files are optional
^^^^^^^^^^^^^^^^^^^^^^^
Node files are optional (but recommended)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

GSProcessing does not require node files to be provided for
every node type. If a node type appears in one of the edges,
every node type. Any node types that appears as source or destination in one of the edges,
its unique node identifiers will be determined by the edge files.

In the example GConstruct file above (`gconstruct-config.json`), the node ids for the node types
``movie`` and ``genre`` will be extracted from the edge list provided.
However, this is an expensive operation, so if you know your node ID
space from the start we recommend providing node input files for each
node type. You can also have a mix of some node types being provided
and others inferred by the edges.

Example data and configuration
------------------------------

For this example we use a small heterogeneous graph inspired by the Movielens dataset.
You can see the configuration file under
``graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json``
``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph/gconstruct-config.json``

We have 4 node types, ``movie``, ``genre``, ``director``, and ``user``. The graph has 3
edge types, ``movie:included_in:genre``, ``user:rated:movie``, and ``director:directed:movie``.
Expand All @@ -166,18 +168,14 @@ to process the data and create the output on our local storage.
We will provide an input and output prefix for our data, passing
local paths to the script.

We also provide the argument ``--num-output-files`` that instructs PySpark
to try and create output with 4 partitions [#f1]_.

Assuming our working directory is ``graphstorm/graphstorm-processing/``
we can use the following command to run the processing job locally:

.. code-block:: bash
gs-processing --config-filename gconstruct-config.json \
--input-prefix ./tests/resources/small_heterogeneous_graph \
--output-prefix /tmp/gsprocessing-example/ \
--num-output-files 4
--output-prefix /tmp/gsprocessing-example/
To finalize processing and to wrangle the data into the structure that
Expand Down
4 changes: 2 additions & 2 deletions docs/source/gs-processing/usage/row-count-alignment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ on SageMaker:
bash docker/push_gsprocessing_image.sh --environment sagemaker --region ${REGION}
SAGEMAKER_ROLE_NAME="enter-your-sagemaker-execution-role-name-here"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:0.2.1"
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/graphstorm-processing-sagemaker:latest-x86_64"
ROLE="arn:aws:iam::${ACCOUNT}:role/service-role/${SAGEMAKER_ROLE_NAME}"
INSTANCE_TYPE="ml.t3.xlarge"
Expand Down Expand Up @@ -137,7 +137,7 @@ The file streaming implementation will hold at most 2 files worth of data
in memory, so by choosing an appropriate file number when processing you should
be able to process any data size.

.. note:: text
.. note::

The file streaming implementation will be much slower than the in-memory
one, so only use in case no instance size can handle your data.
3 changes: 2 additions & 1 deletion graphstorm-processing/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ To build the image you will use `build_gsprocessing_image.sh` and to
push it to ECR you will use `push_gsprocessing_image.sh`.

For a tutorial on building and pushing the images to ECR to use
with Amazon SageMaker see docs/source/usage/distributed-processing-setup.rst.
with Amazon SageMaker see https://graphstorm.readthedocs.io/en/latest/gs-processing/usage/distributed-processing-setup.html.

## Building the image

Expand All @@ -33,6 +33,7 @@ You can get the other parameters of the script using
`graphstorm-processing-${ENVIRONMENT}:${VERSION}-${ARCH}-test`.
* `-t, --target` Target of the image. Use `test` if you intend to use the image for testing
new library functionality, otherwise `prod`. Default: `prod`
* `-m, --hf-model` When provided with a valid Huggingface model name, will include it in the image.

## Pushing the image

Expand Down
2 changes: 1 addition & 1 deletion graphstorm-processing/docker/build_gsprocessing_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Available options:
-v, --version Docker version tag, default is the library's current version (`poetry version --short`)
-s, --suffix Suffix for the image tag, can be used to push custom image tags. Default is "".
-b, --build Docker build directory prefix, default is '/tmp/'.
-m, --hf-model Provide a Huggingface Model name to be packed into the docker image. Default is "".
-m, --hf-model Provide a Huggingface Model name to be packed into the docker image. Default is "", no model included.
EOF
exit
Expand Down

0 comments on commit d654146

Please sign in to comment.