Move GSProcessing docs to main repo documentation. (#502)

*Issue #, if available:* *Description of changes:* Move the GSProcessing docs under the main repo to allow publishing under common readthedocs project. Add a new "Distributed Processing" section at the index root, rename "Scale to Giant Graphs" to "Distributed Training" to differentiate between processing and training. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: xiang song(charlie.song) <[email protected]>
awslabs · Sep 28, 2023 · 601701a · 601701a
1 parent 97288af
commit 601701a
Show file tree

Hide file tree

Showing 11 changed files with 94 additions and 167 deletions.
diff --git a/...docs/source/developer/developer-guide.rst → ...-processing/developer/developer-guide.rst b/...docs/source/developer/developer-guide.rst → ...-processing/developer/developer-guide.rst
@@ -34,7 +34,7 @@ On Amazon Linux 2 you can use:
     sudo yum install java-11-amazon-corretto-devel
 
 Install ``pyenv``
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~
 
 ``pyenv`` is a tool to manage multiple Python version installations. It
 can be installed through the installer below on a Linux machine:
@@ -50,7 +50,7 @@ or use ``brew`` on a Mac:
    brew update
    brew install pyenv
 
-For more info on ``pyenv`` see `its documentation. <https://github.com/pyenv/pyenv>`
+For more info on ``pyenv`` see `its documentation. <https://github.com/pyenv/pyenv>`_
 
 Create a Python 3.9 env and activate it.
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -72,7 +72,7 @@ training.
    dependencies.
 
 Install ``poetry``
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~
 
 ``poetry`` is a dependency and build management system for Python. To install it
 use:
@@ -82,7 +82,7 @@ use:
    curl -sSL https://install.python-poetry.org | python3 -
 
 Install dependencies through ``poetry``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Now we are ready to install our dependencies through ``poetry``.
 
@@ -176,8 +176,8 @@ ensure your code conforms to the expectation by running
 on your code before commits. To make this easier we include
 a pre-commit hook below.
 
-Use a pre-commit hook to ensure ``black`` and ``pylint`` runs before commits
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Use a pre-commit hook to ensure ``black`` and ``pylint`` run before commits
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 To make code formatting and ``pylint`` checks easier for graphstorm-processing
 developers, we recommend using a pre-commit hook.
@@ -216,14 +216,14 @@ And then run:
 
    pre-commit install
 
-which will install the ``black`` and ``pylin`` hooks into your local repository and
+which will install the ``black`` and ``pylint`` hooks into your local repository and
 ensure it runs before every commit.
 
 .. note::
 
     The pre-commit hook will also apply to all commits you make to the root
     GraphStorm repository. Since that Graphstorm doesn't use ``black``, you might
-    want to remove the hooks. You can do so from the root repo
+    want to remove the ``black`` hook. You can do so from the root repo
     using ``rm -rf .git/hooks``.
 
     Both projects use ``pylint`` to check Python files so we'd still recommend using

diff --git a/.../source/developer/input-configuration.rst → ...cessing/developer/input-configuration.rst b/.../source/developer/input-configuration.rst → ...cessing/developer/input-configuration.rst
diff --git a/graphstorm-processing/docs/source/index.rst → ...cessing/gs-processing-getting-started.rst b/graphstorm-processing/docs/source/index.rst → ...cessing/gs-processing-getting-started.rst
@@ -1,24 +1,9 @@
-.. graphstorm-processing documentation master file, created by
-   sphinx-quickstart on Tue Aug  1 02:04:45 2023.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
+GraphStorm Processing Getting Started
+=====================================
 
-Welcome to GraphStorm Distributed Data Processing documentation!
-=================================================
 
-.. toctree::
-    :maxdepth: 1
-    :caption: Contents:
-
-    Example <usage/example>
-    Distributed processing setup <usage/distributed-processing-setup>
-    Running on Amazon Sagemaker <usage/amazon-sagemaker>
-    Developer Guide <developer/developer-guide>
-    Input configuration <developer/input-configuration>
-
-
-GraphStorm Distributed Data Processing allows you to process and prepare massive graph data
-for training with GraphStorm. GraphStorm Processing takes care of generating
+GraphStorm Distributed Data Processing (GSProcessing) allows you to process and prepare massive graph data
+for training with GraphStorm. GSProcessing takes care of generating
 unique ids for nodes, using them to encode edge structure files, process
 individual features and prepare the data to be passed into the
 distributed partitioning and training pipeline of GraphStorm.
@@ -27,11 +12,17 @@ We use PySpark to achieve
 horizontal parallelism, allowing us to scale to graphs with billions of nodes
 and edges.
 
-.. _installation-ref:
+.. _gsp-installation-ref:
 
 Installation
 ------------
 
+The project needs Python 3.9 and Java 8 or 11 installed. Below we provide brief
+guides for each requirement.
+
+Install Python 3.9
+^^^^^^^^^^^^^^^^^^
+
 The project uses Python 3.9. We recommend using `PyEnv <https://github.com/pyenv/pyenv>`_
 to have isolated Python installations.
 
@@ -42,13 +33,37 @@ With PyEnv installed you can create and activate a Python 3.9 environment using
     pyenv install 3.9
     pyenv local 3.9
 
+Install GSProcessing from source
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 With a recent version of ``pip`` installed (we recommend ``pip>=21.3``), you can simply run ``pip install .``
 from the root directory of the project (``graphstorm/graphstorm-processing``),
-which should install the library into your environment and pull in all dependencies.
+which should install the library into your environment and pull in all dependencies:
+
+.. code-block:: bash
+
+    # Ensure Python is at least 3.9
+    python -V
+    cd graphstorm/graphstorm-processing
+    pip install .
 
-Install Java 8, 11, or 17
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Install GSProcessing using poetry
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can also create a local virtual environment using `poetry <https://python-poetry.org/docs/>`_.
+With Python 3.9 and ``poetry`` installed you can run:
+
+.. code-block:: bash
+
+    cd graphstorm/graphstorm-processing
+    # This will create a virtual env under graphstorm-processing/.venv
+    poetry install
+    # This will activate the .venv
+    poetry shell
+
+
+Install Java 8 or 11
+^^^^^^^^^^^^^^^^^^^^
 
 Spark has a runtime dependency on the JVM to run, so you'll need to ensure
 Java is installed and available on your system.
@@ -87,16 +102,19 @@ See the provided :doc:`usage/example` for an example of how to start with tabula
 data and convert them into a graph representation before partitioning and
 training with GraphStorm.
 
-Usage
------
+Running locally
+---------------
+
+For data that fit into the memory of one machine, you can run jobs locally instead of a
+cluster.
 
 To use the library to process your data, you will need to have your data
 in a tabular format, and a corresponding JSON configuration file that describes the
 data. The input data can be in CSV (with header(s)) or Parquet format.
 
 The configuration file can be in GraphStorm's GConstruct format,
-with the caveat that the file paths need to be relative to the
-location of the config file. See :doc:`/usage/example` for more details.
+**with the caveat that the file paths need to be relative to the
+location of the config file.** See :ref:`gsp-relative-paths` for more details.
 
 After installing the library, executing a processing job locally can be done using:
 
@@ -126,7 +144,7 @@ partitioning pipeline.
 See `this guide <https://github.com/awslabs/graphstorm/blob/main/sagemaker/README.md#launch-graph-partitioning-task>`_
 for more details on how to use GraphStorm distributed partitioning on SageMaker.
 
-See :doc:`/usage/example` for a detailed walkthrough of using GSProcessing to
+See :doc:`usage/example` for a detailed walkthrough of using GSProcessing to
 wrangle data into a format that's ready to be consumed by the GraphStorm/DGL
 partitioning pipeline.
 
@@ -137,13 +155,15 @@ Using with Amazon SageMaker
 To run distributed jobs on Amazon SageMaker we will have to build a Docker image
 and push it to the Amazon Elastic Container Registry, which we cover in
 :doc:`usage/distributed-processing-setup` and run a SageMaker Processing
-job which we describe in :doc:`/usage/amazon-sagemaker`.
+job which we describe in :doc:`usage/amazon-sagemaker`.
 
 
 Developer guide
 ---------------
 
-To get started with developing the package refer to :doc:`/developer/developer-guide`.
+To get started with developing the package refer to :doc:`developer/developer-guide`.
+To see the input configuration format that GSProcessing uses internally see
+:doc:`developer/input-configuration`.
 
 
 .. rubric:: Footnotes

diff --git a/...ng/docs/source/usage/amazon-sagemaker.rst → .../gs-processing/usage/amazon-sagemaker.rst b/...ng/docs/source/usage/amazon-sagemaker.rst → .../gs-processing/usage/amazon-sagemaker.rst
@@ -36,7 +36,7 @@ directory we can upload the test data to S3 using:
 
     Make sure you are uploading your data to a bucket
     that was created in the same region as the ECR image
-    you pushed in :doc:`/usage/distributed-processing-setup`.
+    you pushed in :doc:`distributed-processing-setup`.
 
 
 Launch the GSProcessing job on Amazon SageMaker
@@ -52,12 +52,12 @@ of up to 20 instances, allowing you to scale your processing to massive graphs,
 using larger instances like `ml.r5.24xlarge`.
 
 Since we're now executing on AWS, we'll need access to an execution role
-for SageMaker and the ECR image URI we created in :doc:`/usage/distributed-processing-setup`.
+for SageMaker and the ECR image URI we created in :doc:`distributed-processing-setup`.
 For instructions on how to create an execution role for SageMaker
 see the `AWS SageMaker documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-create-execution-role>`_.
 
-Let's set up a small bash script that will run the parametrized processing
-job, followed by the re-partitioning job, both on SageMaker
+Let's set up a small ``bash`` script that will run the parametrized processing
+job, followed by the re-partitioning job, both on SageMaker:
 
 .. code-block:: bash
 
@@ -131,7 +131,7 @@ Examine the output
 
 Once both jobs are finished we can examine the output created, which
 should match the output we saw when running the same jobs locally
-in :doc:`/usage/example`:
+in :ref:`gsp-examining-output`.
 
 
 .. code-block:: bash

diff --git a/...ce/usage/distributed-processing-setup.rst → ...ng/usage/distributed-processing-setup.rst b/...ce/usage/distributed-processing-setup.rst → ...ng/usage/distributed-processing-setup.rst
@@ -1,8 +1,8 @@
-Distributed Processing setup for Amazon SageMaker
-=================================================
+GraphStorm Processing setup for Amazon SageMaker
+================================================
 
 In this guide we'll demonstrate how to prepare your environment to run
-GraphStorm Processing (GSP) jobs on Amazon SageMaker.
+GraphStorm Processing (GSProcessing) jobs on Amazon SageMaker.
 
 We're assuming a Linux host environment used throughout
 this tutorial, but other OS should work fine as well.

diff --git a/...-processing/docs/source/usage/example.rst → docs/source/gs-processing/usage/example.rst b/...-processing/docs/source/usage/example.rst → docs/source/gs-processing/usage/example.rst
@@ -1,4 +1,4 @@
-GraphStorm Processing example
+GraphStorm Processing Example
 =============================
 
 To demonstrate how to use the library locally we will
@@ -13,7 +13,7 @@ To run the local example you will need to install the GSProcessing
 library to your Python environment, and you'll need to clone the
 GraphStorm repository to get access to the data.
 
-Follow the :ref:`installation-ref` guide to install the GSProcessing library.
+Follow the :ref:`gsp-installation-ref` guide to install the GSProcessing library.
 
 You can clone the repository using
 
@@ -48,7 +48,7 @@ Apart from the data, GSProcessing also requires a configuration file that descri
 data and the transformations we will need to apply to the features and any encoding needed for
 labels.
 We support both the `GConstruct configuration format <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
-, and the library's own GSProcessing format, described in :doc:`/developer/input-configuration`.
+, and the library's own GSProcessing format, described in :doc:`/gs-processing/developer/input-configuration`.
 
 .. note::
     We expect end users to only provide a GConstruct configuration file,
@@ -61,7 +61,9 @@ We support both the `GConstruct configuration format <https://graphstorm.readthe
     as we do with GConstruct.
 
 For a detailed description of all the entries of the GSProcessing configuration file see
-:doc:`/developer/input-configuration`.
+:doc:`/gs-processing/developer/input-configuration`.
+
+.. _gsp-relative-paths:
 
 Relative file paths required
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -186,6 +188,7 @@ guarantees the data conform to the expectations of DGL:
 
     gs-repartition --input-prefix /tmp/gsprocessing-example/
 
+.. _gsp-examining-output:
 
 Examining the job output
 ------------------------
@@ -248,16 +251,19 @@ in an ``edge_data`` directory.
 At this point you can use the DGL distributed partitioning pipeline
 to partition your data, as described in the
 `DGL documentation <https://docs.dgl.ai/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_
+.
 
 To simplify the process of partitioning and training, without the need
 to manage your own infrastructure, we recommend using GraphStorm's
 `SageMaker wrappers <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
 that do all the hard work for you and allow
-you to focus on model development.
+you to focus on model development. In particular you can follow the GraphStorm documentation to run
+`distributed partititioning on SageMaker <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_.
+
 
 To run GSProcessing jobs on Amazon SageMaker we'll need to follow
-:doc:`/usage/distributed-processing-setup` to set up our environment
-and :doc:`/usage/amazon-sagemaker` to execute the job.
+:doc:`/gs-processing/usage/distributed-processing-setup` to set up our environment
+and :doc:`/gs-processing/usage/amazon-sagemaker` to execute the job.
 
 
 .. rubric:: Footnotes

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -15,7 +15,18 @@ Welcome to the GraphStorm Documentation and Tutorials
 
 .. toctree::
    :maxdepth: 1
-   :caption: Scale to Giant Graphs
+   :caption: Distributed Processing
+   :hidden:
+   :glob:
+
+   gs-processing/gs-processing-getting-started
+   gs-processing/usage/example
+   gs-processing/usage/distributed-processing-setup
+   gs-processing/usage/amazon-sagemaker
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Distributed Training
    :hidden:
    :glob:
 
@@ -52,7 +63,7 @@ Getting Started
 
 For beginners, please first start with the :ref:`GraphStorm Docker environment setup<setup>`. This tutorial covers how to set up a Docker environment and build a GraphStorm Docker image, which serves as the Standalone running environment for GraphStorm. We are working on supporting more running environments for GraphStorm.
 
-Once successfully set up the GraphStorm Docker running environment, 
+Once successfully set up the GraphStorm Docker running environment,
 
 - follow the :ref:`GraphStorm Standalone Mode Quick-Start Tutorial<quick-start-standalone>` to run examples using GraphStorm built-in data and models, hence getting familiar with GraphStorm's usage of training and inference.
 - follow the :ref:`Use Your Own Graph Data Tutorial<use-own-data>` to prepare your own graph data for using GraphStorm.

diff --git a/graphstorm-processing/docker/build_gsprocessing_image.sh b/graphstorm-processing/docker/build_gsprocessing_image.sh
@@ -15,7 +15,7 @@ Available options:
 
 -h, --help      Print this help and exit
 -x, --verbose   Print script debug info (set -x)
--t, --target    Docker image target, must be one of 'prod' or 'test'. Default is 'prod'.
+-t, --target    Docker image target, must be one of 'prod' or 'test'. Default is 'test'.
 -p, --path      Path to graphstorm-processing directory, default is one level above this script.
 -i, --image     Docker image name, default is 'graphstorm-processing'.
 -v, --version   Docker version tag, default is the library's current version (`poetry version --short`)
@@ -41,6 +41,7 @@ parse_params() {
   IMAGE_NAME='graphstorm-processing'
   VERSION=`poetry version --short`
   BUILD_DIR='/tmp'
+  TARGET='test'
 
   while :; do
     case "${1-}" in
@@ -75,9 +76,6 @@ parse_params() {
 
   args=("$@")
 
-  # check required params and arguments
-  [[ -z "${TARGET-}" ]] && die "Missing required parameter: --target [prod|test]"
-
   return 0
 }
 

diff --git a/graphstorm-processing/docs/Makefile b/graphstorm-processing/docs/Makefile