[Doc] V0.3.1 Documentation and Tutorial update (#973)

*Issue #, if available:* *Description of changes:* This PR updated the overall Documentation and Tutorial organization. The changes include: - Grouped the main contents under two 1st-level menu, i.e., `COMMAND LINE INTERFACE USER GUIDE` and `PROGRAMMING INTERFACE USER GUIDE`. - In the CLI user guide, regrouped previous contents into two 2nd-level menu, i.e., `GraphStorm Graph Construction` and `GraphStorm Model Training and Inference`. - In the `GraphStorm Graph Construction`, added a new document, `Input Raw Data Specification`, to explain the specifications of the input data, and provide a simple raw data example. - Added a new document,`Single Machine Graph Construction`, to introduce the `gconstruct` module, and provide a simple construction configuration JSON example. - In the `Distributed Graph Construction`, added a few text to link documents and renamed some titles. - Renamed the existing 1st-level `DISTRIBUTED TRAINING` to `GraphStorm Model Training and Inference` and move the contents into the 2nd-level menu under `COMMAND LINE INTERFACE USER GUIDE`. - Added a new `Model Training and Inference on a Single Machine` to explain the launch commands. - Moved the `Model Training and Inference Configurations` under this 2n-level menu. - Added a new `GraphStorm Training and Inference Output` to explain the intermediate outputs. - Added a new `GraphStorm Output Node ID Remapping` to explain the CLIs output and the remapping operation. - In the API user guide, merged the API doc string commits. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: xiang song(charlie.song) <[email protected]> Co-authored-by: jalencato <[email protected]> Co-authored-by: Oxfordblue7 <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]> Co-authored-by: Xiang Song <[email protected]>
awslabs · Aug 16, 2024 · d739902 · d739902
1 parent 3baab75
commit d739902
Show file tree

Hide file tree

Showing 38 changed files with 1,428 additions and 448 deletions.
diff --git a/README.md b/README.md
@@ -43,27 +43,13 @@ python /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \
 
 GraphStorm training relies on ssh to launch training jobs. The GraphStorm standalone mode uses ssh services in port 22.
 
-In addition, to run GraphStorm training in a single machine, users need to create a ``ip_list.txt`` file that contains one row as below, which will facilitate ssh communication to the machine itself.
-
-```127.0.0.1```
-
-Users can use the following command to create the simple ip_list.txt file.
-
-```
-touch /tmp/ip_list.txt
-echo 127.0.0.1 > /tmp/ip_list.txt
-```
-
 Third, run the below command to train an RGCN model to perform node classification on the partitioned arxiv graph.
 
 ```
 python -m graphstorm.run.gs_node_classification \
        --workspace /tmp/ogbn-arxiv-nc \
        --num-trainers 1 \
-       --num-servers 1 \
-       --num-samplers 0 \
        --part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \
-       --ip-config  /tmp/ip_list.txt \
        --ssh-port 22 \
        --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
        --save-perf-results-path /tmp/ogbn-arxiv-nc/models
@@ -96,7 +82,6 @@ python -m graphstorm.run.gs_link_prediction \
        --num-servers 1 \
        --num-samplers 0 \
        --part-config /tmp/ogbn_mag_lp_train_val_1p_4t/ogbn-mag.json \
-       --ip-config /tmp/ip_list.txt \
        --ssh-port 22 \
        --cf /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml \
        --node-feat-name paper:feat \

diff --git a/docs/source/advanced/link-prediction.rst b/docs/source/advanced/link-prediction.rst
@@ -197,10 +197,57 @@ In general, GraphStorm covers following cases:
 The gconstruct pipeline of GraphStorm provides support to load hard negative data from raw input.
 Hard destination negatives can be defined through ``edge_dst_hard_negative`` transformation.
 The ``feature_col`` field of ``edge_dst_hard_negative`` must stores the raw node ids of hard destination nodes.
+The follwing example shows how to define a hard negative feature for edges with the relation ``(node1, relation1, node1)``:
+
+  .. code-block:: json
+
+    {
+        ...
+        "edges": [
+            ...
+            {
+                "source_id_col":    "src",
+                "dest_id_col":      "dst",
+                "relation": ("node1", "relation1", "node1"),
+                "format":   {"name": "parquet"},
+                "files":    "edge_data.parquet",
+                "features": [
+                    {
+                        "feature_col": "hard_neg",
+                        "feature_name": "hard_neg_feat",
+                        "transform": {"name": "edge_dst_hard_negative",
+                                                "separator": ";"},
+                    }
+                ]
+            }
+        ]
+    }
+
+The hard negative data is stored in the column named ``hard_neg`` in the ``edge_data.parquet`` file.
+The edge feature to store the hard negative will be ``hard_neg_feat``.
+
 GraphStorm accepts two types of hard negative inputs:
 
 - **An array of strings or integers** When the input format is ``Parquet``, the ``feature_col`` can store string or integer arrays. In this case, each row stores a string/integer array representing the hard negative node ids of the corresponding edge. For example, the ``feature_col`` can be a 2D string array, like ``[["e0_hard_0", "e0_hard_1"],["e1_hard_0"], ..., ["en_hard_0", "en_hard_1"]]`` or a 2D integer array (for integer node ids) like ``[[10,2],[3],...[4,12]]``. It is not required for each row to have the same dimension size. GraphStorm will automatically handle the case when some edges do not have enough pre-defined hard negatives.
+For example, the file storing hard negatives should look like the following:
+
+.. code-block:: yaml
+
+      src    |   dst    | hard_neg
+    "src_0"  | "dst_0"  | ["dst_10", "dst_11"]
+    "src_0"  | "dst_1"  | ["dst_5"]
+    ...
+    "src_100"| "dst_41" | [dst0, dst_2]
+
+- **A single string** The ``feature_col`` stores strings instead of string arrays (When the input format is ``Parquet`` or ``CSV``). In this case, a ``separator`` must be provided int the transformation definition to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives.
+For example, the file storing hard negatives should look like the following:
+
+.. code-block:: yaml
 
-- **A single string** The ``feature_col`` stores strings instead of string arrays. (When the input format is ``Parquet`` or ``CSV``) In this case, a ``separator`` must be provided to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives.
+      src    |   dst    | hard_neg
+    "src_0"  | "dst_0" | "dst_10;dst_11"
+    "src_0"  | "dst_1" | "dst_5"
+    ...
+    "src_100"| "dst_41"| "dst0;dst_2"
 
 GraphStorm will automatically translate the Raw Node IDs of hard negatives into Partition Node IDs in a DistDGL graph.
diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst
@@ -318,3 +318,129 @@ GraphStorm supports to run multi-task inference on :ref:`SageMaker<distributed-s
         --instance-count <INSTANCE_COUNT> \
         --instance-type <INSTANCE_TYPE>
 
+Multi-task Learning Output
+--------------------------
+
+Saved Node Embeddings
+~~~~~~~~~~~~~~~~~~~~~~
+When ``save_embed_path`` is provided in the training configuration or the inference configuration,
+GraphStorm will save the node embeddings in the corresponding path.
+In multi-task learning, by default, GraphStorm will save the node embeddings
+produced by the GNN layer for every node type under the path specified by
+``save_embed_path``. The output format follows the :ref:`GraphStorm saved node embeddings
+format<gs-out-embs>`. Meanwhile, in multi-task learning, certain tasks might apply
+task specific normalization to node embeddings. For instance, a link prediction
+task might apply l2 normalization on each node embeddings. In certain cases, GraphStorm
+will also save the normalized node embeddings under the ``save_embed_path``.
+The task specific node embeddings are saved separately under different sub-directories
+named with the corresponding task id. (A task id is formated as ``<task_type>-<ntype/etype(s)>-<label>``.
+For instance, the task id of a node classification task on the node type ``paper`` with the
+label field ``venue`` will be ``node_classification-paper-venue``. As another example,
+the task id of a link prediction task on the edge type ``(paper, cite, paper)`` will be
+``link_prediction-paper_cite_paper``
+and the task id of a edge regression task on the edge type ``(paper, cite, paper)`` with
+the label field ``year`` will be ``edge_regression-paper_cite_paper-year``).
+The output format of task specific node embeddings follows
+the :ref:`GraphStorm saved node embeddings format<gs-out-embs>`.
+The contents of the ``save_embed_path`` in multi-task learning will look like following:
+
+.. code-block:: bash
+
+    emb_dir/
+        ntype0/
+            embed_nids-00000.pt
+            embed_nids-00001.pt
+            ...
+            embed-00000.pt
+            embed-00001.pt
+            ...
+        ntype1/
+            embed_nids-00000.pt
+            embed_nids-00001.pt
+            ...
+            embed-00000.pt
+            embed-00001.pt
+            ...
+        emb_info.json
+        link_prediction-paper_cite_paper/
+            ntype0/
+                embed_nids-00000.pt
+                embed_nids-00001.pt
+                ...
+                embed-00000.pt
+                embed-00001.pt
+                ...
+            ntype1/
+                embed_nids-00000.pt
+                embed_nids-00001.pt
+                ...
+                embed-00000.pt
+                embed-00001.pt
+                ...
+            emb_info.json
+        edge_regression-paper_cite_paper-year/
+            ntype0/
+                embed_nids-00000.pt
+                embed_nids-00001.pt
+                ...
+                embed-00000.pt
+                embed-00001.pt
+                ...
+            ntype1/
+                embed_nids-00000.pt
+                embed_nids-00001.pt
+                ...
+                embed-00000.pt
+                embed-00001.pt
+                ...
+            emb_info.json
+
+In the above example both the link prediction and edge regression tasks
+apply task specific normalization on node embeddings.
+
+**Note: The built-in GraphStorm training or inference pipeline
+(launched by GraphStorm CLIs) will process each saved node embeddings
+to convert the integer node IDs into the raw node IDs, which are usually string node IDs.**
+Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>`
+
+Saved Prediction Results
+~~~~~~~~~~~~~~~~~~~~~~~~~
+When ``save_prediction_path`` is provided in the inference configuration,
+GraphStorm will save the prediction results in the corresponding path.
+In multi-task learning inference, each prediction task will have its prediction
+results saved separately under different sub-directories
+named with the
+corresponding task id. The output format of task specific prediction results
+follows the :ref:`GraphStorm saved prediction result format<gs-out-predictions>`.
+The contents of the ``save_prediction_path`` in multi-task learning will look like following:
+
+.. code-block:: bash
+
+    prediction_dir/
+        edge_regression-paper_cite_paper-year/
+            paper_cite_paper/
+                predict-00000.pt
+                predict-00001.pt
+                ...
+                src_nids-00000.pt
+                src_nids-00001.pt
+                ...
+                dst_nids-00000.pt
+                dst_nids-00001.pt
+                ...
+            result_info.json
+        node_classification-paper-venue/
+            paper/
+                predict-00000.pt
+                predict-00001.pt
+                ...
+                predict_nids-00000.pt
+                predict_nids-00001.pt
+                ...
+            result_info.json
+        ...
+
+**Note: The built-in GraphStorm inference pipeline
+(launched by GraphStorm CLIs) will process each saved prediction result
+to convert the integer node IDs into the raw node IDs, which are usually string node IDs.**
+Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>`
diff --git a/...ph-construction/gs-processing/example.rst → ...raph-construction/distributed/example.rst b/...ph-construction/gs-processing/example.rst → ...raph-construction/distributed/example.rst
diff --git a/...s-processing/gspartition/ec2-clusters.rst → .../distributed/gspartition/ec2-clusters.rst b/...s-processing/gspartition/ec2-clusters.rst → .../distributed/gspartition/ec2-clusters.rst
@@ -1,6 +1,6 @@
-======================================
-Running partition jobs on EC2 Clusters
-======================================
+===============================================
+Running partition jobs on Amazon EC2 Clusters
+===============================================
 
 Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed,
 users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and
@@ -64,7 +64,7 @@ Collect the IP address list
 ...........................
 The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like:
 
-.. figure:: ../../../../../tutorial/distributed_ips.png
+.. figure:: ../../../../../../tutorial/distributed_ips.png
     :align: center
 
 .. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints.

diff --git a/docs/source/cli/graph-construction/distributed/gspartition/index.rst b/docs/source/cli/graph-construction/distributed/gspartition/index.rst
@@ -0,0 +1,24 @@
+.. _gspartition_index:
+
+=======================================
+GraphStorm Distributed Graph Partition
+=======================================
+
+GraphStorm Distributed Graph Partition (GSPartition), which is built on top of the
+dgl `distributed graph partitioning pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_, allows users to do distributed partition on the outputs of :ref:`GSProcessing<gs-processing>`. 
+
+GSPartition consists of two steps: Graph Partitioning and Data Dispatching. Graph Partitioning step assigns each node to one partition and save the results as a set of files, called partition assignment. Data Dispatching step will physically partition the
+graph data and dispatch them according to the partition assignment. It will generate the graph data in DGL distributed graph format, ready for GraphStorm distributed training and inference.
+
+.. note::
+    GraphStorm currently only supports running GSPartition on AWS infrastructure, i.e., `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_ and `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. But, users can easily create your own Linux clusters by following the GSPartition tutorial on Amazon EC2.
+
+The first section includes instructions on how to run GSPartition on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_.
+The second section includes instructions on how to run GSPartition on `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_.
+
+.. toctree::
+   :maxdepth: 1
+   :glob:
+
+   sagemaker.rst
+   ec2-clusters.rst
diff --git a/...n/gs-processing/gspartition/sagemaker.rst → ...ion/distributed/gspartition/sagemaker.rst b/...n/gs-processing/gspartition/sagemaker.rst → ...ion/distributed/gspartition/sagemaker.rst
diff --git a/...processing/aws-infra/amazon-sagemaker.rst → ...processing/aws-infra/amazon-sagemaker.rst b/...processing/aws-infra/amazon-sagemaker.rst → ...processing/aws-infra/amazon-sagemaker.rst
diff --git a/...s-processing/aws-infra/emr-serverless.rst → ...gsprocessing/aws-infra/emr-serverless.rst b/...s-processing/aws-infra/emr-serverless.rst → ...gsprocessing/aws-infra/emr-serverless.rst
diff --git a/...struction/gs-processing/aws-infra/emr.rst → ...istributed/gsprocessing/aws-infra/emr.rst b/...struction/gs-processing/aws-infra/emr.rst → ...istributed/gsprocessing/aws-infra/emr.rst
diff --git a/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst
@@ -0,0 +1,15 @@
+================================================
+Running GSProcessing jobs on AWS Infra
+================================================
+
+After successfully building the Docker image and pushing it to
+`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_,
+you can now initiate GSProcessing jobs with AWS resources.
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  amazon-sagemaker.rst
+  emr-serverless.rst
+  emr.rst
diff --git a/...cessing/aws-infra/row-count-alignment.rst → ...cessing/aws-infra/row-count-alignment.rst b/...cessing/aws-infra/row-count-alignment.rst → ...cessing/aws-infra/row-count-alignment.rst
diff --git a/...cessing/prerequisites/developer-guide.rst → ...tributed/gsprocessing/developer-guide.rst b/...cessing/prerequisites/developer-guide.rst → ...tributed/gsprocessing/developer-guide.rst
diff --git a/...quisites/distributed-processing-setup.rst → ...ocessing/distributed-processing-setup.rst b/...quisites/distributed-processing-setup.rst → ...ocessing/distributed-processing-setup.rst
@@ -1,6 +1,6 @@
 .. _gsprocessing_distributed_setup:
 
-GraphStorm Processing Distributed Setup
+GSProcessing Distributed Setup
 =======================================
 
 In this guide we'll demonstrate how to prepare your environment to run

diff --git a/...uisites/gs-processing-getting-started.rst → ...cessing/gs-processing-getting-started.rst b/...uisites/gs-processing-getting-started.rst → ...cessing/gs-processing-getting-started.rst
@@ -1,6 +1,6 @@
 .. _gs-processing:
 
-GraphStorm Processing Getting Started
+GSProcessing Getting Started
 =====================================
 
 .. _gsp-installation-ref:

diff --git a/docs/source/cli/graph-construction/distributed/gsprocessing/index.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/index.rst
@@ -0,0 +1,27 @@
+.. _gsprocessing_prerequisites_index:
+
+========================================
+GraphStorm Distributed Data Processing
+========================================
+
+GraphStorm Distributed Data Processing (GSProcessing) enables the processing and preparation of massive graph data for training with GraphStorm. GSProcessing handles generating unique node IDs, encoding edge structure files, processing individual features, and preparing data for the distributed partition stage.
+
+.. note::
+
+    * We use PySpark for horizontal parallelism, enabling scalability to graphs with billions of nodes and edges.
+    * GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.
+
+The following sections outline essential prerequisites and provide a detailed guide to use 
+GSProcessing.
+The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration.
+The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources.
+The third section explains how to deploy GSProcessing job with AWS infrastructure. The last section offers the details about generating a configuration file for GSProcessing jobs.
+
+.. toctree::
+  :maxdepth: 1
+  :titlesonly:
+
+  gs-processing-getting-started.rst
+  distributed-processing-setup.rst
+  aws-infra/index.rst
+  input-configuration.rst
diff --git a/...ion/gs-processing/input-configuration.rst → ...uted/gsprocessing/input-configuration.rst b/...ion/gs-processing/input-configuration.rst → ...uted/gsprocessing/input-configuration.rst
@@ -1,7 +1,7 @@
 ..  _gsprocessing_input_configuration:
 
-GraphStorm Processing Input Configuration
-=========================================
+GSProcessing Input Configuration
+================================
 
 GraphStorm Processing uses a JSON configuration file to
 parse and process the data into the format needed

diff --git a/docs/source/cli/graph-construction/distributed/index.rst b/docs/source/cli/graph-construction/distributed/index.rst
@@ -0,0 +1,24 @@
+.. _distributed-gconstruction:
+
+Distributed Graph Construction
+==============================
+
+Beyond single machine graph construction, distributed graph construction offers enhanced scalability
+and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing)
+and GraphStorm Distributed Graph Partitioning (GSPartition).The below diagram is an overview of the workflow for distributed graph construction.
+
+.. figure:: ../../../../../tutorial/distributed_construction.png
+    :align: center
+
+* **GSProcessing**: It accepts tabular files in parquet/CSV format, and prepares the raw data into structured data for partitioning, including edge and node data, transformation details, and node id mappings.
+* **GSPartition**: It will process these structured data to create multiple partitions in `DGL Distributed Graph <https://docs.dgl.ai/en/latest/api/python/dgl.distributed.html#distributed-graph>`_ format for distributed model training and inference. 
+
+The following sections provide guidance on doing GSProcessing and GSPartition. In addition, this tutorial also offers an example that demonstrates the end-to-end distributed graph construction process.
+
+.. toctree::
+   :maxdepth: 1
+   :glob:
+
+   gsprocessing/index.rst
+   gspartition/index.rst
+   example.rst
diff --git a/docs/source/cli/graph-construction/index.rst b/docs/source/cli/graph-construction/index.rst
@@ -0,0 +1,21 @@
+.. _graph_construction:
+
+==============================
+GraphStorm Graph Construction
+==============================
+
+In order to use GraphStorm's graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm's specifications. Users can find more details of these specifications in the :ref:`Input Raw Data Explanations <input_raw_data>` section.
+
+Once the raw data is ready, by using GraphStorm :ref:`single machine graph construction CLIs <single-machine-gconstruction>`, users can handle most common academic graphs or small graphs sampled from enterprise data, typically with millions of nodes and up to one billion edges. It's recommended to use machines with large CPU memory. A general guideline: 1TB of memory for graphs with one billion edges.
+
+Many production-level enterprise graphs contain billions of nodes and edges, with features having hundreds or thousands of dimensions. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` help users manage these complex graphs. This is particularly useful for building automatic graph data processing pipelines in production environments. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` could be applied on multiple Amazon infrastructures, including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_,
+`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and
+`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.
+
+.. toctree::
+   :maxdepth: 2
+   :glob:
+
+   raw_data
+   single-machine-gconstruct
+   distributed/index