Skip to content

Commit

Permalink
[Doc] V0.3.1 Documentation and Tutorial update (#973)
Browse files Browse the repository at this point in the history
*Issue #, if available:*

*Description of changes:*
This PR updated the overall Documentation and Tutorial organization. The
changes include:

- Grouped the main contents under two 1st-level menu, i.e., `COMMAND
LINE INTERFACE USER GUIDE` and `PROGRAMMING INTERFACE USER GUIDE`.
- In the CLI user guide, regrouped previous contents into two 2nd-level
menu, i.e., `GraphStorm Graph Construction` and `GraphStorm Model
Training and Inference`.
- In the `GraphStorm Graph Construction`, added a new document, `Input
Raw Data Specification`, to explain the specifications of the input
data, and provide a simple raw data example.
- Added a new document,`Single Machine Graph Construction`, to introduce
the `gconstruct` module, and provide a simple construction configuration
JSON example.
- In the `Distributed Graph Construction`, added a few text to link
documents and renamed some titles.
- Renamed the existing 1st-level `DISTRIBUTED TRAINING` to `GraphStorm
Model Training and Inference` and move the contents into the 2nd-level
menu under `COMMAND LINE INTERFACE USER GUIDE`.
- Added a new `Model Training and Inference on a Single Machine` to
explain the launch commands.
- Moved the `Model Training and Inference Configurations` under this
2n-level menu.
- Added a new `GraphStorm Training and Inference Output` to explain the
intermediate outputs.
- Added a new `GraphStorm Output Node ID Remapping` to explain the CLIs
output and the remapping operation.
    - In the API user guide, merged the API doc string commits.


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: xiang song(charlie.song) <[email protected]>
Co-authored-by: jalencato <[email protected]>
Co-authored-by: Oxfordblue7 <[email protected]>
Co-authored-by: Theodore Vasiloudis <[email protected]>
Co-authored-by: Theodore Vasiloudis <[email protected]>
Co-authored-by: Xiang Song <[email protected]>
  • Loading branch information
8 people authored Aug 16, 2024
1 parent 3baab75 commit d739902
Show file tree
Hide file tree
Showing 38 changed files with 1,428 additions and 448 deletions.
15 changes: 0 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,27 +43,13 @@ python /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \

GraphStorm training relies on ssh to launch training jobs. The GraphStorm standalone mode uses ssh services in port 22.

In addition, to run GraphStorm training in a single machine, users need to create a ``ip_list.txt`` file that contains one row as below, which will facilitate ssh communication to the machine itself.

```127.0.0.1```

Users can use the following command to create the simple ip_list.txt file.

```
touch /tmp/ip_list.txt
echo 127.0.0.1 > /tmp/ip_list.txt
```

Third, run the below command to train an RGCN model to perform node classification on the partitioned arxiv graph.

```
python -m graphstorm.run.gs_node_classification \
--workspace /tmp/ogbn-arxiv-nc \
--num-trainers 1 \
--num-servers 1 \
--num-samplers 0 \
--part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \
--ip-config /tmp/ip_list.txt \
--ssh-port 22 \
--cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
--save-perf-results-path /tmp/ogbn-arxiv-nc/models
Expand Down Expand Up @@ -96,7 +82,6 @@ python -m graphstorm.run.gs_link_prediction \
--num-servers 1 \
--num-samplers 0 \
--part-config /tmp/ogbn_mag_lp_train_val_1p_4t/ogbn-mag.json \
--ip-config /tmp/ip_list.txt \
--ssh-port 22 \
--cf /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml \
--node-feat-name paper:feat \
Expand Down
49 changes: 48 additions & 1 deletion docs/source/advanced/link-prediction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,10 +197,57 @@ In general, GraphStorm covers following cases:
The gconstruct pipeline of GraphStorm provides support to load hard negative data from raw input.
Hard destination negatives can be defined through ``edge_dst_hard_negative`` transformation.
The ``feature_col`` field of ``edge_dst_hard_negative`` must stores the raw node ids of hard destination nodes.
The follwing example shows how to define a hard negative feature for edges with the relation ``(node1, relation1, node1)``:

.. code-block:: json
{
...
"edges": [
...
{
"source_id_col": "src",
"dest_id_col": "dst",
"relation": ("node1", "relation1", "node1"),
"format": {"name": "parquet"},
"files": "edge_data.parquet",
"features": [
{
"feature_col": "hard_neg",
"feature_name": "hard_neg_feat",
"transform": {"name": "edge_dst_hard_negative",
"separator": ";"},
}
]
}
]
}
The hard negative data is stored in the column named ``hard_neg`` in the ``edge_data.parquet`` file.
The edge feature to store the hard negative will be ``hard_neg_feat``.

GraphStorm accepts two types of hard negative inputs:

- **An array of strings or integers** When the input format is ``Parquet``, the ``feature_col`` can store string or integer arrays. In this case, each row stores a string/integer array representing the hard negative node ids of the corresponding edge. For example, the ``feature_col`` can be a 2D string array, like ``[["e0_hard_0", "e0_hard_1"],["e1_hard_0"], ..., ["en_hard_0", "en_hard_1"]]`` or a 2D integer array (for integer node ids) like ``[[10,2],[3],...[4,12]]``. It is not required for each row to have the same dimension size. GraphStorm will automatically handle the case when some edges do not have enough pre-defined hard negatives.
For example, the file storing hard negatives should look like the following:

.. code-block:: yaml
src | dst | hard_neg
"src_0" | "dst_0" | ["dst_10", "dst_11"]
"src_0" | "dst_1" | ["dst_5"]
...
"src_100"| "dst_41" | [dst0, dst_2]
- **A single string** The ``feature_col`` stores strings instead of string arrays (When the input format is ``Parquet`` or ``CSV``). In this case, a ``separator`` must be provided int the transformation definition to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives.
For example, the file storing hard negatives should look like the following:

.. code-block:: yaml
- **A single string** The ``feature_col`` stores strings instead of string arrays. (When the input format is ``Parquet`` or ``CSV``) In this case, a ``separator`` must be provided to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives.
src | dst | hard_neg
"src_0" | "dst_0" | "dst_10;dst_11"
"src_0" | "dst_1" | "dst_5"
...
"src_100"| "dst_41"| "dst0;dst_2"
GraphStorm will automatically translate the Raw Node IDs of hard negatives into Partition Node IDs in a DistDGL graph.
126 changes: 126 additions & 0 deletions docs/source/advanced/multi-task-learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -318,3 +318,129 @@ GraphStorm supports to run multi-task inference on :ref:`SageMaker<distributed-s
--instance-count <INSTANCE_COUNT> \
--instance-type <INSTANCE_TYPE>
Multi-task Learning Output
--------------------------

Saved Node Embeddings
~~~~~~~~~~~~~~~~~~~~~~
When ``save_embed_path`` is provided in the training configuration or the inference configuration,
GraphStorm will save the node embeddings in the corresponding path.
In multi-task learning, by default, GraphStorm will save the node embeddings
produced by the GNN layer for every node type under the path specified by
``save_embed_path``. The output format follows the :ref:`GraphStorm saved node embeddings
format<gs-out-embs>`. Meanwhile, in multi-task learning, certain tasks might apply
task specific normalization to node embeddings. For instance, a link prediction
task might apply l2 normalization on each node embeddings. In certain cases, GraphStorm
will also save the normalized node embeddings under the ``save_embed_path``.
The task specific node embeddings are saved separately under different sub-directories
named with the corresponding task id. (A task id is formated as ``<task_type>-<ntype/etype(s)>-<label>``.
For instance, the task id of a node classification task on the node type ``paper`` with the
label field ``venue`` will be ``node_classification-paper-venue``. As another example,
the task id of a link prediction task on the edge type ``(paper, cite, paper)`` will be
``link_prediction-paper_cite_paper``
and the task id of a edge regression task on the edge type ``(paper, cite, paper)`` with
the label field ``year`` will be ``edge_regression-paper_cite_paper-year``).
The output format of task specific node embeddings follows
the :ref:`GraphStorm saved node embeddings format<gs-out-embs>`.
The contents of the ``save_embed_path`` in multi-task learning will look like following:

.. code-block:: bash
emb_dir/
ntype0/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json
link_prediction-paper_cite_paper/
ntype0/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json
edge_regression-paper_cite_paper-year/
ntype0/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
ntype1/
embed_nids-00000.pt
embed_nids-00001.pt
...
embed-00000.pt
embed-00001.pt
...
emb_info.json
In the above example both the link prediction and edge regression tasks
apply task specific normalization on node embeddings.

**Note: The built-in GraphStorm training or inference pipeline
(launched by GraphStorm CLIs) will process each saved node embeddings
to convert the integer node IDs into the raw node IDs, which are usually string node IDs.**
Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>`

Saved Prediction Results
~~~~~~~~~~~~~~~~~~~~~~~~~
When ``save_prediction_path`` is provided in the inference configuration,
GraphStorm will save the prediction results in the corresponding path.
In multi-task learning inference, each prediction task will have its prediction
results saved separately under different sub-directories
named with the
corresponding task id. The output format of task specific prediction results
follows the :ref:`GraphStorm saved prediction result format<gs-out-predictions>`.
The contents of the ``save_prediction_path`` in multi-task learning will look like following:

.. code-block:: bash
prediction_dir/
edge_regression-paper_cite_paper-year/
paper_cite_paper/
predict-00000.pt
predict-00001.pt
...
src_nids-00000.pt
src_nids-00001.pt
...
dst_nids-00000.pt
dst_nids-00001.pt
...
result_info.json
node_classification-paper-venue/
paper/
predict-00000.pt
predict-00001.pt
...
predict_nids-00000.pt
predict_nids-00001.pt
...
result_info.json
...
**Note: The built-in GraphStorm inference pipeline
(launched by GraphStorm CLIs) will process each saved prediction result
to convert the integer node IDs into the raw node IDs, which are usually string node IDs.**
Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>`
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
======================================
Running partition jobs on EC2 Clusters
======================================
===============================================
Running partition jobs on Amazon EC2 Clusters
===============================================

Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed,
users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and
Expand Down Expand Up @@ -64,7 +64,7 @@ Collect the IP address list
...........................
The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like:

.. figure:: ../../../../../tutorial/distributed_ips.png
.. figure:: ../../../../../../tutorial/distributed_ips.png
:align: center

.. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _gspartition_index:

=======================================
GraphStorm Distributed Graph Partition
=======================================

GraphStorm Distributed Graph Partition (GSPartition), which is built on top of the
dgl `distributed graph partitioning pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_, allows users to do distributed partition on the outputs of :ref:`GSProcessing<gs-processing>`.

GSPartition consists of two steps: Graph Partitioning and Data Dispatching. Graph Partitioning step assigns each node to one partition and save the results as a set of files, called partition assignment. Data Dispatching step will physically partition the
graph data and dispatch them according to the partition assignment. It will generate the graph data in DGL distributed graph format, ready for GraphStorm distributed training and inference.

.. note::
GraphStorm currently only supports running GSPartition on AWS infrastructure, i.e., `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_ and `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. But, users can easily create your own Linux clusters by following the GSPartition tutorial on Amazon EC2.

The first section includes instructions on how to run GSPartition on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_.
The second section includes instructions on how to run GSPartition on `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_.

.. toctree::
:maxdepth: 1
:glob:

sagemaker.rst
ec2-clusters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
================================================
Running GSProcessing jobs on AWS Infra
================================================

After successfully building the Docker image and pushing it to
`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_,
you can now initiate GSProcessing jobs with AWS resources.

.. toctree::
:maxdepth: 1
:titlesonly:

amazon-sagemaker.rst
emr-serverless.rst
emr.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _gsprocessing_distributed_setup:

GraphStorm Processing Distributed Setup
GSProcessing Distributed Setup
=======================================

In this guide we'll demonstrate how to prepare your environment to run
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _gs-processing:

GraphStorm Processing Getting Started
GSProcessing Getting Started
=====================================

.. _gsp-installation-ref:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. _gsprocessing_prerequisites_index:

========================================
GraphStorm Distributed Data Processing
========================================

GraphStorm Distributed Data Processing (GSProcessing) enables the processing and preparation of massive graph data for training with GraphStorm. GSProcessing handles generating unique node IDs, encoding edge structure files, processing individual features, and preparing data for the distributed partition stage.

.. note::

* We use PySpark for horizontal parallelism, enabling scalability to graphs with billions of nodes and edges.
* GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.

The following sections outline essential prerequisites and provide a detailed guide to use
GSProcessing.
The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration.
The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources.
The third section explains how to deploy GSProcessing job with AWS infrastructure. The last section offers the details about generating a configuration file for GSProcessing jobs.

.. toctree::
:maxdepth: 1
:titlesonly:

gs-processing-getting-started.rst
distributed-processing-setup.rst
aws-infra/index.rst
input-configuration.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _gsprocessing_input_configuration:

GraphStorm Processing Input Configuration
=========================================
GSProcessing Input Configuration
================================

GraphStorm Processing uses a JSON configuration file to
parse and process the data into the format needed
Expand Down
24 changes: 24 additions & 0 deletions docs/source/cli/graph-construction/distributed/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. _distributed-gconstruction:

Distributed Graph Construction
==============================

Beyond single machine graph construction, distributed graph construction offers enhanced scalability
and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing)
and GraphStorm Distributed Graph Partitioning (GSPartition).The below diagram is an overview of the workflow for distributed graph construction.

.. figure:: ../../../../../tutorial/distributed_construction.png
:align: center

* **GSProcessing**: It accepts tabular files in parquet/CSV format, and prepares the raw data into structured data for partitioning, including edge and node data, transformation details, and node id mappings.
* **GSPartition**: It will process these structured data to create multiple partitions in `DGL Distributed Graph <https://docs.dgl.ai/en/latest/api/python/dgl.distributed.html#distributed-graph>`_ format for distributed model training and inference.

The following sections provide guidance on doing GSProcessing and GSPartition. In addition, this tutorial also offers an example that demonstrates the end-to-end distributed graph construction process.

.. toctree::
:maxdepth: 1
:glob:

gsprocessing/index.rst
gspartition/index.rst
example.rst
21 changes: 21 additions & 0 deletions docs/source/cli/graph-construction/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _graph_construction:

==============================
GraphStorm Graph Construction
==============================

In order to use GraphStorm's graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm's specifications. Users can find more details of these specifications in the :ref:`Input Raw Data Explanations <input_raw_data>` section.

Once the raw data is ready, by using GraphStorm :ref:`single machine graph construction CLIs <single-machine-gconstruction>`, users can handle most common academic graphs or small graphs sampled from enterprise data, typically with millions of nodes and up to one billion edges. It's recommended to use machines with large CPU memory. A general guideline: 1TB of memory for graphs with one billion edges.

Many production-level enterprise graphs contain billions of nodes and edges, with features having hundreds or thousands of dimensions. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` help users manage these complex graphs. This is particularly useful for building automatic graph data processing pipelines in production environments. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` could be applied on multiple Amazon infrastructures, including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_,
`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and
`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_.

.. toctree::
:maxdepth: 2
:glob:

raw_data
single-machine-gconstruct
distributed/index
Loading

0 comments on commit d739902

Please sign in to comment.