-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Doc] V0.3.1 Documentation and Tutorial update (#973)
*Issue #, if available:* *Description of changes:* This PR updated the overall Documentation and Tutorial organization. The changes include: - Grouped the main contents under two 1st-level menu, i.e., `COMMAND LINE INTERFACE USER GUIDE` and `PROGRAMMING INTERFACE USER GUIDE`. - In the CLI user guide, regrouped previous contents into two 2nd-level menu, i.e., `GraphStorm Graph Construction` and `GraphStorm Model Training and Inference`. - In the `GraphStorm Graph Construction`, added a new document, `Input Raw Data Specification`, to explain the specifications of the input data, and provide a simple raw data example. - Added a new document,`Single Machine Graph Construction`, to introduce the `gconstruct` module, and provide a simple construction configuration JSON example. - In the `Distributed Graph Construction`, added a few text to link documents and renamed some titles. - Renamed the existing 1st-level `DISTRIBUTED TRAINING` to `GraphStorm Model Training and Inference` and move the contents into the 2nd-level menu under `COMMAND LINE INTERFACE USER GUIDE`. - Added a new `Model Training and Inference on a Single Machine` to explain the launch commands. - Moved the `Model Training and Inference Configurations` under this 2n-level menu. - Added a new `GraphStorm Training and Inference Output` to explain the intermediate outputs. - Added a new `GraphStorm Output Node ID Remapping` to explain the CLIs output and the remapping operation. - In the API user guide, merged the API doc string commits. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: xiang song(charlie.song) <[email protected]> Co-authored-by: jalencato <[email protected]> Co-authored-by: Oxfordblue7 <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]> Co-authored-by: Xiang Song <[email protected]>
- Loading branch information
1 parent
3baab75
commit d739902
Showing
38 changed files
with
1,428 additions
and
448 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
24 changes: 24 additions & 0 deletions
24
docs/source/cli/graph-construction/distributed/gspartition/index.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
.. _gspartition_index: | ||
|
||
======================================= | ||
GraphStorm Distributed Graph Partition | ||
======================================= | ||
|
||
GraphStorm Distributed Graph Partition (GSPartition), which is built on top of the | ||
dgl `distributed graph partitioning pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_, allows users to do distributed partition on the outputs of :ref:`GSProcessing<gs-processing>`. | ||
|
||
GSPartition consists of two steps: Graph Partitioning and Data Dispatching. Graph Partitioning step assigns each node to one partition and save the results as a set of files, called partition assignment. Data Dispatching step will physically partition the | ||
graph data and dispatch them according to the partition assignment. It will generate the graph data in DGL distributed graph format, ready for GraphStorm distributed training and inference. | ||
|
||
.. note:: | ||
GraphStorm currently only supports running GSPartition on AWS infrastructure, i.e., `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_ and `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. But, users can easily create your own Linux clusters by following the GSPartition tutorial on Amazon EC2. | ||
|
||
The first section includes instructions on how to run GSPartition on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_. | ||
The second section includes instructions on how to run GSPartition on `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:glob: | ||
|
||
sagemaker.rst | ||
ec2-clusters.rst |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
15 changes: 15 additions & 0 deletions
15
docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
================================================ | ||
Running GSProcessing jobs on AWS Infra | ||
================================================ | ||
|
||
After successfully building the Docker image and pushing it to | ||
`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_, | ||
you can now initiate GSProcessing jobs with AWS resources. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:titlesonly: | ||
|
||
amazon-sagemaker.rst | ||
emr-serverless.rst | ||
emr.rst |
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion
2
...quisites/distributed-processing-setup.rst → ...ocessing/distributed-processing-setup.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...uisites/gs-processing-getting-started.rst → ...cessing/gs-processing-getting-started.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
27 changes: 27 additions & 0 deletions
27
docs/source/cli/graph-construction/distributed/gsprocessing/index.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
.. _gsprocessing_prerequisites_index: | ||
|
||
======================================== | ||
GraphStorm Distributed Data Processing | ||
======================================== | ||
|
||
GraphStorm Distributed Data Processing (GSProcessing) enables the processing and preparation of massive graph data for training with GraphStorm. GSProcessing handles generating unique node IDs, encoding edge structure files, processing individual features, and preparing data for the distributed partition stage. | ||
|
||
.. note:: | ||
|
||
* We use PySpark for horizontal parallelism, enabling scalability to graphs with billions of nodes and edges. | ||
* GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. | ||
|
||
The following sections outline essential prerequisites and provide a detailed guide to use | ||
GSProcessing. | ||
The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration. | ||
The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources. | ||
The third section explains how to deploy GSProcessing job with AWS infrastructure. The last section offers the details about generating a configuration file for GSProcessing jobs. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:titlesonly: | ||
|
||
gs-processing-getting-started.rst | ||
distributed-processing-setup.rst | ||
aws-infra/index.rst | ||
input-configuration.rst |
4 changes: 2 additions & 2 deletions
4
...ion/gs-processing/input-configuration.rst → ...uted/gsprocessing/input-configuration.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
.. _distributed-gconstruction: | ||
|
||
Distributed Graph Construction | ||
============================== | ||
|
||
Beyond single machine graph construction, distributed graph construction offers enhanced scalability | ||
and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing) | ||
and GraphStorm Distributed Graph Partitioning (GSPartition).The below diagram is an overview of the workflow for distributed graph construction. | ||
|
||
.. figure:: ../../../../../tutorial/distributed_construction.png | ||
:align: center | ||
|
||
* **GSProcessing**: It accepts tabular files in parquet/CSV format, and prepares the raw data into structured data for partitioning, including edge and node data, transformation details, and node id mappings. | ||
* **GSPartition**: It will process these structured data to create multiple partitions in `DGL Distributed Graph <https://docs.dgl.ai/en/latest/api/python/dgl.distributed.html#distributed-graph>`_ format for distributed model training and inference. | ||
|
||
The following sections provide guidance on doing GSProcessing and GSPartition. In addition, this tutorial also offers an example that demonstrates the end-to-end distributed graph construction process. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:glob: | ||
|
||
gsprocessing/index.rst | ||
gspartition/index.rst | ||
example.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
.. _graph_construction: | ||
|
||
============================== | ||
GraphStorm Graph Construction | ||
============================== | ||
|
||
In order to use GraphStorm's graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm's specifications. Users can find more details of these specifications in the :ref:`Input Raw Data Explanations <input_raw_data>` section. | ||
|
||
Once the raw data is ready, by using GraphStorm :ref:`single machine graph construction CLIs <single-machine-gconstruction>`, users can handle most common academic graphs or small graphs sampled from enterprise data, typically with millions of nodes and up to one billion edges. It's recommended to use machines with large CPU memory. A general guideline: 1TB of memory for graphs with one billion edges. | ||
|
||
Many production-level enterprise graphs contain billions of nodes and edges, with features having hundreds or thousands of dimensions. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` help users manage these complex graphs. This is particularly useful for building automatic graph data processing pipelines in production environments. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` could be applied on multiple Amazon infrastructures, including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, | ||
`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and | ||
`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:glob: | ||
|
||
raw_data | ||
single-machine-gconstruct | ||
distributed/index |
Oops, something went wrong.