From d73990259911bfab98d82a65cc60cf59279743e3 Mon Sep 17 00:00:00 2001 From: "Jian Zhang (James)" <6593865@qq.com> Date: Fri, 16 Aug 2024 12:45:18 -0700 Subject: [PATCH] [Doc] V0.3.1 Documentation and Tutorial update (#973) *Issue #, if available:* *Description of changes:* This PR updated the overall Documentation and Tutorial organization. The changes include: - Grouped the main contents under two 1st-level menu, i.e., `COMMAND LINE INTERFACE USER GUIDE` and `PROGRAMMING INTERFACE USER GUIDE`. - In the CLI user guide, regrouped previous contents into two 2nd-level menu, i.e., `GraphStorm Graph Construction` and `GraphStorm Model Training and Inference`. - In the `GraphStorm Graph Construction`, added a new document, `Input Raw Data Specification`, to explain the specifications of the input data, and provide a simple raw data example. - Added a new document,`Single Machine Graph Construction`, to introduce the `gconstruct` module, and provide a simple construction configuration JSON example. - In the `Distributed Graph Construction`, added a few text to link documents and renamed some titles. - Renamed the existing 1st-level `DISTRIBUTED TRAINING` to `GraphStorm Model Training and Inference` and move the contents into the 2nd-level menu under `COMMAND LINE INTERFACE USER GUIDE`. - Added a new `Model Training and Inference on a Single Machine` to explain the launch commands. - Moved the `Model Training and Inference Configurations` under this 2n-level menu. - Added a new `GraphStorm Training and Inference Output` to explain the intermediate outputs. - Added a new `GraphStorm Output Node ID Remapping` to explain the CLIs output and the remapping operation. - In the API user guide, merged the API doc string commits. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Ubuntu Co-authored-by: xiang song(charlie.song) Co-authored-by: jalencato Co-authored-by: Oxfordblue7 Co-authored-by: Theodore Vasiloudis Co-authored-by: Theodore Vasiloudis Co-authored-by: Xiang Song --- README.md | 15 - docs/source/advanced/link-prediction.rst | 49 ++- docs/source/advanced/multi-task-learning.rst | 126 ++++++ .../distributed}/example.rst | 0 .../distributed}/gspartition/ec2-clusters.rst | 8 +- .../distributed/gspartition/index.rst | 24 ++ .../distributed}/gspartition/sagemaker.rst | 0 .../aws-infra/amazon-sagemaker.rst | 0 .../aws-infra/emr-serverless.rst | 0 .../gsprocessing}/aws-infra/emr.rst | 0 .../gsprocessing/aws-infra/index.rst | 15 + .../aws-infra/row-count-alignment.rst | 0 .../gsprocessing}/developer-guide.rst | 0 .../distributed-processing-setup.rst | 2 +- .../gs-processing-getting-started.rst | 2 +- .../distributed/gsprocessing/index.rst | 27 ++ .../gsprocessing}/input-configuration.rst | 4 +- .../graph-construction/distributed/index.rst | 24 ++ docs/source/cli/graph-construction/index.rst | 21 + .../cli/graph-construction/raw_data.rst | 190 +++++++++ .../single-machine-gconstruct.rst | 372 ++++++++++++++++++ .../configuration-run.rst | 17 +- .../distributed/cluster.rst} | 8 +- .../distributed}/sagemaker.rst | 5 +- .../cli/model-training-inference/index.rst | 24 ++ .../output-remapping.rst | 220 +++++++++++ .../cli/model-training-inference/output.rst | 215 ++++++++++ .../single-machine-training-inference.rst | 76 ++++ .../configuration-gconstruction.rst | 171 -------- .../configuration/configuration-partition.rst | 51 --- docs/source/configuration/index.rst | 21 - .../gs-processing/aws-infra/index.rst | 37 -- .../gs-processing/gspartition/index.rst | 28 -- .../gs-processing/index.rst | 35 -- .../gs-processing/prerequisites/index.rst | 28 -- docs/source/graph-construction/index.rst | 13 - docs/source/index.rst | 46 +-- docs/source/tutorials/own-data.rst | 2 +- 38 files changed, 1428 insertions(+), 448 deletions(-) rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed}/example.rst (100%) rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed}/gspartition/ec2-clusters.rst (97%) create mode 100644 docs/source/cli/graph-construction/distributed/gspartition/index.rst rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed}/gspartition/sagemaker.rst (100%) rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed/gsprocessing}/aws-infra/amazon-sagemaker.rst (100%) rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed/gsprocessing}/aws-infra/emr-serverless.rst (100%) rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed/gsprocessing}/aws-infra/emr.rst (100%) create mode 100644 docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed/gsprocessing}/aws-infra/row-count-alignment.rst (100%) rename docs/source/{graph-construction/gs-processing/prerequisites => cli/graph-construction/distributed/gsprocessing}/developer-guide.rst (100%) rename docs/source/{graph-construction/gs-processing/prerequisites => cli/graph-construction/distributed/gsprocessing}/distributed-processing-setup.rst (99%) rename docs/source/{graph-construction/gs-processing/prerequisites => cli/graph-construction/distributed/gsprocessing}/gs-processing-getting-started.rst (99%) create mode 100644 docs/source/cli/graph-construction/distributed/gsprocessing/index.rst rename docs/source/{graph-construction/gs-processing => cli/graph-construction/distributed/gsprocessing}/input-configuration.rst (99%) create mode 100644 docs/source/cli/graph-construction/distributed/index.rst create mode 100644 docs/source/cli/graph-construction/index.rst create mode 100644 docs/source/cli/graph-construction/raw_data.rst create mode 100644 docs/source/cli/graph-construction/single-machine-gconstruct.rst rename docs/source/{configuration => cli/model-training-inference}/configuration-run.rst (97%) rename docs/source/{scale/distributed.rst => cli/model-training-inference/distributed/cluster.rst} (97%) rename docs/source/{scale => cli/model-training-inference/distributed}/sagemaker.rst (99%) create mode 100644 docs/source/cli/model-training-inference/index.rst create mode 100644 docs/source/cli/model-training-inference/output-remapping.rst create mode 100644 docs/source/cli/model-training-inference/output.rst create mode 100644 docs/source/cli/model-training-inference/single-machine-training-inference.rst delete mode 100644 docs/source/configuration/configuration-gconstruction.rst delete mode 100644 docs/source/configuration/configuration-partition.rst delete mode 100644 docs/source/configuration/index.rst delete mode 100644 docs/source/graph-construction/gs-processing/aws-infra/index.rst delete mode 100644 docs/source/graph-construction/gs-processing/gspartition/index.rst delete mode 100644 docs/source/graph-construction/gs-processing/index.rst delete mode 100644 docs/source/graph-construction/gs-processing/prerequisites/index.rst delete mode 100644 docs/source/graph-construction/index.rst diff --git a/README.md b/README.md index 277be9cb7c..4844a3ed79 100644 --- a/README.md +++ b/README.md @@ -43,27 +43,13 @@ python /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \ GraphStorm training relies on ssh to launch training jobs. The GraphStorm standalone mode uses ssh services in port 22. -In addition, to run GraphStorm training in a single machine, users need to create a ``ip_list.txt`` file that contains one row as below, which will facilitate ssh communication to the machine itself. - -```127.0.0.1``` - -Users can use the following command to create the simple ip_list.txt file. - -``` -touch /tmp/ip_list.txt -echo 127.0.0.1 > /tmp/ip_list.txt -``` - Third, run the below command to train an RGCN model to perform node classification on the partitioned arxiv graph. ``` python -m graphstorm.run.gs_node_classification \ --workspace /tmp/ogbn-arxiv-nc \ --num-trainers 1 \ - --num-servers 1 \ - --num-samplers 0 \ --part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \ - --ip-config /tmp/ip_list.txt \ --ssh-port 22 \ --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \ --save-perf-results-path /tmp/ogbn-arxiv-nc/models @@ -96,7 +82,6 @@ python -m graphstorm.run.gs_link_prediction \ --num-servers 1 \ --num-samplers 0 \ --part-config /tmp/ogbn_mag_lp_train_val_1p_4t/ogbn-mag.json \ - --ip-config /tmp/ip_list.txt \ --ssh-port 22 \ --cf /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml \ --node-feat-name paper:feat \ diff --git a/docs/source/advanced/link-prediction.rst b/docs/source/advanced/link-prediction.rst index 8d9df457cb..a9f97b0192 100644 --- a/docs/source/advanced/link-prediction.rst +++ b/docs/source/advanced/link-prediction.rst @@ -197,10 +197,57 @@ In general, GraphStorm covers following cases: The gconstruct pipeline of GraphStorm provides support to load hard negative data from raw input. Hard destination negatives can be defined through ``edge_dst_hard_negative`` transformation. The ``feature_col`` field of ``edge_dst_hard_negative`` must stores the raw node ids of hard destination nodes. +The follwing example shows how to define a hard negative feature for edges with the relation ``(node1, relation1, node1)``: + + .. code-block:: json + + { + ... + "edges": [ + ... + { + "source_id_col": "src", + "dest_id_col": "dst", + "relation": ("node1", "relation1", "node1"), + "format": {"name": "parquet"}, + "files": "edge_data.parquet", + "features": [ + { + "feature_col": "hard_neg", + "feature_name": "hard_neg_feat", + "transform": {"name": "edge_dst_hard_negative", + "separator": ";"}, + } + ] + } + ] + } + +The hard negative data is stored in the column named ``hard_neg`` in the ``edge_data.parquet`` file. +The edge feature to store the hard negative will be ``hard_neg_feat``. + GraphStorm accepts two types of hard negative inputs: - **An array of strings or integers** When the input format is ``Parquet``, the ``feature_col`` can store string or integer arrays. In this case, each row stores a string/integer array representing the hard negative node ids of the corresponding edge. For example, the ``feature_col`` can be a 2D string array, like ``[["e0_hard_0", "e0_hard_1"],["e1_hard_0"], ..., ["en_hard_0", "en_hard_1"]]`` or a 2D integer array (for integer node ids) like ``[[10,2],[3],...[4,12]]``. It is not required for each row to have the same dimension size. GraphStorm will automatically handle the case when some edges do not have enough pre-defined hard negatives. +For example, the file storing hard negatives should look like the following: + +.. code-block:: yaml + + src | dst | hard_neg + "src_0" | "dst_0" | ["dst_10", "dst_11"] + "src_0" | "dst_1" | ["dst_5"] + ... + "src_100"| "dst_41" | [dst0, dst_2] + +- **A single string** The ``feature_col`` stores strings instead of string arrays (When the input format is ``Parquet`` or ``CSV``). In this case, a ``separator`` must be provided int the transformation definition to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives. +For example, the file storing hard negatives should look like the following: + +.. code-block:: yaml -- **A single string** The ``feature_col`` stores strings instead of string arrays. (When the input format is ``Parquet`` or ``CSV``) In this case, a ``separator`` must be provided to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives. + src | dst | hard_neg + "src_0" | "dst_0" | "dst_10;dst_11" + "src_0" | "dst_1" | "dst_5" + ... + "src_100"| "dst_41"| "dst0;dst_2" GraphStorm will automatically translate the Raw Node IDs of hard negatives into Partition Node IDs in a DistDGL graph. diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst index 214b1c22de..c6d68eb7c8 100644 --- a/docs/source/advanced/multi-task-learning.rst +++ b/docs/source/advanced/multi-task-learning.rst @@ -318,3 +318,129 @@ GraphStorm supports to run multi-task inference on :ref:`SageMaker \ --instance-type +Multi-task Learning Output +-------------------------- + +Saved Node Embeddings +~~~~~~~~~~~~~~~~~~~~~~ +When ``save_embed_path`` is provided in the training configuration or the inference configuration, +GraphStorm will save the node embeddings in the corresponding path. +In multi-task learning, by default, GraphStorm will save the node embeddings +produced by the GNN layer for every node type under the path specified by +``save_embed_path``. The output format follows the :ref:`GraphStorm saved node embeddings +format`. Meanwhile, in multi-task learning, certain tasks might apply +task specific normalization to node embeddings. For instance, a link prediction +task might apply l2 normalization on each node embeddings. In certain cases, GraphStorm +will also save the normalized node embeddings under the ``save_embed_path``. +The task specific node embeddings are saved separately under different sub-directories +named with the corresponding task id. (A task id is formated as ``--