diff --git a/README.md b/README.md index 277be9cb7c..4844a3ed79 100644 --- a/README.md +++ b/README.md @@ -43,27 +43,13 @@ python /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \ GraphStorm training relies on ssh to launch training jobs. The GraphStorm standalone mode uses ssh services in port 22. -In addition, to run GraphStorm training in a single machine, users need to create a ``ip_list.txt`` file that contains one row as below, which will facilitate ssh communication to the machine itself. - -```127.0.0.1``` - -Users can use the following command to create the simple ip_list.txt file. - -``` -touch /tmp/ip_list.txt -echo 127.0.0.1 > /tmp/ip_list.txt -``` - Third, run the below command to train an RGCN model to perform node classification on the partitioned arxiv graph. ``` python -m graphstorm.run.gs_node_classification \ --workspace /tmp/ogbn-arxiv-nc \ --num-trainers 1 \ - --num-servers 1 \ - --num-samplers 0 \ --part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \ - --ip-config /tmp/ip_list.txt \ --ssh-port 22 \ --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \ --save-perf-results-path /tmp/ogbn-arxiv-nc/models @@ -96,7 +82,6 @@ python -m graphstorm.run.gs_link_prediction \ --num-servers 1 \ --num-samplers 0 \ --part-config /tmp/ogbn_mag_lp_train_val_1p_4t/ogbn-mag.json \ - --ip-config /tmp/ip_list.txt \ --ssh-port 22 \ --cf /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml \ --node-feat-name paper:feat \ diff --git a/docs/source/advanced/link-prediction.rst b/docs/source/advanced/link-prediction.rst index 8d9df457cb..a9f97b0192 100644 --- a/docs/source/advanced/link-prediction.rst +++ b/docs/source/advanced/link-prediction.rst @@ -197,10 +197,57 @@ In general, GraphStorm covers following cases: The gconstruct pipeline of GraphStorm provides support to load hard negative data from raw input. Hard destination negatives can be defined through ``edge_dst_hard_negative`` transformation. The ``feature_col`` field of ``edge_dst_hard_negative`` must stores the raw node ids of hard destination nodes. +The follwing example shows how to define a hard negative feature for edges with the relation ``(node1, relation1, node1)``: + + .. code-block:: json + + { + ... + "edges": [ + ... + { + "source_id_col": "src", + "dest_id_col": "dst", + "relation": ("node1", "relation1", "node1"), + "format": {"name": "parquet"}, + "files": "edge_data.parquet", + "features": [ + { + "feature_col": "hard_neg", + "feature_name": "hard_neg_feat", + "transform": {"name": "edge_dst_hard_negative", + "separator": ";"}, + } + ] + } + ] + } + +The hard negative data is stored in the column named ``hard_neg`` in the ``edge_data.parquet`` file. +The edge feature to store the hard negative will be ``hard_neg_feat``. + GraphStorm accepts two types of hard negative inputs: - **An array of strings or integers** When the input format is ``Parquet``, the ``feature_col`` can store string or integer arrays. In this case, each row stores a string/integer array representing the hard negative node ids of the corresponding edge. For example, the ``feature_col`` can be a 2D string array, like ``[["e0_hard_0", "e0_hard_1"],["e1_hard_0"], ..., ["en_hard_0", "en_hard_1"]]`` or a 2D integer array (for integer node ids) like ``[[10,2],[3],...[4,12]]``. It is not required for each row to have the same dimension size. GraphStorm will automatically handle the case when some edges do not have enough pre-defined hard negatives. +For example, the file storing hard negatives should look like the following: + +.. code-block:: yaml + + src | dst | hard_neg + "src_0" | "dst_0" | ["dst_10", "dst_11"] + "src_0" | "dst_1" | ["dst_5"] + ... + "src_100"| "dst_41" | [dst0, dst_2] + +- **A single string** The ``feature_col`` stores strings instead of string arrays (When the input format is ``Parquet`` or ``CSV``). In this case, a ``separator`` must be provided int the transformation definition to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives. +For example, the file storing hard negatives should look like the following: + +.. code-block:: yaml -- **A single string** The ``feature_col`` stores strings instead of string arrays. (When the input format is ``Parquet`` or ``CSV``) In this case, a ``separator`` must be provided to split the strings into node ids. The ``feature_col`` will be a 1D string list, for example ``["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]``. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives. + src | dst | hard_neg + "src_0" | "dst_0" | "dst_10;dst_11" + "src_0" | "dst_1" | "dst_5" + ... + "src_100"| "dst_41"| "dst0;dst_2" GraphStorm will automatically translate the Raw Node IDs of hard negatives into Partition Node IDs in a DistDGL graph. diff --git a/docs/source/advanced/multi-task-learning.rst b/docs/source/advanced/multi-task-learning.rst index 214b1c22de..c6d68eb7c8 100644 --- a/docs/source/advanced/multi-task-learning.rst +++ b/docs/source/advanced/multi-task-learning.rst @@ -318,3 +318,129 @@ GraphStorm supports to run multi-task inference on :ref:`SageMaker<distributed-s --instance-count <INSTANCE_COUNT> \ --instance-type <INSTANCE_TYPE> +Multi-task Learning Output +-------------------------- + +Saved Node Embeddings +~~~~~~~~~~~~~~~~~~~~~~ +When ``save_embed_path`` is provided in the training configuration or the inference configuration, +GraphStorm will save the node embeddings in the corresponding path. +In multi-task learning, by default, GraphStorm will save the node embeddings +produced by the GNN layer for every node type under the path specified by +``save_embed_path``. The output format follows the :ref:`GraphStorm saved node embeddings +format<gs-out-embs>`. Meanwhile, in multi-task learning, certain tasks might apply +task specific normalization to node embeddings. For instance, a link prediction +task might apply l2 normalization on each node embeddings. In certain cases, GraphStorm +will also save the normalized node embeddings under the ``save_embed_path``. +The task specific node embeddings are saved separately under different sub-directories +named with the corresponding task id. (A task id is formated as ``<task_type>-<ntype/etype(s)>-<label>``. +For instance, the task id of a node classification task on the node type ``paper`` with the +label field ``venue`` will be ``node_classification-paper-venue``. As another example, +the task id of a link prediction task on the edge type ``(paper, cite, paper)`` will be +``link_prediction-paper_cite_paper`` +and the task id of a edge regression task on the edge type ``(paper, cite, paper)`` with +the label field ``year`` will be ``edge_regression-paper_cite_paper-year``). +The output format of task specific node embeddings follows +the :ref:`GraphStorm saved node embeddings format<gs-out-embs>`. +The contents of the ``save_embed_path`` in multi-task learning will look like following: + +.. code-block:: bash + + emb_dir/ + ntype0/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + ntype1/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + emb_info.json + link_prediction-paper_cite_paper/ + ntype0/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + ntype1/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + emb_info.json + edge_regression-paper_cite_paper-year/ + ntype0/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + ntype1/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + emb_info.json + +In the above example both the link prediction and edge regression tasks +apply task specific normalization on node embeddings. + +**Note: The built-in GraphStorm training or inference pipeline +(launched by GraphStorm CLIs) will process each saved node embeddings +to convert the integer node IDs into the raw node IDs, which are usually string node IDs.** +Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` + +Saved Prediction Results +~~~~~~~~~~~~~~~~~~~~~~~~~ +When ``save_prediction_path`` is provided in the inference configuration, +GraphStorm will save the prediction results in the corresponding path. +In multi-task learning inference, each prediction task will have its prediction +results saved separately under different sub-directories +named with the +corresponding task id. The output format of task specific prediction results +follows the :ref:`GraphStorm saved prediction result format<gs-out-predictions>`. +The contents of the ``save_prediction_path`` in multi-task learning will look like following: + +.. code-block:: bash + + prediction_dir/ + edge_regression-paper_cite_paper-year/ + paper_cite_paper/ + predict-00000.pt + predict-00001.pt + ... + src_nids-00000.pt + src_nids-00001.pt + ... + dst_nids-00000.pt + dst_nids-00001.pt + ... + result_info.json + node_classification-paper-venue/ + paper/ + predict-00000.pt + predict-00001.pt + ... + predict_nids-00000.pt + predict_nids-00001.pt + ... + result_info.json + ... + +**Note: The built-in GraphStorm inference pipeline +(launched by GraphStorm CLIs) will process each saved prediction result +to convert the integer node IDs into the raw node IDs, which are usually string node IDs.** +Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` diff --git a/docs/source/graph-construction/gs-processing/example.rst b/docs/source/cli/graph-construction/distributed/example.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/example.rst rename to docs/source/cli/graph-construction/distributed/example.rst diff --git a/docs/source/graph-construction/gs-processing/gspartition/ec2-clusters.rst b/docs/source/cli/graph-construction/distributed/gspartition/ec2-clusters.rst similarity index 97% rename from docs/source/graph-construction/gs-processing/gspartition/ec2-clusters.rst rename to docs/source/cli/graph-construction/distributed/gspartition/ec2-clusters.rst index e12253cbcb..d839615331 100644 --- a/docs/source/graph-construction/gs-processing/gspartition/ec2-clusters.rst +++ b/docs/source/cli/graph-construction/distributed/gspartition/ec2-clusters.rst @@ -1,6 +1,6 @@ -====================================== -Running partition jobs on EC2 Clusters -====================================== +=============================================== +Running partition jobs on Amazon EC2 Clusters +=============================================== Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed, users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and @@ -64,7 +64,7 @@ Collect the IP address list ........................... The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like: -.. figure:: ../../../../../tutorial/distributed_ips.png +.. figure:: ../../../../../../tutorial/distributed_ips.png :align: center .. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints. diff --git a/docs/source/cli/graph-construction/distributed/gspartition/index.rst b/docs/source/cli/graph-construction/distributed/gspartition/index.rst new file mode 100644 index 0000000000..1b43b490ce --- /dev/null +++ b/docs/source/cli/graph-construction/distributed/gspartition/index.rst @@ -0,0 +1,24 @@ +.. _gspartition_index: + +======================================= +GraphStorm Distributed Graph Partition +======================================= + +GraphStorm Distributed Graph Partition (GSPartition), which is built on top of the +dgl `distributed graph partitioning pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_, allows users to do distributed partition on the outputs of :ref:`GSProcessing<gs-processing>`. + +GSPartition consists of two steps: Graph Partitioning and Data Dispatching. Graph Partitioning step assigns each node to one partition and save the results as a set of files, called partition assignment. Data Dispatching step will physically partition the +graph data and dispatch them according to the partition assignment. It will generate the graph data in DGL distributed graph format, ready for GraphStorm distributed training and inference. + +.. note:: + GraphStorm currently only supports running GSPartition on AWS infrastructure, i.e., `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_ and `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. But, users can easily create your own Linux clusters by following the GSPartition tutorial on Amazon EC2. + +The first section includes instructions on how to run GSPartition on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_. +The second section includes instructions on how to run GSPartition on `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. + +.. toctree:: + :maxdepth: 1 + :glob: + + sagemaker.rst + ec2-clusters.rst diff --git a/docs/source/graph-construction/gs-processing/gspartition/sagemaker.rst b/docs/source/cli/graph-construction/distributed/gspartition/sagemaker.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/gspartition/sagemaker.rst rename to docs/source/cli/graph-construction/distributed/gspartition/sagemaker.rst diff --git a/docs/source/graph-construction/gs-processing/aws-infra/amazon-sagemaker.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/amazon-sagemaker.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/aws-infra/amazon-sagemaker.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/amazon-sagemaker.rst diff --git a/docs/source/graph-construction/gs-processing/aws-infra/emr-serverless.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/emr-serverless.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/aws-infra/emr-serverless.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/emr-serverless.rst diff --git a/docs/source/graph-construction/gs-processing/aws-infra/emr.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/emr.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/aws-infra/emr.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/emr.rst diff --git a/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst new file mode 100644 index 0000000000..239025debe --- /dev/null +++ b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/index.rst @@ -0,0 +1,15 @@ +================================================ +Running GSProcessing jobs on AWS Infra +================================================ + +After successfully building the Docker image and pushing it to +`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_, +you can now initiate GSProcessing jobs with AWS resources. + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + amazon-sagemaker.rst + emr-serverless.rst + emr.rst \ No newline at end of file diff --git a/docs/source/graph-construction/gs-processing/aws-infra/row-count-alignment.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/row-count-alignment.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/aws-infra/row-count-alignment.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/aws-infra/row-count-alignment.rst diff --git a/docs/source/graph-construction/gs-processing/prerequisites/developer-guide.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/developer-guide.rst similarity index 100% rename from docs/source/graph-construction/gs-processing/prerequisites/developer-guide.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/developer-guide.rst diff --git a/docs/source/graph-construction/gs-processing/prerequisites/distributed-processing-setup.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/distributed-processing-setup.rst similarity index 99% rename from docs/source/graph-construction/gs-processing/prerequisites/distributed-processing-setup.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/distributed-processing-setup.rst index bf2849688e..e36efc7d12 100644 --- a/docs/source/graph-construction/gs-processing/prerequisites/distributed-processing-setup.rst +++ b/docs/source/cli/graph-construction/distributed/gsprocessing/distributed-processing-setup.rst @@ -1,6 +1,6 @@ .. _gsprocessing_distributed_setup: -GraphStorm Processing Distributed Setup +GSProcessing Distributed Setup ======================================= In this guide we'll demonstrate how to prepare your environment to run diff --git a/docs/source/graph-construction/gs-processing/prerequisites/gs-processing-getting-started.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/gs-processing-getting-started.rst similarity index 99% rename from docs/source/graph-construction/gs-processing/prerequisites/gs-processing-getting-started.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/gs-processing-getting-started.rst index 36b01ef8dd..39259c4058 100644 --- a/docs/source/graph-construction/gs-processing/prerequisites/gs-processing-getting-started.rst +++ b/docs/source/cli/graph-construction/distributed/gsprocessing/gs-processing-getting-started.rst @@ -1,6 +1,6 @@ .. _gs-processing: -GraphStorm Processing Getting Started +GSProcessing Getting Started ===================================== .. _gsp-installation-ref: diff --git a/docs/source/cli/graph-construction/distributed/gsprocessing/index.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/index.rst new file mode 100644 index 0000000000..3cdeaff406 --- /dev/null +++ b/docs/source/cli/graph-construction/distributed/gsprocessing/index.rst @@ -0,0 +1,27 @@ +.. _gsprocessing_prerequisites_index: + +======================================== +GraphStorm Distributed Data Processing +======================================== + +GraphStorm Distributed Data Processing (GSProcessing) enables the processing and preparation of massive graph data for training with GraphStorm. GSProcessing handles generating unique node IDs, encoding edge structure files, processing individual features, and preparing data for the distributed partition stage. + +.. note:: + + * We use PySpark for horizontal parallelism, enabling scalability to graphs with billions of nodes and edges. + * GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. + +The following sections outline essential prerequisites and provide a detailed guide to use +GSProcessing. +The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration. +The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources. +The third section explains how to deploy GSProcessing job with AWS infrastructure. The last section offers the details about generating a configuration file for GSProcessing jobs. + +.. toctree:: + :maxdepth: 1 + :titlesonly: + + gs-processing-getting-started.rst + distributed-processing-setup.rst + aws-infra/index.rst + input-configuration.rst diff --git a/docs/source/graph-construction/gs-processing/input-configuration.rst b/docs/source/cli/graph-construction/distributed/gsprocessing/input-configuration.rst similarity index 99% rename from docs/source/graph-construction/gs-processing/input-configuration.rst rename to docs/source/cli/graph-construction/distributed/gsprocessing/input-configuration.rst index f84cb9520f..d12367227f 100644 --- a/docs/source/graph-construction/gs-processing/input-configuration.rst +++ b/docs/source/cli/graph-construction/distributed/gsprocessing/input-configuration.rst @@ -1,7 +1,7 @@ .. _gsprocessing_input_configuration: -GraphStorm Processing Input Configuration -========================================= +GSProcessing Input Configuration +================================ GraphStorm Processing uses a JSON configuration file to parse and process the data into the format needed diff --git a/docs/source/cli/graph-construction/distributed/index.rst b/docs/source/cli/graph-construction/distributed/index.rst new file mode 100644 index 0000000000..8f331b897f --- /dev/null +++ b/docs/source/cli/graph-construction/distributed/index.rst @@ -0,0 +1,24 @@ +.. _distributed-gconstruction: + +Distributed Graph Construction +============================== + +Beyond single machine graph construction, distributed graph construction offers enhanced scalability +and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing) +and GraphStorm Distributed Graph Partitioning (GSPartition).The below diagram is an overview of the workflow for distributed graph construction. + +.. figure:: ../../../../../tutorial/distributed_construction.png + :align: center + +* **GSProcessing**: It accepts tabular files in parquet/CSV format, and prepares the raw data into structured data for partitioning, including edge and node data, transformation details, and node id mappings. +* **GSPartition**: It will process these structured data to create multiple partitions in `DGL Distributed Graph <https://docs.dgl.ai/en/latest/api/python/dgl.distributed.html#distributed-graph>`_ format for distributed model training and inference. + +The following sections provide guidance on doing GSProcessing and GSPartition. In addition, this tutorial also offers an example that demonstrates the end-to-end distributed graph construction process. + +.. toctree:: + :maxdepth: 1 + :glob: + + gsprocessing/index.rst + gspartition/index.rst + example.rst \ No newline at end of file diff --git a/docs/source/cli/graph-construction/index.rst b/docs/source/cli/graph-construction/index.rst new file mode 100644 index 0000000000..76d762eb2c --- /dev/null +++ b/docs/source/cli/graph-construction/index.rst @@ -0,0 +1,21 @@ +.. _graph_construction: + +============================== +GraphStorm Graph Construction +============================== + +In order to use GraphStorm's graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm's specifications. Users can find more details of these specifications in the :ref:`Input Raw Data Explanations <input_raw_data>` section. + +Once the raw data is ready, by using GraphStorm :ref:`single machine graph construction CLIs <single-machine-gconstruction>`, users can handle most common academic graphs or small graphs sampled from enterprise data, typically with millions of nodes and up to one billion edges. It's recommended to use machines with large CPU memory. A general guideline: 1TB of memory for graphs with one billion edges. + +Many production-level enterprise graphs contain billions of nodes and edges, with features having hundreds or thousands of dimensions. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` help users manage these complex graphs. This is particularly useful for building automatic graph data processing pipelines in production environments. GraphStorm :ref:`distributed graph construction CLIs <distributed-gconstruction>` could be applied on multiple Amazon infrastructures, including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, +`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and +`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. + +.. toctree:: + :maxdepth: 2 + :glob: + + raw_data + single-machine-gconstruct + distributed/index \ No newline at end of file diff --git a/docs/source/cli/graph-construction/raw_data.rst b/docs/source/cli/graph-construction/raw_data.rst new file mode 100644 index 0000000000..4a629e568a --- /dev/null +++ b/docs/source/cli/graph-construction/raw_data.rst @@ -0,0 +1,190 @@ +.. _input_raw_data: + +Input Raw Data Specification +============================= + +In order to use GraphStorm's graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm's specifications explained below. + +Data tables +------------ +The main part of GraphStorm input raw data is composed of two sets of tables. One for nodes and one for edges. These data tables could be in one of three file formats: ``csv`` files, ``parquet`` files, or ``HDF5`` files. All of the three file formats store data in tables that contain headers, i.e., a list of column names, and values belonging to each column. + +Node tables +............ +GraphStorm requires each node type to have its own table(s). It is suggested to have one folder for one node type to store table(s). + +In the table for one node type, there **must** be one column that stores the IDs of nodes. The IDs could be non-integers, such as strings. GraphStorm will treat non-integer IDs as strings and convert them into interger IDs. + +If certain type of nodes has features, the features could be stored in multiple columns, each of which stores one type of features. These features could be numerical, categorial, or textual data. Similarly, training labels associated with certain type of nodes could be stored in multiple columns, each of which store one type of labels. + +Edge tables +............ +GraphStorm requires each edge type to have its own table(s). It is suggested to have one folder for one edge type to store tables(s). + +In the table for one edge type, there **must** be two columns. One column stores the IDs of source node type of the edge type, while another column stores the IDs of destination node type of the edge type. The source and destination node type should have their corresponding node tables. Same as node features and labels, edge features and labels could be stored in multiple columns. + +.. note:: + + * If the number of rows is too large, it is suggested to split and store the data into mutliple table files that have the identical schema. Doing so could speed up the data reading process during graph construction if use multiple processing. + * It is suggested to use **parquet** file format for its popularity and compressed file sizes. The **HDF5** format is only suggested for data with large volume of high dimension features. + * Users can also store columns in multiple sets of table files, for example, puting "node IDs" and "feature_1" in the set of "table1_1.parquet" file and "table1_2.parquet" file, and put "feature_2" in another set of "table2_1.h5" file and "table2_2.h5" file with the same row order. + +.. warning:: + + If users split both rows and columns into mutliple sets of table files, they need to make sure that after files are sorted according to the file names, the order of the rows of each column will still keep the same. + + Suppose the columns are split into two file sets. One set includes a list of files, i.e., ``table_1.h5, table_2.h5, ..., table_9.h5, table_10.h5, table_11.h5``, and another set also includes a list of files, i.e., ``table_001.h5, table_002.h5, ..., table_009.h5, table_010.h5, table_011.h5``. The order of rows in the two set of files is the same when using the original order of files in the two lists. However, after being sorted by Linux OS, we will get ``table_1.h5, table_10.h5, table_11.h5, table_2.h5, ..., table_9.h5`` for the first list, and get ``table_001.h5, table_002.h5, ..., table_009.h5, table_010.h5, table_011.h5`` for the second list. The order of files is different, which will cause mismatch between node IDs and node features. + + Therefore, it is **strongly** suggested to use the ``_000*`` file name template, like ``table_001, table_002, ..., table_009, table_010, table_011, ..., table_100, table_101, ...``. + +.. _customized-split-labels: + +Label split files (Optional) +----------------------- +In some cases, users may want to control which nodes or edges should be used for training, validation, or testing. To achieve this goal, users can set the customized label split information in three JSON files or parquet files. + +For node split files, users just need to list the node IDs used for training in one file, node IDs used for validation in one file, and node IDs used for testing in another file. If use JSON files, put one node ID in one line like :ref:`this example <node-split-json>` below. If use parquet files, place these node IDs in one column and assign a column name to it. + +Foe edge split files, users need to provide both source node IDs and destination node IDs in the split files. If use JSON files, put one edge as a JSON list with two elements, i.e., ``["source node ID", "destination node ID"]``, in one line. If use parquet files, place the source node IDs and destination node IDs into two columns, and assign column names to them like :ref:`this example <edge-split-parquet>` below. + +If there is no validation or testing set, users do not need to create the corresponding file(s). + +.. _simple-input-raw-data-example: + +A simple raw data example +-------------------------- +To better help users to prepare the input raw data artifacts, this section provides a simple example. + +This simple raw data has three types of nodes, ``paper``, ``subject``, ``author``, and two types of edges, ``paper, has, subject`` and ``paper, written-by, author``. + +``paper`` node tables +....................... +The ``paper`` table (``paper_nodes.parquet``) includes three columns, i.e., `nid` for node IDs, `aff` is a feature column with categorial values, `class` is a classification label column with 3 classes, and ``abs`` is a feature column with textual values. + +===== ======= ======= =============== +nid aff class abs +===== ======= ======= =============== +n1_1 NE 0 chips are +n1_2 MT 1 electricity +n1_3 UL 1 prime numbers +n1_4 TT 2 Questions are +===== ======= ======= =============== + + +``subject`` node table +....................... +The ``subject`` table (``subject_nodes.parquet``) includes one column only, i.e., `domain`, functioning as node IDs. + ++--------+ +| domain | ++========+ +| eee | ++--------+ +| mth | ++--------+ +| llm | ++--------+ + +.. _multi-set-table-examle: + +``author`` node table +....................... +The ``author`` table (``author_nodes.parquet``) includes two columns, i.e., `n_id` for node IDs, and `hdx` as a feature column with numerical values. + +===== ======= +n_id hdx +===== ======= +60 0.75 +70 25.34 +80 1.34 +===== ======= + +The ``author`` nodes also have a 2048 dimension embeddings pre-computed on a textual feature stored as an **HDF5** file (``author_node_embeddings.h5``) as shown below. + ++----------------------------------------------------------------+ +| embedding | ++================================================================+ +| 0.2964, 0.0779, 1.2763, 2.8971, ..., -0.2564, 0.9060, -0.8740 | ++----------------------------------------------------------------+ +| 1.6941, -1.6765, 0.1862, -0.4449, ..., 0.6474, 0.2358, -0.5952 | ++----------------------------------------------------------------+ +| -0.8417, 2.5096, -0.0393, -0.8208, ..., 0.9894, 2.3389, 0.9778 | ++----------------------------------------------------------------+ + +.. note:: The order of rows in the ``author_node_embeddings.h5`` file **MUST** be same as those in the ``author_nodes.parquet`` file, i.e., the first value row contains the embeddings for the ``author`` node with ``n_id`` as ``60``, and the second value row is for ``author`` node with ``n_id`` as ``70``, and so on. + +``paper, has, subject`` edge table +...................................... +The ``paper, has, subject`` edge table (``paper_has_subject_edges.parquet``) includes three columns, i.e., ``nid`` as the source node IDs, ``domain`` as the destination IDs, and ``cnt`` as the label field for a regression task. + +===== ======= ======= +nid domain cnt +===== ======= ======= +n1_1 eee 100 +n1_2 eee 1 +n1_3 mth 39 +n1_4 llm 4700 +===== ======= ======= + +``paper, written-by, author`` edge table +...................................... +The ``paper, written-by, author`` edge table (``paper_written-by_author_edges.parquet``) includes two columns, i.e., ``nid`` as the source node IDs, ``n_id`` as the destination IDs. + +===== ======= +nid n_id +===== ======= +n1_1 60 +n1_2 60 +n1_3 70 +n1_4 70 +===== ======= + +.. _node-split-json: + +Node split JSON files +...................... +This example sets customized node split files on the ``paper`` nodes for a node classification task in the JSON format. There are two nodes in the training set, one node for validation, and one node for testing. + +**train.json** contents + +.. code:: json + + n1_2 + n1_3 + +**val.json** contents + +.. code:: json + + n1_4 + +**test.json** contents + +.. code:: json + + n1_1 + +.. _edge-split-parquet: + +Edge split parquet files +......................... + +This example sets customized edge split files on the ``paper, has, subject`` edges for an edge regression task in the parquet format. There are three edges in the training set, one edge for validation, and no edge for testing. + +**train_edges.parquet** contents + +===== ======= +nid domain +===== ======= +n1_1 eee +n1_2 eee +n1_4 llm +===== ======= + +**val_edges.parquet** contents + +===== ======= +nid domain +===== ======= +n1_3 mth +===== ======= diff --git a/docs/source/cli/graph-construction/single-machine-gconstruct.rst b/docs/source/cli/graph-construction/single-machine-gconstruct.rst new file mode 100644 index 0000000000..3e604006df --- /dev/null +++ b/docs/source/cli/graph-construction/single-machine-gconstruct.rst @@ -0,0 +1,372 @@ +.. _single-machine-gconstruction: + +Single Machine Graph Construction +----------------------------------- + +Prerequisites +************** + +1. A machine with Linux operation system that has proper CPU memory according to the raw data size. +2. Following the :ref:`Setup GraphStorm with pip Packages <setup_pip>` guideline to install GraphStorm and its dependencies. +3. Following the :ref:`Input Raw Data Explanations <input_raw_data>` guideline to prepare the input raw data. + +Graph consturction command +**************************** + +GraphStorm provides a ``gconstruct.construct_graph`` module for graph construction in a signle machine. Users can run the ``gconstruct.construct_graph`` command by following the command template below. + +.. code:: python + + python -m graphstorm.gconstruct.construct_graph \ + --conf-file config.json \ + --output-dir /a_path \ + --num-parts 1 \ + --graph-name a_name + +This template provides the actual Python command, and it also indicates the three required command arguments, i.e., ``--conf-file`` specifies a JSON file containing graph construction configurations, ``--output-dir`` specifies the directory for outputs, and ``--graph-name`` specifies a string as a name given to the constructed graph. The ``--num-parts`` whose default given value is ``1`` is also an important argument. It determines how many partitions to be constructed. In distrusted model training and inference, the number of machines is determined by the number of partitions. + +.. _gconstruction-json: + +Configuration JSON Object Explanations +************************************** + +The configuration JSON file is the **key** input argument for graph construction. The file contains a JSON object that defines the overall graph schema in terms of node type and edge type. For each node and edge type, it defines where the node and edge data are stored and in what file format. When a type of node or edge has features, it defines which columns in the data table are features and what feature transformation operations will be used to encode the features. When a type of node or edge has labels, it defines which columns in the data table are labels and how to split the labels into the training, validation, and testing sets. + +In the highest level, the JSON object contains three fields: ``version``, ``nodes`` and ``edges``. + +``version`` (**Optional**) +.......................... +``version`` marks the version of the configuration file schema, allowing its identification to be self-contained for downstream applications. The current (and expected) version is ``gconstruct-v0.1``. + +``nodes`` (**Required**) +........................ +``nodes`` contains a list of node types and the information of a node type is stored in a dictionary. A node dictionary contains multiple fields and most fields are optional. + +* ``node_type``: (**Required**) specifies the node type. Think this as a name given to one type of nodes, e.g. `"author"` and `"paper"`. +* ``files``: (**Required**) specifies the input files for the node type. There are multiple options to specify the input files. For a single input file, it contains the path of a single file. For multiple files, it could contain the paths of files with a wildcard, e.g., `file_name*.parquet`, or a list of file paths, e.g., `["file_name001.parquet", "file_name002.parquet", ...]`. +* ``format``: (**Required**) specifies the input file format. Currently, the construction command supports three input file formats: ``csv``, ``parquet``, and ``HDF5``. The value of this field is a dictionary, where the key is ``name`` and the value is either ``csv``, ``parquet`` or ``HDF5``, e.g., `{"name":"csv"}`. The detailed format information could be found in the :ref:`Input Raw Data Explanations <input_raw_data>` guideline. +* ``node_id_col``: specifies the column name that contains the node IDs. This field is optional. If not provided, the construction command will create node IDs according to the total number of rows and consider each row in the node table is a unique node. If user choose to store columns of a node type in multiple sets of tables, only one of the set of tables require to specify the node ID column. For example of this multiple sets of tables, please refer to :ref:`the simple input data example <multi-set-table-examle>` document. +* ``features`` is a list of dictionaries that define how to get features and transform features. This is optional. The format of a feature dictionary is defined in the :ref:`Feature dictionary format <feat-format>` section below. +* ``labels`` is a list of dictionaries that define where to get labels and how to split the labels into training/validation/test set. This is optional. The format of a label dictionary is defined in the :ref:`Label dictionary format <label-format>` section below. + +``edges`` (**Required**) +........................ +Similarly, ``edges`` contains a list of edge types and the information of an edge type is stored in a dictionary. An edge dictionary also contains the same fields of ``files``, ``format``, ``features`` and ``labels`` as the ``nodes`` field. In addition, it contains the following unique fields: + +* ``source_id_col``: (**Required**) specifies the column name of the source node IDs. +* ``dest_id_col``: (**Required**) specifies the column name of the destination node IDs. +* ``relation``: (**Required**) is a list of three elements that contains the node type of the source nodes, the relation type of the edges, and the node type of the destination nodes. Values of node types should be same as the corresponding values specified in the ``node_type`` fields in ``nodes`` objects, e.g., `["author", "write", "paper"]`. + +.. _feat-format: + +**Feature dictionary format** + +* ``feature_col``: (**Required**) specifies the column name in the input file that contains the feature. The ``feature_col`` can accept either a string or a list. When ``feature_col`` is specified as a list with multiple columns, the same feature transformation operation will be applied to each column, and then the transformed feature will be concatenated to form the final feature. +* ``feature_name``: specifies the prefix of the column feature name. This is optional. If feature_name is not provided, ``feature_col`` is used as the feature name. If the feature transformation generates multiple tensors, ``feature_name`` becomes the prefix of the names of the generated tensors. If there are multiple columns defined in ``feature_col``, ``feature_name`` is required. +* ``out_dtype`` specifies the data type of the transformed feature. ``out_dtype`` is optional. If it is not set, no data type casting is applied to the transformed feature. If it is set, the output feature will be cast into the corresponding data type. Now only `float16`, `float32`, and `float64` are supported. +* ``transform``: specifies the actual feature transformation. This is a dictionary and its name field indicates the feature transformation operation. Each transformation operation has its own argument(s). The list of feature transformations supported by the pipeline are listed in the section of :ref:`Feature Transformation <feat-transform>` below. + +.. _label-format: + +**Label dictionary format** + +* ``task_type``: (**Required**) specifies the task defined on the nodes or edges. Currently, its value can be one of ``classification``, ``regression``, ``link_prediction``, and ``reconstruct_node_feat``. +* ``label_col``: specifies the column name in the input file that contains the labels. This has to be specified for ``classification`` and ``regression`` tasks. ``label_col`` is also used as the label name. +* ``split_pct``: specifies how to split the data into training/validation/test. If it's not specified, the data is split into 80% for training 10% for validation and 10% for testing. The pipeline constructs three additional vectors indicating the training/validation/test masks. For ``classification`` and ``regression`` tasks, the names of the mask tensors are ``train_mask``, ``val_mask`` and ``test_mask``. +* ``custom_split_filenames``: specifies the customized training/validation/test mask. It has field named ``train``, ``valid``, and ``test`` to specify the path of the mask files. It is possible that one of the subfield here leaves empty and it will be treated as none. It will override the ``split_pct`` once provided. Refer to :ref:`Label split files <customized-split-labels>` for detailed explanations. +* ``label_stats_type``: specifies the statistic type used to summarize labels. So far, only support one value, i.e., ``frequency_cnt``. + +.. _feat-transform: + +Feature transformation +......................... +GraphStorm provides a set of transformation operations for different types of feautures. + +* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. + + 1. ``bert_model`` specifies the LM model used for tokenization. Users can choose any `HuggingFace LM models <https://huggingface.co/models>`_ from one of the following types: ``"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert"``, such as ``"bert-base-uncased" and "roberta-base"`` + 2. ``max_seq_length`` specifies the maximal sequence length. + + Example: + + .. code:: json + + "transform": {"name": "tokenize_hf", + "bert_model": "bert-base-uncased", + "max_seq_length": 16}, + +* **HuggingFace LM transformation** encodes text strings with a HuggingFace LM model. The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. + + 1. ``bert_model`` specifies the LM model used for embedding text. Users can choose any `HuggingFace LM models <https://huggingface.co/models>`_ from one of the following types: ``"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert"``, such as ``"bert-base-uncased" and "roberta-base"`` + 2. ``max_seq_length`` specifies the maximal sequence length. + + Example: + + .. code:: json + + "transform": {"name": "bert_hf", + "bert_model": "roberta-base", + "max_seq_length": 256}, + +* **Numerical MAX_MIN transformation** normalizes numerical input features with `val = (val-min)/(max-min)`, where `val` is the feature value, `max` is the maximum value in the feature and `min` is the minimum value in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dictionary can contain four optional fields: ``max_bound``, ``min_bound``, ``max_val`` and ``min_val``. + + - ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. Here, `max = min(np.amax(feats), ``max_bound``)`. + - ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``. Here, `min` = max(np.amin(feats), ``min_bound``). + - ``max_val`` defines the `max` in the transformation formula. When ``max_val`` is provided, `max` is always equal to ``max_val``. + - ``min_val`` defines the `min` in the transformation formula. When ``min_val`` is provided, `min` is always equal to ``min_val``. + + ``max_val`` and ``min_val`` are mainly used in the inference stage, where we want to use the same `max` and `min` values computed in the training stage to normalize inference data. + + Example: + + .. code:: json + + "transform": {"name": "max_min_norm", + "max_bound": 2., + "min_bound": -2.} + +* **Numerical Rank Gauss transformation** normalizes numerical input features with rank gauss normalization. It maps the numeric feature values to gaussian distribution based on ranking. The method follows the description in the normalization section of `the Porto Seguro's Safe Driver Prediction kaggle competition <https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927>`_. The ``name`` field in the feature transformation dictionary is ``rank_gauss``. The dict can contains two optional fields, i.e., ``epsilon`` which is used to avoid ``INF`` float during computation and ``uniquify`` which controls whether deduplicating input features before computing rank gauss norm. + + Example: + + .. code:: json + + "transform": {"name": "rank_gauss", + "epsilon": 1e-5, + "uniquify": True, } + +* **Convert to categorical values** converts text data to categorial values. The ``name`` field is ``to_categorical``, and ``separator`` specifies how to split the string into multiple categorical values (this is only used to define multiple categorical values). If ``separator`` is not specified, the entire string is considered as a single categorical value. ``mapping`` (optional) is a dictionary that specifies how to map a string to an integer value that defines a categorical value. If ``mapping`` is provided, any string value which is not in the ``mapping`` will be ignored. The ``mapping`` field is mainly used in the inference stage when we want to keep the same categorical mapping as in the training stage. + + Example: + + .. code:: json + + "transform": {"name": "to_categorical"}, + +* **Numerical Bucket transformation** normalizes numerical input features with buckets. The input features are divided into one or multiple buckets. Each bucket stands for a range of floats. An input value can fall into one or more buckets depending on the transformation configuration. The ``name`` field in the feature transformation dictionary is ``bucket_numerical``. Users can to provide ``range`` and ``bucket_cnt`` fields, where ``range`` defines a numerical range, and ``bucket_cnt`` defines number of buckets among the range. All buckets will have same length, and each of them is left included. e.g, bucket ``[a, b)`` will include ``a``, but not ``b``. All input feature column data are categorized into respective buckets using this method. Any input data lower than the minimum value will be assigned to the first bucket, and any input data exceeding the maximum value will be assigned to the last bucket. For example, with ``range: [10,30]`` and ``bucket_cnt: 2``, input data ``1`` will fall into the bucket ``[10, 20]``, input data ``11`` will be mapped to ``[10, 20]``, input data ``21`` will be mapped to ``[20, 30]``, input data ``31`` will be mapped to ``[20, 30]``. Finally GraphStorm uses one-hot-encoding to encode the feature for each numerical bucket. If a user wants to make numeric values fall into more than one bucket, it is suggested to use the ``slide_window_size`` field. ``slide_window_size`` defines a number, e.g., ``s``. Then each value ``v`` will be transformed into a range from ``v - s/2`` through ``v + s/2`` , and assigns the value ``v`` to every bucket that the range covers. + + Example: + + .. code:: json + + "transform": {"name": "bucket_numerical", + "range": [10, 50], + "bucket_cnt": 2, + "slide_window_size": 10}, + +* **No-op vector truncation (experimental)** truncates feature vectors to the length requested. The ``name`` field can be empty (e.g., ``{name: }``), and an integer ``truncate_dim`` value will determine the length of the output vector. This can be useful when experimenting with input features that were trained using `Matryoshka Representation Learning <https://arxiv.org/abs/2205.13147>`_. + + Example: + + .. code:: json + + "transform": {"name": , + "truncate_dim": 24}, + +.. _gcon-output-format: + +Outputs of the graph consturction command +............................................ +The graph construction command outputs two formats: ``DistDGL`` or ``DGL`` specified by the argument **-\-output-format**. + +If select ``DGL``, the output includes an `DGLGraph <https://docs.dgl.ai/en/1.0.x/generated/dgl.save_graphs.html>`_ file, named ``<graph_name>.dgl`` under the folder specified by the **-\-output-dir** argument, where `<graph_name>` is the value of argument **-\-graph-name**. + +If select ``DistDGL``, the output will be a partitioned `DistDGL graph <https://doc.dgl.ai/guide/distributed-preprocessing.html#partitioning-api>`_. It includes a JSON file, named `<graph_name>.json` that describes the meta-information of the partitioned graph, a set of ``part*`` folders under the folder specified by the **-\-output-dir** argument, where the `*` is the number specified by the **-\-num-parts** argument. + +Besides the graph data, the graph construction command also generate other files that contain related metadata information associated with the graph data, including a set of node and edge ID mapping files, a new construction configuration JSON file that records the details of feature transformation operations, and lable statistic summary files if required in the ``label_stats_type`` field. + +.. _gs-id-mapping-files: + + - **Node and Edge Mapping Files:** + There are two node/edge id mapping stages during graph construction. The first mapping occurs when GraphStorm converts the original user provided node ids into integer-based node ids, and the second mapping happends when graph partition operation shuffles these integer-based node ids to each partition with new node ids. Meanwhile, graph construction also saves two sets of node id mapping files as parts of its outputs. + + Outputs of the first mapping stage are stored at the ``raw_id_mappings`` folder under the path specified by the **-\-output-dir** argument. For each node type, there is a dedicated folder named after the ``node_type`` filed, in which contains parquet format files named after ``part-*****.parquet``, where ``*****`` represents five digit numbers starting from ``00000``. + + Outputs of the second mapping stage are two PyTorch tensor files, i.e., ``node_mapping.pt`` and ``edge_mapping.pt``, each of which maps the node and edge in the partitoined graph into the integer original node and edge id space. The node ID mapping is stored as a dictionary of 1D tensors whose key is the node type and value is a 1D tensor mapping between shuffled node IDs and the original node IDs. The edge ID mapping is stored as a dictionary of 1D tensors whose key is the edge type and value is a 1D tensor mapping between shuffled edge IDs and the original edge IDs. + + - **New Construction Configuration JSON:** + By default, GraphStorm will regenerate a construction configuration JSON file that copies the contents in the given JSON file specified by the **--conf-file** argument. In addition if there are transformations of features occurred, this newly generated JSON file will include some additional information. For example, if the original configuration JSON file requires to perform a **Convert to categorical values** transformation without giving the ``mapping`` dictionary, the newly generated configuration JSON file will add this ``mapping`` dictionary with the actual values and their mapping ids. This added information could help construct new graphs for fine-tunning saved models or doing inference with saved models. + + If users provide a value of the **-\-output-conf-file** argument, the newly generated configuration file will use this value as the file name. Otherwise GraphStorm will save the configuration JSON file in the **-\-output-dir** with name ``data_transform_new.json``. + + - **Label Statistic Summary JSONs:** + If required in the ``label_stats_type`` field, the graph construction command will compute statistics of labels and save them in a ``node_label_stats.json`` or a ``edge_label_stats.json``. + +.. note:: These mapping files are important for mapping the training and inference outputs. Therefore, DO NOT move or delete them. + +A construction configuration JSON example +.......................................... + +This section provides a construction configuration JSON associated to the :ref:`simple raw data example <simple-input-raw-data-example>` as an example for refernece. + +.. code:: yaml + + { + "version": "gconstruct-v0.1", + "nodes": [ + { + "node_id_col": "nid", + "node_type": "paper", + "format": {"name": "parquet"}, + "files": "paper_nodes.parquet", + "features": [ + { + "feature_col": ["aff"], + "feature_name": "aff_feat", + "transform": {"name": "to_categorical", + "mapping": {"NE": 0, "MT": 1,"UL": 2, "TT": 3,"UC": 4}} + }, + { + "feature_col": "abs", + "feature_name": "abs_bert", + "out_dtype": "float32", + "transform": {"name": "bert_hf", + "bert_model": "roberta", + "max_seq_length": 16} + }, + ], + "labels": [ + { + "label_col": "class", + "task_type": "classification", + "custom_split_filenames": { + "train": "train.json", + "valid": "val.json", + "test": "test.json"}, + "label_stats_type": "frequency_cnt", + }, + ], + }, + { + "node_id_col": "domain", + "node_type": "subject", + "format": {"name": "parquet"}, + "files": "subject_nodes.parquet", + }, + { + "node_id_col": "n_id", + "node_type": "author", + "format": {"name": "parquet"}, + "files": "author_nodes.parquet", + "features": [ + { + "feature_col": ["hdx"], + "feature_name": "feat", + "out_dtype": 'float16', + "transform": {"name": "max_min_norm", + "max_bound": 1000., + "min_val": 0.} + }, + ], + }, + { + "node_type": "author", + "format": {"name": "hdf5"}, + "files": "author_node_embeddings.h5", + "features": [ + { + "feature_col": ["embedding"], + "feature_name": "embed", + "out_dtype": 'float16', + }, + ], + + }, + ], + "edges": [ + { + "source_id_col": "nid", + "dest_id_col": "domain", + "relation": ["paper", "has", "subject"], + "format": {"name": "parquet"}, + "files": ["paper_has_subject_edges.parquet"], + "labels": [ + { + "label_col": "cnt", + "task_type": "regression", + "custom_split_filenames": { + "train": "train_edges.json", + "valid": "val_edges.json", + }, + }, + ], + }, + { + "source_id_col": "nid", + "dest_id_col": "n_id", + "relation": ["paper", "written-by", "author"], + "format": {"name": "parquet"}, + "files": ["paper_written-by_author_edges.parquet"], + } + ] + } + +.. note:: For a real runnable example, please refer to the :ref:`input JSON file <input-config>` used in the :ref:`Use Your Own Graphs Tutorial <use-own-data>`. + +A full argument list of the ``gconstruct.construct_graph`` command +................................................................... + +* **-\-conf-file**: (**Required**) the path of the configuration JSON file. +* **-\-num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing, but will also increase the CPU memory consumption. +* **-\-num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing. +* **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing. +* **-\-output-dir**: (**Required**) the path of the output data files. +* **-\-graph-name**: (**Required**) the name assigned for the graph. +* **-\-remap-node-id**: boolean value to decide whether to rename node IDs or not. Adding this argument will set it to be true, otherwise false. +* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Adding this argument sets it to true; otherwise, it defaults to false. It is **strongly** suggested to include this argument for graph construction, as some nodes in the original data may not have in-degrees, and thus cannot update their presentations by aggregating messages from their neighbors. Adding this arugment helps prevent this issue. +* **-\-output-format**: the format of constructed graph, options are ``DGL``, ``DistDGL``. Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <gcon-output-format>` section above. +* **-\-num-parts**: an integer value that specifies the number of graph partitions to produce. This is only valid if the output format is ``DistDGL``. +* **-\-skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true. +* **-\-ext-mem-workspace**: the directory where the tool can store intermediate data during graph construction. Suggest to use high-speed SSD as the external memory workspace. +* **-\-ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64. +* **-\-output-conf-file**: The output file with the updated configurations that records the details of data transformation, e.g., convert to categorical value mappings, and max-min normalization ranges. If not specified, will save the updated configuration file in the **-\-output-dir** with name `data_transform_new.json`. + +.. _configurations-partition: + +Graph Partition for DGL Graphs +******************************** + +.. warning:: The two graph partition tools in this section were originally implemented for quick code debugging and are no longer maintained. It is **strongly** suggested to use the ``gconstruct.construct_graph`` command or the :ref:`Distributed Graph Construction <distributed-gconstruction>` guideline for graph construction. + +For users who are already familiar with DGL and know how to construct DGL graphs, GraphStorm provides two graph partition tools to split DGL graphs into the required input format for GraphStorm model training and inference. + +* `partition_graph.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph.py>`_: for Node/Edge Classification/Regress task graph partition. +* `partition_graph_lp.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph_lp.py>`_: for Link Prediction task graph partition. + +`partition_graph.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph.py>`_ arguments +........................................................................................................... + +- **-\-dataset**: (**Required**) the graph dataset name defined for the saved DGL graph file. +- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. +- **-\-target-ntype**: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports **one** prediction node type only. +- **-\-ntype-task**: the node type task to perform. Only support ``classification`` and ``regression`` so far. Default is ``classification``. +- **-\-nlabel-field**: the field that stores labels on the prediction node type, **required** if **target-ntype** is set. The format is ``nodetype:labelname``, e.g., `"paper:label"`. +- **-\-target-etype**: the canonical edge type for making prediction, **required** for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports **one** prediction edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. +- **-\-etype-task**: the edge type task to perform. Only allow ``classification`` and ``regression`` so far. Default is ``classification``. +- **-\-elabel-field**: the field that stores labels on the prediction edge type, required if **target-etype** is set. The format is ``src_ntype,etype,dst_ntype:labelname``, e.g., `"author,write,paper:label"`. +- **-\-generate-new-node-split**: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If this argument is set to ``true``, the **target-ntype** argument **must** also be set. +- **-\-generate-new-edge-split**: a boolean value, required if need the partition script to split edges for training/validation/test sets. If this argument is set to ``true``, the **target-etype** argument **must** also be set. +- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. +- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. + +.. Note:: + The sum of the **train-pct** and **val-pct** should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct). + +- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graph. +- **-\-num-parts**: (**Required**) an integer value that specifies the number of graph partitions to produce. Remember this number because we will need to set it in the model training step. +- **-\-output**: (**Required**) the folder path that the partitioned DGL graphs will be saved. + +`partition_graph_lp.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph_lp.py>`_ arguments +.................................................................................................................. +- **-\-dataset**: (**Required**) the graph name defined for the saved DGL graph file. +- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. +- **-\-target-etypes**: (**Required**) the canonical edge types for making prediction. GraphStorm supports multiple predict edge types that are separated by a white space. The format is ``src_ntype1,etype1,dst_ntype1 src_ntype2,etype2,dst_ntype2``, e.g., `"author,write,paper paper,citing,paper"`. +- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split edges for training/validation/test sets, you can set this value to control the percentage of edges for training. +- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of edges for validation. + +.. Note:: + The sum of the **train-pct** and **val-pct** should less than 1. And the percentage of test edges is the result of 1-(train_pct + val_pct). + +- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graphs. +- **-\-num-parts**: (**Required**) an integer value that specifies the number of graph partitions to produce. Remember this number because we will need to set it in the model training step. +- **-\-output**: (**Required**) the folder path that the partitioned DGL graph will be saved. \ No newline at end of file diff --git a/docs/source/configuration/configuration-run.rst b/docs/source/cli/model-training-inference/configuration-run.rst similarity index 97% rename from docs/source/configuration/configuration-run.rst rename to docs/source/cli/model-training-inference/configuration-run.rst index 841b8d3354..b99be25708 100644 --- a/docs/source/configuration/configuration-run.rst +++ b/docs/source/cli/model-training-inference/configuration-run.rst @@ -1,7 +1,7 @@ .. _configurations-run: -Training and Inference -============================ +Model Training and Inference Configurations +=========================================== GraphStorm provides dozens of configurable parameters for users to control their training and inference tasks. This document provides detailed description of each configurable parameter. You can use YAML config file to define these parameters or you can use command line arguments to define and update these parameters. Specifically, GraphStorm parses yaml config file first. Then it parses arguments to overwrite parameters defined in the yaml file or add new parameters. @@ -11,7 +11,7 @@ GraphStorm's `graphstorm.run.launch <https://github.com/awslabs/graphstorm/blob/ - **workspace**: the folder where launch command assume all artifacts were saved. If the other parameters' file paths are relative paths, launch command will consider these files in the workspace. - **part-config**: (**Required**) Path to a file containing graph partition configuration. The graph partition is generated by GraphStorm Partition tools. **HINT**: Use absolute path to avoid any path related problems. Otherwise, the file should be in workspace. -- **ip-config**: (**Required**) Path to a file containing IPs of instances in a distributed training/inference cluster. In the ip config file, each line stores one IP. **HINT**: Use absolute path to avoid any path related problems. Otherwise, the file should be in workspace. +- **ip-config**: Path to a file containing IPs of instances in a distributed training/inference cluster. In the ip config file, each line stores one IP. This configuration is required only for model training and inference on distributed clusters. **HINT**: Use absolute path to avoid any path related problems. Otherwise, the file should be in workspace. - **num-trainers**: The number of trainer processes per machine. Should >0. - **num-servers**: The number of server processes per machine. Should >0. - **num-samplers**: The number of sampler processes per trainer process. Should >=0. @@ -30,7 +30,7 @@ GraphStorm's `graphstorm.run.launch <https://github.com/awslabs/graphstorm/blob/ .. note:: Below configurations can be set either in a YAML configuration file or be added as arguments of launch command. Environment Configurations -------------------------------------- +---------------------------- - **backend**: (**Required**) PyTorch distributed backend, the suggested backend is gloo. Support backends include gloo and nccl - Yaml: ``backend: gloo`` @@ -216,11 +216,6 @@ GraphStorm provides a set of parameters to control training hyper-parameters. - Yaml: ``num_ffn_layers_in_decoder: 1`` - Argument: ``--num-ffn-layers-in-decoder 1`` - Default value: ``0`` -- **decoder_norm**: Graphstorm provides this argument as an option to define the norm type for the task decoders. Please note, it only accepts ``batch`` and ``layer`` for batchnorm and layernorm respectively. By default, it is set to 'none'. Note: all the built-in classification and regression decoders accept ``norm`` as one of their input arguments to define the norm type, but only ``MLPEFeatEdgeDecoder`` implements layer/batch norm between its layers. - - - Yaml: ``decoder_norm: batch`` - - Argument: ``--decoder-norm batch`` - - Default value: ``none`` - **input_activate**: Graphstorm provides this argument as an option to change the activation function in the input layer. Please note, it only accepts 'relu' and 'none'. - Yaml: ``input_activate: relu`` @@ -371,10 +366,10 @@ Classification and Regression Task - Yaml: ``imbalance_class_weights: 0.1,0.2,0.3`` - Argument: ``--imbalance-class-weights 0.1,0.2,0.3`` - Default value: ``None`` -- **return_proba**: For classification task, this configuration determines whether to return probability estimates for each class or the maximum probable class. Set `true`` to return probability estimates and `false` to return the maximum probable class. +- **return_proba**: For classification task, this configuration determines whether to return probability estimates for each class or the maximum probable class. Set true to return probability estimates and false to return the maximum probable class. - Yaml: ``return_proba: true`` - - Argument: ``--return_proba true`` + - Argument: ``--return-proba true`` - Default value: ``true`` - **save_prediction_path**: Path to save prediction results. This is used in node/edge classification/regression inference. diff --git a/docs/source/scale/distributed.rst b/docs/source/cli/model-training-inference/distributed/cluster.rst similarity index 97% rename from docs/source/scale/distributed.rst rename to docs/source/cli/model-training-inference/distributed/cluster.rst index cc9066fd96..3af5eb1b62 100644 --- a/docs/source/scale/distributed.rst +++ b/docs/source/cli/model-training-inference/distributed/cluster.rst @@ -1,7 +1,7 @@ .. _distributed-cluster: -Use GraphStorm in a Distributed Cluster -======================================== +Model Training and Inference on a Distributed Cluster +====================================================== GraphStorm can scale to the enterprise-level graphs in the distributed mode by using a cluster of instances. To leverage this capacity, there are four steps to follow: * Create a cluster with instances each of which can run GraphStorm Docker container. @@ -53,7 +53,7 @@ Collect the IP list ...................... The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of the three instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like: -.. figure:: ../../../tutorial/distributed_ips.png +.. figure:: ../../../../../tutorial/distributed_ips.png :align: center .. note:: If possible, use **private IP addresses**, insteand of public IP addresses. Public IP addresses may have additional port constraints, which cause communication issues. @@ -157,7 +157,7 @@ That's it! The command will initialize the training in all three GraphStorm cont Train a Large Graph (OGBN-Papers100M) -------------------------------------- -The previous sections demonstrates GraphStorm's distributed capability for a quick start. This section will use GraphStorm to train a large Graph data, i.e., `OGBN-Papers100M <https://ogb.stanford.edu/docs/nodeprop/#ogbn-papers100M>`_, that can hardly train an RGCN model on it in a single machine. The steps of training this large graph is nearly the same as the above section, and only need a few additional operations. +The previous sections demonstrates GraphStorm's distributed capability for a quick start. This section will use GraphStorm to train a large Graph data, i.e., `OGBN-Papers100M <https://ogb.stanford.edu/docs/nodeprop/#ogbn-papers100M>`_, that can hardly train an RGCN model on a single machine. The steps of training this large graph is nearly the same as the above section, and only need a few additional operations. Create a GraphStorm Cluster ............................ diff --git a/docs/source/scale/sagemaker.rst b/docs/source/cli/model-training-inference/distributed/sagemaker.rst similarity index 99% rename from docs/source/scale/sagemaker.rst rename to docs/source/cli/model-training-inference/distributed/sagemaker.rst index 937a26b900..5b111726cb 100644 --- a/docs/source/scale/sagemaker.rst +++ b/docs/source/cli/model-training-inference/distributed/sagemaker.rst @@ -1,7 +1,8 @@ .. _distributed-sagemaker: -Use GraphStorm on SageMaker -=================================== +Model Training and Inference on on SageMaker +============================================= + GraphStorm can run on Amazon Sagemaker to leverage SageMaker's ML DevOps capabilities. Prerequisites diff --git a/docs/source/cli/model-training-inference/index.rst b/docs/source/cli/model-training-inference/index.rst new file mode 100644 index 0000000000..e4aa6c0829 --- /dev/null +++ b/docs/source/cli/model-training-inference/index.rst @@ -0,0 +1,24 @@ +.. _model_training_inference: + +======================================== +GraphStorm Model Training and Inference +======================================== + +Once your raw data are converted into partitioned DGL distributed graphs by using the :ref:`GraphStorm Graph Construction <graph_construction>` user guide, you can use Graphstorm CLIs to train GML models and do inference on a signle machine if there is one partition only, or on a distributed environment, such as a Linux cluster, for multiple partition graphs. + +This section provides guidelines of GraphStorm model training and inference on :ref:`signle machine <single-machine-training-inference>`, :ref:`distributed clusters <distributed-cluster>`, and :ref:`Amazon SageMaker <distributed-sagemaker>`. + +GraphStorm CLIs require less- or no-code operations for users to perform Graph Machine Learning (GML) tasks. In most cases, users only need to configure the parameters or arguments provided by GraphStorm to fulfill their GML tasks. Users can find the details of these configurations in the :ref:`Model Training and Inference Configurations<configurations-run>`. + +In addition, there are two node ID mapping operations during the graph construction procedure, and these mapping results are saved in a certain folder by which GraphStorm inference pipelines will automatically use to remap prediction results' node IDs back to the original IDs. In case when such automatic remapping does not occur, you can do it mannually according to the :ref:`GraphStorm Output Node ID Remapping <output-remapping>` guideline. + +.. toctree:: + :maxdepth: 2 + :glob: + + single-machine-training-inference + distributed/cluster + distributed/sagemaker + configuration-run + output + output-remapping diff --git a/docs/source/cli/model-training-inference/output-remapping.rst b/docs/source/cli/model-training-inference/output-remapping.rst new file mode 100644 index 0000000000..88d1e6dd36 --- /dev/null +++ b/docs/source/cli/model-training-inference/output-remapping.rst @@ -0,0 +1,220 @@ +.. _gs-output-remapping: + +GraphStorm Output Node ID Remapping +==================================== +During :ref:`Graph Construction<graph-construction>`, GraphStorm converts +user provided node IDs into integer-based node IDs. Thus, the outputs of +GraphStorm training and inference jobs, i.e., :ref:`saved node +embeddings<gs-output-embs>` and :ref:`saved prediction results<gs-out-predictions>`, +are stored with their integer-based node IDs. GraphStorm provides a +``gconstruct.remap_result`` module to remap the integer-based node IDs +back to the original user provided node IDs according to the :ref:`node ID +mapping files<gs-id-mapping-files>`. + +.. note:: + + If the training or inference tasks are launched by GraphStorm CLIs, + the ``gconstruct.remap_result`` module is automatically triggered to + to remap the integer-based node IDs back to the original user provided + node IDs. + +Output Node Embeddings after Remapping +-------------------------------------- +By default, the output node embeddings after ``gconstruct.remap_result`` +are stored in the path specified by ``save_embed_path`` in parquet format. +The node embeddings for different node +types are stored in separate directories, each named after the +corresponding node type. The content of the output directory will look like following: + +.. code-block:: bash + + emb_dir/ + ntype0: + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + ntype1: + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + +For multi-task learning tasks, the output node embeddings may have +task specific versions. (Details can be found in :ref:`Multi-task +Learning Output<multi-task-learning-output>`). The task specific +node embeddings are also processed by the ``gconstruct.remap_result`` module. +The content of the output directory will look like following: + +.. code-block:: bash + + emb_dir/ + ntype0/ + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + ntype1: + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + link_prediction-paper_cite_paper/ + ntype0/ + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + ntype1: + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + edge_regression-paper_cite_paper-year/ + ntype0/ + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + ntype1: + embed-00000_00000.parquet + embed-00000_00001.parquet + ... + +The content of each parquet file will look like following: + ++-----+------------------------------------------------------------------+ +| nid | emb | ++=====+==================================================================+ +| n0 | [0.2964, 0.0779, 1.2763, 2.8971, ..., -0.2564, 0.9060, -0.8740] | ++-----+------------------------------------------------------------------+ +| n1 | [1.6941, -1.6765, 0.1862, -0.4449, ..., 0.6474, 0.2358, -0.5952] | ++-----+------------------------------------------------------------------+ +| n10 | [-0.8417, 2.5096, -0.0393, -0.8208, ..., 0.9894, 2.3389, 0.9778] | ++-----+------------------------------------------------------------------+ + +.. note:: + + ``gconstruct.remap_result`` uses ``nid`` as the default column name + for node IDs and ``emb`` as the default column name for embeddings + + +Output Prediction Results after Remapping +----------------------------------------- +By default, the output prediction results after ``gconstruct.remap_result`` +are stored in the path specified by ``save_prediction_path`` in parquet format. +The prediction results for different node +types are stored in separate directories, each named after the +corresponding node type. The prediction results for different edge +types are stored in separate directories, each named after the +corresponding edge type. The content of the directory of node prediction results will look like following: + +.. code-block:: bash + + predict_dir/ + ntype0: + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + ntype1: + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + +The content of the directory of edge prediction results will look like following: + +.. code-block:: bash + + predict_dir/ + etype0: + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + etype1: + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + +For multi-task learning tasks, there can be multiple prediction results +for different tasks. (Details can be found in :ref:`Multi-task +Learning Output<multi-task-learning-output>`). The task specific +prediction results are also processed by the ``gconstruct.remap_result`` module. +The content of the output directory will look like following: + +.. code-block:: bash + + prediction_dir/ + edge_regression-paper_cite_paper-year/ + paper_cite_paper/ + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + node_classification-paper-venue/ + paper/ + predict-00000_00000.parquet + predict-00000_00001.parquet + ... + +The content of a node prediction result file will look like following: + ++-----+------------------+ +| nid | pred | ++=====+==================+ +| n0 | [0.2964, 0.7036] | ++-----+------------------+ +| n1 | [0.1862, 0.8138] | ++-----+------------------+ +| n10 | [0.9778, 0.0222] | ++-----+------------------+ + +.. note:: + + ``gconstruct.remap_result`` uses ``nid``as the default column name + for node IDs and ``pred``as the default column name for prediction results. + +The content of an edge prediction result file will look like following: + ++---------+---------+------------------+ +| src_nid | dst_nid | pred | ++=========+=========+==================+ +| n0 | n32 | [0.2964, 0.7036] | ++---------+---------+------------------+ +| n1 | n21 | [0.1862, 0.8138] | ++---------+---------+------------------+ +| n10 | n2 | [0.9778, 0.0222] | ++---------+---------+------------------+ + +.. note:: + + ``gconstruct.remap_result`` uses ``src_nid``as the default column name + for source node IDs, ``dst_nid``as the default column name for + destination node IDs and ``pred``as the default column name for prediction results. + +Run remap_result Command +------------------------- +If users want to run remap_result by themselves, they can run the +``gconstruct.remap_result`` command by following the command example: + +.. code:: python + + python -m graphstorm.gconstruct.remap_result \ + --node-id-mapping PATH_TO/id_mapping \ + --pred-ntypes "n0" "n1" \ + --prediction-dir PATH_TO/pred/ \ + --node-emb-dir PATH_TO/emb/ \ + +This example provides the actual Python command. It will do node ID +remapping for prediction results of node type `n0` and `n1`` stored +under `PATH_TO/pred/`. It will also do node ID remapping for node +embeddings stored under `PATH_TO/emb/`. The remapped data will be saved +in the save directory as the input data and the input data will be +removed to save disk space. + +Below lists the full argument list of the ``gconstruct.remap_result`` command: + +* **-\-node-id-mapping**: (**Required**) the path storing the node ID mapping files. +* **-\-cf**: the path to the yaml configuration file of the corresponding training or inference task. By providing the configuration file, ``gconstruct.remap_result`` will automatically infer the necessary information for ID remappings for node embeddings and prediction results. +* **-\-num-processes**: The number of processes to process the data simultaneously. A larger number of processes will speedup the ID remapping progress but consumes more CPU memory. Default is 4. +* **-\-node-emb-dir**: The directory storing the node embeddings to be remapped. Default is None. +* **-\-prediction-dir**: The directory storing the graph prediction results to be remapped. Default is None. +* **-\-pred-etypes**: A list of canonical edge types which have prediction results to be remmaped. For example, ``--pred-etypes user,rate,movie user,watch,movie``. Must be used with ``--prediction-dir``. Default is None. +* **-\-pred-ntypes**: A list of node types which have prediction results to be remmaped. For example, ``--pred-ntypes user movie``. Must be used with ``--prediction-dir``. Default is None. +* **-\-output-format**: The output format. It can be ``parquet`` or ``csv``. Default is ``parquet``. +* **-\-output-delimiter**: The delimiter used when **-\-output-format** set to ``csv``. Default is ``,``. +* **-\-column-names**: Defines how to rename default column names to new names. For example, given ``--column-names nid,~id emb,embedding``, the column ``nid``will be renamed to ``~id`` and the column ``emb`` will be renamed to `embedding`. Default is None. +* **-\-logging-level**: The logging level. The possible values: `debug`, ``info``, ``warning``, ``error``. Default is ``info``. +* **-\-output-chunk-size**: Number of rows per output file. ``gconstruct.remap_result`` will automatically split output file into multiple files. By default, it is set to ``sys.maxsize`` +* **-\-preserve-input**: Whether we preserve the input data. This is only for debug purpose. Default is False. diff --git a/docs/source/cli/model-training-inference/output.rst b/docs/source/cli/model-training-inference/output.rst new file mode 100644 index 0000000000..f7e81571e4 --- /dev/null +++ b/docs/source/cli/model-training-inference/output.rst @@ -0,0 +1,215 @@ +.. _gs-output: + +GraphStorm Training and Inference Output +======================================== +GraphStorm training pipeline can save both trained model checkpoints and node embeddings +on disk. When ``save_model_path`` is provided in the training configuration, +the trained model checkpoints will be saved in the corresponding path. +The contents of the ``save_model_path`` will look like following: + +.. code-block:: bash + + model_dir/ + epoch-0-iter-1099/ + epoch-0-iter-2099/ + epoch-0/ + epoch-1-iter-1099/ + epoch-1-iter-2099/ + ... + +When ``save_embed_path`` is provided in the training configuration, +the node embeddings produced by the best model checkpoint will be saved +in the corresponding path. When the training task is launched by +GraphStorm CLIs, a node ID remapping process will be launched +automatically, after the training job, to process the saved node embeddings and the corresponding node IDs. The final output of node +embeddings will be in parquet format by default. Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` + +GraphStorm inference pipeline can save both node embeddings and prediction +results on disk. When ``save_embed_path`` is provided in the inference configurations, +the node embeddings will be saved in the same way as GraphStorm training pipeline. +When ``save_prediction_path`` is provided in the inference configurations, +GraphStorm will save the prediction results in the corresponding path. +When the inference task is launched by GraphStorm CLIs, a ndoe ID remapping +process will be launched automatically, after the inference job, to +process the saved prediction results and the corresponding node IDs. +The final output of prediction results will be in parquet format by default. +Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` + + +The following sections will introduce how the node embeddings and prediction +results are saved by the GraphStorm training and inference scripts. + +.. note:: + + In most of the end-to-end training and inference cases, the saved files, usually in ``.pt`` format, are not consumable by the downstream applications. The :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` must be invoked to process the output files. + + +.. _gs-output-embs: + +Saved Node Embeddings +--------------------- +When ``save_embed_path`` is provided in the training configuration or the inference configuration, +GraphStorm will save the node embeddings in the corresponding path. The node embeddings +of each node type are saved separately under different sub-directories named with +the corresponding node types. GraphStorm will also save an ``emb_info.json`` file, +which contains all the metadata for the saved node embeddings. +The contents of the ``save_embed_path`` will look like following: + +.. code-block:: bash + + emb_dir/ + ntype0/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + ntype1/ + embed_nids-00000.pt + embed_nids-00001.pt + ... + embed-00000.pt + embed-00001.pt + ... + ... + emb_info.json + +The ``embed_nids-*`` files store the integer node IDs of each node embedding and +the ``embed-*`` files store the corresponding node embeddings. +The contents of ``embed_nids-*`` files and ``embed-*`` files look like: + +.. code-block:: + + embed_nids-00000.pt | embed-00000.pt + | + Graph Node ID | embeddings + 10 | 0.112,0.123,-0.011,... + 1 | 0.872,0.321,-0.901,... + 23 | 0.472,0.432,-0.732,... + ... + +The ``emb_info.json`` stores three types of information: + * ``format``: The format of the saved embeddings. By default, it is ``pytorch``. + * ``emb_name``: A list of node types that have node embeddings saved. For example: ["ntype0", "ntype1"] + * ``world_size``: The number of chunks (files) into which the node embeddings of a particular node type are divided. For instance, if world_size is set to 8, there will be 8 files for each node type's node embeddings." + +**Note: The built-in GraphStorm training or inference pipeline +(launched by GraphStorm CLIs) will process the saved node embeddings +to convert the integer node IDs into the raw node IDs, which are usually +string node IDs. The final output will be in parquet format by default. +And the node embedding files, i.e.,``embed-*.pt`` files, and node ID +files, i.e.,``embed_nids-*.pt`` files, will be removed.** Details can be +found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` + +.. _gs-out-predictions: + +Saved Prediction Results +------------------------ +When ``save_prediction_path`` is provided in the inference configurations, +GraphStorm will save the prediction results in the corresponding path. +For node prediction tasks, the prediction results are saved per node type. +GraphStorm will also save an ``result_info.json`` file, which contains all +the metadata for the saved prediction results. The contents of the ``save_prediction_path`` +will look like following: + +.. code-block:: bash + + prediction_dir/ + ntype0/ + predict-00000.pt + predict-00001.pt + ... + predict_nids-00000.pt + predict_nids-00001.pt + ... + ntype1/ + predict-00000.pt + predict-00001.pt + ... + predict_nids-00000.pt + predict_nids-00001.pt + ... + ... + result_info.json + +The ``predict_nids-*`` files store the integer node IDs of each prediction result and +the ``predict-*`` files store the corresponding prediction results. +The content of ``predict_nids-*`` files and ``predict-*`` files looks like: + +.. code-block:: + + predict_nids-00000.pt | predict-00000.pt + | + Graph Node ID | Prediction results + 10 | 0.112 + 1 | 0.872 + 23 | 0.472 + ... + +The ``result_info.json`` stores three types of information: + * ``format``: The format of the saved prediction results. By default, it is ``pytorch``. + * ``emb_name``: A list of node types that have node prediction results saved. For example: ["ntype0", "ntype1"] + * ``world_size``: The number of chunks (files) into which the prediction results of a particular node type are divided. For instance, if world_size is set to 8, there will be 8 files for each node type's prediction results." + + +For edge prediction tasks, the prediction results are saved per edge type. +The sub-directory for an edge type is named as ``<src_ntype>_<relation_type>_<dst_ntype>``. +For instance, given an edge type ``("movie","rated-by","user")``, the corresponding +sub-directory is named as ``movie_rated-by_user``. +GraphStorm will also save an ``result_info.json`` file, which contains all +the metadata for the saved prediction results. The contents of the ``save_prediction_path`` +will look like following: + +.. code-block:: bash + + prediction_dir/ + etype0/ + predict-00000.pt + predict-00001.pt + ... + src_nids-00000.pt + src_nids-00001.pt + ... + dst_nids-00000.pt + dst_nids-00001.pt + ... + etype1/ + predict-00000.pt + predict-00001.pt + ... + src_nids-00000.pt + src_nids-00001.pt + ... + dst_nids-00000.pt + dst_nids-00001.pt + ... + ... + result_info.json + +The ``src_nids-*`` and ``dst_nids-*`` files contain the integer node IDs for +the source and destination nodes of each prediction, respectively. +The ``predict-*`` files store the corresponding prediction results. +The content of ``src_nids-*``, ``dst_nids-*`` and ``predict-*`` files looks like: + +.. code-block:: + + src_nids-00000.pt | dst_nids-00000.pt | predict-00000.pt + | + Source Node ID | Destination Node ID | Prediction results + 10 | 12 | 0.112 + 1 | 20 | 0.872 + 23 | 3 | 0.472 + ... + +The ``result_info.json`` stores three types of informations: + * ``format``: The format of the saved prediction results. By default, it is ``pytorch``. + * ``etypes``: A list of edge types that have edge prediction results saved. For example: [("movie","rated-by","user"), ("user","watched","movie")] + * ``world_size``: The number of chunks (files) into which the prediction results of a particular edge type are divided. For instance, if world_size is set to 8, there will be 8 files for each edge type's prediction results." + +**Note: The built-in GraphStorm inference pipeline +(launched by GraphStorm CLIs) will process the saved prediction results +to convert the integer node IDs into the raw node IDs, which are usually string node IDs. The final output will be in parquet format by default. +And the prediction files, i.e.,``predict-*.pt`` files, and node ID files, +i.e.,``predict_nids-*.pt``, ``src_nids-*.pt``, and ``dst_nids-*.pt`` files +will be removed.** Details can be found in :ref:`GraphStorm Output Node ID Remapping<gs-output-remapping>` diff --git a/docs/source/cli/model-training-inference/single-machine-training-inference.rst b/docs/source/cli/model-training-inference/single-machine-training-inference.rst new file mode 100644 index 0000000000..28e985bfe5 --- /dev/null +++ b/docs/source/cli/model-training-inference/single-machine-training-inference.rst @@ -0,0 +1,76 @@ +.. _single-machine-training-inference: + +Model Training and Inference on a Single Machine +------------------------------------------------- +While the :ref:`Standalone Mode Quick Start <quick-start-standalone>` tutorial introduces some basic concepts, commands, and steps of using GprahStorm CLIs on a single machine, this user guide provides more detailed description of the usage of GraphStorm CLIs in a single machine. In addition, the majority of the descriptions in this guide can be directly applied to :ref:`model training and inference on distributed clusters <distributed-cluster>`. + +GraphStorm can support graph machine learning (GML) model training and inference for common GML tasks, including **node classification**, **node regression**, **edge classification**, **edge regression**, and **link prediction**. Since the :ref:`multi-task learning <multi_task_learning>` feature released in v0.3 is in experimental stage, formal documentations about this feature will be released later when it is mature. + +For each task, GraphStorm provide a dedicated CLI for model training and inference. These CLIs share the same command template and some configurations, while each CLI has its unique task-specific configurations. GraphStorm also has a task-agnostic CLI for users to run your customized models. + +Task-specific CLI template for model training and inference +............................................................ +GraphStorm model training and inference CLIs like the commands below. + +.. code-block:: bash + + # Model training + python -m graphstorm.run.TASK_COMMAND \ + --num-trainers 1 \ + --part-config data.json \ + --cf config.yaml \ + --save-model-path model_path/ + + # Model inference + python -m graphstorm.run.TASK_COMMAND \ + --inference \ + --num-trainers 1 \ + --part-config data.json \ + --cf config.yaml \ + --restore-model-path model_path/ \ + --save-prediction-path pred_path/ + +In the above two templates, the ``TASK_COMMAND`` represents one of the five task-specific commands: + + * ``gs_node_classification`` for node classification tasks; + * ``gs_node_regression`` for node regression tasks; + * ``gs_edge_classification`` for edge classification tasks; + * ``gs_edge_regression`` for edge regression tasks; + * ``gs_link_prediction`` for link prediction tasks. + +These task-specific commands work for both model training and inference except that inference CLI needs to add the **-\-inference** argument to indicate this is an inference CLI, and the **-\-restore-model-path** argument that indicates the path of the saved model checkpoint. + +For a single machine, the argument **-\-num-trainers** can configure how many GPUs or CPU processes to be used. If using a GPU machine, the value of **-\-num-trainers** should be **equal or less** than the total number of available GPUs, while in a CPU-only machine, the value could be less than the total number of CPU processes to avoid errors. + +GraphStorm model training and inference CLIs use the **-\-part-config** argument to specify the partitioned graph data. Its value should be the path of the `*.json` file that is generated by the :ref:`GraphStorm Graph Construction <graph_construction>` step. + +While the CLIs could be very simple as the template demonstrated, users can leverage a YAML file to set a variaty of GraphStorm configurations that could make full use of the rich functions and features provided by GraphStorm. The YAML file will be specified to the **-\-cf** argument. GraphStorm has a set of `example YAML files <https://github.com/awslabs/graphstorm/tree/main/training_scripts>`_ available for reference. + +.. note:: + + * Users can set CLI configurations either in CLI arguments or the configuration YAML file. But values set in CLI arguments will overwrite the values of the same configuration set in the YAML file. + * This guide only explains a few commonly used configurations. For the detailed explanations of GraphStorm CLI configurations, please refer to the :ref:`Model Training and Inference Configurations <configurations-run>` section. + +Task-agnostic CLI for model training and inference +................................................... +While task-specific CLIs allow users to quickly perform GML tasks supported by GraphStorm, users may build their own GNN models as described in the :ref:`Use Your Own Models <use-own-models>` tutorial. To put these customized models into GraphStorm model training and inference pipeline, users can use the task-agnostic CLI as shown in the examples below. + +.. code-block:: bash + + # Model training + python -m graphstorm.run.launch \ + --num-trainers 1 \ + --part-config data.json \ + --save-model-path model_path/ \ + customized_model.py customized_arguments + + # Model inference + python -m graphstorm.run.launch \ + --inference \ + --num-trainers 1 \ + --part-config data.json \ + --restore-model-path model_path/ \ + --save-prediction-path pred_path/ + customized_model.py customized_arguments + +The task-agnostic CLI command (``launch``) has similar tempalte as the task-specific CLIs except that it takes the customized model, which is stored as a ``.py`` file, as an argument. And in case the customized model has its own arguments, they should be placed after the customized model python file. diff --git a/docs/source/configuration/configuration-gconstruction.rst b/docs/source/configuration/configuration-gconstruction.rst deleted file mode 100644 index 429830ab34..0000000000 --- a/docs/source/configuration/configuration-gconstruction.rst +++ /dev/null @@ -1,171 +0,0 @@ -.. _configurations-gconstruction: - -Graph Construction -============================ - -`construct_graph.py <https://github.com/zhjwy9343/graphstorm/blob/main/python/graphstorm/gconstruct/construct_graph.py>`_ arguments --------------------------------------------------------------------------------------------------------------------------------------- - -* **-\-conf-file**: (**Required**) the path of the configuration JSON file. -* **-\-num-processes**: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing. -* **-\-num-processes-for-nodes**: the number of processes to process node data simulteneously. Increase this number can speed up node data processing. -* **-\-num-processes-for-edges**: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing. -* **-\-output-dir**: (**Required**) the path of the output data files. -* **-\-graph-name**: (**Required**) the name assigned for the graph. -* **-\-remap-node-id**: boolean value to decide whether to rename node IDs or not. Adding this argument will set it to be true, otherwise false. -* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Adding this argument will set it to be true, otherwise false. -* **-\-output-format**: the format of constructed graph, options are ``DGL``, ``DistDGL``. Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <output-format>` section below. -* **-\-num-parts**: the number of partitions of the constructed graph. This is only valid if the output format is ``DistDGL``. -* **-\-skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true. -* **-\-ext-mem-workspace**: the directory where the tool can store data during graph construction. Suggest to use high-speed SSD as the external memory workspace. -* **-\-ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64. -* **-\-output-conf-file**: The output file with the updated configurations that records the details of data transformation, e.g., convert to categorical value mappings, and max-min normalization ranges. If not specified, will save the updated configuration file in the **-\-output-dir** with name `data_transform_new.json`. - -.. _gconstruction-json: - -Configuration JSON Explanations ---------------------------------- - -The JSON file that describes the graph data defines where to get node data and edge data to construct a graph. Below shows an example of such a JSON file. In the highest level, it contains three fields: ``version``, ``nodes`` and ``edges``. - -``version`` -........... -``version`` marks the version of the configuration file schema, allowing its identification to be self-contained for downstream applications. The current (and expected) version is ``gconstruct-v0.1``. - -``nodes`` -........... -``nodes`` contains a list of node types and the information of a node type is stored in a dictionary. A node dictionary contains multiple fields and most fields are optional. - -* ``node_type``: (**Required**) specifies the node type. Think this as a name given to one type of nodes, e.g. `author` and `paper`. -* ``files``: (**Required**) specifies the input files for the node data. There are multiple options to specify the input files. For a single input file, it contains the path of a single file. For multiple files, it contains the paths of files with a wildcard, or a list of file paths, e.g., `file_name*.parquet`. -* ``format``: (**Required**) specifies the input file format. Currently, the pipeline supports three formats: ``parquet``, ``HDF5``, and ``JSON``. The value of this field is a dictionary, where the key is ``name`` and the value is either ``parquet`` or ``JSON``, e.g., `{"name":"JSON"}`. The detailed format information is specified in the format section. -* ``node_id_col``: specifies the column name that contains the node IDs. This field is optional. If a node type contains multiple blocks to specify the node data, only one of the blocks require to specify the node ID column. -* ``features`` is a list of dictionaries that define how to get features and transform features. This is optional. The format of a feature dictionary is defined :ref:`below <feat-format>`. -* ``labels`` is a list of dictionaries that define where to get labels and how to split the data into training/validation/test set. This is optional. The format of a label dictionary is defined :ref:`below<label-format>`. - -``edges`` -........... -Similarly, ``edges`` contains a list of edge types and the information of an edge type is stored in a dictionary. An edge dictionary also contains the same fields of ``files``, ``format``, ``features`` and ``labels`` as ``nodes``. In addition, it contains the following fields: - -* ``source_id_col``: (**Required**) specifies the column name of the source node IDs. -* ``dest_id_col``: (**Required**) specifies the column name of the destination node IDs. -* ``relation``: (**Required**) is a list of three elements that contains the node type of the source nodes, the relation type of the edges and the node type of the destination nodes. Values of node types should be same as the corresponding values specified in the ``node_type`` fields in ``nodes`` objects, e.g., `["author", "write", "paper"]`. - -.. _feat-format: - -**A feature dictionary is defined:** - -* ``feature_col``: (**Required**) specifies the column name in the input file that contains the feature. The ``feature_col`` can accept either a string or a list. When ``feature_col`` is specified as a list with multiple columns, the same feature transformation operation will be applied to each column, and then the transformed feature will be concatenated to form the final feature. -* ``feature_name``: specifies the prefix of the column feature name. This is optional. If feature_name is not provided, ``feature_col`` is used as the feature name. If the feature transformation generates multiple tensors, ``feature_name`` becomes the prefix of the names of the generated tensors. If there are multiple columns defined in ``feature_col``, ``feature_name`` is required. -* ``out_dtype`` specifies the data type of the transformed feature. ``out_dtype`` is optional. If it is not set, no data type casting is applied to the transformed feature. If it is set, the output feature will be cast into the corresponding data type. Now only `float16`, `float32`, and `float64` are supported. -* ``transform``: specifies the actual feature transformation. This is a dictionary and its name field indicates the feature transformation. Each transformation has its own argument. The list of feature transformations supported by the pipeline are listed in the section of :ref:`Feature Transformation <feat-transform>` below. - -.. _label-format: - -**A label dictionary is defined:** - -* ``task_type``: (**Required**) specifies the task defined on the nodes or edges. Currently, its value can be ``classification``, ``regression`` and ``link_prediction``. -* ``label_col``: specifies the column name in the input file that contains the label. This has to be specified for ``classification`` and ``regression`` tasks. ``label_col`` is used as the label name. -* ``split_pct``: (Optional) specifies how to split the data into training/validation/test. If it's not specified, the data is split into 80% for training 10% for validation and 10% for testing. The pipeline constructs three additional vectors indicating the training/validation/test masks. For ``classification`` and ``regression`` tasks, the names of the mask tensors are ``train_mask``, ``val_mask`` and ``test_mask``. -* ``custom_split_filenames``: (Optional) specifies the customized training/validation/test mask. It has field named ``train``, ``valid``, and ``test`` to specify the path of the mask files. It is possible that one of the subfield here leaves empty and it will be treated as none. It will override the ``split_pct`` once provided. Refer to :ref:`Use Your Own Graphs Tutorial <use-own-data>` for an example. - -.. _input-format: - -Input formats -.............. -Currently, the graph construction pipeline supports three input formats: ``Parquet``, ``HDF5``, and ``JSON``. - -For the Parquet format, each column defines a node/edge feature, label or node/edge IDs. For multi-dimensional features, currently the pipeline requires the features to be stored as a list of vectors. The pipeline will reconstruct multi-dimensional features and store them in a matrix. - -The HDF5 format is similar as the parquet format, but have larger capacity. Therefore suggest to use HDF5 format if users' data is large. - -For JSON format, each line of the JSON file is a JSON object. The JSON object can only have one level. The value of each field can only be primitive values, such as integers, strings and floating points, or a list of integers or floating points. - -.. _feat-transform: - -Feature transformation -......................... -Currently, the graph construction pipeline supports the following feature transformation: - -* **HuggingFace tokenizer transformation** tokenizes text strings with a HuggingFace tokenizer. The ``name`` field in the feature transformation dictionary is ``tokenize_hf``. The dict should contain two additional fields. ``bert_model`` specifies the LM model used for tokenization. Users can choose any `HuggingFace LM models <https://huggingface.co/models>`_ from one of the following types: ``"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert"``. ``max_seq_length`` specifies the maximal sequence length. -* **HuggingFace LM transformation** encodes text strings with a HuggingFace LM model. The ``name`` field in the feature transformation dictionary is ``bert_hf``. The dict should contain two additional fields. ``bert_model`` specifies the LM model used for embedding text. Users can choose any `HuggingFace LM models <https://huggingface.co/models>`_ from one of the following types: ``"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert"``. ``max_seq_length`` specifies the maximal sequence length. -* **Numerical MAX_MIN transformation** normalizes numerical input features with `val = (val-min)/(max-min)`, where `val` is the feature value, `max` is the maximum number in the feature and `min` is the minimum number in the feature. The ``name`` field in the feature transformation dictionary is ``max_min_norm``. The dict can contain four **optional** fields: ``max_bound``, ``min_bound``, ``max_val`` and ``min_val``. ``max_bound`` specifies the maximum value allowed in the feature. Any number larger than ``max_bound`` will be set to ``max_bound``. Here, `max = min(np.amax(feats), ``max_bound``)`. ``min_bound`` specifies the minimum value allowed in the feature. Any number smaller than ``min_bound`` will be set to ``min_bound``. Here, `min` = max(np.amin(feats), ``min_bound``). ``max_val`` defines the `max` in the transformation formula. When ``max_val`` is provided, `max` is always equal to ``max_val``. ``min_val`` defines the `min` in the transformation formula. When ``min_val`` is provided, `min` is always equal to ``min_val``. ``max_val`` and ``min_val`` are mainly used in the inference stage, where we want to use the max & min values computed in the training stage to normalize inference data. -* **Numerical Rank Gauss transformation** normalizes numerical input features with rank gauss normalization. It maps the numeric feature values to gaussian distribution based on ranking. The method follows https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927. The ``name`` field in the feature transformation dictionary is ``rank_gauss``. The dict can contains one optional field, i.e., ``epsilon`` which is used to avoid INF float during computation and ``uniquify`` which controls whether deduplicating input features before computing rank gauss norm. -* **Convert to categorical values** converts text data to categorial values. The ``name`` field is ``to_categorical``, and ``separator`` specifies how to split the string into multiple categorical values (this is only used to define multiple categorical values). If ``separator`` is not specified, the entire string is a categorical value. ``mapping`` (**optional**) is a dict that specifies how to map a string to an integer value that defines a categorical value. -* **Numerical Bucket transformation** normalizes numerical input features with buckets. The input features are divided into one or multiple buckets. Each bucket stands for a range of floats. An input value can fall into one or more buckets depending on the transformation configuration. The ``name`` field in the feature transformation dictionary is ``bucket_numerical``. Users need to provide ``range`` and ``bucket_cnt`` field, which ``range`` defines a numerical range, and ``bucket_cnt`` defines number of buckets among the range. All buckets will have same length, and each of them is left included. e.g, bucket ``(a, b)`` will include a, but not b. All input feature column data are categorized into respective buckets using this method. Any input data lower than the minimum value will be assigned to the first bucket, and any input data exceeding the maximum value will be assigned to the last bucket. For example, with range=`[10,30]` and bucket_cnt=`2`, input data `1` will fall into the bucket `[10, 20]`, input data `11` will be mapped to `[10, 20]`, input data `21` will be mapped to `[20, 30]`, input data `31` will be mapped to `[20, 30]`. Finally we use one-hot-encoding to encode the feature for each numerical bucket. If a user wants to make numeric values fall into more than one bucket, it is preferred to use the `slide_window_size`: `"slide_window_size": s` , where `s` is a number. Then each value `v` will be transformed into a range from `v - s/2` through `v + s/2` , and assigns the value `v` to every bucket that the range covers. -* **No-op vector truncation** truncates feature vectors to the length requested. The ``name`` field can be empty, - and an integer ``truncate_dim`` value will determine the length of the output vector. - This can be useful when experimenting with input features that were trained using Matryoshka Representation Learning. - -.. _output-format: - -Output -.......... -Currently, the graph construction pipeline outputs two output formats: ``DistDGL`` and ``DGL``. If select ``DGL``, the output is a file, named `<graph_name>.dgl` under the folder specified by the **-\-output-dir** argument, where `<graph_name>` is the value of argument **-\-graph-name**. If select ``DistDGL``, the output is a JSON file, named `<graph_name>.json`, and a set of `part*` folders under the folder specified by the **-\-output-dir** argument, where the `*` is the number specified by the **-\-num-parts** argument. - -By Specifying the output_format as ``DGL``, the output will be an `DGLGraph <https://docs.dgl.ai/en/1.0.x/generated/dgl.save_graphs.html>`_. By Specifying the output_format as ``DistDGL``, the output will be a partitioned graph named `DistDGL graph <https://doc.dgl.ai/guide/distributed-preprocessing.html#partitioning-api>`_. It contains the partitioned graph, a JSON config describing the meta-information of the partitioned graph, the mappings for the edges and nodes after partition, and other files that contain related metadata information, e.g., the new construction configuration JSON file that records the details of feature transformation operations. - -**Node and Edge Mapping Files:** - -There are two node/edge id mapping stages during graph construction. The first mapping occurs when GraphStorm converts the original user provided node ids into integer-based node ids, and the second mapping happends when graph partition operation shuffles these integer-based node ids to each partition with new node ids. Meanwhile, graph construction also saves two sets of node id mapping files as parts of its outputs. - -Outputs of the first mapping stage are stored at the `raw_id_mappings` folder under the path specified by the **-\-output-dir** argument. For each node type, there is a dedicated folder named after the ``node_type`` filed, in which contains parquet format files named after `part-*****.parquet`, where `*****` represents five digit numbers starting from `00000`. - -Outputs of the second mapping stage are two PyTorch tensor files, i.e., ``node_mapping.pt`` and ``edge_mapping.pt``, each of which maps the node and edge in the partitoined graph into the integer original node and edge id space. The node ID mapping is stored as a dictionary of 1D tensors whose key is the node type and value is a 1D tensor mapping between shuffled node IDs and the original node IDs. The edge ID mapping is stored as a dictionary of 1D tensors whose key is the edge type and value is a 1D tensor mapping between shuffled edge IDs and the original edge IDs. - -.. note:: These mapping files are important for mapping the training and inference outputs. Therefore, DO NOT move or delete them. - -**New Construction Configuration JSON:** - -By default, GraphStorm will regenerate a construction configuration JSON file that copies the contents in the given JSON file specified by the **--conf-file** argument. In addition if there are transformations of features occurred, this newly generated JSON file will include some additional information. For example, if the original configuration JSON file requires to perform a **Convert to categorical values** transformation without giving the ``mapping`` dictionary, the newly generated configuration JSON file will add this ``mapping`` dictionary with the actual values and their mapping ids. This added information could help construct new graphs for fine-tunning saved models or doing inference with saved models. - -If users provide a value of the **-\-output-conf-file** argument, the newly generated configuration file will use this value as the file name. Otherwise GraphStorm will save the configuration JSON file in the **-\-output-dir** with name `data_transform_new.json`. - -An example -............ -Below shows an example that contains one node type and an edge type. For a real example, please refer to the :ref:`input JSON file <input-config>` used in the :ref:`Use Your Own Graphs Tutorial <use-own-data>`. - -.. code-block:: json - - { - "version": "gconstruct-v0.1", - "nodes": [ - { - "node_id_col": "paper_id", - "node_type": "paper", - "format": {"name": "parquet"}, - "files": "/tmp/dummy/paper_nodes*.parquet", - "features": [ - { - "feature_col": ["paper_title"], - "feature_name": "title", - "transform": {"name": "tokenize_hf", - "bert": "huggingface-basic", - "max_seq_length": 512} - }, - ], - "labels": [ - { - "label_col": "labels", - "task_type": "classification", - "split_pct": [0.8, 0.2, 0.0], - }, - ], - } - ], - "edges": [ - { - "source_id_col": "src_paper_id", - "dest_id_col": "dest_paper_id", - "relation": ["paper", "cite", "paer"], - "format": {"name": "parquet"}, - "files": ["/tmp/edge_feat.parquet"], - "features": [ - { - "feature_col": ["citation_time"], - "feature_name": "feat", - }, - ] - } - ] - } diff --git a/docs/source/configuration/configuration-partition.rst b/docs/source/configuration/configuration-partition.rst deleted file mode 100644 index 8949dc3452..0000000000 --- a/docs/source/configuration/configuration-partition.rst +++ /dev/null @@ -1,51 +0,0 @@ -.. _configurations-partition: - -Graph Partition -============================ - -For users who are already familiar with DGL and know how to construct DGL graph, GraphStorm provides two graph partition tools to partition DGL graphs into the required input format for GraphStorm launch tool for training and inference. - -* `partition_graph.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph.py>`_: for Node/Edge Classification/Regress task graph partition. -* `partition_graph_lp.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph_lp.py>`_: for Link Prediction task graph partition. - -`partition_graph.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph.py>`_ arguments ---------------------------------------------------------------------------------------------------------------- - -- **-\-dataset**: (**Required**) the graph dataset name defined for the saved DGL graph file. -- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. -- **-\-target-ntype**: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports **one** predict node type only. -- **-\-ntype-task**: the node type task to perform. Only support ``classification`` and ``regression`` so far. Default is ``classification``. -- **-\-nlabel-field**: the field that stores labels on the predict node type, **required** if set the **target-ntype**. The format is ``nodetype:labelname``, e.g., `"paper:label"`. -- **-\-target-etype**: the canonical edge type for making prediction, **required** for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. -- **-\-etype-task**: the edge type task to perform. Only allow ``classification`` and ``regression`` so far. Default is ``classification``. -- **-\-elabel-field**: the field that stores labels on the predict edge type, required if set the **target-etype**. The format is ``src_ntype,etype,dst_ntype:labelname``, e.g., `"author,write,paper:label"`. -- **-\-generate-new-node-split**: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If set this argument ``true``, **must** set the **target-ntype** argument too. -- **-\-generate-new-edge-split**: a boolean value, required if need the partition script to split edges for training/validation/test sets. If set this argument ``true``, you must set the **target-etype** argument too. -- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. -- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. - -.. Note:: - The sum of the **train-pct** and **val-pct** should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct). - -- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graph. -- **-\-retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. -- **-\-num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. -- **-\-output**: (**Required**) the folder path that the partitioned DGL graph will be saved. - -`partition_graph_lp.py <https://github.com/awslabs/graphstorm/blob/main/tools/partition_graph_lp.py>`_ arguments ------------------------------------------------------------------------------------------------------------------------------------- -- **-\-dataset**: (**Required**) the graph name defined for the saved DGL graph file. -- **-\-filepath**: (**Required**) the file path of the saved DGL graph file. -- **-\-target-etypes**: (**Required**) the canonical edge type for making prediction. GraphStorm supports **one** predict edge type only. The format is ``src_ntype,etype,dst_ntype``, e.g., `"author,write,paper"`. -- **-\-train-pct**: a float value (\>0. and \<1.) with default value ``0.8``. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training. -- **-\-val-pct**: a float value (\>0. and \<1.) with default value ``0.1``. You can set this value to control the percentage of nodes/edges for validation. - -.. Note:: - The sum of the **train-pct** and **val-pct** should less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct). - -- **-\-add-reverse-edges**: if add this argument, will add reverse edges to the given graphs. -- **-\-train-graph-only**: boolean value to control if partition the training graph or not, default is ``true``. -- **-\-retain-original-features**: boolean value to control if use the original features generated by dataset, e.g., embeddings of paper abstracts. If set to ``true``, will keep the original features; otherwise we will use the tokenized text for using BERT models to generate embeddings. -- **-\-retain-etypes**: the list of canonical edge type that will be retained before partitioning the graph. This might be helpful to remove noise edges in this application. Format example: ``—-retain-etypes query,clicks,asin query,adds,asin query,purchases,asin asin,rev-clicks,query``. -- **-\-num-parts**: (**Required**) integer value that specifies partitions the DGL graph to be split. Remember this number because we will need to set it in the model training step. -- **-\-output**: (**Required**) the folder path that the partitioned DGL graph will be saved. \ No newline at end of file diff --git a/docs/source/configuration/index.rst b/docs/source/configuration/index.rst deleted file mode 100644 index d5038a5bdd..0000000000 --- a/docs/source/configuration/index.rst +++ /dev/null @@ -1,21 +0,0 @@ -.. _configurations: - -GraphStorm Configurations -============================ - -GraphStorm is designed for easy to use and requires less- or no-code operations for users to perform Graph Machine Learning (GML) tasks. In most cases, users only need to configure the parameters or arguments provided by GraphStorm to fulfill their GML tasks. - -These configurations and arguments include: - -- :ref:`GraphStorm graph construction configurations<configurations-gconstruction>`. -- :ref:`GraphStorm graph partition configurations<configurations-partition>`. -- :ref:`GraphStorm training and inference configurations<configurations-run>`. - -.. toctree:: - :maxdepth: 1 - :hidden: - :glob: - - configuration-gconstruction - configuration-partition - configuration-run diff --git a/docs/source/graph-construction/gs-processing/aws-infra/index.rst b/docs/source/graph-construction/gs-processing/aws-infra/index.rst deleted file mode 100644 index c9c42ae7a1..0000000000 --- a/docs/source/graph-construction/gs-processing/aws-infra/index.rst +++ /dev/null @@ -1,37 +0,0 @@ -================================================ -Running distributed processing jobs on AWS Infra -================================================ - -After successfully building the Docker image and pushing it to -`Amazon ECR <https://docs.aws.amazon.com/ecr/>`_, -you can now initiate GSProcessing jobs with AWS resources. - -We support running GSProcessing jobs on different AWS execution environments including: -`Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, -`EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and -`EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. - - -Running distributed jobs on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_: - -.. toctree:: - :maxdepth: 1 - :titlesonly: - - amazon-sagemaker.rst - -Running distributed jobs on `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_: - -.. toctree:: - :maxdepth: 1 - :titlesonly: - - emr-serverless.rst - -Running distributed jobs on `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_: - -.. toctree:: - :maxdepth: 1 - :titlesonly: - - emr.rst \ No newline at end of file diff --git a/docs/source/graph-construction/gs-processing/gspartition/index.rst b/docs/source/graph-construction/gs-processing/gspartition/index.rst deleted file mode 100644 index 1e4032175c..0000000000 --- a/docs/source/graph-construction/gs-processing/gspartition/index.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. _gspartition_index: - -=================================== -Running partition jobs on AWS Infra -=================================== - -GraphStorm Distributed Graph Partition (GSPartition) allows users to do distributed partition on preprocessed graph data -prepared by :ref:`GSProcessing<gs-processing>`. To enable distributed training, the preprocessed input data must be converted to a partitioned graph representation. -GSPartition allows user to handle massive graph data in distributed clusters. GSPartition is built on top of the -dgl `distributed graph partitioning pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_. - -GSPartition consists of two steps: Graph Partitioning and Data Dispatching. Graph Partitioning step will assign each node to one partition -and save the results as a set of files called partition assignment. Data Dispatching step will physically partition the -graph data and dispatch them according to the partition assignment. It will generate the graph data in DGL format, ready for distributed training and inference. - -Tutorials for GSPartition are specifically prepared based on AWS infrastructure, -i.e., `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_ and `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. -But, users can create your own clusters easily by following the GSPartition tutorial on Amazon EC2. - -The first section includes instructions on how to run GSPartition on `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_. -The second section includes instructions on how to run GSPartition on `Amazon EC2 clusters <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html>`_. - -.. toctree:: - :maxdepth: 1 - :glob: - - sagemaker.rst - ec2-clusters.rst diff --git a/docs/source/graph-construction/gs-processing/index.rst b/docs/source/graph-construction/gs-processing/index.rst deleted file mode 100644 index 8635b00ffc..0000000000 --- a/docs/source/graph-construction/gs-processing/index.rst +++ /dev/null @@ -1,35 +0,0 @@ -============================== -Distributed Graph Construction -============================== - -Beyond single-machine graph construction, distributed graph construction offers enhanced scalability -and efficiency. This process involves two main steps: GraphStorm Distributed Data Processing (GSProcessing) -and GraphStorm Distributed Graph Partitioning (GSPartition). GSProcessing will preprocess the raw data into structured -datasets, and GSPartition will process these structured datasets to create multiple partitions in -`DGL format <https://docs.dgl.ai/en/latest/api/python/dgl.DGLGraph.html>`_. - -Here is an overview of the workflow for distributed graph construction: - -* **Prepare input data**: GraphStorm Distributed Construction accepts tabular files in parquet/CSV format. -* **Run GSProcessing**: Use GSProcessing to process the input data. This step prepares the data for partitioning including edge and node data, transformation details, and node id mappings. -* **Run GSPartition**: Use GSPartition to partition the processed data into graph files suitable for distributed training. - -.. figure:: ../../../../tutorial/distributed_construction.png - :align: center - -The following sections provide guidance on doing distributed graph construction. -The first section details the execution environment setup for GSProcessing. -The second section offers examples on drafting a configuration file for a GSProcessing job. -The third section explains how to deploy your GSProcessing job with AWS infrastructure. -The fourth section includes how to do GSPartition with the GSProcessing output. -The final section shows an example to quickly start GSProcessing and GSPartition. - -.. toctree:: - :maxdepth: 1 - :glob: - - prerequisites/index.rst - input-configuration.rst - aws-infra/index.rst - gspartition/index.rst - example.rst \ No newline at end of file diff --git a/docs/source/graph-construction/gs-processing/prerequisites/index.rst b/docs/source/graph-construction/gs-processing/prerequisites/index.rst deleted file mode 100644 index 7951c26994..0000000000 --- a/docs/source/graph-construction/gs-processing/prerequisites/index.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. _gsprocessing_prerequisites_index: - -=============================================== -Distributed GraphStorm Processing -=============================================== - -GraphStorm Distributed Data Processing (GSProcessing) allows you to process -and prepare massive graph data for training with GraphStorm. GSProcessing takes -care of generating unique ids for nodes, using them to encode edge structure files, -process individual features and prepare the data to be passed into the distributed -partitioning and training pipeline of GraphStorm. - -We use PySpark to achieve horizontal parallelism, allowing us to scale to graphs with billions of nodes and edges. - -.. warning:: - GraphStorm currently only supports running GSProcessing on AWS Infras including `Amazon SageMaker <https://docs.aws.amazon.com/sagemaker/>`_, `EMR Serverless <https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html>`_, and `EMR on EC2 <https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html>`_. - -The following sections outline essential prerequisites and provide a detailed guide to use -GSProcessing. -The first section provides an introduction to GSProcessing, how to install it locally and a quick example of its input configuration. -The second section demonstrates how to set up GSProcessing for distributed processing, enabling scalable and efficient processing using AWS resources. - -.. toctree:: - :maxdepth: 1 - :titlesonly: - - gs-processing-getting-started.rst - distributed-processing-setup.rst \ No newline at end of file diff --git a/docs/source/graph-construction/index.rst b/docs/source/graph-construction/index.rst deleted file mode 100644 index 4c5d8403f5..0000000000 --- a/docs/source/graph-construction/index.rst +++ /dev/null @@ -1,13 +0,0 @@ -.. _graph_construction: - -================== -Graph Construction -================== - -Graphstorm offers various methods to build graphs on both a single machine and distributed clusters. - -.. toctree:: - :maxdepth: 2 - :glob: - - gs-processing/index.rst \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 5bfbb4d99e..f1a45a3a7e 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -19,16 +19,8 @@ Welcome to the GraphStorm Documentation and Tutorials :hidden: :glob: - graph-construction/index.rst - -.. toctree:: - :maxdepth: 1 - :caption: Distributed Training - :hidden: - :glob: - - scale/distributed - scale/sagemaker + cli/graph-construction/index.rst + cli/model-training-inference/index.rst .. toctree:: :maxdepth: 2 @@ -52,6 +44,7 @@ Welcome to the GraphStorm Documentation and Tutorials advanced/advanced-wholegraph advanced/multi-task-learning advanced/advanced-usages + single-machine-gconstruct GraphStorm is a graph machine learning (GML) framework designed for enterprise use cases. It simplifies the development, training and deployment of GML models on industry-scale graphs (measured in billons of nodes and edges) by providing scalable training and inference pipelines of GML models. GraphStorm comes with a collection of built-in GML models, allowing users to train a GML model with a single command, eliminating the need to write any code. Moreover, GraphStorm provides a wide range of configurations to customiz model implementations and training pipelines, enhancing model performance. In addition, GraphStorm offers a programming interface that enables users to train custom GML models in a distributed manner. Users can bring their own model implementations and leverage the GraphStorm training pipeline for scalability. @@ -63,23 +56,32 @@ For beginners, please first start with the :ref:`Setup GraphStorm with pip Packa Once successfully set up the GraphStorm running environment, - follow the :ref:`GraphStorm Standalone Mode Quick-Start Tutorial<quick-start-standalone>` to use GraphStorm Command Line Interfaces (CLIs) to run examples based on GraphStorm built-in data and models, hence getting familiar with GraphStorm CLIs for training and inference. -- follow the :ref:`Use Your Own Graph Data Tutorial<use-own-data>` to prepare your own graph data for using GraphStorm CLIs. -- read the :ref:`GraphStorm Training and Inference Configurations<configurations-run>` to learn the various configurations provided by GraphStorm for CLIs that can help to achieve the best performance. +- follow the :ref:`Use Your Own Graph Data Tutorial<use-own-data>` to prepare your own graph data for using GraphStorm model training and inference pipelines or APIs. +- read the :ref:`Model Training and Inference Configurations<configurations-run>` to learn the various configurations provided by GraphStorm for CLIs that can help to achieve the best performance. + +GraphStorm provides two types of interfaces, i.e., Command Line Interfaces (CLIs) and Application Programming Interfaces (APIs), for users to conduct GML tasks for different purposes. + +The CLIs abstract away the complexity of the GML pipeline for users to quickly build, train, and deploy models using common recipes. Meanwhile, the APIs reveal the major components by which GraphStorm constructs the GML pipelines. Users can levearge these APIs to customize GraphStorm for their specific needs. + +GraphStorm CLIs User Guide +--------------------------- + +GraphStorm CLIs include two major functions, i.e., Graph Construction, and Model Training and Inference. + +The :ref:`GraphStorm Graph Construction<graph_construction>` documentations explain how to construct distributed DGL graphs that can be use in GraphStorm training and inference pipelines. For relatively small data, users can :ref:`construct graphs on a single machine<single-machine-gconstruction>`. When dealing with very large data that can not be fit into memory of a single machine, users can refer to the :ref:`distributed graph construction <distributed-gconstruction>` documentations, knowing how to set up distributed environments and construct graphs using different infrastructures. + +While the :ref:`GraphStorm Standalone Mode Quick-Start Tutorial<quick-start-standalone>` provides some information of using GraphStorm CLIs on a single machine, the :ref:`Model Training and Inference on a Single Machine <single-machine-training-inference>` documentation provides more detailed guidance. -Scale to Giant Graphs ----------------------- +Similar as the documentations of distributed graph construction, the distributed model training and inference user guide explains how to set up distributed environments and run GraphStorm model training and inference using a :ref:`Distributed Cluster <distributed-cluster>` or :ref:`Amazon SageMaker <distributed-sagemaker>` to deal with enterprise-level graphs. -For users who wish to train and run infernece on very large graphs, +GraphStorm APIs User Guide +--------------------------- -- follow the :ref:`Setup GraphStorm Docker Environment<setup_docker>` tutorial to create GraphStorm dockers for distributed runtime environments. -- follow the :ref:`Use GraphStorm Distributed Data Processing<gs-processing>` tutorial to process and construction large graphs in the Distributed mode. -- follow the :ref:`Use GraphStorm in a Distributed Cluster<distributed-cluster>` tutorial to use GraphStorm in the Distributed mode. -- follow the :ref:`Use GraphStorm on SageMaker<distributed-sagemaker>` tutorial to use GraphStorm in the Distribute mode based on Amazon SageMaker. +The released GraphStorm APIs list the major components that can help users to develop GraphStorm-like GML pipelines, or customize components such as GNN models, training conctrolers for their specific needs. -Use GraphStorm APIs ---------------------- +To help users use these APIs, GraphStorm also released a set of Jupyter notebooks at :ref:`GraphStorm API Programming Example Notebooks<programming-examples>`. By running these notebooks, users can explore some APIs, learn how to use APIs to reproduce CLIs pipelines, and then customize GraphStorm components for specific requirements. -For users who wish to customize GraphStorm for their specific needs, follow the :ref:`GraphStorm API Programming Example Notebooks<programming-examples>` to explore GraphStorm APIs, learn how to use GraphStorm APIs to reproduce CLIs pipelines, and then customize GraphStorm components for specific requirements. Users can find the details of GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. +Users can find the comprehensive descriptions of these GraphStorm APIs in the :ref:`API Reference<api-reference>` documentations. For unrelease APIs, we encourage users to read their source code. If users want to have more APIs formally released, please raise issues at the `GraphStorm GitHub Repository <https://github.com/awslabs/graphstorm/issues>`_. Advanced Topics ---------------- diff --git a/docs/source/tutorials/own-data.rst b/docs/source/tutorials/own-data.rst index 681ed5347a..86ade89751 100644 --- a/docs/source/tutorials/own-data.rst +++ b/docs/source/tutorials/own-data.rst @@ -373,7 +373,7 @@ In terms of link prediction task, run the following command to partition the dat --target-etype paper,citing,paper \ --output /tmp/acm_lp -Please refer to :ref:`Graph Partition Configurations <configurations-partition>` to find more details of the arguments of the two partition tools. +Please refer to :ref:`Graph Partition for DGL Graphs <configurations-partition>` guideline for more details of the arguments of the two partition tools. Step 2: Modify the YAML configuration file to include your own data's information -----------------------------------------------------------------------------------