Merge branch 'main' into gconstruct_gsprocessing_config_validation

awslabs · Aug 1, 2024 · ef462c7 · ef462c7
2 parents 38702cc + 1ca1424
commit ef462c7
Show file tree

Hide file tree

Showing 28 changed files with 1,484 additions and 500 deletions.
diff --git a/docs/source/_templates/dataloadertemplate.rst b/docs/source/_templates/dataloadertemplate.rst
@@ -7,4 +7,6 @@
 
 .. autoclass:: {{ name }}
     :show-inheritance:
-    :special-members: __iter__, __next__
+    :members:
+    :member-order: bysource
+    :special-members: __iter__, __next__, __len__
diff --git a/docs/source/_templates/datasettemplate.rst b/docs/source/_templates/datasettemplate.rst
@@ -7,4 +7,5 @@
 
 .. autoclass:: {{ name }}
     :show-inheritance:
-    :members: prepare_data, get_node_feats, get_edge_feats, get_labels, get_node_feat_size
+    :members:
+    :member-order: bysource
diff --git a/docs/source/api/references/graphstorm.dataloading.rst b/docs/source/api/references/graphstorm.dataloading.rst
@@ -1,17 +1,34 @@
 .. _apidataloading:
 
-graphstorm.dataloading
-==========================
+graphstorm.dataloading.dataset
+===============================
 
-    GraphStorm dataloading module includes a set of graph DataSets and DataLoaders for different
-    graph machine learning tasks.
-
-    If users would like to customize DataLoaders, please extend those classes in the
-    :ref:`Base DataLoaders <basedataloaders>` section and customize their abstract methods.
+    GraphStorm dataset provides one unified dataset class, i.e., ``GSgnnData``, for all graph
+    machine learning tasks. Users can build a ``GSgnnData`` object by giving the path of
+    the JSON file created by the :ref:`GraphStorm Graph Construction<graph_construction>`
+    operations. The ``GSgnnData`` will load the related graph artifacts specified in the JSON
+    file. It provides a set of APIs for users to extract information of the graph data for
+    model training and inference.
 
 .. currentmodule:: graphstorm.dataloading
 
-.. _basedataloaders:
+.. autosummary::
+    :toctree: ../generated/
+    :nosignatures:
+    :template: datasettemplate.rst
+
+    GSgnnData
+
+graphstorm.dataloading.dataloading
+===================================
+
+    GraphStorm dataloading module includes a set of different DataLoaders for
+    different graph machine learning tasks.
+
+    If users would like to customize DataLoaders, please extend those dataloader base 
+    classes in the **Base DataLoaders** section and customize their abstract functions.
+
+.. currentmodule:: graphstorm.dataloading
 
 Base DataLoaders
 -------------------
@@ -25,16 +42,6 @@ Base DataLoaders
     GSgnnEdgeDataLoaderBase
     GSgnnLinkPredictionDataLoaderBase
 
-DataSets
-------------
-
-.. autosummary::
-    :toctree: ../generated/
-    :nosignatures:
-    :template: datasettemplate.rst
-
-    GSgnnData
-
 DataLoaders
 ------------
 
@@ -44,5 +51,8 @@ DataLoaders
     :template: dataloadertemplate.rst
 
     GSgnnNodeDataLoader
+    GSgnnNodeSemiSupDataLoader
     GSgnnEdgeDataLoader
     GSgnnLinkPredictionDataLoader
+    GSgnnLinkPredictionTestDataLoader
+    GSgnnLinkPredictionPredefinedTestDataLoader
diff --git a/docs/source/graph-construction/gs-processing/example.rst b/docs/source/graph-construction/gs-processing/example.rst
@@ -1,41 +1,54 @@
 .. _distributed_construction_example:
 
-GraphStorm Processing Example
-=============================
+A GraphStorm Distributed Graph Construction Example
+===================================================
 
-To demonstrate how to use the library locally we will
+GraphStorm's distributed graph construction is involved with multiple steps.
+To help users better understand these steps, we provide an example of distributed graph construction,
+which can run locally in one instance.
+
+To demonstrate how to use distributed graph construction locally we will
 use the same example data as we use in our
 unit tests, which you can find in the project's repository,
 under ``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph``.
 
-Install example dependencies
-----------------------------
+Install dependencies
+--------------------
 
-To run the local example you will need to install the GSProcessing
+To run the local example you will need to install the GSProcessing and GraphStorm
 library to your Python environment, and you'll need to clone the
-GraphStorm repository to get access to the data.
+GraphStorm repository to get access to the data, and DGL tool for GSPartition.
 
 Follow the :ref:`gsp-installation-ref` guide to install the GSProcessing library.
 
-You can clone the repository using
+To run GSPartition job, you can install the dependencies as following:
 
 .. code-block:: bash
 
+    pip install graphstorm
+    pip install pydantic
+    pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cpu
+    pip install dgl==1.1.3 -f https://data.dgl.ai/wheels-internal/repo.html
     git clone https://github.com/awslabs/graphstorm.git
+    cd graphstorm
+    git clone --branch v1.1.3 https://github.com/dmlc/dgl.git
 
 You can then navigate to the ``graphstorm-processing/`` directory
 that contains the relevant data:
 
 .. code-block:: bash
 
-    cd ./graphstorm/graphstorm-processing/
+    cd ./graphstorm-processing/
 
 
 Expected file inputs and configuration
 --------------------------------------
 
+The example will include GSProcessing as the first step and GSPartition as the second step.
+
 GSProcessing expects the input files to be in a specific format that will allow
 us to perform the processing and prepare the data for partitioning and training.
+GSPartition then takes the output of GSProcessing to produce graph data in DistDGLGraph format for training or inference..
 
 The data files are expected to be:
 
@@ -202,8 +215,8 @@ For more details on the re-partitioning step see
 
 .. _gsp-examining-output:
 
-Examining the job output
-------------------------
+Examining the job output of GSProcessing
+------------------------------------------
 
 Once the processing and re-partitioning jobs are done,
 we can examine the outputs they created. The output will be
@@ -281,28 +294,74 @@ in an ``edge_data`` directory.
     for node id 1 etc.
 
 
-At this point you can use the DGL distributed partitioning pipeline
-to partition your data, as described in the
-`DGL documentation <https://docs.dgl.ai/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_
-.
+Run a GSPartition job locally
+------------------------------
+While :ref:`GSPartition<gspartition_index>` is designed to run on a multi-machine cluster,
+you can run GSPartition job locally for the example. Once you have completed the installation
+and the GSProcessing example described in the previous section, you can proceed to run the GSPartition step.
+
+Assuming your working directory is ``graphstorm``,
+you can use the following command to run the partition job locally:
+
+.. code:: bash
+
+    echo 127.0.0.1 > ip_list.txt
+    python3 -m graphstorm.gpartition.dist_partition_graph \
+        --input-path /tmp/gsprocessing-example/ \
+        --metadata-filename updated_row_counts_metadata.json \
+        --output-path /tmp/gspartition-example/ \
+        --num-parts 2 \
+        --dgl-tool-path ./dgl/tools \
+        --partition-algorithm random \
+        --ip-config ip_list.txt 
 
-To simplify the process of partitioning and training, without the need
-to manage your own infrastructure, we recommend using GraphStorm's
-`SageMaker wrappers <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
-that do all the hard work for you and allow
-you to focus on model development. In particular you can follow the GraphStorm documentation to run
-`distributed partitioning on SageMaker <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_.
+The command above will first do graph partitioning to determine the ownership for each partition and save the results.
+Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.
+Finally it will generate the graph data ready for training/inference.
 
+Examining the job output of GSPartition
+---------------------------------------
 
-To run GSProcessing jobs on Amazon SageMaker we'll need to follow
-:ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>` to set up our environment
-and :ref:`Running GSProcessing on SageMaker<gsprocessing_sagemaker>` to execute the job.
+Once the partition job is done, you can examine the outputs.
+
+.. code-block:: bash
 
+    $ cd /tmp/gspartition-example
+    $ ls -ltR
+
+    dist_graph/
+        metadata.json
+        |- part0/
+            edge_feat.dgl
+            graph.dgl
+            node_feat.dgl
+            orig_eids.dgl
+            orig_nids.dgl
+    partition_assignment/
+        director.txt
+        genre.txt
+        movie.txt
+        partition_meta.json
+        user.txt
+
+The ``dist_graph`` folder contains partitioned graph ready for training and inference.
+
+* ``part0``: As we only specify 1 partition in the previous command, we have one part folder here.
+There are five files for the partition
+    * ``edge_feat.dgl``: The edge features for part 0 stored in binary format.
+    * ``graph.dgl``: The graph structure data for part 0 stored in binary format.
+    * ``node_feat.dgl``: The node features data for part 0 stored in binary format.
+    * ``orig_eids.dgl``: The mapping for edges between raw edge IDs and the partitioned graph edge IDs.
+    * ``orig_nids.dgl``: The mapping for nodes between raw node IDs and the partitioned graph node IDs.
+
+* ``metadata.json``: This file contains metadata about the distributed DGL graph.
+
+The ``partition_assignment`` directory contains different partition results for different node types,
+which can reused for the `dgl dispatch pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_
 
 .. rubric:: Footnotes
 
 
 .. [#f1] Note that this is just a hint to the Spark engine, and it's
     not guaranteed that the number of output partitions will always match
-    the requested value.
-.. [#f2] This doc will be future extended to include a partition example.
+    the requested value.
diff --git a/docs/source/graph-construction/gs-processing/gspartition/ec2-clusters.rst b/docs/source/graph-construction/gs-processing/gspartition/ec2-clusters.rst
@@ -0,0 +1,119 @@
+======================================
+Running partition jobs on EC2 Clusters
+======================================
+
+Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed,
+users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and
+start GSPartition jobs on it.
+
+Create a GraphStorm Cluster
+----------------------------
+
+Setup instances of a cluster
+.............................
+A cluster contains several instances, each of which runs a GraphStorm Docker container. Before creating a cluster, we recommend to
+follow the :ref:`Environment Setup <setup_docker>`. The guide shows how to build GraphStorm Docker images, and use a Docker container registry,
+e.g. `AWS ECR <https://docs.aws.amazon.com/ecr/>`_ , to upload the GraphStorm image to an ECR repository, pull it on the instances in the cluster,
+and finally start the image as a container.
+
+.. note::
+
+    If you are planning to use **parmetis** algorithm, please prepare your docker image using the following instructions:
+
+    .. code-block:: bash
+
+        git clone https://github.com/awslabs/graphstorm.git
+
+        cd /path-to-graphstorm/docker/
+
+        bash /path-to-graphstorm/docker/build_docker_parmetis.sh /path-to-graphstorm/ image-name image-tag
+
+    There are three positional arguments for ``build_docker_parmetis.sh``:
+
+    1. **path-to-graphstorm** (**required**), is the absolute path of the "graphstorm" folder, where you cloned the GraphStorm source code. For example, the path could be ``/code/graphstorm``.
+    2. **image-name** (optional), is the assigned name of the Docker image to be built . Default is ``graphstorm``.
+    3. **image-tag** (optional), is the assigned tag prefix of the Docker image. Default is ``local``.
+
+Setup a shared file system for the cluster
+...........................................
+A cluster requires a shared file system, such as NFS or `EFS <https://docs.aws.amazon.com/efs/>`_, mounted to each instance in the cluster, in which all GraphStorm containers can share data files, save model artifacts and prediction results.
+
+`Here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist#step-0-setup-a-distributed-file-system>`_ is the instruction of setting up an NFS for a cluster. As the steps of setting an NFS could be various on different systems, we suggest users to look for additional information about NFS setting. Here are some available resources: `NFS tutorial <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-22-04>`_ by DigitalOcean, `NFS document <https://ubuntu.com/server/docs/service-nfs>`_ for Ubuntu.
+
+For an AWS EC2 cluster, users can also use EFS as the shared file system. Please follow 1) `the instruction of creating EFS <https://docs.aws.amazon.com/efs/latest/ug/gs-step-two-create-efs-resources.html>`_; 2) `the instruction of installing an EFS client <https://docs.aws.amazon.com/efs/latest/ug/installing-amazon-efs-utils.html>`_; and 3) `the instructions of mounting the EFS filesystem <https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html>`_ to set up EFS.
+
+After setting up a shared file system, we can keep all graph data in a shared folder. Then mount the data folder to the ``/path_to_data/`` of each instances in the cluster so that all GraphStorm containers can access the data.
+
+Run a GraphStorm container
+...........................
+In each instance, use the following command to start a GraphStorm Docker container and run it as a backend daemon on cpu.
+
+.. code-block:: shell
+
+    docker run -v /path_to_data/:/data \
+                      -v /dev/shm:/dev/shm \
+                      --network=host \
+                      -d --name test graphstorm:local-cpu service ssh restart
+
+This command mounts the shared ``/path_to_data/`` folder to a container's ``/data/`` folder by which GraphStorm codes can access graph data and save the partition result.
+
+Setup the IP Address File and Check Port Status
+----------------------------------------------------------
+
+Collect the IP address list
+...........................
+The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like:
+
+.. figure:: ../../../../../tutorial/distributed_ips.png
+    :align: center
+
+.. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints.
+
+Put the IP list file into container's ``/data/`` folder.
+
+Check port
+................
+The GraphStorm Docker container uses port ``2222`` to **ssh** to containers running on other machines without password. Please make sure the port is not used by other processes.
+
+Users also need to make sure the port ``2222`` is open for **ssh** commands.
+
+Pick one instance and run the following command to connect to the GraphStorm Docker container.
+
+.. code-block:: bash
+
+    docker container exec -it test /bin/bash
+
+Users need to exchange the ssh key from each of GraphStorm Docker container to
+the rest containers in the cluster: copy the keys from the ``/root/.ssh/id_rsa.pub`` from one container to ``/root/.ssh/authorized_keys`` in containers on all other containers.
+In the container environment, users can check the connectivity with the command ``ssh <ip-in-the-cluster> -o StrictHostKeyChecking=no -p 2222``. Please replace the ``<ip-in-the-cluster>`` with the real IP address from the ``ip_list.txt`` file above, e.g.,
+
+.. code-block:: bash
+
+    ssh 172.38.12.143 -o StrictHostKeyChecking=no -p 2222
+
+If successful, you should login to the container with ip 172.38.12.143.
+
+If not, please make sure there is no restriction of exposing port 2222.
+
+
+Launch GSPartition Jobs
+-----------------------
+
+Now we can ssh into the **leader node** of the EC2 cluster, and start GSPartition process with the following command:
+
+.. code:: bash
+
+    python3 -m graphstorm.gpartition.dist_partition_graph
+        --input-path ${LOCAL_INPUT_DATAPATH} \
+        --metadata-filename ${METADATA_FILE} \
+        --output-path ${LOCAL_OUTPUT_DATAPATH} \
+        --num-parts ${NUM_PARTITIONS} \
+        --partition-algorithm ${ALGORITHM} \
+        --ip-config ${IP_CONFIG}
+
+.. warning::
+    1. Please make sure the both ``LOCAL_INPUT_DATAPATH`` and ``LOCAL_OUTPUT_DATAPATH`` are located on the shared filesystem.
+    2. The number of instances in the cluster should be equal to ``NUM_PARTITIONS``.
+    3. For users who only want to generate partition assignments instead of the partitioned DGL graph, please add ``--partition-assignment-only`` flag.
+
+Currently we support both ``random`` and ``parmetis`` as the partitioning algorithm for EC2 clusters.