Skip to content

Commit

Permalink
Merge branch 'main' into gconstruct_gsprocessing_config_validation
Browse files Browse the repository at this point in the history
  • Loading branch information
jalencato authored Aug 1, 2024
2 parents 38702cc + 1ca1424 commit ef462c7
Show file tree
Hide file tree
Showing 28 changed files with 1,484 additions and 500 deletions.
4 changes: 3 additions & 1 deletion docs/source/_templates/dataloadertemplate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@

.. autoclass:: {{ name }}
:show-inheritance:
:special-members: __iter__, __next__
:members:
:member-order: bysource
:special-members: __iter__, __next__, __len__
3 changes: 2 additions & 1 deletion docs/source/_templates/datasettemplate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@

.. autoclass:: {{ name }}
:show-inheritance:
:members: prepare_data, get_node_feats, get_edge_feats, get_labels, get_node_feat_size
:members:
:member-order: bysource
46 changes: 28 additions & 18 deletions docs/source/api/references/graphstorm.dataloading.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,34 @@
.. _apidataloading:

graphstorm.dataloading
==========================
graphstorm.dataloading.dataset
===============================

GraphStorm dataloading module includes a set of graph DataSets and DataLoaders for different
graph machine learning tasks.

If users would like to customize DataLoaders, please extend those classes in the
:ref:`Base DataLoaders <basedataloaders>` section and customize their abstract methods.
GraphStorm dataset provides one unified dataset class, i.e., ``GSgnnData``, for all graph
machine learning tasks. Users can build a ``GSgnnData`` object by giving the path of
the JSON file created by the :ref:`GraphStorm Graph Construction<graph_construction>`
operations. The ``GSgnnData`` will load the related graph artifacts specified in the JSON
file. It provides a set of APIs for users to extract information of the graph data for
model training and inference.

.. currentmodule:: graphstorm.dataloading

.. _basedataloaders:
.. autosummary::
:toctree: ../generated/
:nosignatures:
:template: datasettemplate.rst

GSgnnData

graphstorm.dataloading.dataloading
===================================

GraphStorm dataloading module includes a set of different DataLoaders for
different graph machine learning tasks.

If users would like to customize DataLoaders, please extend those dataloader base
classes in the **Base DataLoaders** section and customize their abstract functions.

.. currentmodule:: graphstorm.dataloading

Base DataLoaders
-------------------
Expand All @@ -25,16 +42,6 @@ Base DataLoaders
GSgnnEdgeDataLoaderBase
GSgnnLinkPredictionDataLoaderBase

DataSets
------------

.. autosummary::
:toctree: ../generated/
:nosignatures:
:template: datasettemplate.rst

GSgnnData

DataLoaders
------------

Expand All @@ -44,5 +51,8 @@ DataLoaders
:template: dataloadertemplate.rst

GSgnnNodeDataLoader
GSgnnNodeSemiSupDataLoader
GSgnnEdgeDataLoader
GSgnnLinkPredictionDataLoader
GSgnnLinkPredictionTestDataLoader
GSgnnLinkPredictionPredefinedTestDataLoader
111 changes: 85 additions & 26 deletions docs/source/graph-construction/gs-processing/example.rst
Original file line number Diff line number Diff line change
@@ -1,41 +1,54 @@
.. _distributed_construction_example:

GraphStorm Processing Example
=============================
A GraphStorm Distributed Graph Construction Example
===================================================

To demonstrate how to use the library locally we will
GraphStorm's distributed graph construction is involved with multiple steps.
To help users better understand these steps, we provide an example of distributed graph construction,
which can run locally in one instance.

To demonstrate how to use distributed graph construction locally we will
use the same example data as we use in our
unit tests, which you can find in the project's repository,
under ``graphstorm/graphstorm-processing/tests/resources/small_heterogeneous_graph``.

Install example dependencies
----------------------------
Install dependencies
--------------------

To run the local example you will need to install the GSProcessing
To run the local example you will need to install the GSProcessing and GraphStorm
library to your Python environment, and you'll need to clone the
GraphStorm repository to get access to the data.
GraphStorm repository to get access to the data, and DGL tool for GSPartition.

Follow the :ref:`gsp-installation-ref` guide to install the GSProcessing library.

You can clone the repository using
To run GSPartition job, you can install the dependencies as following:

.. code-block:: bash
pip install graphstorm
pip install pydantic
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cpu
pip install dgl==1.1.3 -f https://data.dgl.ai/wheels-internal/repo.html
git clone https://github.com/awslabs/graphstorm.git
cd graphstorm
git clone --branch v1.1.3 https://github.com/dmlc/dgl.git
You can then navigate to the ``graphstorm-processing/`` directory
that contains the relevant data:

.. code-block:: bash
cd ./graphstorm/graphstorm-processing/
cd ./graphstorm-processing/
Expected file inputs and configuration
--------------------------------------

The example will include GSProcessing as the first step and GSPartition as the second step.

GSProcessing expects the input files to be in a specific format that will allow
us to perform the processing and prepare the data for partitioning and training.
GSPartition then takes the output of GSProcessing to produce graph data in DistDGLGraph format for training or inference..

The data files are expected to be:

Expand Down Expand Up @@ -202,8 +215,8 @@ For more details on the re-partitioning step see

.. _gsp-examining-output:

Examining the job output
------------------------
Examining the job output of GSProcessing
------------------------------------------

Once the processing and re-partitioning jobs are done,
we can examine the outputs they created. The output will be
Expand Down Expand Up @@ -281,28 +294,74 @@ in an ``edge_data`` directory.
for node id 1 etc.


At this point you can use the DGL distributed partitioning pipeline
to partition your data, as described in the
`DGL documentation <https://docs.dgl.ai/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_
.
Run a GSPartition job locally
------------------------------
While :ref:`GSPartition<gspartition_index>` is designed to run on a multi-machine cluster,
you can run GSPartition job locally for the example. Once you have completed the installation
and the GSProcessing example described in the previous section, you can proceed to run the GSPartition step.

Assuming your working directory is ``graphstorm``,
you can use the following command to run the partition job locally:

.. code:: bash
echo 127.0.0.1 > ip_list.txt
python3 -m graphstorm.gpartition.dist_partition_graph \
--input-path /tmp/gsprocessing-example/ \
--metadata-filename updated_row_counts_metadata.json \
--output-path /tmp/gspartition-example/ \
--num-parts 2 \
--dgl-tool-path ./dgl/tools \
--partition-algorithm random \
--ip-config ip_list.txt
To simplify the process of partitioning and training, without the need
to manage your own infrastructure, we recommend using GraphStorm's
`SageMaker wrappers <https://graphstorm.readthedocs.io/en/latest/scale/sagemaker.html>`_
that do all the hard work for you and allow
you to focus on model development. In particular you can follow the GraphStorm documentation to run
`distributed partitioning on SageMaker <https://github.com/awslabs/graphstorm/tree/main/sagemaker#launch-graph-partitioning-task>`_.
The command above will first do graph partitioning to determine the ownership for each partition and save the results.
Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.
Finally it will generate the graph data ready for training/inference.

Examining the job output of GSPartition
---------------------------------------

To run GSProcessing jobs on Amazon SageMaker we'll need to follow
:ref:`GSProcessing distributed setup<gsprocessing_distributed_setup>` to set up our environment
and :ref:`Running GSProcessing on SageMaker<gsprocessing_sagemaker>` to execute the job.
Once the partition job is done, you can examine the outputs.

.. code-block:: bash
$ cd /tmp/gspartition-example
$ ls -ltR
dist_graph/
metadata.json
|- part0/
edge_feat.dgl
graph.dgl
node_feat.dgl
orig_eids.dgl
orig_nids.dgl
partition_assignment/
director.txt
genre.txt
movie.txt
partition_meta.json
user.txt
The ``dist_graph`` folder contains partitioned graph ready for training and inference.

* ``part0``: As we only specify 1 partition in the previous command, we have one part folder here.
There are five files for the partition
* ``edge_feat.dgl``: The edge features for part 0 stored in binary format.
* ``graph.dgl``: The graph structure data for part 0 stored in binary format.
* ``node_feat.dgl``: The node features data for part 0 stored in binary format.
* ``orig_eids.dgl``: The mapping for edges between raw edge IDs and the partitioned graph edge IDs.
* ``orig_nids.dgl``: The mapping for nodes between raw node IDs and the partitioned graph node IDs.

* ``metadata.json``: This file contains metadata about the distributed DGL graph.

The ``partition_assignment`` directory contains different partition results for different node types,
which can reused for the `dgl dispatch pipeline <https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#distributed-graph-partitioning-pipeline>`_

.. rubric:: Footnotes


.. [#f1] Note that this is just a hint to the Spark engine, and it's
not guaranteed that the number of output partitions will always match
the requested value.
.. [#f2] This doc will be future extended to include a partition example.
the requested value.
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
======================================
Running partition jobs on EC2 Clusters
======================================

Once the :ref:`distributed processing<gsprocessing_distributed_setup>` is completed,
users can start the partition jobs. This tutorial will provide instructions on how to setup an EC2 cluster and
start GSPartition jobs on it.

Create a GraphStorm Cluster
----------------------------

Setup instances of a cluster
.............................
A cluster contains several instances, each of which runs a GraphStorm Docker container. Before creating a cluster, we recommend to
follow the :ref:`Environment Setup <setup_docker>`. The guide shows how to build GraphStorm Docker images, and use a Docker container registry,
e.g. `AWS ECR <https://docs.aws.amazon.com/ecr/>`_ , to upload the GraphStorm image to an ECR repository, pull it on the instances in the cluster,
and finally start the image as a container.

.. note::

If you are planning to use **parmetis** algorithm, please prepare your docker image using the following instructions:

.. code-block:: bash
git clone https://github.com/awslabs/graphstorm.git
cd /path-to-graphstorm/docker/
bash /path-to-graphstorm/docker/build_docker_parmetis.sh /path-to-graphstorm/ image-name image-tag
There are three positional arguments for ``build_docker_parmetis.sh``:

1. **path-to-graphstorm** (**required**), is the absolute path of the "graphstorm" folder, where you cloned the GraphStorm source code. For example, the path could be ``/code/graphstorm``.
2. **image-name** (optional), is the assigned name of the Docker image to be built . Default is ``graphstorm``.
3. **image-tag** (optional), is the assigned tag prefix of the Docker image. Default is ``local``.

Setup a shared file system for the cluster
...........................................
A cluster requires a shared file system, such as NFS or `EFS <https://docs.aws.amazon.com/efs/>`_, mounted to each instance in the cluster, in which all GraphStorm containers can share data files, save model artifacts and prediction results.

`Here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist#step-0-setup-a-distributed-file-system>`_ is the instruction of setting up an NFS for a cluster. As the steps of setting an NFS could be various on different systems, we suggest users to look for additional information about NFS setting. Here are some available resources: `NFS tutorial <https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-22-04>`_ by DigitalOcean, `NFS document <https://ubuntu.com/server/docs/service-nfs>`_ for Ubuntu.

For an AWS EC2 cluster, users can also use EFS as the shared file system. Please follow 1) `the instruction of creating EFS <https://docs.aws.amazon.com/efs/latest/ug/gs-step-two-create-efs-resources.html>`_; 2) `the instruction of installing an EFS client <https://docs.aws.amazon.com/efs/latest/ug/installing-amazon-efs-utils.html>`_; and 3) `the instructions of mounting the EFS filesystem <https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html>`_ to set up EFS.

After setting up a shared file system, we can keep all graph data in a shared folder. Then mount the data folder to the ``/path_to_data/`` of each instances in the cluster so that all GraphStorm containers can access the data.

Run a GraphStorm container
...........................
In each instance, use the following command to start a GraphStorm Docker container and run it as a backend daemon on cpu.

.. code-block:: shell
docker run -v /path_to_data/:/data \
-v /dev/shm:/dev/shm \
--network=host \
-d --name test graphstorm:local-cpu service ssh restart
This command mounts the shared ``/path_to_data/`` folder to a container's ``/data/`` folder by which GraphStorm codes can access graph data and save the partition result.

Setup the IP Address File and Check Port Status
----------------------------------------------------------

Collect the IP address list
...........................
The GraphStorm Docker containers use SSH on port ``2222`` to communicate with each other. Users need to collect all IP addresses of all the instances and put them into a text file, e.g., ``/data/ip_list.txt``, which is like:

.. figure:: ../../../../../tutorial/distributed_ips.png
:align: center

.. note:: We recommend to use **private IP addresses** on AWS EC2 cluster to avoid any possible port constraints.

Put the IP list file into container's ``/data/`` folder.

Check port
................
The GraphStorm Docker container uses port ``2222`` to **ssh** to containers running on other machines without password. Please make sure the port is not used by other processes.

Users also need to make sure the port ``2222`` is open for **ssh** commands.

Pick one instance and run the following command to connect to the GraphStorm Docker container.

.. code-block:: bash
docker container exec -it test /bin/bash
Users need to exchange the ssh key from each of GraphStorm Docker container to
the rest containers in the cluster: copy the keys from the ``/root/.ssh/id_rsa.pub`` from one container to ``/root/.ssh/authorized_keys`` in containers on all other containers.
In the container environment, users can check the connectivity with the command ``ssh <ip-in-the-cluster> -o StrictHostKeyChecking=no -p 2222``. Please replace the ``<ip-in-the-cluster>`` with the real IP address from the ``ip_list.txt`` file above, e.g.,

.. code-block:: bash
ssh 172.38.12.143 -o StrictHostKeyChecking=no -p 2222
If successful, you should login to the container with ip 172.38.12.143.

If not, please make sure there is no restriction of exposing port 2222.


Launch GSPartition Jobs
-----------------------

Now we can ssh into the **leader node** of the EC2 cluster, and start GSPartition process with the following command:

.. code:: bash
python3 -m graphstorm.gpartition.dist_partition_graph
--input-path ${LOCAL_INPUT_DATAPATH} \
--metadata-filename ${METADATA_FILE} \
--output-path ${LOCAL_OUTPUT_DATAPATH} \
--num-parts ${NUM_PARTITIONS} \
--partition-algorithm ${ALGORITHM} \
--ip-config ${IP_CONFIG}
.. warning::
1. Please make sure the both ``LOCAL_INPUT_DATAPATH`` and ``LOCAL_OUTPUT_DATAPATH`` are located on the shared filesystem.
2. The number of instances in the cluster should be equal to ``NUM_PARTITIONS``.
3. For users who only want to generate partition assignments instead of the partitioned DGL graph, please add ``--partition-assignment-only`` flag.

Currently we support both ``random`` and ``parmetis`` as the partitioning algorithm for EC2 clusters.
Loading

0 comments on commit ef462c7

Please sign in to comment.