Merge branch 'main' into num_multicol

awslabs · Nov 1, 2023 · 0139d33 · 0139d33
2 parents 93198e2 + cee26e4
commit 0139d33
Show file tree

Hide file tree

Showing 52 changed files with 2,367 additions and 265 deletions.
diff --git a/.github/workflows/continuous-integration.yml b/.github/workflows/continuous-integration.yml
@@ -12,13 +12,15 @@ on:
       - '.github/workflow_scripts/gsprocessing_lint.sh'
       - 'graphstorm-processing/**'
       - '.github/workflows/gsprocessing-workflow.yml'
+      - 'docs/**'
   pull_request_target:
     types: [ labeled, opened, reopened, synchronize, ready_for_review ]
     paths-ignore:
       - '.github/workflow_scripts/gsprocessing_pytest.sh'
       - '.github/workflow_scripts/gsprocessing_lint.sh'
       - 'graphstorm-processing/**'
-      - '.github/workflows/gsprocessing-workflow.yml'    
+      - '.github/workflows/gsprocessing-workflow.yml' 
+      - 'docs/**'
 
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.number || github.event.pull_request.head.sha }}

diff --git a/README.md b/README.md
@@ -107,7 +107,9 @@ python3 -m graphstorm.run.gs_link_prediction \
 ## Limitation
 GraphStorm framework now supports using CPU or NVidia GPU for model training and inference. But it only works with PyTorch-gloo backend. It was only tested on AWS CPU instances or AWS GPU instances equipped with NVidia GPUs including P4, V100, A10 and A100.
 
-Multiple samplers are not supported for PyTorch versions greater than 1.12. Please use `--num-samplers 0` when your PyTorch version is above 1.12. You can find more details [here](https://github.com/awslabs/graphstorm/issues/199).
+Multiple samplers are supported in PyTorch versions <= 1.12 and >= 2.1.0. Please use `--num-samplers 0` for other PyTorch versions. More details [here](https://github.com/awslabs/graphstorm/issues/199).
+
+To use multiple samplers on sagemaker please use PyTorch versions <= 1.12.
 
 ## License
 This project is licensed under the Apache-2.0 License.

diff --git a/docker/Dockerfile.local b/docker/Dockerfile.local
@@ -9,7 +9,7 @@ RUN apt-get install -y python3-pip git wget psmisc
 RUN apt-get install -y cmake
 
 # Install Pytorch
-RUN pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+RUN pip3 install torch==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 
 # Install DGL
 RUN pip3 install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
@@ -49,4 +49,4 @@ RUN cp ${SSHDIR}/id_rsa.pub ${SSHDIR}/authorized_keys
 
 EXPOSE 2222
 RUN mkdir /run/sshd
-CMD ["/usr/sbin/sshd", "-D"]
+CMD ["/usr/sbin/sshd", "-D"]
diff --git a/docs/source/advanced/advanced-wholegraph.rst b/docs/source/advanced/advanced-wholegraph.rst
@@ -22,15 +22,49 @@ Prerequisite
 
     2. **EFA-enabled security group**: Please follow the `steps <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-setup>`_ to prepare an EFA-enabled security group for Amazon EC2 instances.
 
-    3. **Docker**: You need to install Docker in your environment as the `Docker documentation <https://docs.docker.com/get-docker/>`_ suggests, and the `Nvidia Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html>`_.
+    3. **NVIDIA-Docker**: You need to install NVIDIA Docker in your environment and the `Nvidia Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html>`_.
 
-For example, in an Amazon EC2 instance without Docker preinstalled, you can run the following commands to install Docker.
+For example, in an Amazon EC2 instance without Docker preinstalled, you can run the following commands to install NVIDIA Docker.
 
 .. code:: bash
 
+    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
+    && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
+    && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
+        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
+        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
     sudo apt-get update
-    sudo apt update
-    sudo apt install Docker.io
+    sudo apt-get install -y nvidia-docker2
+    sudo systemctl daemon-reload
+    sudo systemctl restart docker
+
+Launch instance with EFA support
+---------------------------------
+
+While launching the EFA supported EC2 instances, in the Network settings section, choose Edit, and then do the following:
+
+    1. For Subnet, choose the subnet in which to launch the instance. If you do not select a subnet, you can't enable the instance for EFA.
+
+    2. For Firewall (security groups), Choose `Select existing security group` and then pick the EFA-enabled security group you previously created as outlined in the prerequisites.
+
+    3. Expand the Advanced network configuration section, and for Elastic Fabric Adapter, select Enable.
+
+
+Install aws-efa-installer on the EC2 instance
+----------------------------------------------
+
+Install aws-efa-installer on both the base instance and within the Docker container. This enables the instance to use AWS's EFA kernel instead of Ubuntu's default kernel. Install the EFA installer without `--skip-kmod` on the instance and with `--skip-kmod` in the container. A command is provided for the base instance installation, and the Dockerfile will handle container installation in the next step.
+
+.. code:: bash
+
+    curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
+    && tar -xf $HOME/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
+    && cd aws-efa-installer \
+    && apt-get update \
+    && apt-get install -y libhwloc-dev \
+    && ./efa_installer.sh -y -g -d --skip-limit-conf --no-verify \
+    && rm -rf /var/lib/apt/lists/*
+
 
 Build a GraphStorm-WholeGraph Docker image from source
 --------------------------------------------------------
@@ -62,7 +96,7 @@ If the build succeeds, there should be a new Docker image, named `<docker-name>:
 Create a GraphStorm-WholeGraph container
 -----------------------------------------
 
-You can launch a container based on the Docker image built in the previous step. Make sure to use ``--privileged`` and ``—network=host`` map your host network to the container:
+You can launch a container based on the Docker image built in the previous step. Make sure to use ``--privileged`` and ``—-network=host`` map your host network to the container:
 
 .. code:: bash
 

diff --git a/docs/source/configuration/configuration-run.rst b/docs/source/configuration/configuration-run.rst
@@ -45,7 +45,7 @@ Model Configurations
 --------------------------------
 GraphStorm provides a set of parameters to config the GNN model structure (input layer, gnn layer, decoder layer, etc)
 
-- **model_encoder_type**: (**Required**) The Encoder module used to encode graph data. It can be a GNN encoder or a non-GNN encoder. A GNN encoder is composed of an input module, which encodes input node features, and a GNN module. A non-GNN encoder only contains an input module. GraphStorm supports two GNN encoders: `rgcn` which uses relational graph convolutional network as its GNN module and `rgat` which uses relational graph attention network as its GNN module. GraphStorm supports two non-GNN encoder: `lm` which requires each node type has and only has text features and uses language model, e.g., Bert, to encode these features and `mlp` which accepts various types of input node features (text feature, floating points and learnable embeddings) and finally uses an MLP to project these features into same dimension.
+- **model_encoder_type**: (**Required**) The Encoder module used to encode graph data. It can be a GNN encoder or a non-GNN encoder. A GNN encoder is composed of an input module, which encodes input node features, and a GNN module. A non-GNN encoder only contains an input module. GraphStorm supports five GNN encoders: `rgcn` which uses relational graph convolutional network as its GNN module, `rgat` which uses relational graph attention network as its GNN module, `sage` which uses GraphSage as its GNN module (only works with homogeneous graph), `gat` which uses graph attention network as its GNN module (only works with homogeneous graph) and `hgt` which uses heterogenous graph transformer as its GNN module. GraphStorm supports two non-GNN encoder: `lm` which requires each node type has and only has text features and uses language model, e.g., Bert, to encode these features and `mlp` which accepts various types of input node features (text feature, floating points and learnable embeddings) and finally uses an MLP to project these features into same dimension.
 
     - Yaml: ``model_encoder_type: rgcn``
     - Argument: ``--model-encoder-type rgcn``

diff --git a/docs/source/gs-processing/developer/input-configuration.rst b/docs/source/gs-processing/developer/input-configuration.rst
@@ -12,8 +12,8 @@ between other config formats, such as the one used
 by the single-machine GConstruct module.
 
 GSProcessing can take a GConstruct-formatted file
-directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
-that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
+directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
+that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
 input configuration file into the ``GSProcessing`` format,
 although this is mostly aimed at developers, users are
 can rely on the automatic conversion.
@@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
 -  ``version`` (String, required): The version of configuration file being used. We include
    the package name to allow self-contained identification of the file format.
 -  ``graph`` (JSON object, required): one configuration object that defines each
-   of the node types and edge types that describe the graph.
+   of the edge and node types that constitute the graph.
 
 We describe the ``graph`` object next.
 
-``graph`` configuration object
+Contents of the ``graph`` configuration object
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The ``graph`` configuration object can have two top-level objects:
@@ -71,7 +71,7 @@ objects:
      },
      "source": {"column": "String", "type": "String"},
      "relation": {"type": "String"},
-     "destination": {"column": "String", "type": "String"},
+     "dest": {"column": "String", "type": "String"},
      "labels" : [
             {
                 "column": "String",
@@ -82,8 +82,8 @@ objects:
                     "test": "Float"
                 }
             },
-      ]
-       "features": [{}]
+      ],
+      "features": [{}]
    }
 
 -  ``data`` (JSON Object, required): Describes the physical files
@@ -135,13 +135,12 @@ objects:
    ``source`` key, with a JSON object that contains
    ``{“column: String, and ”type“: String}``.
 -  ``relation``: (JSON object, required): Describes the relation
-   modeled by the edges. A relation can be common among all edges, or it
-   can have sub-types. The top-level objects for the object are:
+   modeled by the edges. The top-level keys for the object are:
 
    -  ``type`` (String, required): The type of the relation described by
       the edges. For example, for a source type ``user``, destination
-      ``movie`` we can have a relation type ``interacted_with`` for an
-      edge type ``user:interacted_with:movie``.
+      ``movie`` we can have a relation type ``rated`` for an
+      edge type ``user:rated:movie``.
 
 -  ``labels`` (List of JSON objects, optional): Describes the label
    for the current edge type. The label object has the following
@@ -171,9 +170,9 @@ objects:
       -  ``train``: The percentage of the data with available labels to
          assign to the train set (0.0, 1.0].
       -  ``val``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the validation set [0.0, 1.0).
       -  ``test``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the test set [0.0, 1.0).
 
 -  ``features`` (List of JSON objects, optional)\ **:** Describes
    the set of features for the current edge type. See the :ref:`features-object` section for details.
@@ -194,13 +193,12 @@ following top-level keys:
             "files": ["String"],
             "separator": "String"
         },
-        "column" : "String",
-        "type" : "String",
+        "column": "String",
+        "type": "String",
         "labels" : [
             {
                 "column": "String",
                 "type": "String",
-                "separator": "String",
                 "split_rate": {
                     "train": "Float",
                     "val": "Float",
@@ -215,8 +213,8 @@ following top-level keys:
    the edges object, with one top-level key for the ``format`` that
    takes a String value, and one for the ``files`` that takes an array
    of String values.
--  ``column``: (String, required): The column in the data that
-   corresponds to the column that stores the node ids.
+-  ``column``: (String, required): The name of the column in the data that
+   stores the node ids.
 -  ``type:`` (String, optional): A type name for the nodes described
    in this object. If not provided the ``column`` value is used as the
    node type.
@@ -248,12 +246,12 @@ following top-level keys:
       -  ``train``: The percentage of the data with available labels to
          assign to the train set (0.0, 1.0].
       -  ``val``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the validation set [0.0, 1.0).
       -  ``test``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the test set [0.0, 1.0).
 
 -  ``features`` (List of JSON objects, optional): Describes
-   the set of features for the current edge type. See the next section, :ref:`features-object`
+   the set of features for the current node type. See the section :ref:`features-object`
    for details.
 
 --------------
@@ -272,10 +270,10 @@ can contain the following top-level keys:
         "column": "String",
         "name": "String",
         "transformation": {
-        "name": "String",
-        "kwargs": {
-            "arg_name": "<value>"
-        }
+            "name": "String",
+            "kwargs": {
+                "arg_name": "<value>"
+            }
         },
         "data": {
             "format": "String",
@@ -285,7 +283,7 @@ can contain the following top-level keys:
     }
 
 -  ``column`` (String, required): The column that contains the raw
-   feature values in the dataset
+   feature values in the data.
 -  ``transformation`` (JSON object, optional): The type of
    transformation that will be applied to the feature. For details on
    the individual transformations supported see :ref:`supported-transformations`.
@@ -309,7 +307,7 @@ can contain the following top-level keys:
 
         # Example node config with multiple features
         {
-            # This is where the node structure data exist just need an id col
+            # This is where the node structure data exist, just need an id col in these files
             "data": {
                 "format": "parquet",
                 "files": ["path/to/node_ids"]
@@ -356,7 +354,7 @@ Supported transformations
 
 In this section we'll describe the transformations we support.
 The name of the transformation is the value that would appear
-in the ``transform['name']`` element of the feature configuration,
+in the ``['transformation']['name']`` element of the feature configuration,
 with the attached ``kwargs`` for the transformations that support
 arguments.
 
@@ -373,7 +371,55 @@ arguments.
          split the values in the column and create a vector column
          output. Example: for a separator ``'|'`` the CSV value
          ``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
+-  ``numerical``
+
+   -  Transforms a numerical column using a missing data imputer and an
+      optional normalizer.
+   -  ``kwargs``:
+
+      -  ``imputer`` (String, optional): A method to fill in missing values in the data.
+         Valid values are:
+         ``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
+         with the respective value computed from the data.
+      - ``normalizer`` (String, optional): Applies a normalization to the data, after
+         imputation. Can take the following values:
+         - ``none``: (Default) Don't normalize the numerical values during encoding.
+         - ``min-max``: Normalize each value by subtracting the minimum value from it,
+        and then dividing it by the difference between the maximum value and the minimum.
+        - ``standard``: Normalize each value by dividing it by the sum of all the values.
+-  ``multi-numerical``
+
+   -  Column-wise transformation for vector-like numerical data using a missing data imputer and an
+      optional normalizer.
+   -  ``kwargs``:
+
+      - ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
+        apply the ``mean`` transformation by default.
+      - ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
+        normalization is applied by default.
+      - ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
+        values in CSV input. If the input data are in Parquet format, each value in the
+        column is assumed to be an array of floats.
+-  ``bucket-numerical``
+
+   -  Transforms a numerical column to a one-hot or multi-hot bucket representation, using bucketization.
+       Also supports optional missing value imputation through the `imputer` kwarg.```
+   -  ``kwargs``:
 
+      - ``imputer`` (String, optional): A method to fill in missing values in the data.
+        Valid values are:
+        ``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
+        with the respective value computed from the data.
+      - ``range`` (List[float], required), The range defines the start and end point of the buckets with ``[a, b]``. It should be
+        a list of two floats. For example, ``[10, 30]`` defines a bucketing range between 10 and 30.
+      - ``bucket_cnt`` (Integer, required), The count of bucket lists used in the bucket feature transform. GSProcessing
+        calculates the size of each bucket as  ``( b - a ) / c`` , and encodes each numeric value as the number
+        of whatever bucket it falls into. Any value less than a is considered to belong in the first bucket,
+        and any value greater than b is considered to belong in the last bucket.
+      - ``slide_window_size`` (Integer, optional), slide_window_size can be used to make numeric values fall into more than one bucket,
+        by specifying a slide-window size ``s``, where ``s`` can an integer or float. GSProcessing then transforms each
+        numeric value ``v`` of the property into a range from ``v - s/2`` through ``v + s/2`` , and assigns the value v
+        to every bucket that the range covers.
 --------------
 
 Examples
@@ -403,15 +449,27 @@ OAG-Paper dataset
             ],
             "nodes" : [
                 {
+                    "type": "paper",
+                    "column": "ID",
                     "data": {
                         "format": "csv",
                         "separator": ",",
                         "files": [
                             "node_feat.csv"
                         ]
                     },
-                    "type": "paper",
-                    "column": "ID",
+                    "features": [
+                        {
+                            "column": "n_citation",
+                            "transformation": {
+                                "name": "numerical",
+                                "kwargs": {
+                                    "imputer": "mean",
+                                    "normalizer": "min-max"
+                                }
+                            }
+                        }
+                    ],
                     "labels": [
                         {
                             "column": "field",

diff --git a/docs/source/gs-processing/gs-processing-getting-started.rst b/docs/source/gs-processing/gs-processing-getting-started.rst
@@ -1,3 +1,5 @@
+.. _gs-processing:
+
 GraphStorm Processing Getting Started
 =====================================