Skip to content

Commit

Permalink
Merge branch 'main' into num_multicol
Browse files Browse the repository at this point in the history
  • Loading branch information
jalencato authored Nov 1, 2023
2 parents 93198e2 + cee26e4 commit 0139d33
Show file tree
Hide file tree
Showing 52 changed files with 2,367 additions and 265 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/continuous-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,15 @@ on:
- '.github/workflow_scripts/gsprocessing_lint.sh'
- 'graphstorm-processing/**'
- '.github/workflows/gsprocessing-workflow.yml'
- 'docs/**'
pull_request_target:
types: [ labeled, opened, reopened, synchronize, ready_for_review ]
paths-ignore:
- '.github/workflow_scripts/gsprocessing_pytest.sh'
- '.github/workflow_scripts/gsprocessing_lint.sh'
- 'graphstorm-processing/**'
- '.github/workflows/gsprocessing-workflow.yml'
- '.github/workflows/gsprocessing-workflow.yml'
- 'docs/**'

concurrency:
group: ${{ github.workflow }}-${{ github.event.number || github.event.pull_request.head.sha }}
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,9 @@ python3 -m graphstorm.run.gs_link_prediction \
## Limitation
GraphStorm framework now supports using CPU or NVidia GPU for model training and inference. But it only works with PyTorch-gloo backend. It was only tested on AWS CPU instances or AWS GPU instances equipped with NVidia GPUs including P4, V100, A10 and A100.

Multiple samplers are not supported for PyTorch versions greater than 1.12. Please use `--num-samplers 0` when your PyTorch version is above 1.12. You can find more details [here](https://github.com/awslabs/graphstorm/issues/199).
Multiple samplers are supported in PyTorch versions <= 1.12 and >= 2.1.0. Please use `--num-samplers 0` for other PyTorch versions. More details [here](https://github.com/awslabs/graphstorm/issues/199).

To use multiple samplers on sagemaker please use PyTorch versions <= 1.12.

## License
This project is licensed under the Apache-2.0 License.
Expand Down
4 changes: 2 additions & 2 deletions docker/Dockerfile.local
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get install -y python3-pip git wget psmisc
RUN apt-get install -y cmake

# Install Pytorch
RUN pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
RUN pip3 install torch==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

# Install DGL
RUN pip3 install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
Expand Down Expand Up @@ -49,4 +49,4 @@ RUN cp ${SSHDIR}/id_rsa.pub ${SSHDIR}/authorized_keys

EXPOSE 2222
RUN mkdir /run/sshd
CMD ["/usr/sbin/sshd", "-D"]
CMD ["/usr/sbin/sshd", "-D"]
44 changes: 39 additions & 5 deletions docs/source/advanced/advanced-wholegraph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,49 @@ Prerequisite

2. **EFA-enabled security group**: Please follow the `steps <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-setup>`_ to prepare an EFA-enabled security group for Amazon EC2 instances.

3. **Docker**: You need to install Docker in your environment as the `Docker documentation <https://docs.docker.com/get-docker/>`_ suggests, and the `Nvidia Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html>`_.
3. **NVIDIA-Docker**: You need to install NVIDIA Docker in your environment and the `Nvidia Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html>`_.

For example, in an Amazon EC2 instance without Docker preinstalled, you can run the following commands to install Docker.
For example, in an Amazon EC2 instance without Docker preinstalled, you can run the following commands to install NVIDIA Docker.

.. code:: bash
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt update
sudo apt install Docker.io
sudo apt-get install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker
Launch instance with EFA support
---------------------------------

While launching the EFA supported EC2 instances, in the Network settings section, choose Edit, and then do the following:

1. For Subnet, choose the subnet in which to launch the instance. If you do not select a subnet, you can't enable the instance for EFA.

2. For Firewall (security groups), Choose `Select existing security group` and then pick the EFA-enabled security group you previously created as outlined in the prerequisites.

3. Expand the Advanced network configuration section, and for Elastic Fabric Adapter, select Enable.


Install aws-efa-installer on the EC2 instance
----------------------------------------------

Install aws-efa-installer on both the base instance and within the Docker container. This enables the instance to use AWS's EFA kernel instead of Ubuntu's default kernel. Install the EFA installer without `--skip-kmod` on the instance and with `--skip-kmod` in the container. A command is provided for the base instance installation, and the Dockerfile will handle container installation in the next step.

.. code:: bash
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& tar -xf $HOME/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& apt-get update \
&& apt-get install -y libhwloc-dev \
&& ./efa_installer.sh -y -g -d --skip-limit-conf --no-verify \
&& rm -rf /var/lib/apt/lists/*
Build a GraphStorm-WholeGraph Docker image from source
--------------------------------------------------------
Expand Down Expand Up @@ -62,7 +96,7 @@ If the build succeeds, there should be a new Docker image, named `<docker-name>:
Create a GraphStorm-WholeGraph container
-----------------------------------------

You can launch a container based on the Docker image built in the previous step. Make sure to use ``--privileged`` and ``—network=host`` map your host network to the container:
You can launch a container based on the Docker image built in the previous step. Make sure to use ``--privileged`` and ``-network=host`` map your host network to the container:

.. code:: bash
Expand Down
2 changes: 1 addition & 1 deletion docs/source/configuration/configuration-run.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Model Configurations
--------------------------------
GraphStorm provides a set of parameters to config the GNN model structure (input layer, gnn layer, decoder layer, etc)

- **model_encoder_type**: (**Required**) The Encoder module used to encode graph data. It can be a GNN encoder or a non-GNN encoder. A GNN encoder is composed of an input module, which encodes input node features, and a GNN module. A non-GNN encoder only contains an input module. GraphStorm supports two GNN encoders: `rgcn` which uses relational graph convolutional network as its GNN module and `rgat` which uses relational graph attention network as its GNN module. GraphStorm supports two non-GNN encoder: `lm` which requires each node type has and only has text features and uses language model, e.g., Bert, to encode these features and `mlp` which accepts various types of input node features (text feature, floating points and learnable embeddings) and finally uses an MLP to project these features into same dimension.
- **model_encoder_type**: (**Required**) The Encoder module used to encode graph data. It can be a GNN encoder or a non-GNN encoder. A GNN encoder is composed of an input module, which encodes input node features, and a GNN module. A non-GNN encoder only contains an input module. GraphStorm supports five GNN encoders: `rgcn` which uses relational graph convolutional network as its GNN module, `rgat` which uses relational graph attention network as its GNN module, `sage` which uses GraphSage as its GNN module (only works with homogeneous graph), `gat` which uses graph attention network as its GNN module (only works with homogeneous graph) and `hgt` which uses heterogenous graph transformer as its GNN module. GraphStorm supports two non-GNN encoder: `lm` which requires each node type has and only has text features and uses language model, e.g., Bert, to encode these features and `mlp` which accepts various types of input node features (text feature, floating points and learnable embeddings) and finally uses an MLP to project these features into same dimension.

- Yaml: ``model_encoder_type: rgcn``
- Argument: ``--model-encoder-type rgcn``
Expand Down
118 changes: 88 additions & 30 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ between other config formats, such as the one used
by the single-machine GConstruct module.

GSProcessing can take a GConstruct-formatted file
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
input configuration file into the ``GSProcessing`` format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
Expand All @@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
- ``version`` (String, required): The version of configuration file being used. We include
the package name to allow self-contained identification of the file format.
- ``graph`` (JSON object, required): one configuration object that defines each
of the node types and edge types that describe the graph.
of the edge and node types that constitute the graph.

We describe the ``graph`` object next.

``graph`` configuration object
Contents of the ``graph`` configuration object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``graph`` configuration object can have two top-level objects:
Expand Down Expand Up @@ -71,7 +71,7 @@ objects:
},
"source": {"column": "String", "type": "String"},
"relation": {"type": "String"},
"destination": {"column": "String", "type": "String"},
"dest": {"column": "String", "type": "String"},
"labels" : [
{
"column": "String",
Expand All @@ -82,8 +82,8 @@ objects:
"test": "Float"
}
},
]
"features": [{}]
],
"features": [{}]
}
- ``data`` (JSON Object, required): Describes the physical files
Expand Down Expand Up @@ -135,13 +135,12 @@ objects:
``source`` key, with a JSON object that contains
``{“column: String, and ”type“: String}``.
- ``relation``: (JSON object, required): Describes the relation
modeled by the edges. A relation can be common among all edges, or it
can have sub-types. The top-level objects for the object are:
modeled by the edges. The top-level keys for the object are:

- ``type`` (String, required): The type of the relation described by
the edges. For example, for a source type ``user``, destination
``movie`` we can have a relation type ``interacted_with`` for an
edge type ``user:interacted_with:movie``.
``movie`` we can have a relation type ``rated`` for an
edge type ``user:rated:movie``.

- ``labels`` (List of JSON objects, optional): Describes the label
for the current edge type. The label object has the following
Expand Down Expand Up @@ -171,9 +170,9 @@ objects:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional)\ **:** Describes
the set of features for the current edge type. See the :ref:`features-object` section for details.
Expand All @@ -194,13 +193,12 @@ following top-level keys:
"files": ["String"],
"separator": "String"
},
"column" : "String",
"type" : "String",
"column": "String",
"type": "String",
"labels" : [
{
"column": "String",
"type": "String",
"separator": "String",
"split_rate": {
"train": "Float",
"val": "Float",
Expand All @@ -215,8 +213,8 @@ following top-level keys:
the edges object, with one top-level key for the ``format`` that
takes a String value, and one for the ``files`` that takes an array
of String values.
- ``column``: (String, required): The column in the data that
corresponds to the column that stores the node ids.
- ``column``: (String, required): The name of the column in the data that
stores the node ids.
- ``type:`` (String, optional): A type name for the nodes described
in this object. If not provided the ``column`` value is used as the
node type.
Expand Down Expand Up @@ -248,12 +246,12 @@ following top-level keys:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional): Describes
the set of features for the current edge type. See the next section, :ref:`features-object`
the set of features for the current node type. See the section :ref:`features-object`
for details.

--------------
Expand All @@ -272,10 +270,10 @@ can contain the following top-level keys:
"column": "String",
"name": "String",
"transformation": {
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
},
"data": {
"format": "String",
Expand All @@ -285,7 +283,7 @@ can contain the following top-level keys:
}
- ``column`` (String, required): The column that contains the raw
feature values in the dataset
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
Expand All @@ -309,7 +307,7 @@ can contain the following top-level keys:
# Example node config with multiple features
{
# This is where the node structure data exist just need an id col
# This is where the node structure data exist, just need an id col in these files
"data": {
"format": "parquet",
"files": ["path/to/node_ids"]
Expand Down Expand Up @@ -356,7 +354,7 @@ Supported transformations

In this section we'll describe the transformations we support.
The name of the transformation is the value that would appear
in the ``transform['name']`` element of the feature configuration,
in the ``['transformation']['name']`` element of the feature configuration,
with the attached ``kwargs`` for the transformations that support
arguments.

Expand All @@ -373,7 +371,55 @@ arguments.
split the values in the column and create a vector column
output. Example: for a separator ``'|'`` the CSV value
``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
- ``numerical``

- Transforms a numerical column using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
values in CSV input. If the input data are in Parquet format, each value in the
column is assumed to be an array of floats.
- ``bucket-numerical``

- Transforms a numerical column to a one-hot or multi-hot bucket representation, using bucketization.
Also supports optional missing value imputation through the `imputer` kwarg.```
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``range`` (List[float], required), The range defines the start and end point of the buckets with ``[a, b]``. It should be
a list of two floats. For example, ``[10, 30]`` defines a bucketing range between 10 and 30.
- ``bucket_cnt`` (Integer, required), The count of bucket lists used in the bucket feature transform. GSProcessing
calculates the size of each bucket as ``( b - a ) / c`` , and encodes each numeric value as the number
of whatever bucket it falls into. Any value less than a is considered to belong in the first bucket,
and any value greater than b is considered to belong in the last bucket.
- ``slide_window_size`` (Integer, optional), slide_window_size can be used to make numeric values fall into more than one bucket,
by specifying a slide-window size ``s``, where ``s`` can an integer or float. GSProcessing then transforms each
numeric value ``v`` of the property into a range from ``v - s/2`` through ``v + s/2`` , and assigns the value v
to every bucket that the range covers.
--------------

Examples
Expand Down Expand Up @@ -403,15 +449,27 @@ OAG-Paper dataset
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"type": "paper",
"column": "ID",
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
}
],
"labels": [
{
"column": "field",
Expand Down
2 changes: 2 additions & 0 deletions docs/source/gs-processing/gs-processing-getting-started.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _gs-processing:

GraphStorm Processing Getting Started
=====================================

Expand Down
Loading

0 comments on commit 0139d33

Please sign in to comment.