Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Add support for numerical and multi-numerical transformations. #575

Merged
merged 7 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 69 additions & 30 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ between other config formats, such as the one used
by the single-machine GConstruct module.

GSProcessing can take a GConstruct-formatted file
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
input configuration file into the ``GSProcessing`` format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
Expand All @@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
- ``version`` (String, required): The version of configuration file being used. We include
the package name to allow self-contained identification of the file format.
- ``graph`` (JSON object, required): one configuration object that defines each
of the node types and edge types that describe the graph.
of the edge and node types that constitute the graph.

We describe the ``graph`` object next.

``graph`` configuration object
Contents of the ``graph`` configuration object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``graph`` configuration object can have two top-level objects:
Expand Down Expand Up @@ -71,7 +71,7 @@ objects:
},
"source": {"column": "String", "type": "String"},
"relation": {"type": "String"},
"destination": {"column": "String", "type": "String"},
"dest": {"column": "String", "type": "String"},
"labels" : [
{
"column": "String",
Expand All @@ -82,8 +82,8 @@ objects:
"test": "Float"
}
},
]
"features": [{}]
],
"features": [{}]
}

- ``data`` (JSON Object, required): Describes the physical files
Expand Down Expand Up @@ -135,13 +135,12 @@ objects:
``source`` key, with a JSON object that contains
``{“column: String, and ”type“: String}``.
- ``relation``: (JSON object, required): Describes the relation
modeled by the edges. A relation can be common among all edges, or it
can have sub-types. The top-level objects for the object are:
modeled by the edges. The top-level keys for the object are:

- ``type`` (String, required): The type of the relation described by
the edges. For example, for a source type ``user``, destination
``movie`` we can have a relation type ``interacted_with`` for an
edge type ``user:interacted_with:movie``.
``movie`` we can have a relation type ``rated`` for an
edge type ``user:rated:movie``.

- ``labels`` (List of JSON objects, optional): Describes the label
for the current edge type. The label object has the following
Expand Down Expand Up @@ -171,9 +170,9 @@ objects:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional)\ **:** Describes
the set of features for the current edge type. See the :ref:`features-object` section for details.
Expand All @@ -194,13 +193,12 @@ following top-level keys:
"files": ["String"],
"separator": "String"
},
"column" : "String",
"type" : "String",
"column": "String",
"type": "String",
"labels" : [
{
"column": "String",
"type": "String",
"separator": "String",
"split_rate": {
"train": "Float",
"val": "Float",
Expand All @@ -215,8 +213,8 @@ following top-level keys:
the edges object, with one top-level key for the ``format`` that
takes a String value, and one for the ``files`` that takes an array
of String values.
- ``column``: (String, required): The column in the data that
corresponds to the column that stores the node ids.
- ``column``: (String, required): The name of the column in the data that
stores the node ids.
- ``type:`` (String, optional): A type name for the nodes described
in this object. If not provided the ``column`` value is used as the
node type.
Expand Down Expand Up @@ -248,12 +246,12 @@ following top-level keys:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional): Describes
the set of features for the current edge type. See the next section, :ref:`features-object`
the set of features for the current node type. See the section :ref:`features-object`
for details.

--------------
Expand All @@ -272,10 +270,10 @@ can contain the following top-level keys:
"column": "String",
"name": "String",
"transformation": {
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
},
"data": {
"format": "String",
Expand All @@ -285,7 +283,7 @@ can contain the following top-level keys:
}

- ``column`` (String, required): The column that contains the raw
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
feature values in the dataset
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
Expand All @@ -309,7 +307,7 @@ can contain the following top-level keys:

# Example node config with multiple features
{
# This is where the node structure data exist just need an id col
# This is where the node structure data exist, just need an id col in these files
"data": {
"format": "parquet",
"files": ["path/to/node_ids"]
Expand Down Expand Up @@ -356,7 +354,7 @@ Supported transformations

In this section we'll describe the transformations we support.
The name of the transformation is the value that would appear
in the ``transform['name']`` element of the feature configuration,
in the ``['transformation']['name']`` element of the feature configuration,
with the attached ``kwargs`` for the transformations that support
arguments.

Expand All @@ -373,6 +371,35 @@ arguments.
split the values in the column and create a vector column
output. Example: for a separator ``'|'`` the CSV value
``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
- ``numerical``

- Transforms a numerical column using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``none``, ``mean`` (Default), ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
values in CSV input. If the input data are in Parquet format, each value in the
column is assumed to be an array of floats.

--------------

Expand Down Expand Up @@ -403,15 +430,27 @@ OAG-Paper dataset
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"type": "paper",
"column": "ID",
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
}
],
"labels": [
{
"column": "field",
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Welcome to the GraphStorm Documentation and Tutorials
gs-processing/usage/example
gs-processing/usage/distributed-processing-setup
gs-processing/usage/amazon-sagemaker
gs-processing/developer/input-configuration

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ def convert_to_gsprocessing(self, input_dictionary: dict) -> dict:

gsprocessing_dict: dict[str, Any] = {}

# hardcode the version number for the first version
gsprocessing_dict["version"] = "gsprocessing-1.0"
gsprocessing_dict["version"] = "gsprocessing-v1.0"
gsprocessing_dict["graph"] = {}

# deal with nodes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any

from .converter_base import ConfigConverter
from .meta_configuration import NodeConfig, EdgeConfig

Expand Down Expand Up @@ -80,27 +82,40 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
list[dict]
The feature information in the GSProcessing format
"""
feats_list = []
gsp_feats_list = []
thvasilo marked this conversation as resolved.
Show resolved Hide resolved
if feats in [[], [{}]]:
return []
for ele in feats:
if "transform" in ele:
raise ValueError(
"Currently only support no-op operation, "
"we do not support any other no-op operation"
)
feat_dict = {}
kwargs = {"name": "no-op"}
for col in ele["feature_col"]:
feat_dict = {"column": col, "transform": kwargs}
feats_list.append(feat_dict)
if "out_dtype" in ele:
for gconstruct_feat_dict in feats:
gsp_feat_dict = {}
gsp_feat_dict["column"] = gconstruct_feat_dict["feature_col"][0]
if "feature_name" in gconstruct_feat_dict:
gsp_feat_dict["name"] = gconstruct_feat_dict["feature_name"]

gsp_transformation_dict: dict[str, Any] = {}
if "transform" in gconstruct_feat_dict:
gconstruct_transform_dict = gconstruct_feat_dict["transform"]

if gconstruct_transform_dict["name"] == "max_min_norm":
gsp_transformation_dict["name"] = "numerical"
gsp_transformation_dict["kwargs"] = {"normalizer": "min-max", "imputer": "mean"}
# TODO: Add support for other common transformations here
else:
raise ValueError(
"Unsupported GConstruct transformation name: "
f"{gconstruct_transform_dict['name']}"
)
else:
gsp_transformation_dict["name"] = "no-op"

if "out_dtype" in gconstruct_feat_dict:
assert (
ele["out_dtype"] == "float32"
gconstruct_feat_dict["out_dtype"] == "float32"
), "GSProcessing currently only supports float32 features"
if "feature_name" in ele:
feat_dict["name"] = ele["feature_name"]
return feats_list

gsp_feat_dict["transformation"] = gsp_transformation_dict
gsp_feats_list.append(gsp_feat_dict)

return gsp_feats_list

@staticmethod
def convert_nodes(nodes_entries):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
Configuration parsing for edges and nodes
"""
from abc import ABC

from typing import Any, Dict, List, Optional, Sequence

from graphstorm_processing.constants import SUPPORTED_FILE_TYPES
from .label_config_base import LabelConfig, EdgeLabelConfig, NodeLabelConfig
from .feature_config_base import FeatureConfig, NoopFeatureConfig
from .numerical_configs import MultiNumericalFeatureConfig, NumericalFeatureConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -52,6 +52,10 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:

if transformation_name == "no-op":
return NoopFeatureConfig(feature_dict)
elif transformation_name == "numerical":
return NumericalFeatureConfig(feature_dict)
elif transformation_name == "multi-numerical":
return MultiNumericalFeatureConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,13 @@ def _sanity_check(self) -> None:


class NoopFeatureConfig(FeatureConfig):
"""
Feature configuration for features that do not need to be transformed.
"""Feature configuration for features that do not need to be transformed.

Supported kwargs
----------------
separator: str
When provided will treat the input as strings, split each value in the string using
the separator, and convert the resulting list of floats into a float-vector feature.
"""

def __init__(self, config: Mapping):
Expand All @@ -98,4 +103,4 @@ def __init__(self, config: Mapping):
def _sanity_check(self) -> None:
super()._sanity_check()
if self._data_config and self.value_separator and self._data_config.format != "csv":
raise RuntimeError("value_separator should only be provided for CSV data")
raise RuntimeError("separator should only be provided for CSV data")
Loading
Loading