Skip to content

Commit

Permalink
[GSProcessing] Add support for numerical and multi-numerical transfor…
Browse files Browse the repository at this point in the history
…mations.
  • Loading branch information
thvasilo committed Oct 17, 2023
1 parent fe8d317 commit 27d89b4
Show file tree
Hide file tree
Showing 18 changed files with 1,061 additions and 70 deletions.
75 changes: 57 additions & 18 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ between other config formats, such as the one used
by the single-machine GConstruct module.

GSProcessing can take a GConstruct-formatted file
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
input configuration file into the ``GSProcessing`` format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
Expand All @@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
- ``version`` (String, required): The version of configuration file being used. We include
the package name to allow self-contained identification of the file format.
- ``graph`` (JSON object, required): one configuration object that defines each
of the node types and edge types that describe the graph.
of the edge and node types that constitute the graph.

We describe the ``graph`` object next.

``graph`` configuration object
Contents of the ``graph`` configuration object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``graph`` configuration object can have two top-level objects:
Expand Down Expand Up @@ -135,13 +135,12 @@ objects:
``source`` key, with a JSON object that contains
``{“column: String, and ”type“: String}``.
- ``relation``: (JSON object, required): Describes the relation
modeled by the edges. A relation can be common among all edges, or it
can have sub-types. The top-level objects for the object are:
modeled by the edges. The top-level keys for the object are:

- ``type`` (String, required): The type of the relation described by
the edges. For example, for a source type ``user``, destination
``movie`` we can have a relation type ``interacted_with`` for an
edge type ``user:interacted_with:movie``.
``movie`` we can have a relation type ``rated`` for an
edge type ``user:rated:movie``.

- ``labels`` (List of JSON objects, optional): Describes the label
for the current edge type. The label object has the following
Expand Down Expand Up @@ -171,9 +170,9 @@ objects:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional)\ **:** Describes
the set of features for the current edge type. See the :ref:`features-object` section for details.
Expand Down Expand Up @@ -248,12 +247,12 @@ following top-level keys:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional): Describes
the set of features for the current edge type. See the next section, :ref:`features-object`
the set of features for the current node type. See the next section, :ref:`features-object`
for details.

--------------
Expand Down Expand Up @@ -285,7 +284,7 @@ can contain the following top-level keys:
}
- ``column`` (String, required): The column that contains the raw
feature values in the dataset
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
Expand All @@ -309,7 +308,7 @@ can contain the following top-level keys:
# Example node config with multiple features
{
# This is where the node structure data exist just need an id col
# This is where the node structure data exist, just need an id col in these files
"data": {
"format": "parquet",
"files": ["path/to/node_ids"]
Expand Down Expand Up @@ -356,7 +355,7 @@ Supported transformations

In this section we'll describe the transformations we support.
The name of the transformation is the value that would appear
in the ``transform['name']`` element of the feature configuration,
in the ``['transformation']['name']`` element of the feature configuration,
with the attached ``kwargs`` for the transformations that support
arguments.

Expand All @@ -373,6 +372,35 @@ arguments.
split the values in the column and create a vector column
output. Example: for a separator ``'|'`` the CSV value
``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
- ``numerical``

- Transforms a numerical column using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``mean`` (Default), ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
values in CSV input. If the input data are in Parquet format, each value in the
column is assumed to be an array of floats.

--------------

Expand Down Expand Up @@ -403,15 +431,26 @@ OAG-Paper dataset
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"type": "paper",
"column": "ID",
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
]
"labels": [
{
"column": "field",
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Welcome to the GraphStorm Documentation and Tutorials
gs-processing/usage/example
gs-processing/usage/distributed-processing-setup
gs-processing/usage/amazon-sagemaker
gs-processing/developer/input-configuration

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ def convert_to_gsprocessing(self, input_dictionary: dict) -> dict:

gsprocessing_dict: dict[str, Any] = {}

# hardcode the version number for the first version
gsprocessing_dict["version"] = "gsprocessing-1.0"
gsprocessing_dict["version"] = "gsprocessing-v1.0"
gsprocessing_dict["graph"] = {}

# deal with nodes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any

from .converter_base import ConfigConverter
from .meta_configuration import NodeConfig, EdgeConfig

Expand Down Expand Up @@ -80,27 +82,40 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
list[dict]
The feature information in the GSProcessing format
"""
feats_list = []
gsp_feats_list = []
if feats in [[], [{}]]:
return []
for ele in feats:
if "transform" in ele:
raise ValueError(
"Currently only support no-op operation, "
"we do not support any other no-op operation"
)
feat_dict = {}
kwargs = {"name": "no-op"}
for col in ele["feature_col"]:
feat_dict = {"column": col, "transform": kwargs}
feats_list.append(feat_dict)
if "out_dtype" in ele:
for gconstruct_feat_dict in feats:
gsp_feat_dict = {}
gsp_feat_dict["column"] = gconstruct_feat_dict["feature_col"][0]
if "feature_name" in gconstruct_feat_dict:
gsp_feat_dict["name"] = gconstruct_feat_dict["feature_name"]

gsp_transformation_dict: dict[str, Any] = {}
if "transform" in gconstruct_feat_dict:
gconstruct_transform_dict = gconstruct_feat_dict["transform"]

if gconstruct_transform_dict["name"] == "max_min_norm":
gsp_transformation_dict["name"] = "numerical"
gsp_transformation_dict["kwargs"] = {"normalizer": "min-max", "imputer": "mean"}
# TODO: Add support for other common transformations here
else:
raise ValueError(
"Unsupported GConstruct transformation name: "
f"{gconstruct_transform_dict['name']}"
)
else:
gsp_transformation_dict["name"] = "no-op"

if "out_dtype" in gconstruct_feat_dict:
assert (
ele["out_dtype"] == "float32"
gconstruct_feat_dict["out_dtype"] == "float32"
), "GSProcessing currently only supports float32 features"
if "feature_name" in ele:
feat_dict["name"] = ele["feature_name"]
return feats_list

gsp_feat_dict["transformation"] = gsp_transformation_dict
gsp_feats_list.append(gsp_feat_dict)

return gsp_feats_list

@staticmethod
def convert_nodes(nodes_entries):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@
Configuration parsing for edges and nodes
"""
from abc import ABC

from typing import Any, Dict, List, Optional, Sequence
import logging

from graphstorm_processing.constants import SUPPORTED_FILE_TYPES
from .label_config_base import LabelConfig, EdgeLabelConfig, NodeLabelConfig
from .feature_config_base import FeatureConfig, NoopFeatureConfig
from .numerical_configs import MultiNumericalFeatureConfig, NumericalFeatureConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -49,9 +50,14 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
return NoopFeatureConfig(feature_dict)

transformation_name = feature_dict["transformation"]["name"]
logging.info("Transformation name: %s", transformation_name)

if transformation_name == "no-op":
return NoopFeatureConfig(feature_dict)
elif transformation_name == "numerical":
return NumericalFeatureConfig(feature_dict)
elif transformation_name == "multi-numerical":
return MultiNumericalFeatureConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,13 @@ def _sanity_check(self) -> None:


class NoopFeatureConfig(FeatureConfig):
"""
Feature configuration for features that do not need to be transformed.
"""Feature configuration for features that do not need to be transformed.
Supported kwargs
----------------
separator: str
When provided will treat the input as strings, split each value in the string using
the separator, and convert the resulting list of floats into a float-vector feature.
"""

def __init__(self, config: Mapping):
Expand All @@ -98,4 +103,4 @@ def __init__(self, config: Mapping):
def _sanity_check(self) -> None:
super()._sanity_check()
if self._data_config and self.value_separator and self._data_config.format != "csv":
raise RuntimeError("value_separator should only be provided for CSV data")
raise RuntimeError("separator should only be provided for CSV data")
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License").
You may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Mapping

from .feature_config_base import FeatureConfig


class NumericalFeatureConfig(FeatureConfig):
"""Feature configuration for single-column numerical features.
Supported kwargs
----------------
imputer: str
A method to fill in missing values in the data. Valid values are:
"mean" (Default), "median", and "most_frequent". Missing values will be replaced
with the respective value computed from the data.
normalizer: str
A normalization to apply to each column. Valid values are
"none", "min-max", and "standard".
The transformation applied will be:
* "none": (Default) Don't normalize the numerical values during encoding.
* "min-max": Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
* "standard": Normalize each value by dividing it by the sum of all the values.
"""

def __init__(self, config: Mapping):
super().__init__(config)
self.imputer = self._transformation_kwargs.get("imputer", "mean")
self.norm = self._transformation_kwargs.get("normalizer", "none")

self._sanity_check()

def _sanity_check(self) -> None:
super()._sanity_check()
valid_imputers = ["mean", "median", "most_frequent"]
assert (
self.imputer in valid_imputers
), f"Unknown imputer requested, expected one of {valid_imputers}, got {self.imputer}"
valid_normalizers = ["none", "min-max", "standard"]
assert (
self.norm in valid_normalizers
), f"Unknown normalizer requested, expected one of {valid_normalizers}, got {self.norm}"


class MultiNumericalFeatureConfig(NumericalFeatureConfig):
"""Feature configuration for multi-column numerical features.
Supported kwargs
----------------
imputer: str
A method to fill in missing values in the data. Valid values are:
"mean" (Default), "median", and "most_frequent". Missing values will be replaced
with the respective value computed from the data.
normalizer: str
A normalization to apply to each column. Valid values are
"none", "min-max", and "standard".
The transformation applied will be:
* "none": (Default) Don't normalize the numerical values during encoding.
* "min-max": Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
* "standard": Normalize each value by dividing it by the sum of all the values.
separator: str, optional
A separator to use when splitting a delimited string into multiple numerical values
as a vector. Only applicable to CSV input. Example: for a separator `'|'` the CSV
value `1|2|3` would be transformed to a vector, `[1, 2, 3]`. When `None` the expected
input format is an array of numerical values.
"""

def __init__(self, config: Mapping):
super().__init__(config)
self.separator = self._transformation_kwargs.get("separator", None)

self._sanity_check()
Loading

0 comments on commit 27d89b4

Please sign in to comment.