Skip to content

Commit

Permalink
[GSProcessing] Add support for numerical and multi-numerical transfor…
Browse files Browse the repository at this point in the history
…mations.
  • Loading branch information
thvasilo committed Oct 17, 2023
1 parent fe8d317 commit 8551c37
Show file tree
Hide file tree
Showing 14 changed files with 183 additions and 70 deletions.
75 changes: 57 additions & 18 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ between other config formats, such as the one used
by the single-machine GConstruct module.

GSProcessing can take a GConstruct-formatted file
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
input configuration file into the ``GSProcessing`` format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
Expand All @@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
- ``version`` (String, required): The version of configuration file being used. We include
the package name to allow self-contained identification of the file format.
- ``graph`` (JSON object, required): one configuration object that defines each
of the node types and edge types that describe the graph.
of the edge and node types that constitute the graph.

We describe the ``graph`` object next.

``graph`` configuration object
Contents of the ``graph`` configuration object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``graph`` configuration object can have two top-level objects:
Expand Down Expand Up @@ -135,13 +135,12 @@ objects:
``source`` key, with a JSON object that contains
``{“column: String, and ”type“: String}``.
- ``relation``: (JSON object, required): Describes the relation
modeled by the edges. A relation can be common among all edges, or it
can have sub-types. The top-level objects for the object are:
modeled by the edges. The top-level keys for the object are:

- ``type`` (String, required): The type of the relation described by
the edges. For example, for a source type ``user``, destination
``movie`` we can have a relation type ``interacted_with`` for an
edge type ``user:interacted_with:movie``.
``movie`` we can have a relation type ``rated`` for an
edge type ``user:rated:movie``.

- ``labels`` (List of JSON objects, optional): Describes the label
for the current edge type. The label object has the following
Expand Down Expand Up @@ -171,9 +170,9 @@ objects:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional)\ **:** Describes
the set of features for the current edge type. See the :ref:`features-object` section for details.
Expand Down Expand Up @@ -248,12 +247,12 @@ following top-level keys:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional): Describes
the set of features for the current edge type. See the next section, :ref:`features-object`
the set of features for the current edge node. See the next section, :ref:`features-object`
for details.

--------------
Expand Down Expand Up @@ -285,7 +284,7 @@ can contain the following top-level keys:
}
- ``column`` (String, required): The column that contains the raw
feature values in the dataset
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
Expand All @@ -309,7 +308,7 @@ can contain the following top-level keys:
# Example node config with multiple features
{
# This is where the node structure data exist just need an id col
# This is where the node structure data exist, just need an id col in these files
"data": {
"format": "parquet",
"files": ["path/to/node_ids"]
Expand Down Expand Up @@ -356,7 +355,7 @@ Supported transformations

In this section we'll describe the transformations we support.
The name of the transformation is the value that would appear
in the ``transform['name']`` element of the feature configuration,
in the ``['transformation']['name']`` element of the feature configuration,
with the attached ``kwargs`` for the transformations that support
arguments.

Expand All @@ -373,6 +372,35 @@ arguments.
split the values in the column and create a vector column
output. Example: for a separator ``'|'`` the CSV value
``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
- ``numerical``

- Transforms a numerical column using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``mean`` (Default), ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
values in CSV input. If the input data are in Parquet format, each value in the
column is assumed to be an array of floats.

--------------

Expand Down Expand Up @@ -403,15 +431,26 @@ OAG-Paper dataset
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"type": "paper",
"column": "ID",
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
]
"labels": [
{
"column": "field",
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Welcome to the GraphStorm Documentation and Tutorials
gs-processing/usage/example
gs-processing/usage/distributed-processing-setup
gs-processing/usage/amazon-sagemaker
gs-processing/developer/input-configuration

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ def convert_to_gsprocessing(self, input_dictionary: dict) -> dict:

gsprocessing_dict: dict[str, Any] = {}

# hardcode the version number for the first version
gsprocessing_dict["version"] = "gsprocessing-1.0"
gsprocessing_dict["version"] = "gsprocessing-v1.0"
gsprocessing_dict["graph"] = {}

# deal with nodes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any

from .converter_base import ConfigConverter
from .meta_configuration import NodeConfig, EdgeConfig

Expand Down Expand Up @@ -80,27 +82,40 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
list[dict]
The feature information in the GSProcessing format
"""
feats_list = []
gsp_feats_list = []
if feats in [[], [{}]]:
return []
for ele in feats:
if "transform" in ele:
raise ValueError(
"Currently only support no-op operation, "
"we do not support any other no-op operation"
)
feat_dict = {}
kwargs = {"name": "no-op"}
for col in ele["feature_col"]:
feat_dict = {"column": col, "transform": kwargs}
feats_list.append(feat_dict)
if "out_dtype" in ele:
for gconstruct_feat_dict in feats:
gsp_feat_dict = {}
gsp_feat_dict["column"] = gconstruct_feat_dict["feature_col"][0]
if "feature_name" in gconstruct_feat_dict:
gsp_feat_dict["name"] = gconstruct_feat_dict["feature_name"]

gsp_transformation_dict: dict[str, Any] = {}
if "transform" in gconstruct_feat_dict:
gconstruct_transform_dict = gconstruct_feat_dict["transform"]

if gconstruct_transform_dict["name"] == "max_min_norm":
gsp_transformation_dict["name"] = "numerical"
gsp_transformation_dict["kwargs"] = {"normalizer": "min-max", "imputer": "mean"}
# TODO: Add support for other common transformations here
else:
raise ValueError(
"Unsupported GConstruct transformation name: "
f"{gconstruct_transform_dict['name']}"
)
else:
gsp_transformation_dict["name"] = "no-op"

if "out_dtype" in gconstruct_feat_dict:
assert (
ele["out_dtype"] == "float32"
gconstruct_feat_dict["out_dtype"] == "float32"
), "GSProcessing currently only supports float32 features"
if "feature_name" in ele:
feat_dict["name"] = ele["feature_name"]
return feats_list

gsp_feat_dict["transformation"] = gsp_transformation_dict
gsp_feats_list.append(gsp_feat_dict)

return gsp_feats_list

@staticmethod
def convert_nodes(nodes_entries):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@
Configuration parsing for edges and nodes
"""
from abc import ABC

from typing import Any, Dict, List, Optional, Sequence
import logging

from graphstorm_processing.constants import SUPPORTED_FILE_TYPES
from .label_config_base import LabelConfig, EdgeLabelConfig, NodeLabelConfig
from .feature_config_base import FeatureConfig, NoopFeatureConfig
from .numerical_configs import MultiNumericalFeatureConfig, NumericalFeatureConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -49,9 +50,14 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
return NoopFeatureConfig(feature_dict)

transformation_name = feature_dict["transformation"]["name"]
logging.info("Transformation name: %s", transformation_name)

if transformation_name == "no-op":
return NoopFeatureConfig(feature_dict)
elif transformation_name == "numerical":
return NumericalFeatureConfig(feature_dict)
elif transformation_name == "multi-numerical":
return MultiNumericalFeatureConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,13 @@ def _sanity_check(self) -> None:


class NoopFeatureConfig(FeatureConfig):
"""
Feature configuration for features that do not need to be transformed.
"""Feature configuration for features that do not need to be transformed.
Supported kwargs
----------------
separator: str
When provided will treat the input as strings, split each value in the string using
the separator, and convert the resulting list of floats into a float-vector feature.
"""

def __init__(self, config: Mapping):
Expand All @@ -98,4 +103,4 @@ def __init__(self, config: Mapping):
def _sanity_check(self) -> None:
super()._sanity_check()
if self._data_config and self.value_separator and self._data_config.format != "csv":
raise RuntimeError("value_separator should only be provided for CSV data")
raise RuntimeError("separator should only be provided for CSV data")
3 changes: 0 additions & 3 deletions graphstorm-processing/graphstorm_processing/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,6 @@
MISSING_CATEGORY = "GSP_CONSTANT_UNKNOWN"
SINGLE_CATEGORY_COL = "SINGLE_CATEGORY"

################ Multi-numerical Limits #####################
MAX_COLUMNS_TO_IMPUTE = 50

SUPPORTED_FILE_TYPES = ["csv", "parquet"]

################### Label Properties ########################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,17 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
import logging

from pyspark.sql import DataFrame

from graphstorm_processing.config.feature_config_base import FeatureConfig
from .dist_transformations import DistributedTransformation, NoopTransformation
from .dist_transformations import (
DistributedTransformation,
NoopTransformation,
DistNumericalTransformation,
DistMultiNumericalTransformation,
)


class DistFeatureTransformer(object):
Expand All @@ -32,9 +39,15 @@ def __init__(self, feature_config: FeatureConfig):
self.transformation: DistributedTransformation

default_kwargs = {"cols": feature_config.cols}
logging.info("Feature name: %s", feat_name)
logging.info("Transformation type: %s", feat_type)

if feat_type == "no-op":
self.transformation = NoopTransformation(**default_kwargs, **args_dict)
elif feat_type == "numerical":
self.transformation = DistNumericalTransformation(**default_kwargs, **args_dict)
elif feat_type == "multi-numerical":
self.transformation = DistMultiNumericalTransformation(**default_kwargs, **args_dict)
else:
raise NotImplementedError(
f"Feature {feat_name} has type: {feat_type} that is not supported"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,7 @@
)
from .dist_noop_transformation import NoopTransformation
from .dist_label_transformation import DistSingleLabelTransformation, DistMultiLabelTransformation
from .dist_numerical_transformation import (
DistMultiNumericalTransformation,
DistNumericalTransformation,
)
Loading

0 comments on commit 8551c37

Please sign in to comment.