[GSProcessing] Add support for numerical and multi-numerical transfor…

…mations.
awslabs · Oct 17, 2023 · 27d89b4 · 27d89b4
1 parent fe8d317
commit 27d89b4
Show file tree

Hide file tree

Showing 18 changed files with 1,061 additions and 70 deletions.
diff --git a/docs/source/gs-processing/developer/input-configuration.rst b/docs/source/gs-processing/developer/input-configuration.rst
@@ -12,8 +12,8 @@ between other config formats, such as the one used
 by the single-machine GConstruct module.
 
 GSProcessing can take a GConstruct-formatted file
-directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
-that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
+directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
+that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
 input configuration file into the ``GSProcessing`` format,
 although this is mostly aimed at developers, users are
 can rely on the automatic conversion.
@@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
 -  ``version`` (String, required): The version of configuration file being used. We include
    the package name to allow self-contained identification of the file format.
 -  ``graph`` (JSON object, required): one configuration object that defines each
-   of the node types and edge types that describe the graph.
+   of the edge and node types that constitute the graph.
 
 We describe the ``graph`` object next.
 
-``graph`` configuration object
+Contents of the ``graph`` configuration object
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The ``graph`` configuration object can have two top-level objects:
@@ -135,13 +135,12 @@ objects:
    ``source`` key, with a JSON object that contains
    ``{“column: String, and ”type“: String}``.
 -  ``relation``: (JSON object, required): Describes the relation
-   modeled by the edges. A relation can be common among all edges, or it
-   can have sub-types. The top-level objects for the object are:
+   modeled by the edges. The top-level keys for the object are:
 
    -  ``type`` (String, required): The type of the relation described by
       the edges. For example, for a source type ``user``, destination
-      ``movie`` we can have a relation type ``interacted_with`` for an
-      edge type ``user:interacted_with:movie``.
+      ``movie`` we can have a relation type ``rated`` for an
+      edge type ``user:rated:movie``.
 
 -  ``labels`` (List of JSON objects, optional): Describes the label
    for the current edge type. The label object has the following
@@ -171,9 +170,9 @@ objects:
       -  ``train``: The percentage of the data with available labels to
          assign to the train set (0.0, 1.0].
       -  ``val``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the validation set [0.0, 1.0).
       -  ``test``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the test set [0.0, 1.0).
 
 -  ``features`` (List of JSON objects, optional)\ **:** Describes
    the set of features for the current edge type. See the :ref:`features-object` section for details.
@@ -248,12 +247,12 @@ following top-level keys:
       -  ``train``: The percentage of the data with available labels to
          assign to the train set (0.0, 1.0].
       -  ``val``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the validation set [0.0, 1.0).
       -  ``test``: The percentage of the data with available labels to
-         assign to the train set [0.0, 1.0).
+         assign to the test set [0.0, 1.0).
 
 -  ``features`` (List of JSON objects, optional): Describes
-   the set of features for the current edge type. See the next section, :ref:`features-object`
+   the set of features for the current node type. See the next section, :ref:`features-object`
    for details.
 
 --------------
@@ -285,7 +284,7 @@ can contain the following top-level keys:
     }
 
 -  ``column`` (String, required): The column that contains the raw
-   feature values in the dataset
+   feature values in the data.
 -  ``transformation`` (JSON object, optional): The type of
    transformation that will be applied to the feature. For details on
    the individual transformations supported see :ref:`supported-transformations`.
@@ -309,7 +308,7 @@ can contain the following top-level keys:
 
         # Example node config with multiple features
         {
-            # This is where the node structure data exist just need an id col
+            # This is where the node structure data exist, just need an id col in these files
             "data": {
                 "format": "parquet",
                 "files": ["path/to/node_ids"]
@@ -356,7 +355,7 @@ Supported transformations
 
 In this section we'll describe the transformations we support.
 The name of the transformation is the value that would appear
-in the ``transform['name']`` element of the feature configuration,
+in the ``['transformation']['name']`` element of the feature configuration,
 with the attached ``kwargs`` for the transformations that support
 arguments.
 
@@ -373,6 +372,35 @@ arguments.
          split the values in the column and create a vector column
          output. Example: for a separator ``'|'`` the CSV value
          ``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
+-  ``numerical``
+
+   -  Transforms a numerical column using a missing data imputer and an
+      optional normalizer.
+   -  ``kwargs``:
+
+      -  ``imputer`` (String, optional): A method to fill in missing values in the data.
+         Valid values are:
+         ``mean`` (Default), ``median``, and ``most_frequent``. Missing values will be replaced
+         with the respective value computed from the data.
+      - ``normalizer`` (String, optional): Applies a normalization to the data, after
+         imputation. Can take the following values:
+         - ``none``: (Default) Don't normalize the numerical values during encoding.
+         - ``min-max``: Normalize each value by subtracting the minimum value from it,
+        and then dividing it by the difference between the maximum value and the minimum.
+        - ``standard``: Normalize each value by dividing it by the sum of all the values.
+-  ``multi-numerical``
+
+   -  Column-wise transformation for vector-like numerical data using a missing data imputer and an
+      optional normalizer.
+   -  ``kwargs``:
+
+      - ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
+        apply the ``mean`` transformation by default.
+      - ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
+        normalization is applied by default.
+      - ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
+        values in CSV input. If the input data are in Parquet format, each value in the
+        column is assumed to be an array of floats.
 
 --------------
 
@@ -403,15 +431,26 @@ OAG-Paper dataset
             ],
             "nodes" : [
                 {
+                    "type": "paper",
+                    "column": "ID",
                     "data": {
                         "format": "csv",
                         "separator": ",",
                         "files": [
                             "node_feat.csv"
                         ]
                     },
-                    "type": "paper",
-                    "column": "ID",
+                    "features": [
+                        {
+                            "column": "n_citation",
+                            "transformation": {
+                                "name": "numerical",
+                                "kwargs": {
+                                    "imputer": "mean",
+                                    "normalizer": "min-max"
+                            }
+                        }
+                    ]
                     "labels": [
                         {
                             "column": "field",

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -23,6 +23,7 @@ Welcome to the GraphStorm Documentation and Tutorials
    gs-processing/usage/example
    gs-processing/usage/distributed-processing-setup
    gs-processing/usage/amazon-sagemaker
+   gs-processing/developer/input-configuration
 
 .. toctree::
    :maxdepth: 1

diff --git a/graphstorm-processing/graphstorm_processing/config/config_conversion/converter_base.py b/graphstorm-processing/graphstorm_processing/config/config_conversion/converter_base.py
@@ -89,8 +89,7 @@ def convert_to_gsprocessing(self, input_dictionary: dict) -> dict:
 
         gsprocessing_dict: dict[str, Any] = {}
 
-        # hardcode the version number for the first version
-        gsprocessing_dict["version"] = "gsprocessing-1.0"
+        gsprocessing_dict["version"] = "gsprocessing-v1.0"
         gsprocessing_dict["graph"] = {}
 
         # deal with nodes

diff --git a/graphstorm-processing/graphstorm_processing/config/config_conversion/gconstruct_converter.py b/graphstorm-processing/graphstorm_processing/config/config_conversion/gconstruct_converter.py
@@ -13,6 +13,8 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 """
+from typing import Any
+
 from .converter_base import ConfigConverter
 from .meta_configuration import NodeConfig, EdgeConfig
 
@@ -80,27 +82,40 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
         list[dict]
             The feature information in the GSProcessing format
         """
-        feats_list = []
+        gsp_feats_list = []
         if feats in [[], [{}]]:
             return []
-        for ele in feats:
-            if "transform" in ele:
-                raise ValueError(
-                    "Currently only support no-op operation, "
-                    "we do not support any other no-op operation"
-                )
-            feat_dict = {}
-            kwargs = {"name": "no-op"}
-            for col in ele["feature_col"]:
-                feat_dict = {"column": col, "transform": kwargs}
-                feats_list.append(feat_dict)
-            if "out_dtype" in ele:
+        for gconstruct_feat_dict in feats:
+            gsp_feat_dict = {}
+            gsp_feat_dict["column"] = gconstruct_feat_dict["feature_col"][0]
+            if "feature_name" in gconstruct_feat_dict:
+                gsp_feat_dict["name"] = gconstruct_feat_dict["feature_name"]
+
+            gsp_transformation_dict: dict[str, Any] = {}
+            if "transform" in gconstruct_feat_dict:
+                gconstruct_transform_dict = gconstruct_feat_dict["transform"]
+
+                if gconstruct_transform_dict["name"] == "max_min_norm":
+                    gsp_transformation_dict["name"] = "numerical"
+                    gsp_transformation_dict["kwargs"] = {"normalizer": "min-max", "imputer": "mean"}
+                # TODO: Add support for other common transformations here
+                else:
+                    raise ValueError(
+                        "Unsupported GConstruct transformation name: "
+                        f"{gconstruct_transform_dict['name']}"
+                    )
+            else:
+                gsp_transformation_dict["name"] = "no-op"
+
+            if "out_dtype" in gconstruct_feat_dict:
                 assert (
-                    ele["out_dtype"] == "float32"
+                    gconstruct_feat_dict["out_dtype"] == "float32"
                 ), "GSProcessing currently only supports float32 features"
-            if "feature_name" in ele:
-                feat_dict["name"] = ele["feature_name"]
-        return feats_list
+
+            gsp_feat_dict["transformation"] = gsp_transformation_dict
+            gsp_feats_list.append(gsp_feat_dict)
+
+        return gsp_feats_list
 
     @staticmethod
     def convert_nodes(nodes_entries):

diff --git a/graphstorm-processing/graphstorm_processing/config/config_parser.py b/graphstorm-processing/graphstorm_processing/config/config_parser.py
@@ -16,12 +16,13 @@
 Configuration parsing for edges and nodes
 """
 from abc import ABC
-
 from typing import Any, Dict, List, Optional, Sequence
+import logging
 
 from graphstorm_processing.constants import SUPPORTED_FILE_TYPES
 from .label_config_base import LabelConfig, EdgeLabelConfig, NodeLabelConfig
 from .feature_config_base import FeatureConfig, NoopFeatureConfig
+from .numerical_configs import MultiNumericalFeatureConfig, NumericalFeatureConfig
 from .data_config_base import DataStorageConfig
 
 
@@ -49,9 +50,14 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:
         return NoopFeatureConfig(feature_dict)
 
     transformation_name = feature_dict["transformation"]["name"]
+    logging.info("Transformation name: %s", transformation_name)
 
     if transformation_name == "no-op":
         return NoopFeatureConfig(feature_dict)
+    elif transformation_name == "numerical":
+        return NumericalFeatureConfig(feature_dict)
+    elif transformation_name == "multi-numerical":
+        return MultiNumericalFeatureConfig(feature_dict)
     else:
         raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")
 

diff --git a/graphstorm-processing/graphstorm_processing/config/feature_config_base.py b/graphstorm-processing/graphstorm_processing/config/feature_config_base.py
@@ -82,8 +82,13 @@ def _sanity_check(self) -> None:
 
 
 class NoopFeatureConfig(FeatureConfig):
-    """
-    Feature configuration for features that do not need to be transformed.
+    """Feature configuration for features that do not need to be transformed.
+
+    Supported kwargs
+    ----------------
+    separator: str
+        When provided will treat the input as strings, split each value in the string using
+        the separator, and convert the resulting list of floats into a float-vector feature.
     """
 
     def __init__(self, config: Mapping):
@@ -98,4 +103,4 @@ def __init__(self, config: Mapping):
     def _sanity_check(self) -> None:
         super()._sanity_check()
         if self._data_config and self.value_separator and self._data_config.format != "csv":
-            raise RuntimeError("value_separator should only be provided for CSV data")
+            raise RuntimeError("separator should only be provided for CSV data")
diff --git a/graphstorm-processing/graphstorm_processing/config/numerical_configs.py b/graphstorm-processing/graphstorm_processing/config/numerical_configs.py
@@ -0,0 +1,92 @@
+"""
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License").
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+from typing import Mapping
+
+from .feature_config_base import FeatureConfig
+
+
+class NumericalFeatureConfig(FeatureConfig):
+    """Feature configuration for single-column numerical features.
+
+    Supported kwargs
+    ----------------
+    imputer: str
+        A method to fill in missing values in the data. Valid values are:
+        "mean" (Default), "median", and "most_frequent". Missing values will be replaced
+        with the respective value computed from the data.
+    normalizer: str
+        A normalization to apply to each column. Valid values are
+        "none", "min-max", and "standard".
+
+        The transformation applied will be:
+
+        * "none": (Default) Don't normalize the numerical values during encoding.
+        * "min-max": Normalize each value by subtracting the minimum value from it,
+        and then dividing it by the difference between the maximum value and the minimum.
+        * "standard": Normalize each value by dividing it by the sum of all the values.
+    """
+
+    def __init__(self, config: Mapping):
+        super().__init__(config)
+        self.imputer = self._transformation_kwargs.get("imputer", "mean")
+        self.norm = self._transformation_kwargs.get("normalizer", "none")
+
+        self._sanity_check()
+
+    def _sanity_check(self) -> None:
+        super()._sanity_check()
+        valid_imputers = ["mean", "median", "most_frequent"]
+        assert (
+            self.imputer in valid_imputers
+        ), f"Unknown imputer requested, expected one of {valid_imputers}, got {self.imputer}"
+        valid_normalizers = ["none", "min-max", "standard"]
+        assert (
+            self.norm in valid_normalizers
+        ), f"Unknown normalizer requested, expected one of {valid_normalizers}, got {self.norm}"
+
+
+class MultiNumericalFeatureConfig(NumericalFeatureConfig):
+    """Feature configuration for multi-column numerical features.
+
+    Supported kwargs
+    ----------------
+    imputer: str
+        A method to fill in missing values in the data. Valid values are:
+        "mean" (Default), "median", and "most_frequent". Missing values will be replaced
+        with the respective value computed from the data.
+    normalizer: str
+        A normalization to apply to each column. Valid values are
+        "none", "min-max", and "standard".
+
+        The transformation applied will be:
+
+        * "none": (Default) Don't normalize the numerical values during encoding.
+        * "min-max": Normalize each value by subtracting the minimum value from it,
+        and then dividing it by the difference between the maximum value and the minimum.
+        * "standard": Normalize each value by dividing it by the sum of all the values.
+    separator: str, optional
+        A separator to use when splitting a delimited string into multiple numerical values
+        as a vector. Only applicable to CSV input. Example: for a separator `'|'` the CSV
+        value `1|2|3` would be transformed to a vector, `[1, 2, 3]`. When `None` the expected
+        input format is an array of numerical values.
+
+    """
+
+    def __init__(self, config: Mapping):
+        super().__init__(config)
+        self.separator = self._transformation_kwargs.get("separator", None)
+
+        self._sanity_check()