Skip to content

Commit

Permalink
[GSProcessing] Add saving and re-applying for numerical transforms. (#…
Browse files Browse the repository at this point in the history
…1085)

*Issue #, if available:*

Fixes #985 

*Description of changes:*

* We introduce saving and re-applying numerical transformations for all
transforms except rank-gauss, which by definition cannot be reapplied.
* For the more complex transformations we re-construct the original
PySpark transformer objects, by retaining the values needed for each
transformation (e.g. the min and max values), creating tiny DFs that
only contain those numbers and re-training the transformer on that tiny
dataset. Then we can apply the trained transformer to the desired data
and we get the same result.
* To reduce code duplication we pull out the core computations for
standard and min-max normalization into their own functions
(`_apply_standard_transform`, `_apply_minmax_transform`), which we can
call from both the original transformation and the re-applied one. The
presence or absence of pre-computed statistics in the function call
determines which code path we follow.
* We modify the `apply_imputation` and `apply_norm` functions to also
return the representation along with the transformed DF. We encapsulate
the return values in their own dataclass (`ImputationResult`,
`NormalizationResult`), to make future modifications easier (by not
requiring to change the function's return type).
* Introduce new tests to check all re-construction cases.

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: xiang song(charlie.song) <[email protected]>
Co-authored-by: jalencato <[email protected]>
  • Loading branch information
3 people authored Nov 14, 2024
1 parent ab9d143 commit d0873f1
Show file tree
Hide file tree
Showing 7 changed files with 626 additions and 85 deletions.
6 changes: 4 additions & 2 deletions docs/source/cli/graph-construction/distributed/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,9 @@ the graph structure, features, and labels. In more detail:
GSProcessing will use the transformation values listed here
instead of creating new ones, ensuring that models trained with the original
data can still be used in the newly transformed data. Currently only
categorical transformations can be re-applied.
categorical and numerical transformations can be re-applied. Note that
the Rank-Gauss transformation does not support re-application, it may
only work for transductive tasks.
* ``updated_row_counts_metadata.json``:
This file is meant to be used as the input configuration for the
distributed partitioning pipeline. ``gs-repartition`` produces
Expand Down Expand Up @@ -313,7 +315,7 @@ you can use the following command to run the partition job locally:
--num-parts 2 \
--dgl-tool-path ./dgl/tools \
--partition-algorithm random \
--ip-config ip_list.txt
--ip-config ip_list.txt
The command above will first do graph partitioning to determine the ownership for each partition and save the results.
Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
)


class DistFeatureTransformer(object):
class DistFeatureTransformer:
"""
Given a feature configuration selects the correct transformation type,
which can then be be applied through a call to apply_transformation.
Expand All @@ -56,7 +56,9 @@ def __init__(
if feat_type == "no-op":
self.transformation = NoopTransformation(**default_kwargs, **args_dict)
elif feat_type == "numerical":
self.transformation = DistNumericalTransformation(**default_kwargs, **args_dict)
self.transformation = DistNumericalTransformation(
**default_kwargs, **args_dict, json_representation=json_representation
)
elif feat_type == "multi-numerical":
self.transformation = DistMultiNumericalTransformation(**default_kwargs, **args_dict)
elif feat_type == "bucket-numerical":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def get_transformation_name() -> str:
return "DistBucketNumericalTransformation"

def apply(self, input_df: DataFrame) -> DataFrame:
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df)
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df).imputed_df
# TODO: Make range optional by getting min/max from data.
min_val, max_val = self.range

Expand Down
Loading

0 comments on commit d0873f1

Please sign in to comment.