[GSProcessing] Add transformation saving and re-applying for numerica…

…l transforms.
awslabs · Nov 8, 2024 · 81ec25a · 81ec25a
1 parent af7ae16
commit 81ec25a
Show file tree

Hide file tree

Showing 5 changed files with 511 additions and 83 deletions.
diff --git a/docs/source/cli/graph-construction/distributed/example.rst b/docs/source/cli/graph-construction/distributed/example.rst
@@ -259,7 +259,9 @@ the graph structure, features, and labels. In more detail:
   GSProcessing will use the transformation values listed here
   instead of creating new ones, ensuring that models trained with the original
   data can still be used in the newly transformed data. Currently only
-  categorical transformations can be re-applied.
+  categorical and numerical transformations can be re-applied. Note that
+  the Rank-Gauss transformation cannot support re-application, it can
+  only work for transductive tasks.
 * ``updated_row_counts_metadata.json``:
   This file is meant to be used as the input configuration for the
   distributed partitioning pipeline. ``gs-repartition`` produces
@@ -313,7 +315,7 @@ you can use the following command to run the partition job locally:
         --num-parts 2 \
         --dgl-tool-path ./dgl/tools \
         --partition-algorithm random \
-        --ip-config ip_list.txt 
+        --ip-config ip_list.txt
 
 The command above will first do graph partitioning to determine the ownership for each partition and save the results.
 Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.

diff --git a/...cessing/data_transformations/dist_transformations/dist_bucket_numerical_transformation.py b/...cessing/data_transformations/dist_transformations/dist_bucket_numerical_transformation.py
@@ -67,7 +67,7 @@ def get_transformation_name() -> str:
         return "DistBucketNumericalTransformation"
 
     def apply(self, input_df: DataFrame) -> DataFrame:
-        imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df)
+        imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df).imputed_df
         # TODO: Make range optional by getting min/max from data.
         min_val, max_val = self.range