Skip to content

Commit

Permalink
[GSProcessing] Add transformation saving and re-applying for numerica…
Browse files Browse the repository at this point in the history
…l transforms.
  • Loading branch information
thvasilo committed Nov 8, 2024
1 parent af7ae16 commit 81ec25a
Show file tree
Hide file tree
Showing 5 changed files with 511 additions and 83 deletions.
6 changes: 4 additions & 2 deletions docs/source/cli/graph-construction/distributed/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,9 @@ the graph structure, features, and labels. In more detail:
GSProcessing will use the transformation values listed here
instead of creating new ones, ensuring that models trained with the original
data can still be used in the newly transformed data. Currently only
categorical transformations can be re-applied.
categorical and numerical transformations can be re-applied. Note that
the Rank-Gauss transformation cannot support re-application, it can
only work for transductive tasks.
* ``updated_row_counts_metadata.json``:
This file is meant to be used as the input configuration for the
distributed partitioning pipeline. ``gs-repartition`` produces
Expand Down Expand Up @@ -313,7 +315,7 @@ you can use the following command to run the partition job locally:
--num-parts 2 \
--dgl-tool-path ./dgl/tools \
--partition-algorithm random \
--ip-config ip_list.txt
--ip-config ip_list.txt
The command above will first do graph partitioning to determine the ownership for each partition and save the results.
Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def get_transformation_name() -> str:
return "DistBucketNumericalTransformation"

def apply(self, input_df: DataFrame) -> DataFrame:
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df)
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df).imputed_df
# TODO: Make range optional by getting min/max from data.
min_val, max_val = self.range

Expand Down
Loading

0 comments on commit 81ec25a

Please sign in to comment.