Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[GSProcessing] Add saving and re-applying for numerical transforms. (#…
…1085) *Issue #, if available:* Fixes #985 *Description of changes:* * We introduce saving and re-applying numerical transformations for all transforms except rank-gauss, which by definition cannot be reapplied. * For the more complex transformations we re-construct the original PySpark transformer objects, by retaining the values needed for each transformation (e.g. the min and max values), creating tiny DFs that only contain those numbers and re-training the transformer on that tiny dataset. Then we can apply the trained transformer to the desired data and we get the same result. * To reduce code duplication we pull out the core computations for standard and min-max normalization into their own functions (`_apply_standard_transform`, `_apply_minmax_transform`), which we can call from both the original transformation and the re-applied one. The presence or absence of pre-computed statistics in the function call determines which code path we follow. * We modify the `apply_imputation` and `apply_norm` functions to also return the representation along with the transformed DF. We encapsulate the return values in their own dataclass (`ImputationResult`, `NormalizationResult`), to make future modifications easier (by not requiring to change the function's return type). * Introduce new tests to check all re-construction cases. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: xiang song(charlie.song) <[email protected]> Co-authored-by: jalencato <[email protected]>
- Loading branch information