Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Add pre-computed categorical transformation loading #870

Merged
merged 4 commits into from
Jun 17, 2024

Conversation

thvasilo
Copy link
Contributor

@thvasilo thvasilo commented Jun 10, 2024

Issue #, if available:

Description of changes:

To be able to re-apply the categorical transformations that we create using the code in #857 , we first create a mapping from original string to one-hot representation, that we read from the saved JSON file, then use a UDF to use the mapping(s) on the column(s).

The DistributedTransformation class from which all transformation implementations inherit, gains a new function, apply_precomputed_transformation. When a pre-computed transformation JSON file exists in the input, and the feature is one of those listed in that file, we use this function to re-apply the existing transformation instead of creating a new one.

The default implementation for apply_precomputed_transformation is to log a warning and apply a new transformation.

When we implement a pre-computed transform for a new transformation (e.g. numerical) we need to:

  • Ensure the the transformation's self.json_representation is populated during the call to apply(). This ensures the transformation info will be saved in the output JSON.
  • Override the apply_precomputed_transformation function (as we did for DistCategoryTransformation here), so that it uses the dict loaded from the JSON file to re-apply the transformation to the new data.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thvasilo thvasilo added 0.3 gsprocessing For issues and PRs related the the GSProcessing library labels Jun 10, 2024
@thvasilo thvasilo added this to the 0.3 release milestone Jun 10, 2024
@thvasilo thvasilo self-assigned this Jun 10, 2024
@thvasilo thvasilo added the ready able to trigger the CI label Jun 10, 2024
@thvasilo
Copy link
Contributor Author

Any concerns left for this PR?

@classicsong
Copy link
Contributor

I am OK with the PR.
Please check with @jalencato.

Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thvasilo thvasilo merged commit 5199149 into awslabs:main Jun 17, 2024
3 checks passed
thvasilo added a commit that referenced this pull request Jul 11, 2024
…latent bugs (#915)

*Issue #, if available:*

*Description of changes:*
* During the refactor with
#870 we moved the loader to be
a class var for DistributedExecutor, but because the S3 path is not unit
tested we missed on case where the output is on S3 and the user requests
repartition on leader.
* This error was actually picked up by mypy, I fixed some other
potential issues and type annotations here.

### Testing

Pre-commit, unit tests, and one test SageMaker job all succeed.

The S3 codepath can only be integration tested currently.

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants