Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle different categories in training vs holdout data for Ordinal Encoder #3889

Open
tamargrey opened this issue Dec 13, 2022 · 0 comments

Comments

@tamargrey
Copy link
Contributor

If there are categories present in holdout data that weren't present in the training data, the OrdinalEncoder will not work unless handle_unknown and unknown_value are set correctly. This is problematic for the initial integration of the OrdinalEncoder into AutoMLSearch, as the default value for handle_unknown is error.

This can also be problematic for the Ordinal logical type, which will set the order according to the categories that are present, so if we were to try and set the instantiated Ordinal Logical Type on holdout data with different categories, it may produce a Woodwork error that the data contains values that are not present in the order values provided. We should investigate when we may trigger this Woodwork error, and I've opened up an issue in Woodwork to consider ways to handle this kind of thing (alteryx/woodwork#1598).

We should look into how we can handle this. We have several options:

  1. Handle this as part of automl search in the OrdinalEncoder instantiation by setting the parameters such that we handle unknowns gracefully - I think this may make the most sense, and could allow users to have further control of how they would want to handle those unknown values.
  2. Wait to set the Encoder's categories until transform/allow updating the values at transform. I think waiting to set the categories at all until transform is probably putting too much logic into transform, and could also create the reverse problem of not having categories from the training data. More likely, we will want to consider allowing users to expand the categories if needed.
  3. Change the default value for handle_unknown to no longer error - maybe to set the values to be nans?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant