-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830
Conversation
@@ -318,8 +318,8 @@ def _partial_dependence_calculation( | |||
X_eval.ww[variable] = ww.init_series( | |||
part_dep_column, | |||
logical_type=X_eval.ww.logical_types[variable], | |||
origin=X_eval.ww.columns[variable].origin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we'd be able to init with the original column's schema (as we do in fast mode above), as that would make sure that we don't lose any woodwork typing info as we replace this column, but that's not something we can currently do with init_series, so for now we'll just explicitly set individual woodwork parameters, but I opened up a Woodwork issue to make that possible: alteryx/woodwork#1573
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created: #3847
Codecov Report
@@ Coverage Diff @@
## main #3830 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 344 344
Lines 36124 36175 +51
=======================================
+ Hits 35987 36038 +51
Misses 137 137
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing!
changed_col_df.ww.init( | ||
logical_types={variable: X.ww.logical_types[variable]}, | ||
) | ||
changed_col_df.ww.init(schema=X.ww.schema.get_subset_schema([variable])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By passing in the schema, we get any and all woodwork information specified for that column. By passing in parameters, we only get the parameters we specify, which means we need to update it if woodwork type info changes.
So this will help us avoid something like this in the future if we were to make partial dependence rely on some other woodwork property (for example, depend on a semantic tag that's not tied to the logical type - we'd miss it with the previous call, because we aren't also passing along the semantic tags).
engineered_feature = "ABSOLUTE(1)" | ||
assert X_fm.ww.columns[engineered_feature].origin == "engineered" | ||
|
||
pipeline = pipeline.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think you need this clone call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed!
pipeline, | ||
X_fm, | ||
features=engineered_feature, | ||
grid_resolution=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also try turning down the grid resolution so the test runs faster!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowered the grid resolution to 2!
Converting to draft temporarily, as I may include other DFS Transformer-related bug fixes in this PR |
25240b7
to
d9b1f34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes LGTM
ignore_columns={"X": ["target"]}, | ||
seed_features=seed_features, | ||
) | ||
assert any(f.get_name() == "target" for f in features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not necessary but would we want to use get_feature_names()
here as well for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's not really a need, imo, bc target
is a base feature, so it's not possible for it to be a multi output feature (which are definitionally engineered features built from multi output primitives). It would just complicate the assertion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple nitpicks but nothing blocking!
@@ -103,13 +103,14 @@ def default_parameters(cls): | |||
def _supported_by_list_API(cls): | |||
return not cls.modifies_target | |||
|
|||
def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters): | |||
def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters, target): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make target
an optional parameter? I had assumed it was based on the usage in the fast mode utils, and I think it makes more sense since we don't need it in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I can do the same for X
as well, since it's only needed for the dfs transformer as well,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as more explanation: the reason why I hadn't originally made the extra args be optional is because we only call _handle_partial_dependence_fast_mode
in one place where we have to pass in all the optional args.
There's something that felt off about making the args optional when they will always be passed in and not be null. And because they're always passed in, we have to have them present in the component args even when they aren't used. I'd almost rather just use _
in those components for the unused args, but I get that having them be optional makes them more flexible if we did want to use them in some other way in the future and is more technically correct, so I'm happy to make the change!
target (str): The target whose values we are trying to predict. May be present in the | ||
list of features in the DFS Transformer's parameters, in which case we should ignore it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be helpful to rephrase the second sentence here to make why we need it more explicit - i.e. "This is used to know which column to ignore if the target column is present in the list of features in the DFS Transformer's parameters" or something slightly cleaner sounding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion -- changing to it!
[dfs_transformer, "Standard Scaler", "Random Forest Regressor"], | ||
) | ||
# Confirm that the LSA primitive was actually used | ||
assert any(len(f.get_feature_names()) > 1 for f in features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
any(isinstance(f.primitive, ft.primitives.LSA) for f in features)
would be a clearer check criteria for the presence of LSA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch--I probably should've update the comment to say "Confirm that a multi output feature is present."
My thought as to why check explicitly for the multi-output nature of the feature is that EvalML doesn't really care about which primitive was used -- just that it's multi output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Changing the comment would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building off of Roy's comment, I think you could just check the number_output_features
property of the Feature directly:
assert any(f.number_output_features > 1 for f in features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, forgot that property exists!
668f452
to
4cb94c5
Compare
docs/source/release_notes.rst
Outdated
@@ -10,6 +10,9 @@ Release Notes | |||
* Updated demo dataset links to point to new endpoint :pr:`3826` | |||
* Updated ``STLDecomposer`` to infer the time index frequency if it's not present :pr:`3829` | |||
* Updated ``_drop_time_index`` to move the time index from X to both ``X.index`` and ``y.index`` :pr:`3829` | |||
* Fixed bug where engineered features lost their origin attribute in partial dependence, causing it to fail :pr:`3830` | |||
* Fixed bug partial dependence's DFS Transformer fast mode handling wouldn't work with multi output features :pr:`3830` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mabye clean up the wording here a little.
@@ -318,8 +318,8 @@ def _partial_dependence_calculation( | |||
X_eval.ww[variable] = ww.init_series( | |||
part_dep_column, | |||
logical_type=X_eval.ww.logical_types[variable], | |||
origin=X_eval.ww.columns[variable].origin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?
assert not part_dep.feature_values.isna().any() | ||
assert not part_dep.partial_dependence.isna().any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these assertions might be clearer if rewritten, if I'm following right and assuming we are checking there are no null values present.
assert part_dep.feature_values.notnull().all()
assert part_dep.partial_dependence.notnull().all()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this change (and also updated other tests to use this pattern - they were doing the wrong check before, anyway)
[dfs_transformer, "Standard Scaler", "Random Forest Regressor"], | ||
) | ||
# Confirm that the LSA primitive was actually used | ||
assert any(len(f.get_feature_names()) > 1 for f in features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building off of Roy's comment, I think you could just check the number_output_features
property of the Feature directly:
assert any(f.number_output_features > 1 for f in features)
4cb94c5
to
faf8d88
Compare
faf8d88
to
2f26b51
Compare
There are several partial dependence fixes relating to the DFS Transformer that happen in this PR:
origin
property on Engineered Features #3834features
. Closes Allow target to be present in list of features for DFS Transformer to be used in Partial Dependence fast mode #3833f.get_feature_names
instead off.get_name
to handle multi output features correctly. Closes Partial Dependence Fast Mode will fail if DFS Transformer is used and multi-output primitive was used #3832