Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

Merged
merged 9 commits into from
Nov 18, 2022

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Nov 10, 2022

There are several partial dependence fixes relating to the DFS Transformer that happen in this PR:

  1. This PR updates partial dependence to keep the origin values the same. Closes Partial Dependence loses origin property on Engineered Features  #3834
  2. This PR excludes the target from consideration if it's present in the list of features. Closes Allow target to be present in list of features for DFS Transformer to be used in Partial Dependence fast mode #3833
  3. This PR uses f.get_feature_names instead of f.get_name to handle multi output features correctly. Closes Partial Dependence Fast Mode will fail if DFS Transformer is used and multi-output primitive was used  #3832

@@ -318,8 +318,8 @@ def _partial_dependence_calculation(
X_eval.ww[variable] = ww.init_series(
part_dep_column,
logical_type=X_eval.ww.logical_types[variable],
origin=X_eval.ww.columns[variable].origin,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd be able to init with the original column's schema (as we do in fast mode above), as that would make sure that we don't lose any woodwork typing info as we replace this column, but that's not something we can currently do with init_series, so for now we'll just explicitly set individual woodwork parameters, but I opened up a Woodwork issue to make that possible: alteryx/woodwork#1573

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created: #3847

@codecov
Copy link

codecov bot commented Nov 10, 2022

Codecov Report

Merging #3830 (2f26b51) into main (b9ca2b6) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3830     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        344     344             
  Lines      36124   36175     +51     
=======================================
+ Hits       35987   36038     +51     
  Misses       137     137             
Impacted Files Coverage Δ
...l/model_understanding/_partial_dependence_utils.py 99.4% <ø> (ø)
...derstanding/_partial_dependence_fast_mode_utils.py 100.0% <100.0%> (ø)
evalml/pipelines/components/component_base.py 100.0% <100.0%> (ø)
...transformers/feature_selection/feature_selector.py 100.0% <100.0%> (ø)
...ponents/transformers/preprocessing/featuretools.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing!

changed_col_df.ww.init(
logical_types={variable: X.ww.logical_types[variable]},
)
changed_col_df.ww.init(schema=X.ww.schema.get_subset_schema([variable]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By passing in the schema, we get any and all woodwork information specified for that column. By passing in parameters, we only get the parameters we specify, which means we need to update it if woodwork type info changes.

So this will help us avoid something like this in the future if we were to make partial dependence rely on some other woodwork property (for example, depend on a semantic tag that's not tied to the logical type - we'd miss it with the previous call, because we aren't also passing along the semantic tags).

engineered_feature = "ABSOLUTE(1)"
assert X_fm.ww.columns[engineered_feature].origin == "engineered"

pipeline = pipeline.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you need this clone call!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

pipeline,
X_fm,
features=engineered_feature,
grid_resolution=5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also try turning down the grid resolution so the test runs faster!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowered the grid resolution to 2!

@tamargrey tamargrey marked this pull request as draft November 10, 2022 22:13
@tamargrey
Copy link
Contributor Author

Converting to draft temporarily, as I may include other DFS Transformer-related bug fixes in this PR

@tamargrey tamargrey force-pushed the fix-engineered-dfs-feature-pd branch 2 times, most recently from 25240b7 to d9b1f34 Compare November 15, 2022 17:59
@tamargrey tamargrey marked this pull request as ready for review November 15, 2022 18:21
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes LGTM

ignore_columns={"X": ["target"]},
seed_features=seed_features,
)
assert any(f.get_name() == "target" for f in features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not necessary but would we want to use get_feature_names() here as well for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's not really a need, imo, bc target is a base feature, so it's not possible for it to be a multi output feature (which are definitionally engineered features built from multi output primitives). It would just complicate the assertion here

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple nitpicks but nothing blocking!

@@ -103,13 +103,14 @@ def default_parameters(cls):
def _supported_by_list_API(cls):
return not cls.modifies_target

def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters):
def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters, target):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make target an optional parameter? I had assumed it was based on the usage in the fast mode utils, and I think it makes more sense since we don't need it in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I can do the same for X as well, since it's only needed for the dfs transformer as well,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as more explanation: the reason why I hadn't originally made the extra args be optional is because we only call _handle_partial_dependence_fast_mode in one place where we have to pass in all the optional args.

There's something that felt off about making the args optional when they will always be passed in and not be null. And because they're always passed in, we have to have them present in the component args even when they aren't used. I'd almost rather just use _ in those components for the unused args, but I get that having them be optional makes them more flexible if we did want to use them in some other way in the future and is more technically correct, so I'm happy to make the change!

Comment on lines 144 to 145
target (str): The target whose values we are trying to predict. May be present in the
list of features in the DFS Transformer's parameters, in which case we should ignore it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to rephrase the second sentence here to make why we need it more explicit - i.e. "This is used to know which column to ignore if the target column is present in the list of features in the DFS Transformer's parameters" or something slightly cleaner sounding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion -- changing to it!

@gsheni gsheni requested a review from rwedge November 15, 2022 19:35
[dfs_transformer, "Standard Scaler", "Random Forest Regressor"],
)
# Confirm that the LSA primitive was actually used
assert any(len(f.get_feature_names()) > 1 for f in features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

any(isinstance(f.primitive, ft.primitives.LSA) for f in features)

would be a clearer check criteria for the presence of LSA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch--I probably should've update the comment to say "Confirm that a multi output feature is present."

My thought as to why check explicitly for the multi-output nature of the feature is that EvalML doesn't really care about which primitive was used -- just that it's multi output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Changing the comment would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building off of Roy's comment, I think you could just check the number_output_features property of the Feature directly:

assert any(f.number_output_features > 1 for f in features)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, forgot that property exists!

@tamargrey tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from 668f452 to 4cb94c5 Compare November 16, 2022 14:54
@@ -10,6 +10,9 @@ Release Notes
* Updated demo dataset links to point to new endpoint :pr:`3826`
* Updated ``STLDecomposer`` to infer the time index frequency if it's not present :pr:`3829`
* Updated ``_drop_time_index`` to move the time index from X to both ``X.index`` and ``y.index`` :pr:`3829`
* Fixed bug where engineered features lost their origin attribute in partial dependence, causing it to fail :pr:`3830`
* Fixed bug partial dependence's DFS Transformer fast mode handling wouldn't work with multi output features :pr:`3830`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mabye clean up the wording here a little.

@@ -318,8 +318,8 @@ def _partial_dependence_calculation(
X_eval.ww[variable] = ww.init_series(
part_dep_column,
logical_type=X_eval.ww.logical_types[variable],
origin=X_eval.ww.columns[variable].origin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?

Comment on lines 2842 to 2843
assert not part_dep.feature_values.isna().any()
assert not part_dep.partial_dependence.isna().any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these assertions might be clearer if rewritten, if I'm following right and assuming we are checking there are no null values present.

assert part_dep.feature_values.notnull().all()
assert part_dep.partial_dependence.notnull().all()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this change (and also updated other tests to use this pattern - they were doing the wrong check before, anyway)

[dfs_transformer, "Standard Scaler", "Random Forest Regressor"],
)
# Confirm that the LSA primitive was actually used
assert any(len(f.get_feature_names()) > 1 for f in features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building off of Roy's comment, I think you could just check the number_output_features property of the Feature directly:

assert any(f.number_output_features > 1 for f in features)

@tamargrey tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from 4cb94c5 to faf8d88 Compare November 17, 2022 14:35
@tamargrey tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from faf8d88 to 2f26b51 Compare November 18, 2022 14:03
@tamargrey tamargrey merged commit 459ba58 into main Nov 18, 2022
@tamargrey tamargrey deleted the fix-engineered-dfs-feature-pd branch November 18, 2022 15:41
@chukarsten chukarsten mentioned this pull request Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants