Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

tamargrey · 2022-11-10T15:10:03Z

There are several partial dependence fixes relating to the DFS Transformer that happen in this PR:

This PR updates partial dependence to keep the origin values the same. Closes Partial Dependence loses origin property on Engineered Features #3834
This PR excludes the target from consideration if it's present in the list of features. Closes Allow target to be present in list of features for DFS Transformer to be used in Partial Dependence fast mode #3833
This PR uses f.get_feature_names instead of f.get_name to handle multi output features correctly. Closes Partial Dependence Fast Mode will fail if DFS Transformer is used and multi-output primitive was used #3832

tamargrey · 2022-11-10T15:12:42Z

evalml/model_understanding/_partial_dependence_utils.py

@@ -318,8 +318,8 @@ def _partial_dependence_calculation(
                    X_eval.ww[variable] = ww.init_series(
                        part_dep_column,
                        logical_type=X_eval.ww.logical_types[variable],
+                        origin=X_eval.ww.columns[variable].origin,


Ideally, we'd be able to init with the original column's schema (as we do in fast mode above), as that would make sure that we don't lose any woodwork typing info as we replace this column, but that's not something we can currently do with init_series, so for now we'll just explicitly set individual woodwork parameters, but I opened up a Woodwork issue to make that possible: alteryx/woodwork#1573

Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?

Created: #3847

codecov · 2022-11-10T15:18:31Z

Codecov Report

Merging #3830 (2f26b51) into main (b9ca2b6) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3830     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        344     344             
  Lines      36124   36175     +51     
=======================================
+ Hits       35987   36038     +51     
  Misses       137     137

Impacted Files	Coverage Δ
...l/model_understanding/_partial_dependence_utils.py	`99.4% <ø> (ø)`
...derstanding/_partial_dependence_fast_mode_utils.py	`100.0% <100.0%> (ø)`
evalml/pipelines/components/component_base.py	`100.0% <100.0%> (ø)`
...transformers/feature_selection/feature_selector.py	`100.0% <100.0%> (ø)`
...ponents/transformers/preprocessing/featuretools.py	`100.0% <100.0%> (ø)`
...del_understanding_tests/test_partial_dependence.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jeremyliweishih

amazing!

jeremyliweishih · 2022-11-10T15:55:34Z

evalml/model_understanding/_partial_dependence_fast_mode_utils.py

-    changed_col_df.ww.init(
-        logical_types={variable: X.ww.logical_types[variable]},
-    )
+    changed_col_df.ww.init(schema=X.ww.schema.get_subset_schema([variable]))


whats the difference here?

By passing in the schema, we get any and all woodwork information specified for that column. By passing in parameters, we only get the parameters we specify, which means we need to update it if woodwork type info changes.

So this will help us avoid something like this in the future if we were to make partial dependence rely on some other woodwork property (for example, depend on a semantic tag that's not tied to the logical type - we'd miss it with the previous call, because we aren't also passing along the semantic tags).

jeremyliweishih · 2022-11-10T15:58:00Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    engineered_feature = "ABSOLUTE(1)"
+    assert X_fm.ww.columns[engineered_feature].origin == "engineered"
+
+    pipeline = pipeline.clone()


don't think you need this clone call!

jeremyliweishih · 2022-11-10T15:58:32Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+        pipeline,
+        X_fm,
+        features=engineered_feature,
+        grid_resolution=5,


we could also try turning down the grid resolution so the test runs faster!

lowered the grid resolution to 2!

tamargrey · 2022-11-10T22:14:20Z

Converting to draft temporarily, as I may include other DFS Transformer-related bug fixes in this PR

jeremyliweishih

New changes LGTM

jeremyliweishih · 2022-11-15T18:35:50Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+        ignore_columns={"X": ["target"]},
+        seed_features=seed_features,
+    )
+    assert any(f.get_name() == "target" for f in features)


probably not necessary but would we want to use get_feature_names() here as well for consistency?

there's not really a need, imo, bc target is a base feature, so it's not possible for it to be a multi output feature (which are definitionally engineered features built from multi output primitives). It would just complicate the assertion here

eccabay

Looks good, just a couple nitpicks but nothing blocking!

eccabay · 2022-11-15T19:13:30Z

evalml/pipelines/components/component_base.py

@@ -103,13 +103,14 @@ def default_parameters(cls):
    def _supported_by_list_API(cls):
        return not cls.modifies_target

-    def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters):
+    def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters, target):


Can we make target an optional parameter? I had assumed it was based on the usage in the fast mode utils, and I think it makes more sense since we don't need it in most cases.

Yes, and I can do the same for X as well, since it's only needed for the dfs transformer as well,

as more explanation: the reason why I hadn't originally made the extra args be optional is because we only call _handle_partial_dependence_fast_mode in one place where we have to pass in all the optional args.

There's something that felt off about making the args optional when they will always be passed in and not be null. And because they're always passed in, we have to have them present in the component args even when they aren't used. I'd almost rather just use _ in those components for the unused args, but I get that having them be optional makes them more flexible if we did want to use them in some other way in the future and is more technically correct, so I'm happy to make the change!

eccabay · 2022-11-15T19:15:40Z

evalml/pipelines/components/transformers/preprocessing/featuretools.py

+            target (str): The target whose values we are trying to predict. May be present in the
+                list of features in the DFS Transformer's parameters, in which case we should ignore it.


It may be helpful to rephrase the second sentence here to make why we need it more explicit - i.e. "This is used to know which column to ignore if the target column is present in the list of features in the DFS Transformer's parameters" or something slightly cleaner sounding.

I like this suggestion -- changing to it!

rwedge · 2022-11-15T21:39:24Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+        [dfs_transformer, "Standard Scaler", "Random Forest Regressor"],
+    )
+    # Confirm that the LSA primitive was actually used
+    assert any(len(f.get_feature_names()) > 1 for f in features)


I think

any(isinstance(f.primitive, ft.primitives.LSA) for f in features)

would be a clearer check criteria for the presence of LSA

Good catch--I probably should've update the comment to say "Confirm that a multi output feature is present."

My thought as to why check explicitly for the multi-output nature of the feature is that EvalML doesn't really care about which primitive was used -- just that it's multi output.

Makes sense! Changing the comment would be nice.

Building off of Roy's comment, I think you could just check the number_output_features property of the Feature directly:

assert any(f.number_output_features > 1 for f in features)

awesome, forgot that property exists!

thehomebrewnerd · 2022-11-16T21:16:22Z

docs/source/release_notes.rst

@@ -10,6 +10,9 @@ Release Notes
        * Updated demo dataset links to point to new endpoint :pr:`3826`
        * Updated ``STLDecomposer`` to infer the time index frequency if it's not present :pr:`3829`
        * Updated ``_drop_time_index`` to move the time index from X to both ``X.index`` and ``y.index`` :pr:`3829`
+        * Fixed bug where engineered features lost their origin attribute in partial dependence, causing it to fail :pr:`3830`
+        * Fixed bug partial dependence's DFS Transformer fast mode handling wouldn't work with multi output features :pr:`3830`


nit: mabye clean up the wording here a little.

thehomebrewnerd · 2022-11-16T21:18:28Z

evalml/model_understanding/_partial_dependence_utils.py

@@ -318,8 +318,8 @@ def _partial_dependence_calculation(
                    X_eval.ww[variable] = ww.init_series(
                        part_dep_column,
                        logical_type=X_eval.ww.logical_types[variable],
+                        origin=X_eval.ww.columns[variable].origin,


Should we also create a corresponding EvalML issue to update this once the WW issue is resolved so we don't forget about it?

thehomebrewnerd · 2022-11-16T21:30:39Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    assert not part_dep.feature_values.isna().any()
+    assert not part_dep.partial_dependence.isna().any()


I think these assertions might be clearer if rewritten, if I'm following right and assuming we are checking there are no null values present.

assert part_dep.feature_values.notnull().all() assert part_dep.partial_dependence.notnull().all()

made this change (and also updated other tests to use this pattern - they were doing the wrong check before, anyway)

thehomebrewnerd · 2022-11-16T21:33:06Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+        [dfs_transformer, "Standard Scaler", "Random Forest Regressor"],
+    )
+    # Confirm that the LSA primitive was actually used
+    assert any(len(f.get_feature_names()) > 1 for f in features)


Building off of Roy's comment, I think you could just check the number_output_features property of the Feature directly:

assert any(f.number_output_features > 1 for f in features)

auto-assign bot assigned tamargrey Nov 10, 2022

tamargrey commented Nov 10, 2022

View reviewed changes

jeremyliweishih approved these changes Nov 10, 2022

View reviewed changes

tamargrey requested review from eccabay, gsheni and fjlanasa November 10, 2022 19:31

tamargrey marked this pull request as draft November 10, 2022 22:13

tamargrey force-pushed the fix-engineered-dfs-feature-pd branch 2 times, most recently from 25240b7 to d9b1f34 Compare November 15, 2022 17:59

tamargrey marked this pull request as ready for review November 15, 2022 18:21

tamargrey requested a review from thehomebrewnerd November 15, 2022 18:21

jeremyliweishih approved these changes Nov 15, 2022

View reviewed changes

jeremyliweishih requested a review from christopherbunn November 15, 2022 18:39

eccabay approved these changes Nov 15, 2022

View reviewed changes

gsheni requested a review from rwedge November 15, 2022 19:35

rwedge reviewed Nov 15, 2022

View reviewed changes

tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from 668f452 to 4cb94c5 Compare November 16, 2022 14:54

thehomebrewnerd reviewed Nov 16, 2022

View reviewed changes

tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from 4cb94c5 to faf8d88 Compare November 17, 2022 14:35

thehomebrewnerd approved these changes Nov 18, 2022

View reviewed changes

Tamar Grey added 8 commits November 18, 2022 09:03

Allow engineered features with dfs transfomer to have pd calcualtions

e3fb72a

Add release note

e1263f0

PR comments

26267e1

Allow multi output features to be handled correctly by dfs transformer

c76f2f9

Pass pipeline target into fast mode component handlers

c2e2d07

Add tests for the second two bug fixes

353714d

clean up

211bd5b

PR comments

0901a6f

PR comments

2f26b51

tamargrey force-pushed the fix-engineered-dfs-feature-pd branch from faf8d88 to 2f26b51 Compare November 18, 2022 14:03

tamargrey merged commit 459ba58 into main Nov 18, 2022

tamargrey deleted the fix-engineered-dfs-feature-pd branch November 18, 2022 15:41

chukarsten mentioned this pull request Nov 23, 2022

Release v0.63.0 #3860

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

tamargrey commented Nov 10, 2022 •

edited

Loading

tamargrey Nov 10, 2022

thehomebrewnerd Nov 16, 2022

tamargrey Nov 17, 2022

codecov bot commented Nov 10, 2022 •

edited

Loading

jeremyliweishih left a comment

jeremyliweishih Nov 10, 2022

tamargrey Nov 10, 2022

jeremyliweishih Nov 10, 2022

tamargrey Nov 10, 2022

jeremyliweishih Nov 10, 2022

tamargrey Nov 10, 2022

tamargrey commented Nov 10, 2022

jeremyliweishih left a comment

jeremyliweishih Nov 15, 2022

tamargrey Nov 15, 2022

eccabay left a comment

eccabay Nov 15, 2022

tamargrey Nov 15, 2022

tamargrey Nov 15, 2022

eccabay Nov 15, 2022

tamargrey Nov 16, 2022

rwedge Nov 15, 2022

tamargrey Nov 15, 2022

rwedge Nov 15, 2022

thehomebrewnerd Nov 16, 2022

tamargrey Nov 17, 2022

thehomebrewnerd Nov 16, 2022

thehomebrewnerd Nov 16, 2022

thehomebrewnerd Nov 16, 2022

tamargrey Nov 17, 2022

thehomebrewnerd Nov 16, 2022

		target (str): The target whose values we are trying to predict. May be present in the
		list of features in the DFS Transformer's parameters, in which case we should ignore it.

		assert not part_dep.feature_values.isna().any()
		assert not part_dep.partial_dependence.isna().any()

Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

Allow engineered features in pipeline with dfs transfomer to have pd calculations done #3830

Conversation

tamargrey commented Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 10, 2022 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Nov 10, 2022

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Nov 10, 2022 •

edited

Loading

codecov bot commented Nov 10, 2022 •

edited

Loading