Refactor CPI #14

jpaillard · 2024-09-18T12:19:32Z

I suggest a refactoring of the CPI functionality. It is inspired by the current implementation of permutation importance in scikit-learn.
It aims at:

Sperate the model fit / predict / selection from the variable importance part (inspection)
Allow to use multiple variable importance method for a same fitted model (facilitates benchmarking)
Expose the fitting of covariate estimation. This should make clearer which features are used for the measure of importance and which data split are used to fit / predict the covariate estimator.
Leave the parallelization out of the VIM method (I think it should be done at the cross-fitting level)

I also updated the example using the diabetes dataset.

Linked to issues #12 and #13

codecov · 2024-09-18T19:16:11Z

Codecov Report

Attention: Patch coverage is 99.06250% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.35%. Comparing base (b42572e) to head (3ffa1b5).
Report is 27 commits behind head on main.

Files with missing lines	Patch %	Lines
hidimstat/cpi.py	97.50%	2 Missing ⚠️
hidimstat/loco.py	98.36%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #14       +/-   ##
===========================================
- Coverage   91.79%   76.35%   -15.45%     
===========================================
  Files          44       46        +2     
  Lines        2926     2398      -528     
===========================================
- Hits         2686     1831      -855     
- Misses        240      567      +327

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bthirion

Tests are needed for the added code.

bthirion · 2024-09-18T19:12:34Z

examples/plot_diabetes_variable_importance_example_v2.py

+from hidimstat.CPI import CPI
+
+# %%
+


Provide a title for each part

bthirion · 2024-09-18T19:12:51Z

examples/plot_diabetes_variable_importance_example_v2.py

+
+
+def compute_pval(vim):
+    mean_vim = np.mean(vim, axis=0)


Mini docstring welcome

bthirion · 2024-09-18T19:17:36Z

examples/plot_diabetes_variable_importance_example_v2.py

+    y_train, y_test = y[train_index], y[test_index]
+    cpi = CPI(
+        estimator=regressor_list[i],
+        # covariate_estimator=RidgeCV(alphas=np.logspace(-3, 3, 10)),


remove commented lines

hidimstat/CPI.py

bthirion · 2024-09-18T19:17:41Z

hidimstat/CPI.py

+                 random_state: int = None,
+                 n_jobs: int = 1
+                 ):
+


Please write complete doctrings

hidimstat/CPI.py

bthirion · 2024-09-18T19:25:05Z

examples/plot_diabetes_variable_importance_example_v2.py

@@ -0,0 +1,104 @@
+# %%


Please find a better name for the example.
I believe that this example is meant to replace the other one ?

jpaillard · 2024-09-19T12:32:06Z

I tried to address the different comments:

Improve the docstrings
Add tests
Rename and improve the example (adding a consistent implementation of LOCO and Permutation Importance to also evidence how the proposed implementation saves time by allowing to reuse the main predictor for various importance methods)

bthirion

Thx !
Maybe I misunderstand some parts of the code you're adding. We should discuss about it.

bthirion · 2024-09-19T20:00:52Z

examples/plot_diabetes_variable_importance_example_legacy.py

@@ -0,0 +1,173 @@
+"""


What is the point of having legacy examples ?

examples/plot_diabetes_variable_importance_example.py

bthirion · 2024-09-19T20:12:29Z

hidimstat/CPI.py

+        output_dict["loss_reference"] = loss_reference
+        output_dict['loss_perm'] = dict()
+
+        def joblib_predict_one_gp(estimator, X, y, j):


Suggested change

def joblib_predict_one_gp(estimator, X, y, j):

def _joblib_predict_one_gp(estimator, X, y, j):

bthirion · 2024-09-19T20:13:07Z

hidimstat/LOCO.py

+        self.list_estimators = [clone(self.estimator)
+                                for _ in range(self.nb_groups)]
+
+        def joblib_fit_one_gp(estimator, X, y, j):


Suggested change

def joblib_fit_one_gp(estimator, X, y, j):

def _joblib_fit_one_gp(estimator, X, y, j):

bthirion · 2024-09-19T20:21:45Z

hidimstat/test/test_LOCO.py

@@ -0,0 +1,48 @@
+import numpy as np


please rename to test_loco.py

bthirion · 2024-09-19T20:22:02Z

hidimstat/test/test_CPI.py

@@ -0,0 +1,53 @@
+import numpy as np


please rename to test_cpi.py

bthirion · 2024-09-19T20:22:51Z

hidimstat/test/test_LOCO.py

+from hidimstat.LOCO import LOCO
+
+
+def test_LOCO(linear_scenario):


You have to use snale_case or CamelCase, but not both in the same string. -> test_loco

bthirion · 2024-09-19T20:23:18Z

hidimstat/test/test_PermutationImportance.py

@@ -0,0 +1,50 @@
+import numpy as np


test_permutation_importance

bthirion · 2024-09-19T20:23:34Z

hidimstat/test/test_PermutationImportance.py

+from hidimstat.PermutationImportance import PermutationImportance
+
+
+def test_CPI(linear_scenario):


Suggested change

def test_CPI(linear_scenario):

def test_cpi(linear_scenario):

jpaillard · 2024-10-02T13:11:30Z

Also linked to #17
After discussing with @AngelReyero we agreed that the API suggested in the current PR would be more convenient for the reasons mentioned above but also to facilitate the implementation of the recent contribution of Angel's work.
Separating the .predict and .score for example allows to modify CPI for averaging either at the prediction or loss level with minimal code changes.

jpaillard · 2024-10-02T13:41:13Z

The coverage decrease is related to #18, we are no longer testing Dnn and ModifiedRF which were part of the previous integrated in the previous BBI. Not sure if I should address it in this PR or a subsequent one
Otherwise, this PR is ready for review @bthirion

bthirion · 2024-10-02T21:45:57Z

Leave me a bit of time, it is a big one...

bthirion

It seems like a great step forward !
I have some suggestions on the API. LMK what you think.

hidimstat/cpi.py

bthirion · 2024-10-11T08:28:19Z

hidimstat/cpi.py

+        the others.
+        """
+        if self.groups is None:
+            self.nb_groups = X.shape[1]


Suggested change

self.nb_groups = X.shape[1]

self.n_groups = X.shape[1]

And change accordingly nb_groups -> n_groups wherever needed

bthirion · 2024-10-11T08:31:04Z

hidimstat/cpi.py

+        else:
+            self.nb_groups = len(self.groups)
+        # create a list of covariate estimators for each group if not provided
+        if len(self.list_imputation_mod) == 0:


I don't like self.list_imputation_mod
Maybe self.list_imputation_models
bzw, the user is not supposed to manipulate these models ? So this one could be an internal variable,
self._list_imputation_models ?

bthirion · 2024-10-11T08:31:23Z

hidimstat/cpi.py

+                clone(self.imputation_model) for _ in range(self.nb_groups)
+            ]
+
+        def joblib_fit_one_gp(estimator, X, y, j):


Suggested change

def joblib_fit_one_gp(estimator, X, y, j):

def joblib_fit_one_group(estimator, X, y, j):

bthirion · 2024-10-11T08:33:20Z

hidimstat/cpi.py

+        list must match the number of covariates.
+    n_perm: int, default=50
+        Number of permutations to perform.
+    groups: dict, default=None


I think that groups are a data-dependent thing and should be provided at fit time ?

Good point I will then move the self._list_imputation_models to the .fit function.
Maybe it could also support clustering methods rather than predefined groups (for a next PR).

hidimstat/cpi.py

bthirion · 2024-10-11T08:36:26Z

hidimstat/cpi.py

+        for m in self.list_imputation_mod:
+            check_is_fitted(m)
+
+        def joblib_predict_one_gp(imputation_model, X, j):


Suggested change

def joblib_predict_one_gp(imputation_model, X, j):

def _joblib_predict_one_gp(imputation_model, X, j):

hidimstat/loco.py

bthirion · 2024-10-11T08:43:30Z

@achamma723 your opinion is welcome.

jpaillard · 2024-10-11T13:59:46Z

I tried to address all your comments @bthirion
For me it is ready to merge.

achamma723 · 2024-10-11T14:11:31Z

Hello @bthirion and @jpaillard, sorry I weren't able to see all the comments in the past week. I'll try this weekend to have a look if you didn't decide to merge yet

achamma723 · 2024-10-14T00:47:43Z

hidimstat/cpi.py

+        provided, it will be cloned for each covariate. Otherwise, a list of
+        potentially different estimators can be provided, the length of the
+        list must match the number of covariates.
+    n_perm: int, default=50


Maybe better to call it n_permutations?

achamma723 · 2024-10-14T00:52:21Z

hidimstat/cpi.py

+
+        Returns
+        -------
+        output_dict: dict


I think this is the return parameter for the score function (replicated below).

achamma723 · 2024-10-14T00:53:42Z

hidimstat/permutation_importance.py

+    def __init__(
+        self,
+        estimator,
+        n_perm: int = 50,


Also here, n_permutations?

achamma723 · 2024-10-14T00:54:11Z

hidimstat/permutation_importance.py

+        Returns
+        -------
+        output_dict: dict
+            A dictionary containing the following keys:


Same goes for the return parameter between the predict and the score functions

achamma723 · 2024-10-14T00:57:25Z

Hello @jpaillard and @bthirion, I highlighted minor comments when passing, overall I think it is a great stable step for the future benchmarks. As for the cpi, now it is limited to the idea of reconstructing the variable or group of interest by the mean of the residuals, thus maybe the idea of the sampling (that existed via the Modified RF) is also interesting to push in the next steps?

jpaillard · 2024-10-14T07:30:33Z

Thx for the comments @achamma723. I made the modifications you suggested. Indeed it could be worth adding the sampling from nodes of the RF. Maybe we could tackle this in a following PR ?

jpaillard · 2024-10-14T07:54:50Z

Ready to merge on my side @bthirion

bthirion

We're converging. There are just a few missing docstrings and we should include tests for all lines.

hidimstat/cpi.py

hidimstat/loco.py

hidimstat/permutation_importance.py

jpaillard · 2024-10-15T11:01:45Z

I added the additional tests and docstring @bthirion

bthirion

LGTM. The last pending thing is the ValueError("fit must be called before predict") to be tested, but this could be handled in a forthcoming PR

…Adapt tests.

bthirion reviewed Sep 18, 2024

View reviewed changes

bthirion reviewed Sep 19, 2024

View reviewed changes

jpaillard mentioned this pull request Sep 30, 2024

VIM: choice of the API #17

Closed

bthirion reviewed Oct 11, 2024

View reviewed changes

achamma723 reviewed Oct 14, 2024

View reviewed changes

bthirion reviewed Oct 15, 2024

View reviewed changes

hidimstat/cpi.py Show resolved Hide resolved

hidimstat/cpi.py Show resolved Hide resolved

hidimstat/cpi.py Show resolved Hide resolved

hidimstat/loco.py Show resolved Hide resolved

hidimstat/loco.py Show resolved Hide resolved

hidimstat/permutation_importance.py Show resolved Hide resolved

bthirion approved these changes Oct 15, 2024

View reviewed changes

jpaillard force-pushed the main branch from a47fa2b to f653bc3 Compare October 15, 2024 15:35

paillarj added 10 commits October 15, 2024 20:48

re-implement CPI

e80c96d

allow for non contiguous variable groups

94e2638

actually allow to paralleliza within the CPI class

3537afb

add docstring and tests

9d59d62

update example and add concsitent implementation for LOCO and PI

3b69d33

rename examples

79ce7eb

remove seaborn dependency

f1000da

black formating

b1655db

clean duplicated code

e49c1ce

edit init file

9b61ce0

paillarj and others added 16 commits October 15, 2024 20:52

module name should be lower case

d186a99

black formatting

3f31a86

add scoring method to VIM methods

710cdd0

fix test using score instead of predict

aff6133

Update examples/plot_diabetes_variable_importance_example.py

d075b68

nb_groups > n_groups

8456446

Place groups in fit method. Make list of imputation models internal. …

499cad6

…Adapt tests.

Rename internal joblib functions

d344559

cite LOCO

81074dc

Docstring for return. n_perm -> n_permutation

2228190

Fix doc example. Remove deprecated sklean metric

271f1a6

Add test for classification scenario with CPI

f30197b

Change deprecated loss

f615594

Improve docstring

b19e6e3

Add test for clf scenario for LOCO and PI

50ca014

deprecated loss

3ffa1b5

jpaillard force-pushed the main branch from f653bc3 to 3ffa1b5 Compare October 15, 2024 19:00

jpaillard merged commit 0299753 into mind-inria:main Oct 15, 2024
7 of 8 checks passed

This was referenced Oct 15, 2024

BBI: covariate estimator currently only supports sampling_RF and RF #13

Closed

BBI: Should the covariate estimator be fitted on the training or test set ? #12

Closed

Missing doc for CPI & LOCO #19

Closed

	def joblib_predict_one_gp(estimator, X, y, j):
	def _joblib_predict_one_gp(estimator, X, y, j):

	def joblib_fit_one_gp(estimator, X, y, j):
	def _joblib_fit_one_gp(estimator, X, y, j):

		from hidimstat.LOCO import LOCO


		def test_LOCO(linear_scenario):

		from hidimstat.PermutationImportance import PermutationImportance


		def test_CPI(linear_scenario):

	def joblib_fit_one_gp(estimator, X, y, j):
	def joblib_fit_one_group(estimator, X, y, j):

	def joblib_predict_one_gp(imputation_model, X, j):
	def _joblib_predict_one_gp(imputation_model, X, j):

Refactor CPI #14

Refactor CPI #14

Conversation

jpaillard commented Sep 18, 2024 • edited Loading

codecov bot commented Sep 18, 2024 • edited Loading

Codecov Report

bthirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpaillard commented Sep 19, 2024

bthirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpaillard commented Oct 2, 2024

jpaillard commented Oct 2, 2024

bthirion commented Oct 2, 2024

bthirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bthirion commented Oct 11, 2024

jpaillard commented Oct 11, 2024

achamma723 commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achamma723 commented Oct 14, 2024

jpaillard commented Oct 14, 2024

jpaillard commented Oct 14, 2024

bthirion left a comment

Choose a reason for hiding this comment

jpaillard commented Oct 15, 2024

bthirion left a comment

Choose a reason for hiding this comment

jpaillard commented Sep 18, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading