Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CPI #14

Merged
merged 26 commits into from
Oct 15, 2024
Merged

Refactor CPI #14

merged 26 commits into from
Oct 15, 2024

Conversation

jpaillard
Copy link
Collaborator

@jpaillard jpaillard commented Sep 18, 2024

I suggest a refactoring of the CPI functionality. It is inspired by the current implementation of permutation importance in scikit-learn.
It aims at:

  • Sperate the model fit / predict / selection from the variable importance part (inspection)
  • Allow to use multiple variable importance method for a same fitted model (facilitates benchmarking)
  • Expose the fitting of covariate estimation. This should make clearer which features are used for the measure of importance and which data split are used to fit / predict the covariate estimator.
  • Leave the parallelization out of the VIM method (I think it should be done at the cross-fitting level)

I also updated the example using the diabetes dataset.

Linked to issues #12 and #13

Copy link

codecov bot commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 99.06250% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.35%. Comparing base (b42572e) to head (3ffa1b5).
Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
hidimstat/cpi.py 97.50% 2 Missing ⚠️
hidimstat/loco.py 98.36% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #14       +/-   ##
===========================================
- Coverage   91.79%   76.35%   -15.45%     
===========================================
  Files          44       46        +2     
  Lines        2926     2398      -528     
===========================================
- Hits         2686     1831      -855     
- Misses        240      567      +327     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are needed for the added code.

from hidimstat.CPI import CPI

# %%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide a title for each part



def compute_pval(vim):
mean_vim = np.mean(vim, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mini docstring welcome

y_train, y_test = y[train_index], y[test_index]
cpi = CPI(
estimator=regressor_list[i],
# covariate_estimator=RidgeCV(alphas=np.logspace(-3, 3, 10)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented lines

hidimstat/CPI.py Outdated Show resolved Hide resolved
hidimstat/CPI.py Outdated
random_state: int = None,
n_jobs: int = 1
):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write complete doctrings

hidimstat/CPI.py Outdated Show resolved Hide resolved
@@ -0,0 +1,104 @@
# %%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find a better name for the example.
I believe that this example is meant to replace the other one ?

@jpaillard
Copy link
Collaborator Author

I tried to address the different comments:

  • Improve the docstrings
  • Add tests
  • Rename and improve the example (adding a consistent implementation of LOCO and Permutation Importance to also evidence how the proposed implementation saves time by allowing to reuse the main predictor for various importance methods)

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx !
Maybe I misunderstand some parts of the code you're adding. We should discuss about it.

@@ -0,0 +1,173 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of having legacy examples ?

hidimstat/CPI.py Outdated
output_dict["loss_reference"] = loss_reference
output_dict['loss_perm'] = dict()

def joblib_predict_one_gp(estimator, X, y, j):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def joblib_predict_one_gp(estimator, X, y, j):
def _joblib_predict_one_gp(estimator, X, y, j):

self.list_estimators = [clone(self.estimator)
for _ in range(self.nb_groups)]

def joblib_fit_one_gp(estimator, X, y, j):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def joblib_fit_one_gp(estimator, X, y, j):
def _joblib_fit_one_gp(estimator, X, y, j):

@@ -0,0 +1,48 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename to test_loco.py

@@ -0,0 +1,53 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename to test_cpi.py

from hidimstat.LOCO import LOCO


def test_LOCO(linear_scenario):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use snale_case or CamelCase, but not both in the same string. -> test_loco

@@ -0,0 +1,50 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_permutation_importance

from hidimstat.PermutationImportance import PermutationImportance


def test_CPI(linear_scenario):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_CPI(linear_scenario):
def test_cpi(linear_scenario):

@jpaillard jpaillard mentioned this pull request Sep 30, 2024
@jpaillard
Copy link
Collaborator Author

Also linked to #17
After discussing with @AngelReyero we agreed that the API suggested in the current PR would be more convenient for the reasons mentioned above but also to facilitate the implementation of the recent contribution of Angel's work.
Separating the .predict and .score for example allows to modify CPI for averaging either at the prediction or loss level with minimal code changes.

@jpaillard
Copy link
Collaborator Author

The coverage decrease is related to #18, we are no longer testing Dnn and ModifiedRF which were part of the previous integrated in the previous BBI. Not sure if I should address it in this PR or a subsequent one
Otherwise, this PR is ready for review @bthirion

@bthirion
Copy link
Contributor

bthirion commented Oct 2, 2024

Leave me a bit of time, it is a big one...

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a great step forward !
I have some suggestions on the API. LMK what you think.

hidimstat/cpi.py Show resolved Hide resolved
hidimstat/cpi.py Outdated
the others.
"""
if self.groups is None:
self.nb_groups = X.shape[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.nb_groups = X.shape[1]
self.n_groups = X.shape[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And change accordingly nb_groups -> n_groups wherever needed

hidimstat/cpi.py Outdated
else:
self.nb_groups = len(self.groups)
# create a list of covariate estimators for each group if not provided
if len(self.list_imputation_mod) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like self.list_imputation_mod
Maybe self.list_imputation_models
bzw, the user is not supposed to manipulate these models ? So this one could be an internal variable,
self._list_imputation_models ?

hidimstat/cpi.py Outdated
clone(self.imputation_model) for _ in range(self.nb_groups)
]

def joblib_fit_one_gp(estimator, X, y, j):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def joblib_fit_one_gp(estimator, X, y, j):
def joblib_fit_one_group(estimator, X, y, j):

hidimstat/cpi.py Outdated
list must match the number of covariates.
n_perm: int, default=50
Number of permutations to perform.
groups: dict, default=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that groups are a data-dependent thing and should be provided at fit time ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point I will then move the self._list_imputation_models to the .fit function.
Maybe it could also support clustering methods rather than predefined groups (for a next PR).

hidimstat/cpi.py Show resolved Hide resolved
hidimstat/cpi.py Outdated
for m in self.list_imputation_mod:
check_is_fitted(m)

def joblib_predict_one_gp(imputation_model, X, j):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def joblib_predict_one_gp(imputation_model, X, j):
def _joblib_predict_one_gp(imputation_model, X, j):

hidimstat/loco.py Show resolved Hide resolved
hidimstat/loco.py Show resolved Hide resolved
@bthirion
Copy link
Contributor

@achamma723 your opinion is welcome.

@jpaillard
Copy link
Collaborator Author

I tried to address all your comments @bthirion
For me it is ready to merge.

@achamma723
Copy link
Contributor

Hello @bthirion and @jpaillard, sorry I weren't able to see all the comments in the past week. I'll try this weekend to have a look if you didn't decide to merge yet

hidimstat/cpi.py Outdated
provided, it will be cloned for each covariate. Otherwise, a list of
potentially different estimators can be provided, the length of the
list must match the number of covariates.
n_perm: int, default=50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to call it n_permutations?

hidimstat/cpi.py Outdated

Returns
-------
output_dict: dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the return parameter for the score function (replicated below).

def __init__(
self,
estimator,
n_perm: int = 50,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, n_permutations?

Returns
-------
output_dict: dict
A dictionary containing the following keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for the return parameter between the predict and the score functions

@achamma723
Copy link
Contributor

Hello @jpaillard and @bthirion, I highlighted minor comments when passing, overall I think it is a great stable step for the future benchmarks. As for the cpi, now it is limited to the idea of reconstructing the variable or group of interest by the mean of the residuals, thus maybe the idea of the sampling (that existed via the Modified RF) is also interesting to push in the next steps?

@jpaillard
Copy link
Collaborator Author

Thx for the comments @achamma723. I made the modifications you suggested. Indeed it could be worth adding the sampling from nodes of the RF. Maybe we could tackle this in a following PR ?

@jpaillard
Copy link
Collaborator Author

Ready to merge on my side @bthirion

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're converging. There are just a few missing docstrings and we should include tests for all lines.

hidimstat/cpi.py Show resolved Hide resolved
hidimstat/cpi.py Show resolved Hide resolved
hidimstat/cpi.py Show resolved Hide resolved
hidimstat/loco.py Show resolved Hide resolved
hidimstat/loco.py Show resolved Hide resolved
hidimstat/permutation_importance.py Show resolved Hide resolved
@jpaillard
Copy link
Collaborator Author

I added the additional tests and docstring @bthirion

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The last pending thing is the ValueError("fit must be called before predict") to be tested, but this could be handled in a forthcoming PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants