Add bias detection to preprocessing #690

Lilly-May · 2024-04-13T07:36:23Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked (Bias detection module #647)
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes

This is a very drafty PR to get the discussion on the bias module started. The general idea is to have one method that calls several submethods:

Pairwise correlations between features
Standardized mean differences for all features between groups in sensitive features
Feature importances (Is a sensitive feature an important predictor of another feature?)

I thought the usage to be something along the lines of:

# adata is expected to be encoded
ep.pp.bias_detection(adata, sensitive_features=["ethnicity", "sex"])

I also made some necessary adjustments on the feature importances method. Firstly, I changed the default model from regression to rf (Random Forest) as I found this to be a lot more reliable when testing it for the bias detection, so I believe this should be the default. I also added two parameters to include the options to not log during the calculation of feature importances and to return the prediction score (R2 or accuracy).

Discussion points

Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.
It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?
How de we call the big method computing all different measurements? generate_bias_report? bias_detection?
How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?
How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

ToDos

Resolve ToDos in the code
Look into and adjust the code with respect to differences for continuous and categorical features (waiting on Add var type retrospectively #637 for that)
Add tests
Add examples
Save standardized mean differences in uns subdict
Raise error if data are not already encoded
Introduce copy parameter
Fix test warnings

Zethson · 2024-04-14T13:30:35Z

Thank you so much!

Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.

If all of these are fast and don't have too many configurable parameters, it's easier to have a one-stop function. There are ways to circumvent the need to regenerate the Pandas DataFrame such as class variables, cached properties, and other ideas but let's assume that we don't need that for now.

It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

Yeah, that sounds fair to me. Ideally, the fewer parameters the better. That's a rule for every function that we implement. Plotting functions can be exceptions.

How de we call the big method computing all different measurements? generate_bias_report? bias_detection?

detect_bias? When I hear report I always think of an actual HTML/PDF report...

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

I veto HTML reports because they're annoying to interpret and it'd be much cooler to generate whole reports for this including the cohort tracking etc in the future. This would be an entirely different subproject.
One learning of mine is that Rich tables are nice, but sometimes troublesome and cannot be resized. Therefore, I always prefer DataFrames over Rich tables and plots over no plots. Ideally, this function could return a summary table and store extensive results in the varm slots?

Zethson

Generally, does this function need to differentiate a bit more between the different feature types that we're also discussing?

Overall, I think that this is quite good. We'll see whether we can integrate more of the other functions into this, but I think that this function is going into the right direction for sure.

ehrapy/preprocessing/_bias.py

ehrapy/preprocessing/_imputation.py

ehrapy/preprocessing/_bias.py

Lilly-May · 2024-04-15T09:20:50Z

Thanks for the review @Zethson!

I'll summarize the main things I'm going to change here:

Put everything into one method
Introduce sensitive_features="all" option
Rename method to detect_bias
Save all calculations in the adata
Return a summary table with detected biases exceeding the respective threshold specified as parameter (I have to think a bit more about how I could summarize everything in ideally one pandas dataframe)

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

Zethson · 2024-04-15T09:46:16Z

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

I don't think that we need to offer more plots. I'd much rather consider integrating this with @eroell cohort tracking or TableOne down the road.

ehrapy/preprocessing/_bias.py

eroell · 2024-04-15T13:35:18Z

Cool!

Thoughts on the implementing side of things:

Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

This function will likely be very verbose with many arguments. It is probably easier to have argument defaults consistent, and not interacting with each other? :)

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

Hm. A bit more involved, no?
The correlation is a n_var x n_var matrix; but only if we compute a correlation also between (categorical, categorical) and (categorical, numerical), for which we have not paved the way yet right

The feature importance is part of .var if I got this right, since calling a "native" ehrapy function this would be naturally accounted for indeed :)

The standardized mean difference: this is quite a lot going into the direction of rank_features_groups - there, we'd even have things for categoricals/continuous stuff already, and cool test statistics, and so on. Do we want yet another one here? If smd is important, it maybe even should be in rank_features_groups, and from there be called here I think
But to not slow things here down too much, we can also go ahead keeping it here as you suggested - and take care of fancier things later :)

Zethson · 2024-04-15T13:38:15Z

@eroell you are correct :) Thanks!

ehrapy/preprocessing/_bias.py

# Conflicts: # ehrapy/preprocessing/__init__.py # ehrapy/tools/feature_ranking/_feature_importances.py

for more information, see https://pre-commit.ci

Zethson

Lovely! A few minor comments. Thank you very much!

ehrapy/preprocessing/_bias.py

ehrapy/tools/feature_ranking/_feature_importances.py

Co-authored-by: Lukas Heumos <[email protected]>

ehrapy/preprocessing/_bias.py

eroell

It does what it outlines, checked out a bit also with other data than the mimicII.
Neat, does a lot of summary statistics at once really!

Co-authored-by: Eljas Roellin <[email protected]>

for more information, see https://pre-commit.ci

Lilly-May added 4 commits April 11, 2024 14:45

Added correlation calculation

744216d

Standard. Mean Differences

a1e6b2a

Added feature importances

68b1104

Doc string improvements

0536586

github-actions bot added the enhancement New feature or request label Apr 13, 2024

Added correlations parameter

c41ad45

Lilly-May requested a review from Zethson April 13, 2024 07:45

Zethson requested a review from eroell April 14, 2024 13:03

Zethson reviewed Apr 14, 2024

View reviewed changes

Merge branch 'main' into feature/bias_detection

7233f96

eroell reviewed Apr 15, 2024

View reviewed changes

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved

Lilly-May added 5 commits April 15, 2024 15:47

PR Revisions

97b004b

Added categorical value count calculation

778c0c3

Added first test

7ad07ec

docs clarifications

7d483a3

Test improvements

138860b

Zethson reviewed Apr 17, 2024

View reviewed changes

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved

Lilly-May mentioned this pull request Apr 22, 2024

Unify feature type annotations #697

Merged

5 tasks

Lilly-May and others added 7 commits April 25, 2024 10:10

Merge branch 'main' into feature/bias_detection

22f45ef

# Conflicts: # ehrapy/preprocessing/__init__.py # ehrapy/tools/feature_ranking/_feature_importances.py

Incorporate feature type detection

c0bdcb1

Finished tests

031808d

SMD improvements

a863306

Merge branch 'main' into feature/bias_detection

cd44284

Merge branch 'main' into feature/bias_detection

efe6885

Test fixes

eea9772

Merge branch 'main' into feature/bias_detection

ed9d8be

github-actions bot added the chore label May 1, 2024

pre-commit-ci bot and others added 9 commits May 1, 2024 08:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

5347c35

for more information, see https://pre-commit.ci

Save SMD in uns subdict

f1f4b4d

Fix tests and silence test warnings

6688bf6

Introduced copy parameter

381b8b1

Added encoding check

3ff2c65

Fixed sensitive_features dtype

bcfe3a4

Feature importances return docstring

b11f5ea

Improved docs explanations

e1aaaae

Sort feature importances results

c1d3916

Lilly-May removed the chore label May 2, 2024

Lilly-May marked this pull request as ready for review May 2, 2024 10:24

Lilly-May requested review from Zethson and eroell May 2, 2024 10:24

Zethson approved these changes May 2, 2024

View reviewed changes

Apply suggestions from code review

f2d11f8

Co-authored-by: Lukas Heumos <[email protected]>

github-actions bot added the chore label May 2, 2024

Lilly-May mentioned this pull request May 2, 2024

Distinguish between ordinal and nominal categorical feature types #713

Open

Review comments

2e8d630

eroell reviewed May 3, 2024

View reviewed changes

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved

eroell reviewed May 3, 2024

View reviewed changes

ehrapy/preprocessing/_bias.py Show resolved Hide resolved

doc formating

9d8b74e

eroell reviewed May 3, 2024

View reviewed changes

Lilly-May and others added 3 commits May 4, 2024 11:43

Apply suggestions from code review

daef606

Co-authored-by: Eljas Roellin <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c22ee85

for more information, see https://pre-commit.ci

Fixed error raising

5ec7f8a

Lilly-May merged commit 1b5165d into main May 4, 2024
17 checks passed

Lilly-May mentioned this pull request May 4, 2024

Bias detection module #647

Closed

11 tasks

Zethson deleted the feature/bias_detection branch June 3, 2024 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bias detection to preprocessing #690

Add bias detection to preprocessing #690

Lilly-May commented Apr 13, 2024 •

edited

Loading

Zethson commented Apr 14, 2024

Zethson left a comment

Lilly-May commented Apr 15, 2024 •

edited

Loading

Zethson commented Apr 15, 2024

eroell commented Apr 15, 2024

Zethson commented Apr 15, 2024

Zethson left a comment

eroell left a comment

Add bias detection to preprocessing #690

Add bias detection to preprocessing #690

Conversation

Lilly-May commented Apr 13, 2024 • edited Loading

Zethson commented Apr 14, 2024

Zethson left a comment

Choose a reason for hiding this comment

Lilly-May commented Apr 15, 2024 • edited Loading

Zethson commented Apr 15, 2024

eroell commented Apr 15, 2024

Zethson commented Apr 15, 2024

Zethson left a comment

Choose a reason for hiding this comment

eroell left a comment

Choose a reason for hiding this comment

Lilly-May commented Apr 13, 2024 •

edited

Loading

Lilly-May commented Apr 15, 2024 •

edited

Loading