Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bias detection to preprocessing #690

Merged
merged 35 commits into from
May 4, 2024
Merged

Add bias detection to preprocessing #690

merged 35 commits into from
May 4, 2024

Conversation

Lilly-May
Copy link
Collaborator

@Lilly-May Lilly-May commented Apr 13, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked (Bias detection module #647)
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes

This is a very drafty PR to get the discussion on the bias module started. The general idea is to have one method that calls several submethods:

  • Pairwise correlations between features
  • Standardized mean differences for all features between groups in sensitive features
  • Feature importances (Is a sensitive feature an important predictor of another feature?)

I thought the usage to be something along the lines of:

# adata is expected to be encoded
ep.pp.bias_detection(adata, sensitive_features=["ethnicity", "sex"])

I also made some necessary adjustments on the feature importances method. Firstly, I changed the default model from regression to rf (Random Forest) as I found this to be a lot more reliable when testing it for the bias detection, so I believe this should be the default. I also added two parameters to include the options to not log during the calculation of feature importances and to return the prediction score (R2 or accuracy).

Discussion points

  1. Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.
  2. It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?
  3. How de we call the big method computing all different measurements? generate_bias_report? bias_detection?
  4. How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?
  5. How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

ToDos

  • Resolve ToDos in the code
  • Look into and adjust the code with respect to differences for continuous and categorical features (waiting on Add var type retrospectively #637 for that)
  • Add tests
  • Add examples
  • Save standardized mean differences in uns subdict
  • Raise error if data are not already encoded
  • Introduce copy parameter
  • Fix test warnings

@github-actions github-actions bot added the enhancement New feature or request label Apr 13, 2024
@Lilly-May Lilly-May requested a review from Zethson April 13, 2024 07:45
@Zethson Zethson requested a review from eroell April 14, 2024 13:03
@Zethson
Copy link
Member

Zethson commented Apr 14, 2024

Thank you so much!

Do we make the submethods (_feature_correlations and _standardized_mean_differences) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.

If all of these are fast and don't have too many configurable parameters, it's easier to have a one-stop function. There are ways to circumvent the need to regenerate the Pandas DataFrame such as class variables, cached properties, and other ideas but let's assume that we don't need that for now.

It would be great to offer a more explorative way to detect potentially unknown sensitive features and biases in the data. Basically, we would have the option sensitive_features="all". However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

Yeah, that sounds fair to me. Ideally, the fewer parameters the better. That's a rule for every function that we implement. Plotting functions can be exceptions.

How de we call the big method computing all different measurements? generate_bias_report? bias_detection?

detect_bias? When I hear report I always think of an actual HTML/PDF report...

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

How do we show the output? Should we print out a table (similar to scCODA in pertpy)? We could also (but this might be a weird idea) think if it would be possible to generate an (additional) HTML report which already includes plots. But that would be quite different from the rest of the API, so I don't know if we want to go for that.

I veto HTML reports because they're annoying to interpret and it'd be much cooler to generate whole reports for this including the cohort tracking etc in the future. This would be an entirely different subproject.
One learning of mine is that Rich tables are nice, but sometimes troublesome and cannot be resized. Therefore, I always prefer DataFrames over Rich tables and plots over no plots. Ideally, this function could return a summary table and store extensive results in the varm slots?

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, does this function need to differentiate a bit more between the different feature types that we're also discussing?

Overall, I think that this is quite good. We'll see whether we can integrate more of the other functions into this, but I think that this function is going into the right direction for sure.

ehrapy/preprocessing/_bias.py Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_imputation.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
@Lilly-May
Copy link
Collaborator Author

Lilly-May commented Apr 15, 2024

Thanks for the review @Zethson!

I'll summarize the main things I'm going to change here:

  • Put everything into one method
  • Introduce sensitive_features="all" option
  • Rename method to detect_bias
  • Save all calculations in the adata
  • Return a summary table with detected biases exceeding the respective threshold specified as parameter (I have to think a bit more about how I could summarize everything in ideally one pandas dataframe)

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

@Zethson
Copy link
Member

Zethson commented Apr 15, 2024

One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input?

I don't think that we need to offer more plots. I'd much rather consider integrating this with @eroell cohort tracking or TableOne down the road.

ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
@eroell
Copy link
Collaborator

eroell commented Apr 15, 2024

Cool!

Thoughts on the implementing side of things:

Maybe we should add a parameter run_feature_importances that is set to False by default when sensitive_features="all" and set to True by default when sensitive_features is a list of features?

This function will likely be very verbose with many arguments. It is probably easier to have argument defaults consistent, and not interacting with each other? :)

How do we save the results in the AnnData? We will likely need them for downstream plot generation. But for example, for the standardized mean difference, we have one n_groups_in_feature x var_n matrix per investigated sensitive feature, so we would have one entry in varm per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?

I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes varm.

Hm. A bit more involved, no?
The correlation is a n_var x n_var matrix; but only if we compute a correlation also between (categorical, categorical) and (categorical, numerical), for which we have not paved the way yet right

The feature importance is part of .var if I got this right, since calling a "native" ehrapy function this would be naturally accounted for indeed :)

The standardized mean difference: this is quite a lot going into the direction of rank_features_groups - there, we'd even have things for categoricals/continuous stuff already, and cool test statistics, and so on. Do we want yet another one here? If smd is important, it maybe even should be in rank_features_groups, and from there be called here I think
But to not slow things here down too much, we can also go ahead keeping it here as you suggested - and take care of fancier things later :)

@Zethson
Copy link
Member

Zethson commented Apr 15, 2024

@eroell you are correct :) Thanks!

@Lilly-May Lilly-May mentioned this pull request Apr 22, 2024
5 tasks
@github-actions github-actions bot added the chore label May 1, 2024
@Lilly-May Lilly-May removed the chore label May 2, 2024
@Lilly-May Lilly-May marked this pull request as ready for review May 2, 2024 10:24
@Lilly-May Lilly-May requested review from Zethson and eroell May 2, 2024 10:24
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely! A few minor comments. Thank you very much!

ehrapy/preprocessing/_bias.py Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Show resolved Hide resolved
ehrapy/preprocessing/_bias.py Show resolved Hide resolved
ehrapy/tools/feature_ranking/_feature_importances.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does what it outlines, checked out a bit also with other data than the mimicII.
Neat, does a lot of summary statistics at once really!

@Lilly-May Lilly-May merged commit 1b5165d into main May 4, 2024
17 checks passed
@Lilly-May Lilly-May mentioned this pull request May 4, 2024
11 tasks
@Zethson Zethson deleted the feature/bias_detection branch June 3, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants