Unify feature type annotations #697

Lilly-May · 2024-04-18T14:26:55Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked (contributes to Harmonize feature type detection #701)
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes
Currently, we're facing a situation where information about feature types (categorical vs. numerical vs. date) is being retrieved and stored in multiple ways within ehrapy. We want to unify this and store the information once in adata.var.

What I did so far:

Added the method ep.ad.infer_feature_types that guesses the type of each feature and prompts the user to check and fix these annotations.
Added a decorator check_feature_types that checks if the feature annotation is present in adata.var. If not, it raises an error prompting the user to run ep.ad.infer_feature_types.
The ep.tl.rank_features_supervised and several imputation methods now use this decorator, meaning they won't run unless the feature types are specified beforehand.

Discussion points

The main discussion point is whether we want to force users to specify the feature types before encoding the AnnData. So far, I added a code section to the encoding method that uses the feature type annotation from ep.ad.infer_feature_types if present, but it doesn't require the user to have run that before, i.e. is doesn't use the check_feature_types decorator.
Related to 1. but essentially, we need to decide how drastic we want to change ehrapy. Do we completely omit the EHRAPY_TYPE_KEY annotation we currently use and solely save the new annotation from ep.ad.infer_feature_types, which would also be used by all downstream methods? This would simplify the encoding method a lot, and make downstream methods more reliable, as they wouldn't rely on the feature type guesses made by the encoding method. However, it's also quite a bit change as feature types might be detected differently from what was detected before.
How do we deal with dates? E.g. when reading data from a csv, dates are saved in obs instead of X. Is that something we want to default to, i.e. should we test that X doesn't contain dates? If not, how do we deal with them e.g. during imputation?

ToDos

Improve printing (ep.ad.type_overview) is hard-coded for the previous feature type annotations, so I'll have to adapt that one in order to use it for the new ep.ad.infer_feature_types method.

Zethson · 2024-04-18T14:44:24Z

Thank you very much!

The main discussion point is whether we want to force users to specify the feature types before encoding the AnnData. So far, I added a code section to the encoding method that uses the feature type annotation from ep.ad.infer_feature_types if present, but it doesn't require the user to have run that before, i.e. is doesn't use the check_feature_types decorator.

I would probably try to go for a consistent experience. In other words, all methods that require this information check whether it's present and tell the user to fix it as discussed. I don't see why the encoding should be an exception here or handle it differently.

Related to 1. but essentially, we need to decide how drastic we want to change ehrapy. Do we completely omit the EHRAPY_TYPE_KEY annotation we currently use and solely save the new annotation from ep.ad.infer_feature_types, which would also be used by all downstream methods? This would simplify the encoding method a lot, and make downstream methods more reliable, as they wouldn't rely on the feature type guesses made by the encoding method. However, it's also quite a bit change as feature types might be detected differently from what was detected before.

Yeah so:

We could try to access the new locations and if they're not there, access the old locations, and only if both are missing ask users to add them by running the new function. This would force us to keep quite some boilerplate around.
YOLO it and just work with the new solution only. This would make our code simpler. As long as ehrapy isn't fully published yet and we're still making lots and lots of big changes anyways, I'd opt for that. But I can tell you that this can annoy users and I had people complain to me in the past that pertpy made too many breaking changes between versions.

If you think that a backwards compatible version wouldn't introduce too much boilerplate code, I'd be happy to hear more.

How do we deal with dates? E.g. when reading data from a csv, dates are saved in obs instead of X. Is that something we want to default to, i.e. should we test that X doesn't contain dates? If not, how do we deal with them e.g. during imputation?

In practice, I don't reaaaaaaaally see a reason to ever have dates in X but I don't think that we should disallow this completely. The csv reader was just a bit opinionated here. During imputation, they could probably just be treated as categoricals? You could argue that dates are a special case of categoricals. We're treating them slightly differently because Pandas and some other libraries treat them differently.

Hope my answers help a bit. Please annoy me if not!

Lilly-May · 2024-04-19T11:01:50Z

Another question to discuss: Do we want to keep the infer_feature_types method in the anndata module or should I move it to preprocessing? Personally, I would move it to ep.perprocessing - especially if we eventually want this to be a standard step of the preprocessing pipeline.

Zethson · 2024-04-19T14:25:09Z

Another question to discuss: Do we want to keep the infer_feature_types method in the anndata module or should I move it to preprocessing? Personally, I would move it to ep.perprocessing - especially if we eventually want this to be a standard step of the preprocessing pipeline.

I'd probably keep it in ad but do not have a strong opinion. We can move it to preprocessing if you think it fits there better

ehrapy/anndata/_constants.py

eroell · 2024-04-22T07:57:26Z

one more thing, while we are (@Lilly-May is) doing this right here:
so far, we consider continuous, and categorical variables; the categoricals, we consider in a very "nominal" fashion, that is one-hot encoding them etc.
Should we have explicit annotations for ordinal categorical data?

So far I think not, as treating this in analysis seems to boil down to either choosing a continuous integer-scaled perspective, or nominal class perspective, depending on analysis.

But wanted to throw that in here quickly

Lilly-May · 2024-04-22T09:11:02Z

Thanks for the review @eroell!

Should we have explicit annotations for ordinal categorical data?

That's a really good point. I think it comes down to whether we have downstream analyses that make use of this differentiation. I'll incorporate it into issue #701 but not deal with it in this PR.

Also, a general note: @Zethson and I agreed that this PR will merely introduce the feature type detection method so that I can move forward with the bias module (#690). After that is done, I'll come back to issue #701 and work on making the feature type detection and usage consistent for all of ehrapy.

Signed-off-by: zethson <[email protected]>

Zethson

10/10

The emoticons are cute and I think it works well.

I thought that we had also discussed that we print warnings if things could not be inferred confidently, especially concerning the:

        elif np.all(i.is_integer() for i in col) and (
            (col.min() == 0 and np.all(np.sort(col.unique()) == np.arange(col.nunique())))
            or (col.min() == 1 and np.all(np.sort(col.unique()) == np.arange(1, col.nunique() + 1)))
        ):

part. WDYT?
I'd like to hear @eroell final opinion on whether we should automatically call the ep.ad.infer_feature_types(adata, output=None) function when we're checking anyways, or whether our current design where people have to do it manually is better. It's safer, but just not quite as magical.

ehrapy/anndata/_feature_specifications.py

tests/preprocessing/test_imputation.py

Lilly-May · 2024-04-22T13:46:50Z

Thank you for the review @Zethson!

I thought that we had also discussed that we print warnings if things could not be inferred confidently

Personally, I would only print the warnings if we call the function on the fly (i.e., without the user explicitly calling it). If the user specifically calls ep.ad.infer_feature_types, the tree showing the feature types and a message prompting the user to check these are printed anyway. Also, I think there would be a lot of warning messages if we print one for all 'uncertain' cases, for example, this would be the case for all flag features (0/1) in the MIMIC-II dataset.

Zethson · 2024-04-22T13:48:34Z

With uncertain I mostly meant cases where it looks like integers that are not ordered and is therefore probably not label encoded or something of that sort. But yes, let's not warn

for more information, see https://pre-commit.ci

eroell · 2024-04-23T08:33:30Z

I'd like to hear @eroell final opinion on whether we should automatically call the ep.ad.infer_feature_types(adata, output=None)

Also vote on users having to call that - else this must seem like quite a surprise to many new users.
Having people call it, and widely use it in tutorials, seems more explicit.

One exception: not sure if we want our plug-and-play dataset-loaders to call that within them.

Another thing:
Do we want to take care of df_to_anndata here? Quite a lot of type inference going on there. Maybe missed your take on how to go about that thing :)

Lilly-May · 2024-04-23T16:05:07Z

One exception: not sure if we want our plug-and-play dataset-loaders to call that within them.

Do we want to take care of df_to_anndata here? Quite a lot of type inference going on there. Maybe missed your take on how to go about that thing :)

I've added both comments to #701 so that I'll look into and tackle these things in the next PR.

* [pre-commit.ci] pre-commit autoupdate (#702) updates: - [github.com/astral-sh/ruff-pre-commit: v0.3.7 → v0.4.1](astral-sh/ruff-pre-commit@v0.3.7...v0.4.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Unify feature type annotations (#697) * Added infer and check feature types methods * Added and tested decorator and adapted feature importances * Added test cases and updated imputation * Adapted encoding * Feature specifications output * Fix HVF test * Added tree printing for inferred feature types * Notebook fixes * Fix feature importance test * Beautify tree * Base encoding on original feature types * Added to usage * Update logging message * Improved method description * Submodule update Signed-off-by: zethson <[email protected]> * PR Revisions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update submodule * Extended method docs description --------- Signed-off-by: zethson <[email protected]> Co-authored-by: zethson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add faiss backend for KNN imputation Signed-off-by: zethson <[email protected]> * Fix MIMIC-II notebook Signed-off-by: zethson <[email protected]> * Fix MIMIC-II notebook Signed-off-by: zethson <[email protected]> * Refactoring Signed-off-by: zethson <[email protected]> * Refactoring Signed-off-by: zethson <[email protected]> * Submodule Signed-off-by: zethson <[email protected]> --------- Signed-off-by: zethson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Lilly May <[email protected]>

Lilly-May added 5 commits April 16, 2024 16:24

Added infer and check feature types methods

99b4e67

Added and tested decorator and adapted feature importances

a56c0ae

Added test cases and updated imputation

af92a20

Adapted encoding

d8ad3ff

Feature specifications output

b9977c2

github-actions bot added the enhancement New feature or request label Apr 18, 2024

Fix HVF test

39128fd

Added tree printing for inferred feature types

193286d

Lilly-May added 4 commits April 19, 2024 16:41

Notebook fixes

0628e1f

Merge branch 'main' into feature/feature_type_specification

e29f2c6

Fix feature importance test

817594b

Beautify tree

2fed039

Lilly-May mentioned this pull request Apr 20, 2024

Update notebooks to incorporate ep.ad.infer_feature_types theislab/ehrapy-tutorials#25

Merged

Base encoding on original feature types

b873b3b

Lilly-May mentioned this pull request Apr 20, 2024

Harmonize feature type detection #701

Closed

12 tasks

eroell reviewed Apr 22, 2024

View reviewed changes

ehrapy/anndata/_constants.py Show resolved Hide resolved

Added to usage

47e9ad6

github-actions bot added the chore label Apr 22, 2024

Update logging message

23ac82b

Improved method description

46499a6

Lilly-May marked this pull request as ready for review April 22, 2024 10:06

Lilly-May removed the chore label Apr 22, 2024

Submodule update

c4c2464

Signed-off-by: zethson <[email protected]>

github-actions bot added the chore label Apr 22, 2024

Zethson approved these changes Apr 22, 2024

View reviewed changes

Lilly-May and others added 2 commits April 22, 2024 16:06

PR Revisions

bee8124

[pre-commit.ci] auto fixes from pre-commit.com hooks

655b98a

for more information, see https://pre-commit.ci

Update submodule

ddb3b95

Extended method docs description

d0e59d2

Lilly-May removed the chore label Apr 23, 2024

Lilly-May merged commit 169a5bb into main Apr 23, 2024
17 checks passed

Zethson deleted the feature/feature_type_specification branch May 5, 2024 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify feature type annotations #697

Unify feature type annotations #697

Lilly-May commented Apr 18, 2024 •

edited

Loading

Zethson commented Apr 18, 2024

Lilly-May commented Apr 19, 2024

Zethson commented Apr 19, 2024

eroell commented Apr 22, 2024 •

edited

Loading

Lilly-May commented Apr 22, 2024

Zethson left a comment

Lilly-May commented Apr 22, 2024

Zethson commented Apr 22, 2024

eroell commented Apr 23, 2024 •

edited

Loading

Lilly-May commented Apr 23, 2024

Unify feature type annotations #697

Unify feature type annotations #697

Conversation

Lilly-May commented Apr 18, 2024 • edited Loading

Zethson commented Apr 18, 2024

Lilly-May commented Apr 19, 2024

Zethson commented Apr 19, 2024

eroell commented Apr 22, 2024 • edited Loading

Lilly-May commented Apr 22, 2024

Zethson left a comment

Choose a reason for hiding this comment

Lilly-May commented Apr 22, 2024

Zethson commented Apr 22, 2024

eroell commented Apr 23, 2024 • edited Loading

Lilly-May commented Apr 23, 2024

Lilly-May commented Apr 18, 2024 •

edited

Loading

eroell commented Apr 22, 2024 •

edited

Loading

eroell commented Apr 23, 2024 •

edited

Loading