Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify feature type annotations #697

Merged
merged 20 commits into from
Apr 23, 2024
Merged

Conversation

Lilly-May
Copy link
Collaborator

@Lilly-May Lilly-May commented Apr 18, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked (contributes to Harmonize feature type detection #701)
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
Currently, we're facing a situation where information about feature types (categorical vs. numerical vs. date) is being retrieved and stored in multiple ways within ehrapy. We want to unify this and store the information once in adata.var.

What I did so far:

  • Added the method ep.ad.infer_feature_types that guesses the type of each feature and prompts the user to check and fix these annotations.
  • Added a decorator check_feature_types that checks if the feature annotation is present in adata.var. If not, it raises an error prompting the user to run ep.ad.infer_feature_types.
  • The ep.tl.rank_features_supervised and several imputation methods now use this decorator, meaning they won't run unless the feature types are specified beforehand.

Discussion points

  1. The main discussion point is whether we want to force users to specify the feature types before encoding the AnnData. So far, I added a code section to the encoding method that uses the feature type annotation from ep.ad.infer_feature_types if present, but it doesn't require the user to have run that before, i.e. is doesn't use the check_feature_types decorator.
  2. Related to 1. but essentially, we need to decide how drastic we want to change ehrapy. Do we completely omit the EHRAPY_TYPE_KEY annotation we currently use and solely save the new annotation from ep.ad.infer_feature_types, which would also be used by all downstream methods? This would simplify the encoding method a lot, and make downstream methods more reliable, as they wouldn't rely on the feature type guesses made by the encoding method. However, it's also quite a bit change as feature types might be detected differently from what was detected before.
  3. How do we deal with dates? E.g. when reading data from a csv, dates are saved in obs instead of X. Is that something we want to default to, i.e. should we test that X doesn't contain dates? If not, how do we deal with them e.g. during imputation?

ToDos

  • Improve printing (ep.ad.type_overview) is hard-coded for the previous feature type annotations, so I'll have to adapt that one in order to use it for the new ep.ad.infer_feature_types method.

@github-actions github-actions bot added the enhancement New feature or request label Apr 18, 2024
@Zethson
Copy link
Member

Zethson commented Apr 18, 2024

Thank you very much!

The main discussion point is whether we want to force users to specify the feature types before encoding the AnnData. So far, I added a code section to the encoding method that uses the feature type annotation from ep.ad.infer_feature_types if present, but it doesn't require the user to have run that before, i.e. is doesn't use the check_feature_types decorator.

I would probably try to go for a consistent experience. In other words, all methods that require this information check whether it's present and tell the user to fix it as discussed. I don't see why the encoding should be an exception here or handle it differently.

Related to 1. but essentially, we need to decide how drastic we want to change ehrapy. Do we completely omit the EHRAPY_TYPE_KEY annotation we currently use and solely save the new annotation from ep.ad.infer_feature_types, which would also be used by all downstream methods? This would simplify the encoding method a lot, and make downstream methods more reliable, as they wouldn't rely on the feature type guesses made by the encoding method. However, it's also quite a bit change as feature types might be detected differently from what was detected before.

Yeah so:

  1. We could try to access the new locations and if they're not there, access the old locations, and only if both are missing ask users to add them by running the new function. This would force us to keep quite some boilerplate around.
  2. YOLO it and just work with the new solution only. This would make our code simpler. As long as ehrapy isn't fully published yet and we're still making lots and lots of big changes anyways, I'd opt for that. But I can tell you that this can annoy users and I had people complain to me in the past that pertpy made too many breaking changes between versions.

If you think that a backwards compatible version wouldn't introduce too much boilerplate code, I'd be happy to hear more.

How do we deal with dates? E.g. when reading data from a csv, dates are saved in obs instead of X. Is that something we want to default to, i.e. should we test that X doesn't contain dates? If not, how do we deal with them e.g. during imputation?

In practice, I don't reaaaaaaaally see a reason to ever have dates in X but I don't think that we should disallow this completely. The csv reader was just a bit opinionated here. During imputation, they could probably just be treated as categoricals? You could argue that dates are a special case of categoricals. We're treating them slightly differently because Pandas and some other libraries treat them differently.

Hope my answers help a bit. Please annoy me if not!

@Lilly-May
Copy link
Collaborator Author

Another question to discuss: Do we want to keep the infer_feature_types method in the anndata module or should I move it to preprocessing? Personally, I would move it to ep.perprocessing - especially if we eventually want this to be a standard step of the preprocessing pipeline.

@Zethson
Copy link
Member

Zethson commented Apr 19, 2024

Another question to discuss: Do we want to keep the infer_feature_types method in the anndata module or should I move it to preprocessing? Personally, I would move it to ep.perprocessing - especially if we eventually want this to be a standard step of the preprocessing pipeline.

I'd probably keep it in ad but do not have a strong opinion. We can move it to preprocessing if you think it fits there better

@Lilly-May Lilly-May mentioned this pull request Apr 20, 2024
12 tasks
@eroell
Copy link
Collaborator

eroell commented Apr 22, 2024

one more thing, while we are (@Lilly-May is) doing this right here:
so far, we consider continuous, and categorical variables; the categoricals, we consider in a very "nominal" fashion, that is one-hot encoding them etc.
Should we have explicit annotations for ordinal categorical data?

So far I think not, as treating this in analysis seems to boil down to either choosing a continuous integer-scaled perspective, or nominal class perspective, depending on analysis.

But wanted to throw that in here quickly

@github-actions github-actions bot added the chore label Apr 22, 2024
@Lilly-May
Copy link
Collaborator Author

Thanks for the review @eroell!

Should we have explicit annotations for ordinal categorical data?

That's a really good point. I think it comes down to whether we have downstream analyses that make use of this differentiation. I'll incorporate it into issue #701 but not deal with it in this PR.

Also, a general note: @Zethson and I agreed that this PR will merely introduce the feature type detection method so that I can move forward with the bias module (#690). After that is done, I'll come back to issue #701 and work on making the feature type detection and usage consistent for all of ehrapy.

@Lilly-May Lilly-May marked this pull request as ready for review April 22, 2024 10:06
@Lilly-May Lilly-May removed the chore label Apr 22, 2024
Signed-off-by: zethson <[email protected]>
@github-actions github-actions bot added the chore label Apr 22, 2024
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10/10

The emoticons are cute and I think it works well.

I thought that we had also discussed that we print warnings if things could not be inferred confidently, especially concerning the:

        elif np.all(i.is_integer() for i in col) and (
            (col.min() == 0 and np.all(np.sort(col.unique()) == np.arange(col.nunique())))
            or (col.min() == 1 and np.all(np.sort(col.unique()) == np.arange(1, col.nunique() + 1)))
        ):

part. WDYT?
I'd like to hear @eroell final opinion on whether we should automatically call the ep.ad.infer_feature_types(adata, output=None) function when we're checking anyways, or whether our current design where people have to do it manually is better. It's safer, but just not quite as magical.

ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
ehrapy/anndata/_feature_specifications.py Outdated Show resolved Hide resolved
tests/preprocessing/test_imputation.py Outdated Show resolved Hide resolved
@Lilly-May
Copy link
Collaborator Author

Thank you for the review @Zethson!

I thought that we had also discussed that we print warnings if things could not be inferred confidently

Personally, I would only print the warnings if we call the function on the fly (i.e., without the user explicitly calling it). If the user specifically calls ep.ad.infer_feature_types, the tree showing the feature types and a message prompting the user to check these are printed anyway. Also, I think there would be a lot of warning messages if we print one for all 'uncertain' cases, for example, this would be the case for all flag features (0/1) in the MIMIC-II dataset.

@Zethson
Copy link
Member

Zethson commented Apr 22, 2024

With uncertain I mostly meant cases where it looks like integers that are not ordered and is therefore probably not label encoded or something of that sort. But yes, let's not warn

@eroell
Copy link
Collaborator

eroell commented Apr 23, 2024

I'd like to hear @eroell final opinion on whether we should automatically call the ep.ad.infer_feature_types(adata, output=None)

Also vote on users having to call that - else this must seem like quite a surprise to many new users.
Having people call it, and widely use it in tutorials, seems more explicit.

One exception: not sure if we want our plug-and-play dataset-loaders to call that within them.

Another thing:
Do we want to take care of df_to_anndata here? Quite a lot of type inference going on there. Maybe missed your take on how to go about that thing :)

@Lilly-May
Copy link
Collaborator Author

One exception: not sure if we want our plug-and-play dataset-loaders to call that within them.

Do we want to take care of df_to_anndata here? Quite a lot of type inference going on there. Maybe missed your take on how to go about that thing :)

I've added both comments to #701 so that I'll look into and tackle these things in the next PR.

@Lilly-May Lilly-May removed the chore label Apr 23, 2024
@Lilly-May Lilly-May merged commit 169a5bb into main Apr 23, 2024
17 checks passed
Zethson added a commit that referenced this pull request Apr 24, 2024
* [pre-commit.ci] pre-commit autoupdate (#702)

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.3.7 → v0.4.1](astral-sh/ruff-pre-commit@v0.3.7...v0.4.1)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Unify feature type annotations (#697)

* Added infer and check feature types methods

* Added and tested decorator and adapted feature importances

* Added test cases and updated imputation

* Adapted encoding

* Feature specifications output

* Fix HVF test

* Added tree printing for inferred feature types

* Notebook fixes

* Fix feature importance test

* Beautify tree

* Base encoding on original feature types

* Added to usage

* Update logging message

* Improved method description

* Submodule update

Signed-off-by: zethson <[email protected]>

* PR Revisions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update submodule

* Extended method docs description

---------

Signed-off-by: zethson <[email protected]>
Co-authored-by: zethson <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add faiss backend for KNN imputation

Signed-off-by: zethson <[email protected]>

* Fix MIMIC-II notebook

Signed-off-by: zethson <[email protected]>

* Fix MIMIC-II notebook

Signed-off-by: zethson <[email protected]>

* Refactoring

Signed-off-by: zethson <[email protected]>

* Refactoring

Signed-off-by: zethson <[email protected]>

* Submodule

Signed-off-by: zethson <[email protected]>

---------

Signed-off-by: zethson <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lilly May <[email protected]>
@Zethson Zethson deleted the feature/feature_type_specification branch May 5, 2024 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants