-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify feature type annotations #697
Conversation
Thank you very much!
I would probably try to go for a consistent experience. In other words, all methods that require this information check whether it's present and tell the user to fix it as discussed. I don't see why the encoding should be an exception here or handle it differently.
Yeah so:
If you think that a backwards compatible version wouldn't introduce too much boilerplate code, I'd be happy to hear more.
In practice, I don't reaaaaaaaally see a reason to ever have dates in Hope my answers help a bit. Please annoy me if not! |
Another question to discuss: Do we want to keep the |
I'd probably keep it in |
one more thing, while we are (@Lilly-May is) doing this right here: So far I think not, as treating this in analysis seems to boil down to either choosing a continuous integer-scaled perspective, or nominal class perspective, depending on analysis. But wanted to throw that in here quickly |
Thanks for the review @eroell!
That's a really good point. I think it comes down to whether we have downstream analyses that make use of this differentiation. I'll incorporate it into issue #701 but not deal with it in this PR. Also, a general note: @Zethson and I agreed that this PR will merely introduce the feature type detection method so that I can move forward with the bias module (#690). After that is done, I'll come back to issue #701 and work on making the feature type detection and usage consistent for all of ehrapy. |
Signed-off-by: zethson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10/10
The emoticons are cute and I think it works well.
I thought that we had also discussed that we print warnings if things could not be inferred confidently, especially concerning the:
elif np.all(i.is_integer() for i in col) and (
(col.min() == 0 and np.all(np.sort(col.unique()) == np.arange(col.nunique())))
or (col.min() == 1 and np.all(np.sort(col.unique()) == np.arange(1, col.nunique() + 1)))
):
part. WDYT?
I'd like to hear @eroell final opinion on whether we should automatically call the ep.ad.infer_feature_types(adata, output=None)
function when we're checking anyways, or whether our current design where people have to do it manually is better. It's safer, but just not quite as magical.
Thank you for the review @Zethson!
Personally, I would only print the warnings if we call the function on the fly (i.e., without the user explicitly calling it). If the user specifically calls |
With uncertain I mostly meant cases where it looks like integers that are not ordered and is therefore probably not label encoded or something of that sort. But yes, let's not warn |
for more information, see https://pre-commit.ci
Also vote on users having to call that - else this must seem like quite a surprise to many new users. One exception: not sure if we want our plug-and-play dataset-loaders to call that within them. Another thing: |
I've added both comments to #701 so that I'll look into and tackle these things in the next PR. |
* [pre-commit.ci] pre-commit autoupdate (#702) updates: - [github.com/astral-sh/ruff-pre-commit: v0.3.7 → v0.4.1](astral-sh/ruff-pre-commit@v0.3.7...v0.4.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Unify feature type annotations (#697) * Added infer and check feature types methods * Added and tested decorator and adapted feature importances * Added test cases and updated imputation * Adapted encoding * Feature specifications output * Fix HVF test * Added tree printing for inferred feature types * Notebook fixes * Fix feature importance test * Beautify tree * Base encoding on original feature types * Added to usage * Update logging message * Improved method description * Submodule update Signed-off-by: zethson <[email protected]> * PR Revisions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update submodule * Extended method docs description --------- Signed-off-by: zethson <[email protected]> Co-authored-by: zethson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add faiss backend for KNN imputation Signed-off-by: zethson <[email protected]> * Fix MIMIC-II notebook Signed-off-by: zethson <[email protected]> * Fix MIMIC-II notebook Signed-off-by: zethson <[email protected]> * Refactoring Signed-off-by: zethson <[email protected]> * Refactoring Signed-off-by: zethson <[email protected]> * Submodule Signed-off-by: zethson <[email protected]> --------- Signed-off-by: zethson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Lilly May <[email protected]>
PR Checklist
docs
is updatedDescription of changes
Currently, we're facing a situation where information about feature types (
categorical
vs.numerical
vs.date
) is being retrieved and stored in multiple ways within ehrapy. We want to unify this and store the information once inadata.var
.What I did so far:
ep.ad.infer_feature_types
that guesses the type of each feature and prompts the user to check and fix these annotations.check_feature_types
that checks if the feature annotation is present inadata.var
. If not, it raises an error prompting the user to runep.ad.infer_feature_types
.ep.tl.rank_features_supervised
and severalimputation
methods now use this decorator, meaning they won't run unless the feature types are specified beforehand.Discussion points
ep.ad.infer_feature_types
if present, but it doesn't require the user to have run that before, i.e. is doesn't use thecheck_feature_types
decorator.EHRAPY_TYPE_KEY
annotation we currently use and solely save the new annotation fromep.ad.infer_feature_types
, which would also be used by all downstream methods? This would simplify the encoding method a lot, and make downstream methods more reliable, as they wouldn't rely on the feature type guesses made by the encoding method. However, it's also quite a bit change as feature types might be detected differently from what was detected before.obs
instead of X. Is that something we want to default to, i.e. should we test that X doesn't contain dates? If not, how do we deal with them e.g. during imputation?ToDos
ep.ad.type_overview
) is hard-coded for the previous feature type annotations, so I'll have to adapt that one in order to use it for the newep.ad.infer_feature_types
method.