Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank features groups obs #622

Merged
merged 19 commits into from
Dec 7, 2023
Merged

Conversation

eroell
Copy link
Collaborator

@eroell eroell commented Dec 6, 2023

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated (via docstrings)

Description of changes
Resolves #593.
As outlined in this issue, this PR adds an additional functionality for the user to use rank_features_groups: for the .obs field only (as opposed to using .X/.layers).
It does so by allowing the user to specify an additional argument rank_obs_columns in rank_features_groups. It is by default None, and takes the values "all" or a list of keys if only a selection of observation columns is wished/meaningful to be ranked.

Technical details
Uses move_to_x and encode (autodetect=True, encodings="label"). Makes no changes to the passed adata except for adding the respective entries in adata.uns.

Questions

  • What do you think in general?
  • Remarks on code?
  • What do you think of this new option for the user? More arguments? Better argument name suggestions?
  • Should we we want to take care of strange stuff (e.g. dates/times) or expect user to only specify columns that make sense & have it blow up if this is not done properly
  • More tests?

Additional context
Example:

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in .obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151, 158, 159, 160, 161],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80, 55, 56, 57, 58]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90, 80, 90, 120, 130],
                        [77, 76, 60, 90, 110, 78, 56, 76, 67, 82, 59, 81]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)

    adata_dummy.uns['numerical_columns'] = ['glucose', 'weight']
    adata_dummy.uns['non_numerical_columns'] = []
    adata_dummy.uns['encoded_non_numerical_columns'] = []
    
    return adata_dummy

adata_dummy = create_dummy_dataset_numerical_in_obs()

# classical way of using .X
ep.tl.rank_features_groups(adata_dummy, groupby='disease',)
ep.pl.rank_features_groups(adata_dummy)

image

# added way of using .obs
ep.tl.rank_features_groups(adata_dummy, groupby='disease', rank_obs_columns="all", key_added="rank_obs")
ep.pl.rank_features_groups(adata_dummy, key="rank_obs")

image

@eroell eroell requested a review from Zethson December 6, 2023 11:00
@eroell
Copy link
Collaborator Author

eroell commented Dec 6, 2023

@juzb tagging you here to check out the example described above - is this example matching the idea you described in #622? Or is there something else/more to the usecase you have in mind? :)

@juzb
Copy link

juzb commented Dec 6, 2023

@eroell this is exactly what I had in mind! Really like the None / all / some_list options too :)

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! A few minor comments.

Generally, I have the impression that you could leave out a few code comments. I myself comment a lot and am actually learning to reduce it a bit. Hence, I gave a few suggestions.

ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved
@Zethson
Copy link
Member

Zethson commented Dec 6, 2023

What do you think of this new option for the user? More arguments? Better argument name suggestions?

Perfectly fine.

Should we we want to take care of strange stuff (e.g. dates/times) or expect user to only specify columns that make sense & have it blow up if this is not done properly

Maybe we should check the dtype of the obs columns and shout at least if we saw dates/times?

More tests?

Only if we're handling date and time. Otherwise LGTM

@eroell
Copy link
Collaborator Author

eroell commented Dec 7, 2023

New version now can consider highly variable features in .X/.layers[layer], .obs, or both.

This introduces two new arguments:

  • field_to_rank: available options are 'layer' (default), 'obs' (what has been possible above), 'layer_and_obs' (new here)
  • columns_to_rank: default 'all', but can be dict with var_names and obs_names keys, e.g. {"var_names": ["glucose"], "obs_names": ["syst_bp_entry", "station"]}. The ability to select columns from .obs is probably important due to many columns likely not of interest (such as dates); Since var and obs is less strictly separated here than in single-cell though, having the column selection option for both cases probably makes sense, especially if combining them in an analysis this way...

But it might be that this has become overengineered, with more pitfalls of how it could raise Errors in a confusing way

@Zethson what do you think?
@juzb once more - what do you think of the examples below? More confusing to use than the example above?

Example

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151, 158, 159, 160, 161],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80, 55, 56, 57, 58]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90, 80, 90, 120, 130],
                        [77, 76, 60, 90, 110, 78, 56, 76, 67, 82, 59, 81]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)

    adata_dummy.uns['numerical_columns'] = ['glucose', 'weight']
    adata_dummy.uns['non_numerical_columns'] = []
    adata_dummy.uns['encoded_non_numerical_columns'] = []
    
    return adata_dummy

# classical way of using .X
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease')
ep.pl.rank_features_groups(adata_dummy)

image

# added way of using .obs
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease', field_to_rank="obs")
ep.pl.rank_features_groups(adata_dummy)

image

# added way of using both
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease', field_to_rank="layer_and_obs")
ep.pl.rank_features_groups(adata_dummy)

image

# added way of using e.g. both, with selecting only a some variables
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease',
    field_to_rank="layer_and_obs",
    columns_to_rank={"var_names": ["glucose"],
                                   "obs_names": ["syst_bp_entry", "station"]
                                   }
)

image

@eroell eroell requested a review from Zethson December 7, 2023 11:55
@eroell
Copy link
Collaborator Author

eroell commented Dec 7, 2023

Bonus - pre-commit now starts to report ruff errors it so far did only locally for me - but did not yesterday, and enforcing a style I think is not kept in other files, too? Don't mind adapting to what ruff requests for this file, but might get a bit out of sync with style in other files I think

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Let's ignore ruff for now. Maybe it sorts itself out. If not, I'll have a proper look

@eroell eroell marked this pull request as ready for review December 7, 2023 12:30
@eroell eroell marked this pull request as draft December 7, 2023 14:02
@eroell eroell marked this pull request as ready for review December 7, 2023 17:51
@Zethson Zethson merged commit 5447087 into theislab:main Dec 7, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rank feature groups should also be possible to be applied to obs
3 participants