Rank features groups obs #622

eroell · 2023-12-06T11:00:16Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated (via docstrings)

Description of changes
Resolves #593.
As outlined in this issue, this PR adds an additional functionality for the user to use rank_features_groups: for the .obs field only (as opposed to using .X/.layers).
It does so by allowing the user to specify an additional argument rank_obs_columns in rank_features_groups. It is by default None, and takes the values "all" or a list of keys if only a selection of observation columns is wished/meaningful to be ranked.

Technical details
Uses move_to_x and encode (autodetect=True, encodings="label"). Makes no changes to the passed adata except for adding the respective entries in adata.uns.

Questions

What do you think in general?
Remarks on code?
What do you think of this new option for the user? More arguments? Better argument name suggestions?
Should we we want to take care of strange stuff (e.g. dates/times) or expect user to only specify columns that make sense & have it blow up if this is not done properly
More tests?

Additional context
Example:

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in .obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151, 158, 159, 160, 161],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80, 55, 56, 57, 58]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90, 80, 90, 120, 130],
                        [77, 76, 60, 90, 110, 78, 56, 76, 67, 82, 59, 81]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)

    adata_dummy.uns['numerical_columns'] = ['glucose', 'weight']
    adata_dummy.uns['non_numerical_columns'] = []
    adata_dummy.uns['encoded_non_numerical_columns'] = []
    
    return adata_dummy

adata_dummy = create_dummy_dataset_numerical_in_obs()

# classical way of using .X
ep.tl.rank_features_groups(adata_dummy, groupby='disease',)
ep.pl.rank_features_groups(adata_dummy)

# added way of using .obs
ep.tl.rank_features_groups(adata_dummy, groupby='disease', rank_obs_columns="all", key_added="rank_obs")
ep.pl.rank_features_groups(adata_dummy, key="rank_obs")

eroell · 2023-12-06T11:06:16Z

@juzb tagging you here to check out the example described above - is this example matching the idea you described in #622? Or is there something else/more to the usecase you have in mind? :)

juzb · 2023-12-06T11:16:48Z

@eroell this is exactly what I had in mind! Really like the None / all / some_list options too :)

Zethson

Awesome! A few minor comments.

Generally, I have the impression that you could leave out a few code comments. I myself comment a lot and am actually learning to reduce it a bit. Hence, I gave a few suggestions.

ehrapy/tools/feature_ranking/_rank_features_groups.py

Zethson · 2023-12-06T11:32:23Z

What do you think of this new option for the user? More arguments? Better argument name suggestions?

Perfectly fine.

Should we we want to take care of strange stuff (e.g. dates/times) or expect user to only specify columns that make sense & have it blow up if this is not done properly

Maybe we should check the dtype of the obs columns and shout at least if we saw dates/times?

More tests?

Only if we're handling date and time. Otherwise LGTM

Co-authored-by: Lukas Heumos <[email protected]>

eroell · 2023-12-07T11:55:09Z

New version now can consider highly variable features in .X/.layers[layer], .obs, or both.

This introduces two new arguments:

field_to_rank: available options are 'layer' (default), 'obs' (what has been possible above), 'layer_and_obs' (new here)
columns_to_rank: default 'all', but can be dict with var_names and obs_names keys, e.g. {"var_names": ["glucose"], "obs_names": ["syst_bp_entry", "station"]}. The ability to select columns from .obs is probably important due to many columns likely not of interest (such as dates); Since var and obs is less strictly separated here than in single-cell though, having the column selection option for both cases probably makes sense, especially if combining them in an analysis this way...

But it might be that this has become overengineered, with more pitfalls of how it could raise Errors in a confusing way

@Zethson what do you think?
@juzb once more - what do you think of the examples below? More confusing to use than the example above?

Example

import ehrapy as ep
import numpy as np
import pandas as pd
import scanpy as sc

def create_dummy_dataset_numerical_in_obs():
    """
    Create a dummy dataset with numerical and non-numerical variables in obs.
    Also, has numerical variables in .X.
    """
    dummy_obs = {"disease":['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
                "station": ['ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU', 'ICU', 'ICU', 'MICU', 'MICU'],
                "syst_bp_entry": [138, 139, 140, 141, 148, 149, 150, 151, 158, 159, 160, 161],
                "diast_bp_entry": [78, 79, 80, 81, 77, 78, 79, 80, 55, 56, 57, 58]}

    dummy_var = pd.DataFrame({"Unit": ["mg/dl", "kg"]})
    dummy_var.index = ['glucose', "weight"]
    dummy_X = np.array([[80, 90, 120, 130, 80, 130, 120, 90, 80, 90, 120, 130],
                        [77, 76, 60, 90, 110, 78, 56, 76, 67, 82, 59, 81]]).T

    adata_dummy = sc.AnnData(X=dummy_X, obs=dummy_obs, var=dummy_var)

    adata_dummy.uns['numerical_columns'] = ['glucose', 'weight']
    adata_dummy.uns['non_numerical_columns'] = []
    adata_dummy.uns['encoded_non_numerical_columns'] = []
    
    return adata_dummy

# classical way of using .X
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease')
ep.pl.rank_features_groups(adata_dummy)

# added way of using .obs
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease', field_to_rank="obs")
ep.pl.rank_features_groups(adata_dummy)

# added way of using both
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease', field_to_rank="layer_and_obs")
ep.pl.rank_features_groups(adata_dummy)

# added way of using e.g. both, with selecting only a some variables
adata_dummy = create_dummy_dataset_numerical_in_obs()
ep.tl.rank_features_groups(adata_dummy, groupby='disease',
    field_to_rank="layer_and_obs",
    columns_to_rank={"var_names": ["glucose"],
                                   "obs_names": ["syst_bp_entry", "station"]
                                   }
)

eroell · 2023-12-07T11:59:15Z

Bonus - pre-commit now starts to report ruff errors it so far did only locally for me - but did not yesterday, and enforcing a style I think is not kept in other files, too? Don't mind adapting to what ruff requests for this file, but might get a bit out of sync with style in other files I think

Zethson

Great! Let's ignore ruff for now. Maybe it sorts itself out. If not, I'll have a proper look

ehrapy/tools/feature_ranking/_rank_features_groups.py

Co-authored-by: Lukas Heumos <[email protected]>

eroell added 5 commits December 6, 2023 10:11

tests for rank features groups with obs

0c52f13

first drafted feature ranking using obs

dd022ec

fixed encoding names

a818728

Merge branch 'main' into rank-features-groups-obs

0790a80

remove comment

adaa53b

eroell requested a review from Zethson December 6, 2023 11:00

Zethson requested changes Dec 6, 2023

View reviewed changes

eroell and others added 11 commits December 6, 2023 12:46

Remove comment

37c6a9a

Co-authored-by: Lukas Heumos <[email protected]>

Remove comment

06c2a00

Co-authored-by: Lukas Heumos <[email protected]>

Remove comment

6fa3de1

Co-authored-by: Lukas Heumos <[email protected]>

Update ehrapy/tools/feature_ranking/_rank_features_groups.py

458520e

Co-authored-by: Lukas Heumos <[email protected]>

Remove comment

210bac6

Co-authored-by: Lukas Heumos <[email protected]>

Remove comment

ffadf71

Co-authored-by: Lukas Heumos <[email protected]>

Update ehrapy/tools/feature_ranking/_rank_features_groups.py

f0f4867

Co-authored-by: Lukas Heumos <[email protected]>

Iterable to list and import from future

02148f5

no expensive copy

a6f5606

upated to use layer, obs, or both

bf02fff

this test data should be more stable

213d8d5

eroell requested a review from Zethson December 7, 2023 11:55

Zethson approved these changes Dec 7, 2023

View reviewed changes

ehrapy/tools/feature_ranking/_rank_features_groups.py Outdated Show resolved Hide resolved

ehrapy/tools/feature_ranking/_rank_features_groups.py Show resolved Hide resolved

eroell and others added 2 commits December 7, 2023 13:20

Update ehrapy/tools/feature_ranking/_rank_features_groups.py

034f820

Co-authored-by: Lukas Heumos <[email protected]>

correct indent of previous commit and added comment on dummy X

6458265

eroell marked this pull request as ready for review December 7, 2023 12:30

eroell marked this pull request as draft December 7, 2023 14:02

bug fixes, more tests and (fixed) examples in docstring

f444a59

eroell marked this pull request as ready for review December 7, 2023 17:51

Zethson merged commit 5447087 into theislab:main Dec 7, 2023
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank features groups obs #622

Rank features groups obs #622

eroell commented Dec 6, 2023 •

edited

Loading

eroell commented Dec 6, 2023

juzb commented Dec 6, 2023

Zethson left a comment

Zethson commented Dec 6, 2023

eroell commented Dec 7, 2023

eroell commented Dec 7, 2023

Zethson left a comment

Rank features groups obs #622

Rank features groups obs #622

Conversation

eroell commented Dec 6, 2023 • edited Loading

eroell commented Dec 6, 2023

juzb commented Dec 6, 2023

Zethson left a comment

Choose a reason for hiding this comment

Zethson commented Dec 6, 2023

eroell commented Dec 7, 2023

eroell commented Dec 7, 2023

Zethson left a comment

Choose a reason for hiding this comment

eroell commented Dec 6, 2023 •

edited

Loading