Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829

eroell · 2024-11-26T08:33:09Z

Description of feature

Right now AnnData fields such as .X can have different datatypes, being e.g. numpy dense arrays, sparse arrays, or dask arrays.

Some ehrapy functions are compatible with multiple of these types, while some are not.

We can roll this out in a 2 step process:

Introduce a consistent response for functions that fail for a given datatype.
One way to implement this could be via consistent use of single-dispatch. Should also include dedicated tests.
Address functions or batches of functions step by step, and extend the datatypes they support
The order could be determined with volume of estimated usage, required effort, and typical data load (e.g. operations happening on full feature set vs lower-dimensional space)

List of affected functions

Module	Function	What to do?
.\anndata\anndata_ext.py	anndata_to_df
.\anndata\anndata_ext.py	_assert_encoded
.\anndata\anndata_ext.py	set_numeric_vars
.\anndata\anndata_ext.py	_to_dense_matrix
.\io\_read.py	_write_cache
.\io\_read.py	_decode_cached_adata
.\io\_write.py	write
.\preprocessing\_balanced_sampling.py	balanced_sample
.\preprocessing\_encoding.py	encode
.\preprocessing\_encoding.py	_one_hot_encoding
.\preprocessing\_encoding.py	_label_encoding
.\preprocessing\_encoding.py	_delete_all_encodings
.\preprocessing\_encoding.py	_update_obs
.\preprocessing\_imputation.py	explicit_impute
.\preprocessing\_imputation.py	_simple_impute
.\preprocessing\_imputation.py	knn_impute
.\preprocessing\_imputation.py	_knn_impute
.\preprocessing\_imputation.py	miss_forest_impute
.\preprocessing\_imputation.py	mice_forest_impute
.\preprocessing\_normalization.py	_scale_func_group
.\preprocessing\_normalization.py	scale_norm
.\preprocessing\_normalization.py	minmax_norm
.\preprocessing\_normalization.py	maxabs_norm
.\preprocessing\_normalization.py	robust_scale_norm
.\preprocessing\_normalization.py	quantile_norm
.\preprocessing\_normalization.py	power_norm
.\preprocessing\_normalization.py	log_norm
.\preprocessing\_normalization.py	_prep_adata_norm
.\preprocessing\_normalization.py	offset_negative_values
.\preprocessing\_quality_control.py	_obs_qc_metrics
.\preprocessing\_quality_control.py	_var_qc_metrics
.\tools\_sa.py	ols
.\tools\_sa.py	glm
.\tools\feature_ranking\_rank_features_groups.py	rank_features_groups

The text was updated successfully, but these errors were encountered:

nicolassidoux · 2024-11-26T12:32:01Z

In addition to what @eroell mentioned, here are my initial—perhaps naive—thoughts on a potential strategy we could adopt:

Assess and Implement Full Support for All Datatypes
- We could evaluate each function that manipulates layer data and ensure it supports every possible datatype.
Adopt a Standard Datatype with Conversion Extensions
- Alternatively, we could standardize all functions to operate on a single datatype (for example, NumPy arrays 😉).
- Extensions could then handle the conversion of the real underlying datatype to NumPy for processing, and back to the original datatype afterward.

From a design perspective, strategy 2 is quite appealing:

It requires no changes to existing functions or tests.
Adding support for new datatypes would be straightforward if AnnData introduces them in the future.
This approach aligns partially with the purpose of anndata_ext.py functions like df_to_anndata and anndata_to_df.

However, there are downsides to consider:

Conversion Costs: These could negatively impact performance.
Optimization Loss: We may miss opportunities to apply specific optimizations for certain datatypes.
Design Fit: This conversion feature might be better implemented directly within the AnnData package itself.

Zethson · 2024-11-26T12:36:50Z

Strategy 2 is more or less what we've been doing now because in almost all cases we densify. Speaking from experience, we are leaving a loooot of performance behind if we don't support sparse matrices. Moreover, enabling all algorithms with dask arrays is also important so that we can really scale and enable out-of-core computation.
I'd not worry about cupy arrays and GPU for now.

I would:

Ensure that every function converts to numpy array as a fall back if necessary.
Try to implement all functions with sparse arrays (without densifying!) where possible. This alone will be substantial work but with substantial pay offs.
Now do it again with Dask. This is again a massive massive pile of work but it will be super important to scale longterm.

We can break it up into pieces and also @aGuyLearning can potentially help here.

WDYT?

nicolassidoux · 2024-11-26T13:33:34Z

I like this hybrid approach style.

I could see a classic dispatch function for the considered function (knn_impute for example) and two single-dispatch for the conversion function (to_numpy, from_numpy).

Let's try with a quick example. knn_impute could be (just an example, syntax may be wrong)::

def knn_impute(
    adata: AnnData,
    ...
) -> AnnData:

    if isinstance(adata.X, sparse.csr_matrix):  # This datatype has a special algorithm available
        return _knn_impute_on_csr(adata, ...)
    if isinstance(adata.X, sparse.csc_matrix):  # This datatype also
        return _knn_impute_on_csc(adata, ...)
                                                # Note Dask is not supported
    
    # Fallback to numpy
    return from_numpy(_knn_impute_on_numpy(to_numpy(adata.X)))

from_numpy will have to know somehow what was the original datatype.

Then one conversion function could be:

@singledispatch
def to_numpy(layer: Any) -> None:
    raise ValueError("Unsupported type") 


@to_numpy.register
def _(layer: np.ndarray) -> np.ndarray :
    return layer  # The layer is already a numpy array, then return it


@to_numpy.register
def _(layer: sparse.csr_matrix) -> np.ndarray:
    return layer.toarray()  # Will densify if I'm not wrong


@to_numpy.register
def _(layer: DaskArray) -> np.ndarray:
    return layer.whatever()  # Whatever needs to be done there

So to sum up, if the user calls knn_impute with a layer datatype that:

The considered function wants to manage itself, it will dispatch it to a private function and do the job
The considered function cannot manage itself, it will try and call a conversion function
- If the datatype cannot be converted, an exception is thrown
- If the datatype can be converted, then the considered function proceeds with the numpy algorithm and convert back to the original datatype

eroell · 2024-11-26T13:34:10Z

You've spotted this right @nicolassidoux - however, optimization for sparse structures if the data is sparse is huge, can reach orders of magnitude easily depending on data sparsity. Also, out of core operations will not at all be possible without following Strategy 1 :)

Agree at all points with @Zethson.

We can break it up into pieces

absolutely, the high level steps 1 and 2 are both huge. They should rather guide a rollout for this behavior, not "a PR" ;) step 2 is in fact open-end.

I'll try to make a first set of tackeable portions soon for step 1 as sub-issues

eroell · 2024-11-26T13:48:27Z

Rather than having to/from numpy functions singledispatched, and many names for _knn_impute_on_csr, _knn_impute_on_csc, I right now have this pattern in mind:

(Imagine sum_entries to be _knn_impute, where data type specific operations are performed)

@singledispatch
def sum_entries(data):
    """
    Sum entries
    Arguments
    ---------
    data
        Data to sum entries of.

    Returns
    -------
    result
        Sum of entries.
    """
    raise TypeError(f"Unsupported data type: {type(data)}")

@sum_entries.register
def _(data: np.ndarray):
    print("Summing entries of a NumPy array.")
    return np.sum(data)

@sum_entries.register
def _(data: sparse.spmatrix):
    print("Summing entries of a SciPy sparse matrix.")
    return data.sum() # not so exciting example, just illustrates this could look different for this data type

To implement this step by step for (batches of) our functions together with tests would form the implementation of high-level step 1:

eroell · 2024-11-26T13:51:26Z

Raise of TypeError for everything not explicitly ok comes in very naturally here

nicolassidoux · 2024-11-26T13:52:34Z

Looks good to me also, but beware of the limitation of singledispatch: it overloads the function based on the first argument only.

I have a suggestion for a first step, if I may: we need to list every function in ehrapy that can manipulate data so we have a clear idea of what needs to be done.

nicolassidoux · 2024-12-01T12:47:29Z

Ok, so I created this list by checking if a function manipules AnnData.X or AnnData.layers. Feel free to tell me if I missed something.

-> LIST MOVED TO ISSUE DESCRIPTION

eroell · 2024-12-01T19:29:10Z

Is e.g. .\tools\_sa.ols on the list b.c. it converts adata.X to a pandas dataframe?

This would not manipulate adata.X , but could create an expensive pd.DataFrame object right. This is more than enough reason to have it tracked on this list. Just want to double-check I understand it correctly :)

nicolassidoux · 2024-12-02T13:14:29Z

Yes, you're right, this function doesn't manipulate X itself but passes it to DataFrame constructor. This constructor accepts as data:

data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame

So if X is something else than ndarray or DataFrame, I guess data = pd.DataFrame(adata.X, columns=adata.var_names) will throw. Hence we would have to take care of the conversion.

Btw, maybe it would be good to implement all kind of conversions (to_ndarray, to_dataframe, etc.). Of course, once again, these features would be better placed in AnnData...

nicolassidoux · 2024-12-02T13:16:28Z

I moved the list up to issue description so everyone can amend it easily,

eroell · 2024-12-06T08:49:35Z

Another question I have on the list - e.g. in ep.tl.kaplan_meier, calling _univariate_model:

ehrapy/ehrapy/tools/_sa.py

Line 469 in 2e8240f

df = anndata_to_df(adata)

This function and its friends should also be on the list, right? As they call anndata_to_df.

On a sidenote, to avoid this function and only pick relevant features before (e.g. survival analysis often requires just 2 or so) could be the way to have these functions "dask-ified", even if we still just call lifelines. But on a tiny slice of the data.

eroell added the enhancement New feature or request label Nov 26, 2024

Zethson assigned eroell and nicolassidoux Nov 26, 2024

eroell mentioned this issue Dec 4, 2024

initial suggestions of array type handling on example of normalization methods #835

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829

Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829

eroell commented Nov 26, 2024 •

edited by nicolassidoux

Loading

nicolassidoux commented Nov 26, 2024

Zethson commented Nov 26, 2024

nicolassidoux commented Nov 26, 2024 •

edited

Loading

eroell commented Nov 26, 2024

eroell commented Nov 26, 2024 •

edited

Loading

eroell commented Nov 26, 2024

nicolassidoux commented Nov 26, 2024 •

edited

Loading

nicolassidoux commented Dec 1, 2024 •

edited

Loading

eroell commented Dec 1, 2024

nicolassidoux commented Dec 2, 2024 •

edited

Loading

nicolassidoux commented Dec 2, 2024

eroell commented Dec 6, 2024

Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829

Master Issue: Function Compatibility with the Different Datatypes of AnnData Fields #829

Comments

eroell commented Nov 26, 2024 • edited by nicolassidoux Loading

Description of feature

List of affected functions

nicolassidoux commented Nov 26, 2024

Zethson commented Nov 26, 2024

nicolassidoux commented Nov 26, 2024 • edited Loading

eroell commented Nov 26, 2024

eroell commented Nov 26, 2024 • edited Loading

eroell commented Nov 26, 2024

nicolassidoux commented Nov 26, 2024 • edited Loading

nicolassidoux commented Dec 1, 2024 • edited Loading

eroell commented Dec 1, 2024

nicolassidoux commented Dec 2, 2024 • edited Loading

nicolassidoux commented Dec 2, 2024

eroell commented Dec 6, 2024

eroell commented Nov 26, 2024 •

edited by nicolassidoux

Loading

nicolassidoux commented Nov 26, 2024 •

edited

Loading

eroell commented Nov 26, 2024 •

edited

Loading

nicolassidoux commented Nov 26, 2024 •

edited

Loading

nicolassidoux commented Dec 1, 2024 •

edited

Loading

nicolassidoux commented Dec 2, 2024 •

edited

Loading