-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add obsm, etc to ExperimentAxisQuery #179
Conversation
joinids = getattr(self._joinids, axis.value) | ||
return axism[layer].read((joinids, col_joinids)) | ||
|
||
def _axism_inner_csr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method exists purely to be passed to _read
, which loads everything in memory. This works for now but I am not sure this is the right format we want for obsm
. In principle, for obsm
it would be better to use a csc matrix since it has way more rows than column. Most embeddings though are dense, so maybe we could just use a numpy array here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest looking at the AnnData documentation. obsm/varm/obsp/varp are ndarray in typical use, and I don't see any reason they would not be here... Using a sparse matrix is likely to create incompatibility with downstream user assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, for some reason I thought AnnData didn't enforce the typing here, but it looks like it does. I'll convert to ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its more than "enforce typing" - it is conceptually a dense array in the AnnData data model (but not in SOMA)
def to_anndata( | ||
self, | ||
X_name: str, | ||
*, | ||
column_names: Optional[AxisColumnNames] = None, | ||
X_layers: Sequence[str] = (), | ||
obsm_keys: Sequence[str] = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use consistent names for X and obsm:
X_layers
obsm_keys
Suggest simply using _layers
Also, use a tuple, not a list, as the default, a la X_layers
. I'm surprised this passed mypy linting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, missing the rest of the layers (varm, etc). Any reason we should not add all of it in one pass?
It is probably worth thinking about how to do this, e.g.,
- simply add all four to the function as optional args,
- or, add a dict-like structure a la
column_names
.
@thetorpedodog - style suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I add the other layers, I'd like the approach to be validated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several concerns:
- this doesn't produce a "legal" AnnData, which specifies that obsm/obsp/varm/varp are ndarray
- the purpose/signature of query._read seems to morphed over time (not just from this PR), such that the docstrings and sig don't match the original "produces pure Arrow objects" intent. We should make an intentional decision to either a) dedicate this code path to the to_anndata use case exclusively, or b) refactor it back to generating Arrow objects, with the conversion to AnnData wrapped around it. I'm OK with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m liking what I see in general. I have a few points of feedback, both API design–related and otherwise.
obsm = dict() | ||
for key in obsm_keys: | ||
obsm[key] = self._axism_inner_ndarray(_Axis.OBS, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non–API style note: this can be made into a comprehension:
obsm = {
key: self._axism...(..., key)
}
joinids = getattr(self._joinids, axis.value) | ||
return axism[layer].read((joinids, slice(None))) | ||
|
||
def _convert_to_ndarray(self, is_obs: bool, T: pa.Table, n_row: int, n_col: int) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More non–API design feedback: to avoid boolean parameters, I would make this either the _Axis
you’re using, or pass in the indexer function (looking like self._convert_to_ndarray(self.indexer.by_obs, ...)
). Also, to avoid confusion with type names or generic parameters, prefer lowercase for parameter / variable names.
is_obs = axis is _Axis.OBS | ||
n_row = n_col = len(self._joinids.obs) if is_obs else len(self._joinids.var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not something for you to do; just thinking aloud here:
it might make sense to add a function to _Axis
that gets the given attribute, so that this could look something like:
n_row = n_col = axis.getattr_from(self._joinids)
so it does self._joinids.obs
and self._joinids.var
by itself without you having to specify (and thus avoids potential obs/var switcheroos)
(maybe also have it take a suffix, so you could say axis.getattr_from(something, "m")
to get something.obsm
/something.varm
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decided to see what this would look like here #183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be great - I approved #183. If you want to merge it, I can accommodate the changes here, otherwise we can do it later.
This aims to eliminate the annoyance (and potential error) of writing thing = whatever.obs if axis is _Axis.OBS else whatever.var by letting you say thing = axis.getattr_from(whatever) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor style suggestions; nothing critical. Looks good!
obsm = obsm_ft.result() | ||
obsp = obsp_ft.result() | ||
varm = varm_ft.result() | ||
varp = varp_ft.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these could be passed as the named arguments to _AxisQueryResult
without the need for a temporary variable.
if key not in self._ms: | ||
raise ValueError(f"Measurement does not contain {key} data") | ||
|
||
axism = axis.getitem_from(self._ms, suf="m") | ||
if not (layer and layer in axism): | ||
raise ValueError(f"Must specify '{key}' layer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both of these cases, you can use EAFP style:
try:
axism = axis.getitem_from(...)
except KeyError as ke:
raise ValueError(...) from ke # or maybe `from None`
try:
axism_layer = axism[layer]
except KeyError as ke:
...
if not isinstance(axism[layer], data.SparseNDArray): | ||
raise TypeError(f"Unexpected SOMA type stored in '{key}' layer") | ||
|
||
joinids = getattr(self._joinids, axis.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
axis.getitem_from(self._joinids)
) -> np.ndarray: | ||
indexer: pd.Index = axis.getattr_from(self.indexer, pre="by_") | ||
idx = indexer(table["soma_dim_0"]) | ||
Z = np.zeros(n_row * n_col, dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: recommend making z
lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
_read_axis_mappings, self._axism_inner_ndarray, _Axis.OBS, obsm_keys | ||
) | ||
obsp_ft = self._threadpool.submit( | ||
_read_axis_mappings, self._axisp_inner_ndarray, _Axis.OBS, obsp_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkmartinjr and @ebezzi I believe you had a conversation about this offline (or maybe here but I don't see it). We know that the most common uses case of axisp arrays are numerically sparse, and even though anndata's schema says they should be numpy dense arrays, scanpy's methods fill it in with scipy sparse matrices.
Ultimately we would like to move to a world where either AnnData schema takes both (dense and sparse) or only sparse. Given that the ecosystem already violates the schema in favor of better numerical representation I lean towards tiledbsoma also violating the schema, and then we request AnnData's schema to be relaxed.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I have no issues. You might want to ping Isaac V. and see what he thinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I wanted to make sure I didn't miss anything important. I don't see it as a blocker for now, I will file a few issues (here and in AnnData) to move towards the support of sparse matrices in the axisp arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also look at this page, where they claim that obsm etc can be sparse. This is relative to the on-disk format, but I don't believe there is anything that converts them when loading in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from an API signature viewpoint!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏🏻
Adds
obsm
,obsp
,varm
,varp
toExperimentAxisQuery._read
and.to_anndata
.Unit tests can be found here: single-cell-data/TileDB-SOMA#1934