Replies: 5 comments 15 replies
-
A few thoughts:
This is basically how zarrita works. I had put some thought into that design. We also have implemented these concepts (albeit not async) on a higher level in the |
Beta Was this translation helpful? Give feedback.
-
Prior discussions + other prior artSomething similar has previously been discussed here: The xarray team had some thoughts here, especially around their internal I personally quite like the design of Julia's array views , which accomplish something quite like this in a generic way. HDF5 also provides some functionality like this with their "HyperSlabs", but I don't believe |
Beta Was this translation helpful? Give feedback.
-
This is misleading if you read to something that is not a numpy array (e.g. sparse), or might even error if you read directly to a |
Beta Was this translation helpful? Give feedback.
-
Following this logic to its ultimate conclusion leads to a " In zarr-developers/zarr-specs#288 @jbms, @rabernat and I have proposed a "Virtual Concatenation ZEP", imagining a general implementation of lazy indexing/concatenation/stacking of zarr objects, and how the record of such operations could still be serialized into the store on-disk. Awkwardly, this class would then have much of the functionality of a conventional "duck array", but not all of it (because it wouldn't support computations like arithmetic or reductions). To summarize some of the above discussion, there are two models for array behaviour we could follow: "duck arrays" and "disk arrays". "Duck arrays" are normally things you can compute, and anything duck-array-like should endeavour to follow the patterns agreed upon by the python community in the python array API standard. This includes explicitly requiring that But I personally think that zarr-python v3 should expose an array type like I also don't find backwards-compatibility arguments super compelling here... This is the first breaking change of Zarr in how many years? We should improve everything we can whilst we have the chance! We might also imagine softening the v2->v3 transition for user libraries by providing convenience adapter classes, e.g. an Footnotes
|
Beta Was this translation helpful? Give feedback.
-
I just want to provide some feedback as a scientist and user of zarr, h5py/hdf5, pandas, and xarray that deals with very large N-dimensional data. I need an efficient store when reading and writing large datasets that is reasonably straightforward to use and is careful not to use an excessive amount of memory. I first used pandas and xarray as xarray was built to mimic pandas slicing, processing, and analysis tools for N-dimensional arrays. Gradually, xarray became more frustrating due to the lazy loading/caching that was built in to the package that is quite opaque to the user. Lots of care needed to be taken to ensure that the memory wouldn't top out because xarray gradually loads data into the cache of the original object. There was also no way to create a very large netcdf file by iteratively writing parts of an array to the file. All of my colleagues have just come to expect that they have to create tons of small netcdf files because of this lack of appropriate tooling (which exists for appending to files everywhere else in programming). That's when I switched to using h5py. It was a breath of fresh air to be able to easily and iteratively write data to an existing array/dataset in an hdf5 file. And reading data was ideal as well. I could slice the dataset like in numpy and it returns a numpy array. No concerns for a gradually increasing hidden cache that I have to worry about. If I want to use another python package for doing my processing and analysis, it's super easy to convert a numpy array to pretty much anything. I greatly appreciate that zarr currently returns numpy array as h5py does when slicing and consequently it is much more straightforward to know how I should handle the inputs and outputs. If this causes certain use-cases to be less efficient, the I'd still prefer the simplicity. |
Beta Was this translation helpful? Give feedback.
-
Slicing is weird in
zarr-python
In
zarr-python
, slicing a zarr array returns a numpy array (or some other ndarray-flavored thing, depending onmeta_array
). From a type perspective, we havezarr.Array.__getitem__(slice) -> np.ndarray
But slicing a collection should have the type signatureT.__getitem__(slice) -> T
, notT.__getitem__(slice) -> X
(i'm omittingself
from the function signature here).I think "slicing a collection should return an instance of the collection" is a pretty simple rule, but
zarr-python
fails to follow this rule today, which setszarr-python
apart from base python collections (list
,tuple
), or array libraries likenumpy
,dask
,tensorstore
, among others.I understand that
zarr-python
's slicing behavior is consistent withh5py
(which probably wanted to be consistent with numpy), but I'm pretty sure that slicing this way is incorrect, and we should fix it.Why this matters
dask
, but dask has well documented performance limitations for arrays with many small chunks. This forces zarr users to learn how to use (and debug) dask if they want to do basic IO at scale, which means thatzarr-python
is failing those users.zarr-python
today does two full chunks reads + slicing for each chunk).__getitem__
or__setitem_
methods are used, because these methods don't take keyword arguments. To work around this problem, we have cluttered the array API with things likesynchronizer
,write_empty_chunks
,meta_array
, etc. These attributes are important for tuning how zarr arrays do IO, but making them array attributes means that these parameters cannot be changed cleanly during the lifetime of a single array. It would be better if users could supply these parameters when they actually need to do IO, which is not necessarily when they are slicing the array.what we should do
I think we should deprecate this behavior of zarr arrays in version 3 of
zarr-python
. In concrete terms, I'm currently looking attensorstore
for inspiration for how movezarr-python
toward a cleaner slicing story.It would look something like this:
what would break
zarr.Array[0,0,:]
returns a new, smaller instance ofzarr.Array
, then the.shape
attribute ofzarr.Array
is no longer just a copy of theshape
field in.zarray
/zarr.json
. Instead, the.shape
attribute must take into account the sequence of slicing operations that generated the instance ofzarr.Array
.zarr.Array[slice1][slice2]
must composeslice1
withslice2
. This means thatzarr.Array
will end up carrying around its slice, perhaps materialized as something other than a python slice object, to accommodate boolean array indexing and the like. This means that, liketensorstore
arrays,zarr.Array
will be necessarily endowed with something like coordinates. This has implications I have not fully thought through.discussion points
zarrita
).very curious to hear everyone's thoughts!
cc @jni (because I think this direction would have implications for napari)
cc @jbms (because tensorstore)
Beta Was this translation helpful? Give feedback.
All reactions