how to fix slicing in `zarr-python` #1603

d-v-b · 2023-12-10T11:54:25Z

d-v-b
Dec 10, 2023
Maintainer

Slicing is weird in `zarr-python`

In zarr-python, slicing a zarr array returns a numpy array (or some other ndarray-flavored thing, depending on meta_array). From a type perspective, we have zarr.Array.__getitem__(slice) -> np.ndarray But slicing a collection should have the type signature T.__getitem__(slice) -> T, not T.__getitem__(slice) -> X (i'm omitting self from the function signature here).

I think "slicing a collection should return an instance of the collection" is a pretty simple rule, but zarr-python fails to follow this rule today, which sets zarr-python apart from base python collections (list, tuple), or array libraries like numpy, dask, tensorstore, among others.

I understand that zarr-python's slicing behavior is consistent with h5py (which probably wanted to be consistent with numpy), but I'm pretty sure that slicing this way is incorrect, and we should fix it.

Why this matters

users who want lazy evaluation of slicing must use external libraries like dask, but dask has well documented performance limitations for arrays with many small chunks. This forces zarr users to learn how to use (and debug) dask if they want to do basic IO at scale, which means that zarr-python is failing those users.
eagerly performing IO when an array is sliced is potentially very inefficient. If slicing is lazy, then multiple sliced arrays can be fetched in a batch operation that optimizes IO operation over the entire set of arrays. A trivial example of this: suppose a user requests identical slices from a zarr array. if these requests are batched, then it's possible to infer that the two slices are identical, and only a single IO operation is needed (note that this deduplication is even more efficient than caching the first request). Another simple example is the situation when a user requests two halves of the same chunk in two separate slicing operations, which can be simplified to a single full chunk read (zarr-python today does two full chunks reads + slicing for each chunk).
Fetching chunks, decoding chunks, and packing chunks into output actually takes a lot of parameters, but none of those parameters can be given if the __getitem__ or __setitem_ methods are used, because these methods don't take keyword arguments. To work around this problem, we have cluttered the array API with things like synchronizer,write_empty_chunks, meta_array, etc. These attributes are important for tuning how zarr arrays do IO, but making them array attributes means that these parameters cannot be changed cleanly during the lifetime of a single array. It would be better if users could supply these parameters when they actually need to do IO, which is not necessarily when they are slicing the array.

what we should do

I think we should deprecate this behavior of zarr arrays in version 3 of zarr-python. In concrete terms, I'm currently looking at tensorstore for inspiration for how move zarr-python toward a cleaner slicing story.

It would look something like this:

z = zarr.Array(store='root.zarr/`, path='foo', dtype='uint8', shape=(100,100))
z
#> zarr.Array<shape=(100,100), store=...>
# synchronous write, like tensorstore
z[:] = 100 
z[:] == z
#> True
z0 = z[0] # lazy
#> zarr.Array<shape=(1, 100), store=....>
np.array(z0) # trigger chunk fetching, decoding, etc

what would break

if zarr.Array[0,0,:] returns a new, smaller instance of zarr.Array, then the .shape attribute of zarr.Array is no longer just a copy of the shape field in .zarray / zarr.json. Instead, the .shape attribute must take into account the sequence of slicing operations that generated the instance of zarr.Array.
Similarly, in order for slicing to compose, zarr.Array[slice1][slice2] must compose slice1 with slice2. This means that zarr.Array will end up carrying around its slice, perhaps materialized as something other than a python slice object, to accommodate boolean array indexing and the like. This means that, like tensorstore arrays, zarr.Array will be necessarily endowed with something like coordinates. This has implications I have not fully thought through.
Probably lots of other things about today's API will break 😅 .

discussion points

should we do this at all? I think we have to, and it will ultimately be a big win for users and the maintainability of the codebase, but i'm curious to hear other perspectives.
should we do this now? I think so, because introducing V3 necessarily entails breaking changes. If now is not the time, then when?
If we do this, how can we make the transition smooth for users? Perhaps we could keep the old API around as a legacy layer on top of the underlying lazy-slicing base layer (this is similar to the strategy in zarrita).

very curious to hear everyone's thoughts!

cc @jni (because I think this direction would have implications for napari)
cc @jbms (because tensorstore)

normanrz · 2023-12-10T12:54:02Z

normanrz
Dec 10, 2023
Maintainer

A few thoughts:

I think we should provide an API like this. There is enough prior art to suggest that it is a good idea.
I don't think we should break the core reading and writing API of zarr.Array. That means zarr.Array.__getitem__ and zarr.Array.__setitem__ would return or take numpy arrays. If we break this core function functionality, all users would have to change their code. I think passing around slices of an array is useful but I would think of it more like a power features or for libraries.
This API could very well be implemented in zarr.AsyncArray which is a new API, anyways.
I think __getitem__ and __setitem__ should behave consistently. That means that when __getitem__ would return a future, __setitem__ would also have to. Everything else would be confusing to me. I don't think __setitem__ can be made awaitable, though.
I don't think zarr.AsyncArray._getitem__ would need to return zarr.AsyncArray. It could also be a zarr.AsyncArraySlice with a new .shape, an .origin (or bundled as .domain) and a reference to the underlying zarr.AsyncArray (or to the store path and metadata).
I think this zarr.AsyncArraySlice should contain methods for reading and writing. I think there are many good use case for writing into slices as well (tensorstore also supports that).

This is basically how zarrita works. I had put some thought into that design. We also have implemented these concepts (albeit not async) on a higher level in the webknossos package (see wk.dataset.View).

3 replies

jbms Dec 11, 2023

This could be exposed as what I call a "subscript method", i.e. syntax like arr.lazy[10:20] or arr.lazy.oindex[10:20].

Note that while __setitem__ can return a value, when using __setitem__ in the normal way, as x[a] = b, which is the whole point of __setitem__, there is no way to access the return value. Consequently, it does not make sense for __setitem__ to return a Future-like type. In TensorStore, while __getitem__ just does lazy indexing, __setitem__ always does a synchronous assignment. However, when using Transactions (https://google.github.io/tensorstore/python/api/tensorstore.Transaction.html#), __setitem__ synchronously assigns within the transaction, but the separate commit operation (which can be done asynchronously) is needed to actually do the write. A similar transaction mechanism implemented in zarr-python may be one way to use __setitem__ even when async behavior is desired.

A natural way to implement this type of composition of indexing operations would be use something similar to the IndexTransform (https://google.github.io/tensorstore/index_space.html#index-transform) concept used in TensorStore.

ivirshup Dec 11, 2023

I would second not changing the core API. I think it's super useful that zarr and h5py are basically interchangeable at the API level, and wouldn't want to break that. Also don't want to introduce a large barrier to upgrading.

jhamman Dec 14, 2023
Maintainer

I also lean toward keeping the behavior as is at this time unless we have a strong argument for how zarr-python could optimize indexing operations at load time.

I would look closely at Xarray's "Named-Array" and Lazy Indexing classes. We have talked for a long time about how these classes would be very useful outside of Xarray and this seems like a great place to test that idea. cc @andersy005 who has been working hard on this topic lately.

Honestly, the NamedArray object is going to fit so well with Zarr V3 that we should think more about how to integrate them directly.

ivirshup · 2023-12-11T14:41:59Z

ivirshup
Dec 11, 2023

Prior discussions + other prior art

Something similar has previously been discussed here:

Data view /slice of zarr array without loading entire array #980

The xarray team had some thoughts here, especially around their internal LazilyIndexedArray class (cc: @rabernat).

I personally quite like the design of Julia's array views , which accomplish something quite like this in a generic way.

HDF5 also provides some functionality like this with their "HyperSlabs", but I don't believe h5py provides a nice python interface for working with these (or that their C interface is nice for that matter).

0 replies

dcherian · 2024-01-18T17:04:06Z

dcherian
Jan 18, 2024
Maintainer

np.array(z0) # trigger chunk fetching, decoding, etc

This is misleading if you read to something that is not a numpy array (e.g. sparse), or might even error if you read directly to a cupy array. An explicit .load seems like a good idea for the user to indicate "load this thing in to memory please" (Xarray implements this)

5 replies

d-v-b Jan 18, 2024
Maintainer Author

The way I was imagining it, users who don't want a numpy array would not be calling np.array(s0) -- they would instead call my_non_numpy_library.array(s0), e.g. cupy.array(s0)

dcherian Jan 18, 2024
Maintainer

This would require xarray, dask et al. to look at meta_array then. Seems OK in principle. This is the relevant line in dask: https://github.com/dask/dask/blob/4eb026cc9c495a9e64130764498f5ff8b61d7602/dask/array/core.py#L106

d-v-b Jan 18, 2024
Maintainer Author

zarr-python already uses meta_array (for chunks that should decompress to cupy arrays), so hopefully xarray and dask are OK with it in practice, too!

jni Jan 19, 2024
Maintainer

Yeah +1 to what @d-v-b says and using meta-array. From the perspective of napari and any general array processing libraries, a .load is very frustrating because it may work sometimes or in other cases (e.g. plain numpy arrays). np.array(lazy_or_gpu_or_whatever_array) should always work, imho. Some libraries want a force= keyword and it's on my perpetual to-do list to make a NEP for that. 😮‍💨

TomNicholas Feb 20, 2024

napari and any general array processing libraries, a .load is very frustrating because it may work sometimes or in other cases (e.g. plain numpy arrays).

If napari used xarray we would paper over this for you 😉

TomNicholas · 2024-02-20T21:26:01Z

TomNicholas
Feb 20, 2024

users who want lazy evaluation of slicing must use external libraries like dask

If slicing is lazy, then multiple sliced arrays can be fetched in a batch operation that optimizes IO operation over the entire set of arrays.

Similarly, in order for slicing to compose, zarr.Array[slice1][slice2] must compose slice1 with slice2.

Data view /slice of zarr array without loading entire array #980

I'm currently looking at tensorstore for inspiration for how move zarr-python toward a cleaner slicing story.

Following this logic to its ultimate conclusion leads to a "VirtualZarrArray" class that implements lazy indexing and concatenation.

In zarr-developers/zarr-specs#288 @jbms, @rabernat and I have proposed a "Virtual Concatenation ZEP", imagining a general implementation of lazy indexing/concatenation/stacking of zarr objects, and how the record of such operations could still be serialized into the store on-disk.

Awkwardly, this class would then have much of the functionality of a conventional "duck array", but not all of it (because it wouldn't support computations like arithmetic or reductions). To summarize some of the above discussion, there are two models for array behaviour we could follow: "duck arrays" and "disk arrays".

"Duck arrays" are normally things you can compute, and anything duck-array-like should endeavour to follow the patterns agreed upon by the python community in the python array API standard. This includes explicitly requiring that __getitem__ returns type Self, as @d-v-b is advocating.¹

But zarr.Array currently sees itself as a "disk array" (similar to hdf5). Xarray recognizes this concept in it's backend infrastructure, where it asks backend developers to subclass from xarray.backends.BackendArray. (You can also see this distinction emerge in other languages).

I personally think that zarr-python v3 should expose an array type like VirtualZarrArray that supports lazy indexing and concatenation, the question is whether we should also continue to expose a "disk array" zarr.Array, or if we can get away with only exposing one array class.

I also don't find backwards-compatibility arguments super compelling here... This is the first breaking change of Zarr in how many years? We should improve everything we can whilst we have the chance!

We might also imagine softening the v2->v3 transition for user libraries by providing convenience adapter classes, e.g. an EagerZarrArray which wraps the VirtualZarrArray but triggers loading on any indexing operation.

Annoyingly one thing they haven't agreed upon yet is how to coerce to numpy. Xarray currently papers over this for many array types, but in zarr we should pick an approach (.load or something) and stick to it rather than forcing loading to happen when it doesn't need to. ↩

3 replies

jbms Feb 21, 2024

To convert an arbitrary type to a numpy array, numpy supports array: https://numpy.org/devdocs/user/basics.interoperability.html

For synchronous conversion that is probably adequate, but indeed there is no standard for async conversion. Tensorstore uses read.

TomNicholas Feb 21, 2024

Yes sorry I should have written "in-memory array" rather than numpy array - see Deepak's comment #1603 (comment)

jbms Feb 21, 2024

I see --- so the goal is to be able to somehow configure for a ZarrArray object the associated "memory region" (i.e. system memory or the memory of a particular attached accelerator device, or perhaps a particular system memory NUMA node) and possibly associated representation (i.e. regular dense array or particular sparse array representation), and this would constrain the output array for read calls where an existing output array is not specified.

Presumably the associated memory type and representation would also influence how the entire read operation is done (and therefore would require rather deep integration), since if it is just reading into a dense system memory array and then copying at the end, there isn't much benefit.

I can see how this may be useful but I am also wary of trying to solve this problem immediately unless someone has a concrete proposal of how it will work, since it also adds a lot of complexity and the design will likely require careful consideration of how the API might be used. For example when using accelerator devices it is often desired to issue operations asynchronously, in a way that is not necessarily compatible with Python asyncio.

In terms of high-level API, I could imagine that instead of configuring the output array constraints on the ZarrArray object itself, the output array constraints are specified when calling read. Whether that is more natural for users and/or implementations is unclear to me, though.

In any case I would agree that it would be valuable to standardize synchronous and asynchronous read and write APIs,

mullenkamp · 2024-02-24T07:12:06Z

mullenkamp
Feb 24, 2024

I just want to provide some feedback as a scientist and user of zarr, h5py/hdf5, pandas, and xarray that deals with very large N-dimensional data. I need an efficient store when reading and writing large datasets that is reasonably straightforward to use and is careful not to use an excessive amount of memory.

I first used pandas and xarray as xarray was built to mimic pandas slicing, processing, and analysis tools for N-dimensional arrays. Gradually, xarray became more frustrating due to the lazy loading/caching that was built in to the package that is quite opaque to the user. Lots of care needed to be taken to ensure that the memory wouldn't top out because xarray gradually loads data into the cache of the original object. There was also no way to create a very large netcdf file by iteratively writing parts of an array to the file. All of my colleagues have just come to expect that they have to create tons of small netcdf files because of this lack of appropriate tooling (which exists for appending to files everywhere else in programming).

That's when I switched to using h5py. It was a breath of fresh air to be able to easily and iteratively write data to an existing array/dataset in an hdf5 file. And reading data was ideal as well. I could slice the dataset like in numpy and it returns a numpy array. No concerns for a gradually increasing hidden cache that I have to worry about. If I want to use another python package for doing my processing and analysis, it's super easy to convert a numpy array to pretty much anything.

I greatly appreciate that zarr currently returns numpy array as h5py does when slicing and consequently it is much more straightforward to know how I should handle the inputs and outputs. If this causes certain use-cases to be less efficient, the I'd still prefer the simplicity.

4 replies

d-v-b Feb 24, 2024
Maintainer Author

Thanks for the feedback @mullenkamp. I want to emphasize that it will always be easy to convert zarr-backed data into a numpy array. But from my experience, eagerly converting large datasets into numpy arrays is not a design decision that reduces complexity.

I work with multi-TB datasets. In zarr today, I can basically never run this line of code: my_array[1:]. If I run that line, my machine will lock up as zarr attempts to sequentially load the entire multi-TB dataset into memory before converting it to a numpy array. My system will run out of memory before this completes. So I have to use dask, or some other library, for the simple task of slicing my array, which drastically increases the complexity of my workflow. That's what we want to fix: zarr users with large arrays should be able to call my_array[1:] and get back something useful without locking up their computer.

We know that most zarr users (myself included) want an easy path to get a numpy array. That path will always be there. But the current design of zarr puts an undue burden on users with big datasets, and the proposed design actually simplifies the situation considerably for those users.

When we do start writing code to implement this design, it will be extremely useful to get feedback, so I hope you try out what we end up writing and give us your thoughts!

jhamman Feb 26, 2024
Maintainer

@mullenkamp - thanks for sharing your thoughts here.

Putting my Xarray developer hat on briefly, I would love to hear more about the problems you ran into (memory, cache management, etc). Any chance you could open a separate discussion on this set of pain points?

mullenkamp Feb 26, 2024

Hi @d-v-b . I really appreciate your reply and feedback.

Personally, if someone would write my_array[1:] in their code when they wanted to do some computation on the data, then the code should fail and the user shouldn't write such code when it's obvious that it will return an in-memory numpy array. As with h5py, they should iterate through the chunks and process the data accordingly. Just like with my normal use-cases, they likely would want to then save their results to another persistent store possibly in the same shape as the original. The user would process the data in chunks, then immediately write each chunk to a "file".

I do recognise that zarr (and the developers) might be more Dask-minded in the sense of wanting to let Dask handle the data and processing. So doing my_array[1:] followed by some computation that dask will handle magically seems to be the route that zarr (and xarray) has chosen. Am I correct? I do want to better understand the data handling and processing coding logic of zarr and xarray.

Thanks!

mullenkamp Feb 26, 2024

Hi @jhamman .
Sure. I'll open a discussion in the xarray repo. I and many of my colleagues use xarray a lot. So I'm very happy to provide feedback if it'll help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to fix slicing in `zarr-python` #1603

{{title}}

Replies: 5 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

how to fix slicing in zarr-python #1603

d-v-b Dec 10, 2023 Maintainer

Slicing is weird in zarr-python

Why this matters

what we should do

what would break

discussion points

Replies: 5 comments · 15 replies

normanrz Dec 10, 2023 Maintainer

jhamman Dec 14, 2023 Maintainer

Prior discussions + other prior art

dcherian Jan 18, 2024 Maintainer

d-v-b Jan 18, 2024 Maintainer Author

dcherian Jan 18, 2024 Maintainer

d-v-b Jan 18, 2024 Maintainer Author

jni Jan 19, 2024 Maintainer

Footnotes

d-v-b Feb 24, 2024 Maintainer Author

jhamman Feb 26, 2024 Maintainer

how to fix slicing in `zarr-python` #1603

d-v-b
Dec 10, 2023
Maintainer

Slicing is weird in `zarr-python`

Replies: 5 comments 15 replies

normanrz
Dec 10, 2023
Maintainer

jhamman Dec 14, 2023
Maintainer

dcherian
Jan 18, 2024
Maintainer

d-v-b Jan 18, 2024
Maintainer Author

dcherian Jan 18, 2024
Maintainer

d-v-b Jan 18, 2024
Maintainer Author

jni Jan 19, 2024
Maintainer

d-v-b Feb 24, 2024
Maintainer Author

jhamman Feb 26, 2024
Maintainer