-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor MultiZarrToZarr into multiple functions #377
Comments
I've been thinking of something similar. I'm not sure we can get all the flexibility we want out of just two functions, but think a couple classes may do the trick. I've been thinking of an API like: g = kerchunk.KerchunkGroup()
# issubclass(KerchunkGroup, MutableMapping[str, KerchunkGroup | KerchunkArray])
# Concatenation + assignment, following array-api
g["x"]: kerchunk.KerchunkArray = kerchunk.concat([...], axis=0)
# Write a manifest
g.to_json(...)
# Getting a zarr group:
g.to_zarr()
# Reading from a manifest
g2 = kerchunk.KerchunkGroup.from_json(...)
# Assignment
g2["new_group"] = g
# Single level merging
g.update(g2)
# Add an array from a different store:
g.create_group("my").create_group("new")
g["my"]["new"]["array"] = zarr.open("matrix.zarr"): zarr.Array Then leave anything more complicated to something like This suggestion has a lot of the functionality already in
This idea may be more of a Also, I think you may be missing a reference to #375 under the "Suggestion" heading. |
Right now we are using Kerchunk as a workaround for things that Zarr itself does not support, by implementing various types of indirection on chunk keys at the Store level. I think this is a great way to prototype new possibilities. The downside is that only kerchunk-aware clients can ever hope to read these datasets. Also, from a design point of view, stores are technically ignorant of the Zarr data model (arrays, groups, etc.); they are formally just low-level key value stores. So when it comes time to refactor those prototypes into more stable interfaces, I think we should seriously consider upstreaming some of these features into Zarr itself, possibly via new Zarr extensions. The advantage of that approach is that
I believe that all 5 of the features above could be implemented in this a way. In particular, the concept of a "chunk manifest": a document which stores information about where to find the chunks for an array (rather than assuming they follow the Zarr chunk encoding scheme) would get us very far. The outcome of this would be a Zarr array with much more powerful virtualization capabilities, something between Xarray's duck array types and the current Kerchunk situation. We would be eager to collaborate on this and have some work in this direction planned for this quarter. |
To be more concrete about what I mean, we could imagine writing code like this import zarr
a1 = zarr.open("array1.zarr") # zarr.Array
a2 = zarr.open("array2.zarr") # zarr.Array
# this would create a Zarr array with the correct extensions to support indirection to a1 and a2
# (need to figure out the right API)
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")
a3[0] = 1 # write some data...automatically goes into the correct underlying array |
We definitely want to see multi language support (cc: @manzt) Do you have thoughts on what extensions to zarr would need to be made? Definitely some sort of specification for being able to pass around manifests. But maybe this means needing to define some set of stores as well? Does mapping an HDF5 file to zarr fit as a zarr extension? As high priority of one?
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr") My idea currently would be that you don't need to assign to a store on concatenation. I think this could be a purely in memory object and you resolve everything at access or storage time. I'd also even say I'm not sure a "reference" based array should be an instance or subclass of a3[0] = 1 # write some data...automatically goes into the correct underlying array I've mostly been thinking about read access at the moment. Not sure how this complicates things.
Me too! |
read-only is of course easier, but also just a special case of read-write. Our arraylake friends have a well-defined model of where and how writes should go regardless of where the original mapper/chunks were defined. |
Awesome. Yah, something that should be very possible with zarrita.js. |
Repeating https://observablehq.com/@manzt/ome-tiff-as-filesystemreference using some of these newer ideas would be awesome :) |
Sorry I've been out of the kerchunk loop for a bit but I've been trying to stay abreast of changes. zarrita.js has an HTTP-based reference store (and will try to map references with |
(sorry I haven't yet had the chance to review this) |
Thanks @rabernat - I think the proposal of defining how arrays should be concatenated at the Zarr level is really cool, and I would be interested to help. I also think that working this out consistently at the Zarr level would help expose the ways in which the current kerchunk API brushes subtleties under the rug. One of these subtleties actually just came up for me in practice in #386 - I got confused by whether concatenation in Kerchunk was supposed to follow xarray rules or lower-level array-like rules instead. We therefore have to be really clear about our type systems, as we now have at least 3 different data structures for which concatenation would be defined differently:
If we could get this straight it would be really powerful though. Also then ultimately |
Aren't dimensions name optional? In which case these should just behave like numpy arrays? From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#dimension-names
|
Oh I guess so!
But obviously propagating the z1 = zarr.Array(1d_data, dim_names=['t'])
z1 = zarr.Array(1d_data, dim_names=['x'])
zarr.concat([z1, z2], axis=0) # well-defined numpy-like operation, should return a 1D array, but what dim_name should the result have? I'm just saying that having Zarr array type(s) that explicitly encode these propagation rules would be far better than implicitly doing this stuff in the depths of kerchunk. EDIT: It reminds me of this suggestion data-apis/array-api#698 (comment) |
I wonder how we can get together kerchunk-interested people? Do we need a dedicated slack or discord or something? |
I think it's reasonable to let people do concatenation like numpy arrays regardless of what the dimension names are. array API as much as possible would be a good idea, just to reduce the amount of new things one needs to learn. Could something like the merge logic in xarray be used for handling dimension names and other metadata (i.e. anything in attrs)? E.g. rules like "unique", "first"? I've been thinking about operations other than concatenation that are achievable with kerchunk:
It would be nice to have these operations exposed in a way that doesn't require rewriting actual chunk data.
If there was a zarr zulip/ discord/ slack maybe kerchunk could fit in there? |
I second this, especially given that not all HDF or other files will be netCDF compliant anyway, and it would be good to have an opinion on what to do with them. |
Xarray already has a array wrapper type that does all this except concatentation and metadata handling: pydata/xarray#5081. |
For a long time I've felt like there's code we can reuse from dask for the concatenation piece. A |
From kerchunk's point of view, though, we want to figure out where references belong, rather than building a dask graph. The code would probably be quite different? I was wondering whether dask concatenation could get around the "different chunking" problem beyond var-chunks; i.e., each input dataset is first rechunked to a common grid and then concatenated, making an aggregate of aggregate datasets. That would be the only thing you could do in some circumstances. If the dask chunks are bigger than the native chunks, it may even be reasonably performant. |
This is a really useful observation, and I agree we could probably re-use code from xarray & dask. I think we should avoid a dependency on either though. EDIT: Avoiding a dependency on xarray would be good motivation to create the standalone lazy arrays package. |
To satisfy both of these concerns at once we would need to make a 3rd repository just for |
Why not make that thing here in kerchunk? |
Maybe it could, but I'm imagining there might be other uses for the VirtualZarr idea that aren't just in the kerchunk package? e.g. the ArrayLake stuff? I do also think now that it would be conceptually neat to separate a concatenateable VirtualZarr class (/ "chunk manifest") from the kerchunk code that actually reads specific legacy file formats. I don't really have a strong opinion on this though. |
Since the kerchunk package doesn't depend on other things, I think it might be a good place, perhaps just to start. I am happy to include more repo collaborators, if that's a blocker. |
@dcherian I think the indexing capabilities of LazilyIndexedArray may be too powerful here. My understanding was that it can handle indexing statements that kerchunk can't, since Though I would really love to see a split-out LazilyIndexedArray. May be able to solve a big class of issues we have over on anndata.
I would have some interest in the in-memory class being something that can be written to a chunk manifest, but not neccesarily containing a chunk manifest. Some things I think this could help with:
I think it would make sense to have some prototyping before adding these to an established library. |
If we have a We could make a kerchunk backend for xarray that returns a ds = xr.open_mfdataset(
'/my/files*.nc',
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='nested', # 'by_coords' would require actually reading coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects directly
ds.kerchunk.to_zarr(store='out.zarr') # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet) Pros:
Cons:
|
Additional con: that would require opening and reading potentially thousands of reference sets at open_dataset time. If they all formed a regular grid such that you know where everything goes without opening them (from some naming convention or other manifest), well then you've done MultiZarrToZarr already. |
Doesn't that step have to happen however we do this?
The various keyword arguments to
In what sense? You still need to actually create the correct json reference file, or are you saying that should be trivial if you are 100% confident in the layout of the files a priori? |
In the kerchunk model, it happens once; in the concat model, it happens every time.
Both: if you need a manifest, someone has to make this, and kerchunk already can for a range of cases. xarray can do it for a set of already-open datasets but it cannot persist the overall dataset state. Do we need a third way? If we can infer position from names (or something other that is simple), then making the manifest/reference set is indeed trivial. https://nbviewer.org/github/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb was an example of the latter (4D and 5D data), no sub-file coord decode required (but the parameter space is truly massive, and a pre-pass to see which parts are populated was really helpful). |
There is a lot to look at in the notebook, and I probably don't understand xarray well enough to make to much sense of it - I hope others can have a look. People have asked for ways to auto-kerchunk a dataset on open with xr, and you've certainly covered that (note that engine="kerchunk" already is used for passing in references sets, but overloading it might be OK).
I think this is the nub. You use the xarray API, but the concat actually happens here, no? So I'm not seeing what we're getting from xarray's concat logic at all, it's just another place to put code that already works in kerchunk. Actually getting coordinates out of a given reference set (pre-scanned) is fast, so how does xarray "check" and sort those for us? |
The main point is that it's possible to abstract away everything
This is a bit different from what Andrew suggested in https://discourse.pangeo.io/t/making-kerchunk-as-simple-as-a-toggle/369. His
Yes
Concatenation of reference sets does already work in kerchunk, but I originally raised this issue because the user experience could be greatly improved. Refactoring to use xarray's UI instead gets us:
I'm not totally sure I understand what you mean (because it's not super clear to me what checks kerchunk performs and when), but we could allow xarray to read coordinates, build indexes, and use those to automatically sort datasets. That's what
It's specific to assuming you have represented your zarr references as xarray-wrapped I mean of course this is a big suggestion, and we don't have to do any of this, but I just think it's a pretty good idea 🤷♂️ |
This would be the killer motivation for me (gets around many places where the non-existence of variable chunking in zarr is fatal). How complex would it be to demonstrate? |
To be clear: you're asking if we could use this approach to concatenate datasets of inconsistent chunk length, and then access them via fsspec, all before the variable-length chunks ZEP is accepted?
If you can give me a specific example I can have a go. |
Specifically, where the each input is a legal zarr, but perhaps the last chunk is not the same size as the others, or the sizes of chukns from one input to the next is not the same. |
Okay so I think you could make general variable-length chunking work just by generalizing the This would obviously not be a valid set of zarr attributes (until a variable-length chunks ZEP was accepted), but you could make writing these variable-length chunks |
I did this within zarr as POC in zarr-developers/zarr-python#1483 (see also zarr-developers/zarr-python#1595 ) This all brings us to the same conversation: this general logic of which chunk goes where could be implemented in the storage layer (kerchunk/fsspec), zarr or xarray. xarray has its complex indexers, zarr has code simplicity and a natural place to store the chunks, and the storage layer is the only one where we can iterate very quickly and control everything. What we definitely don't want, though, is to repeat the work in multiple places in different ways :| |
There is lots of prior art here that would be worth studying. Specifically: NCML. |
Coming back to the OP, and ignoring the virtual arrays conversation: it occurs to me that the first 4 bullets are essentially special cases of what MZZ does. We could make them separate, simpler functions, or we could provide simple wrapper function APIs that pass the right info to MZZ. |
(Copying Martin's responses to some Q's of mine over on #386)
|
(My responses)
I disagree - I think that what MZZ does can and should be decomposed into a series of steps just like that, you just also need to include the determination of the the total ordering in the series of steps. Xarray can either be told this ordering (via the order of inputs to
The workflow would be different if the user needs to go down to the level of |
I would say, that this is almost entirely what MZZ does right now, and I would be very interested to see how it could be done independently for a second stage of tree-like merge/combine.concat. Would the decomposition you are after be better described as
|
Yes - this is exactly what xarray already does - you can see it clearly by how in the combine module there are two internal functions: EDIT: And all of this nesting is conveniently hidden from the user when they just call |
OK, so all this time I have been asking about how one might use xarray's machinery to serialise the set of positions of the input data, and now I know that _infer_concat_order_from_coords exists... The remaining differences that would need to be figured out:
It sounds like you already have ideas for all of this? |
This is just a special case of specifying the total ordering. The simplest way is just to ask the user to deduce the ordering and pass a list of datasets of the correct order to EDIT: Apparently we can use def add_time_dim_coord_from_filename(ds: xr.Dataset) -> xr.Dataset:
"""Adds a new dimension coordinate containing the time value read from the datasets' filename."""
filename = ds.encoding["source"]
# actually derive the time value from the filename string
subst = re.search(r"\d{4}-\d{2}", filename)[0]
time_val = cftime.DatetimeNoLeap.strptime(subst, '%Y-%m', calendar='noleap', has_year_zero=True)
time_dim_coord = xr.DataArray(name='time', data=[time_val], dims='time')
return ds.assign_coords(time=time_dim_coord)
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='by_coords', # 'by_coords' would require actually reading coordinate data
preprocess=add_time_dim_coord_from_filename,
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects directly
ds.kerchunk.to_zarr(store='out.zarr')
Keeping track of the locations of the chunks is a separate requirement, which should be the responsibility of the |
Right, we definitely need to store that manifest (or whatever it is called/however structured), we cannot be opening all of the input files each time we want access to the dataset. So open_mf is not realistic - even listing files can be expensive - we need to use those low level functions to figure out where things go and save that, ideally (I think) in the existing kerchunk format. |
I am not suggesting that I think of what kerchunk is doing as effectively caching the result of all the ordering, checks, and concatenation that |
That's a separate question from the whole "split everything into array operations that can be wrapped and applied with xarray API" suggestion. It's the same thing as asking do we combine |
Sorry, misunderstood you |
I wonder if it would be better to do this at the reference extraction step (in |
Will some of these cases trigger computation? Or do you mean it will generally be fine in the netcdf-esque situation? |
For mergers where you need the coordinate values, you will always need to compute something at some point, but the main data arrays should not need to be touched. |
General reference to: https://github.com/TomNicholas/VirtualiZarr (which I have not yet had chance to look through) |
Thanks Martin...the above a github rec this week, so that was good. Have only looked at briefly. |
Yep - I was going to wait until I had the MVP fully working before commenting here, but basically I've been working on a new package "VirtualiZarr" that aims to implement all the ideas detailed above. The rationale is laid out in the readme, but basically instead of manipulating references as nested dicts that represent entire files, instead we just use kerchunk to read the byte ranges of chunks from files, but then immediately hide the details inside array-like The {
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
} The The concatenation itself is defined at the level of the Manifest, where it can be performed by just renaming keys and merging dicts. Once you've combined / concatenated / merged the virtual representations of your files using xarray, you write the complete reference sets to disk using an xarray accessor, as ds.virtualize.to_kerchunk(outpath, format='json') Now you should be able to open these using fsspec as you normally would open the results of using Because this design now separates the reading of byte ranges from the representation of them in memory from the concatenation of them from the on-disk format we save them to, it's easier to see how upstream changes in zarr (e.g. chunk manifests in zarr, virtual array concatenation in zarr, and variable chunks in zarr) could plug into |
Problem
MultiZarrToZarr
is extremely powerful but rather hard to use.This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.
Part of the problem is that this one
MultiZarrToZarr
function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):xr.concat
wheredim
is a strxr.concat
wheredim
is a set of valuesxr.merge
xr.combine_nested
xr.combine_by_coords
In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through
MultiZarrToZarr
. It also cannot currently easily handle certain types of multi-dimensional concatenation.Suggestion
Break up
MultiZarrToZarr
by defining a set of functions similar to xarray'smerge
/concat
/combine
/unify_chunks
that consume and produceVirtualZarrStore
objects (EDIT: see #375).Advantages
coo_map
kwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from theVirtualZarrStore
objects they want to concatenate.kerchunk.combine.auto_dask
does).Questions
merge
/concat
/combine
? And what can we learn from the design decisions in pangeo-forge-recipesFilePattern
? (@cisaacstern @rabernat )combine.merge_vars
andcombine.concatenate_arrays
functions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?The text was updated successfully, but these errors were encountered: