-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables #386
Comments
I'm trying to find the cause of the bug, but I'm struggling to follow the control flow of the code. I'm suspicious of these lines though: Line 415 in cdd106b
|
Is it fair to say that coordinate variables are essentially the same as variables? If you include |
Is it fair to say that *coordinate variables* are essentially the same as
*variables*?
Basically - they are literally variables which are also designated as being
"coordinates". You can see how this works in xarray.Dataset here
https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L656
An extra piece of information is needed to distinguish coordinates
variables from variables (xarray calls the latter "data variables"). I do
not actually know how this extra piece of information is stored in Zarr,
because it must be more than just the _ARRAY_DIMENSIONS. I will look into
that.
If you include time in the concat_dims, you do get all 6 values out
You do? I thought I tried that...
I assume this is not what you want.
No - unfortunately that's not the correct result, all the variables that
aren't in `identical_dims` should aquire the id dimension.
|
I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.
This might be the critical thing. Essentially, we want to treat |
It's in the {
"metadata": {
".zattrs": {
"coordinates": "a"
},
".zgroup": {
"zarr_format": 2
},
...
So I was going to reply saying this: "I think that whether or not a variable is a coordinate should be irrelevant for how it is concatenated." But actually I've just realised that xarray wouldn't concatenate
Here's what happens if you try this in xarray: a_data1 = (np.arange(6) + 10).reshape(2, 3)
a_data2 = (np.arange(6) + 20).reshape(2, 3)
time1 = xr.Dataset(
data_vars={'a': (('id', 'time'), a_data)},
coords={'time': ('time', [1, 2, 3])},
)
time2 = xr.Dataset(
data_vars={'a': (('id', 'time'), a_data2)},
coords={'time': ('time', [4, 5, 6])},
)
In [50]: xr.concat([time1, time2], dim='id')
Out[50]:
<xarray.Dataset>
Dimensions: (time: 6, id: 4)
Coordinates:
* time (time) int64 1 2 3 4 5 6
Dimensions without coordinates: id
Data variables:
a (id, time) float64 10.0 11.0 12.0 nan nan ... nan 23.0 24.0 25.0 That also didn't concatenate along So we have a situation in which kerchunk is (possibly unintentionally?) behaving the same way as xarray. My intuition about what should happen here was wrong at least for xarray, but the reason why xarray has that behaviour does not actually apply to kerchunk! Zarr arrays don't have indexes, so this concatenation could be done perfectly easily by kerchunk. What's the desired behaviour? I'm not sure now, but it's a question of data models, and this relates to #377 in that clearer abstractions/type systems would help expose these kind of differences more clearly. |
unless you wanted time to be stacked along the new id dimension too. In
which case, yes, that's the Xarray data model for indexed variables, we'd
need to destroy the index to do what you want and we want the user to do
that explicitly. And yes I agree, Kerchunk should not be following Xarray
here
This is exactly what I am saying! For my use case of Kerchunk I did want
time to be stacked along the new id dimension. I only remembered that was
different to what xarray would do after trying it with xarray.
Kerchunk should not be following Xarray here
In which case there is a real bug in Kerchunk to be fixed!
…On Fri, Nov 3, 2023, 1:24 PM Deepak Cherian ***@***.***> wrote:
I'm so confused. These both seem sensible to me:
[image: image]
<https://user-images.githubusercontent.com/2448579/280370127-9148912d-d239-4b8f-9e28-95aceda6694e.png>
unless you wanted time to be stacked along the new id dimension too. In
which case, yes, that's the Xarray data model for indexed variables, we'd
need to destroy the index to do what you want and we want the user to do
that explicitly. And yes I agree, Kerchunk should not be following Xarray
here
—
Reply to this email directly, view it on GitHub
<#386 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI6MET6DQC32PFN6WNDYCUSF5AVCNFSM6AAAAAA6Z4DMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHA2DONJYGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants. I am aware that to some extent we end up replicating stuff in xarray (like the encoding stuff), but I don't mind that for the same of being explicit. What we really need, is a good set of test cases to describe expected behaviour. I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too. |
Thanks for your perspective @martindurant. I feel like your comments should be posted on #377 too? |
Er yes, having multiple conversations in motion is part of the problem! |
Coming back to this, three thoughts:
T_ArrRefs = NewType("T_ArrRefs", dict[Union[Literal['.zattrs'], Literal['.zarray'], str], dict]) # lower-level dict containing just the information for one zarr array
def rename_var(refs: T_ArrRefs, old_name: str, new_name: str) -> T_ArrRefs:
"""Rename a variable in a kerchunk array-level reference dict without renaming any corresponding dimension."""
for old_key in list(refs):
if old_key.startswith(old_name):
new_key = old_key.replace(old_name, new_name)
refs[new_key] = refs.pop(old_key)
return refs
# combine along new dimension 'id`
mzz = MultiZarrToZarr(
single_jsons,
concat_dims=["id"],
coo_map={'id': [10, 20]},
preprocess=functools.partial(rename_var, old_name='time', new_name='dummy'),
postprocess=functools.partial(rename_var, old_name='dummy', new_name='time'),
)
combined_test_json = mzz.translate() which gives |
Quick answers:
The first is to find all the coordinate values in all of the chunks, so that in the second pass we can thereafter know, for a particular value, where in the chunk grid it fits. When using concat([...]), this ordering (on exactly one dimension) is implicit in the order of the provided list.
Those are much more specific and un-flexible - useful in some limited circumstances. What MZZ does cannot be cleanly decomposed into a set of steps like that. |
To the larger point: MZZ tries to form a sensible aggregate dataset out of all the inputs, by finding the set of coordinates in each chunk of each input (where "coordinate" can include "variable name"). This includes the various combine options of open_mfdataset, which was the inspiration. If you wanted a decomposition of what MZZ does into separate functions, you would need the user to figure out which files belong together for each subconcat along x, which then get concatenated in a separate step along y. The user would need to know the groupings and the order all the way down. That's essentially how dask's stack and concat methods work. The idea, instead, is to be able to infer those values and figure out the whole grid in one operation, whether the origin of the coordinates is variables within each dataset, file names or something else. For the specific issue here, it sounds like the logic of "is this a coordinate" isn't quite right, and given the right reproducer for testing, we should be able to address it easily. |
(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller) |
Thanks for those thoughts @martindurant. I think they would be useful over on #377, so I will copy them over and reply to them there. Coming back to this issue - my workaround exposed other bugs, which I've reported in #434. |
When combining along a new dimension using
coo_map
,MultiZarrToZarr
fails to concatenate dimension coordinate variables, despite concatenating coordinate variables just fine.Minimal example:
This is not what I expected - the time variable should have dimensions
(time, id)
- we've lost half the time values. The variabletime
should have been concatenated alongid
because I did not specify it inidentical_dims
.What's weird is that this works as expected for coordinate variables, just not for dimension coordinates. In other words, if I rename the
time
variable totime_renamed
, but have it still be a function of a dimension namedtime
, then the concatenation happens as expected:The text was updated successfully, but these errors were encountered: