Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous behavior with coordinates when appending to Zarr store with append_dim #8427

Closed
rabernat opened this issue Nov 8, 2023 · 4 comments · Fixed by #8428
Closed

Ambiguous behavior with coordinates when appending to Zarr store with append_dim #8427

rabernat opened this issue Nov 8, 2023 · 4 comments · Fixed by #8428

Comments

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2023

What happened?

There are two quite different scenarios covered by "append" with Zarr

  • Adding new variables to a dataset
  • Extending arrays along a dimensions (via append_dim)

This issue is about what should happen when using append_dim with variables that do not contain append_dim.

Here's the current behavior.

import xarray as xr
import zarr

ds1 = xr.DataArray(
    np.array([1, 2, 3]).reshape(3, 1, 1),
    dims=('time', 'y', 'x'),
    coords={'x': [1], 'y': [2]},
    name="foo"
).to_dataset()

ds2 = xr.DataArray(
    np.array([4, 5]).reshape(2, 1, 1),
    dims=('time', 'y', 'x'),
    coords={'x':[-1], 'y': [-2]},
    name="foo"
).to_dataset()

# how concat works: data are aligned
ds_concat = xr.concat([ds1, ds2], dim="time")
assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}

# now do a Zarr append
store = zarr.storage.MemoryStore()
ds1.to_zarr(store, consolidated=False)
# we do not check that the coordinates are aligned--just that they have the same shape and dtype
ds2.to_zarr(store, append_dim="time", consolidated=False)
ds_append = xr.open_zarr(store, consolidated=False)

# coordinates data have been overwritten 
assert ds_append.dims == {"time": 5, "y": 1, "x": 1}
# ...with the latest values
assert ds_append.x.data[0] == -1

Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.

What did you expect to happen?

There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim)

  1. [current behavior] Do not attempt to align coordinates
    a. [current behavior] Overwrite coordinates with new data
    b. Keep original coordinates
    c. Force the user to explicitly drop the coordinates, as we do for region operations.
  2. Attempt to align coordinates
    a. Fail if coordinates don't match
    b. Extend the arrays to replicate the behavior of concat

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.176-157.645.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.10.1
pandas: 2.1.2
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.1
distributed: 2023.10.1
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: 0.13.0
numbagg: 0.6.0
fsspec: 2023.10.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.16.1
sphinx: None

@rabernat rabernat added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 8, 2023
@rabernat rabernat added design question needs discussion and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 8, 2023
@max-sixty
Copy link
Collaborator

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

+1

@dcherian
Copy link
Contributor

dcherian commented Nov 15, 2023

We could have

  1. mode='a' for (1a) which is append for all variables with append_dim + overwrite all others. This is the current behaviour
  2. mode='a-' for (1b) which is only write variables with append_dim, no alignment is checked.

@dcherian dcherian removed the bug label Nov 15, 2023
@rabernat
Copy link
Contributor Author

That sounds good Deepak! Feel free to pick up #8428 and extend it with those options.

@dcherian
Copy link
Contributor

dcherian commented Nov 16, 2023

#8428 now allow mode="a-" but the default is the existing mode="a".

We could transition the default to mode="a-" with a deprecation cycle, what do people think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants