Skip to content

Commit

Permalink
Ensure that chunking is respected (#379)
Browse files Browse the repository at this point in the history
<!-- Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #xyz
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features).
- [ ] (If applicable) Tests have been added.
- [x] This PR does not seem to break the templates.
- [x] CHANGES.rst has been updated (with summary of main changes).
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added.

### What kind of change does this PR introduce?

* `original_shape` and `chunksizes` don't play well together. This PR
makes sure that `original_shape` is always removed before saving a
dataset.
* Also, (maybe new in the latest version of `xarray` and engine
`netcdf4`?), it appears that dropping `chunksizes` leads to unexpected
behaviours, such as bloated file size and incorrect chunking on disk.
Thus, the `chunksizes` encoding was made more explicit.

### Does this PR introduce a breaking change?

* No.


### Other information:

Related Issues:
pydata/xarray#8385
pydata/xarray#8062
  • Loading branch information
RondeauG authored Apr 11, 2024
2 parents 7f59fdc + 2aa893c commit 65faf07
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 4 deletions.
1 change: 1 addition & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Bug fixes
* Fixed a bug to accept `group = False` in `adjust` function. (:pull:`366`).
* `creep_weights` now correctly handles the case where the grid is small, `n` is large, and `mode=wrap`. (:issue:`367`).
* Fixed a bug in ``tasmin_from_dtr`` and ``tasmax_from_dtr``, when `dtr` units differed from tasmin/max. (:pull:`372`).
* Fixed a bug where the requested chunking would be ignored when saving a dataset (:pull:`379`).

v0.8.3 (2024-02-28)
-------------------
Expand Down
2 changes: 1 addition & 1 deletion environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ dependencies:
- zarr
# Opt
- nc-time-axis >=1.3.1
- pyarrow >=1.0.0
- pyarrow >=10.0.1
# Dev
- babel
- black ==24.2.0
Expand Down
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@ dependencies:
- babel
# Opt
- nc-time-axis >=1.3.1
- pyarrow >=1.0.0
- pyarrow >=10.0.1
- pip
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ dependencies = [
"pandas >=2.2",
"parse",
# Used when opening catalogs.
"pyarrow",
"pyarrow>=10.0.1",
"pyyaml",
"rechunker",
"scipy",
Expand Down
10 changes: 9 additions & 1 deletion xscen/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,8 @@ def save_to_netcdf(
for var in list(ds.data_vars.keys()):
if keepbits := _get_keepbits(bitround, var, ds[var].dtype):
ds = ds.assign({var: round_bits(ds[var], keepbits)})
# Remove original_shape from encoding, since it can cause issues with some engines.
ds[var].encoding.pop("original_shape", None)

_coerce_attrs(ds.attrs)
for var in ds.variables.values():
Expand Down Expand Up @@ -519,6 +521,8 @@ def _skip(var):
encoding.pop(var)
if keepbits := _get_keepbits(bitround, var, ds[var].dtype):
ds = ds.assign({var: round_bits(ds[var], keepbits)})
# Remove original_shape from encoding, since it can cause issues with some engines.
ds[var].encoding.pop("original_shape", None)

if len(ds.data_vars) == 0:
return None
Expand Down Expand Up @@ -904,8 +908,12 @@ def rechunk_for_saving(ds: xr.Dataset, rechunk: dict):
ds[rechunk_var] = ds[rechunk_var].chunk(
{d: chnks for d, chnks in rechunk_dims.items() if d in ds[rechunk_var].dims}
)
ds[rechunk_var].encoding.pop("chunksizes", None)
ds[rechunk_var].encoding["chunksizes"] = tuple(
rechunk_dims[d] if d in rechunk_dims else ds[d].shape[0]
for d in ds[rechunk_var].dims
)
ds[rechunk_var].encoding.pop("chunks", None)
ds[rechunk_var].encoding.pop("preferred_chunks", None)

return ds

Expand Down

0 comments on commit 65faf07

Please sign in to comment.