Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate xarray chunking #30

Open
dfulu opened this issue Aug 15, 2024 · 3 comments
Open

Investigate xarray chunking #30

dfulu opened this issue Aug 15, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@dfulu
Copy link
Member

dfulu commented Aug 15, 2024

When loading multiple zarr files using xarray, I have noticed that it often changes the chunk sizes despite the zarrs having the same chunking when saved to disk. Often it will double the chunk size. This is wasteful since it means we are then doubling the data we load off of disk just to get access to a small piece of it. This then slows down our sampling significantly

We should investigate this further and see if there is a better way to load multiple zarr files with xarray

e.g. In the example below xarray makes the chunks 27 times larger!

Note that I haven't printed the time dimension here. Where we open the two files individually the time chunk sizes are 12. Where we open them together the chunk size becomes 36

import xarray as xr

# Load two zarr files independently
path1 = "/mnt/disks/nwp_rechunk/sat/2020_nonhrv.zarr"
path2 = "/mnt/disks/nwp_rechunk/sat/2021_nonhrv.zarr"

ds1 = xr.open_zarr(path1)
ds2 = xr.open_zarr(path2)

# Check all coords except time have same values
assert (ds1.variable==ds2.variable).all()
assert (ds1.x_geostationary==ds2.x_geostationary).all()
assert (ds1.y_geostationary==ds2.y_geostationary).all()

# Check the chunk sizes are the same and print them
for dim in ["variable", "x_geostationary",  "y_geostationary"]:
    assert ds1.chunks[dim] == ds2.chunks[dim] 
    print(f"{dim}: {ds1.chunks[dim]}")
variable: (11,)
x_geostationary: (100, 100, 100, 100, 100, 100, 14)
y_geostationary: (100, 100, 100, 72)
# Open the two files with our default settings
ds = xr.open_mfdataset([path1, path2],
    engine="zarr",
    concat_dim="time",
    combine="nested",
    chunks="auto",
    join="override"
)

# Check the chunk sizes are the same and print them
for dim in ["variable", "x_geostationary",  "y_geostationary"]:
    print(f"{dim}: {ds.chunks[dim]}")
variable: (11,)
x_geostationary: (300, 300, 14)
y_geostationary: (300, 72)
@dfulu dfulu added the bug Something isn't working label Aug 15, 2024
@dfulu
Copy link
Member Author

dfulu commented Aug 15, 2024

This may have been the fault of some of the coordinates not being the same between different zarr files. I am now unable to recreate the issue. So closing until it pops up again

@dfulu dfulu closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024
@dfulu dfulu reopened this Aug 15, 2024
@dfulu
Copy link
Member Author

dfulu commented Aug 15, 2024

Absolute whirlwind. I have recreated the issue. Added example to the description above

@Sukh-P
Copy link
Member

Sukh-P commented Aug 28, 2024

Great catch! And thanks for the example, I looked at the xarray docs again which had some detail on what the different parameter values for chunks would do and I guess the chunked size with "auto" is at the whim of dask auto and what it deems ideal.

Not sure if this helpful but I recreated the example you made with smaller amount of fake data and when setting chunks=None in open_mfdataset it seemed to preserve what the original chunk sizes were. It would be good to double check that but in that case could a rule of thumb be if you have already rechunked the data you are working on to optimise for performance given your use case it would be best to avoid chunks="auto" and go with None instead and if you haven't rechunked for some reason then auto may still be a sensible choice?

Or if you were thinking of a different way altogether of opening/loading mulitple zarr files I would be interested to see what that could look like!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants