Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing error when use_cftime = True and chunks = 'auto' in xr.open_dataset() #9834

Open
ks905383 opened this issue Nov 27, 2024 · 1 comment

Comments

@ks905383
Copy link
Contributor

ks905383 commented Nov 27, 2024

What is your issue?

Opening a dataset with use_cftime=True turns the time dimension dtype from datetime64 to object. This means that using chunks='auto' will fail in dask, since dask can't estimate the size of variables with dtype object.

However, the error is a bit confusing, since it's from the underlying dask call, and doesn't tell the user what caused it.

import xarray as xr
# Generally succeeds
xr.open_dataset(fn,chunks='auto')

# Definitely fails
xr.open_dataset(fn,chunks='auto',use_cftime=True)

The error is:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[46], line 1
----> 1 xr.open_dataset(fn,use_cftime=True,chunks='auto')

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:617](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=616), in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    610 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
    611 backend_ds = backend.open_dataset(
    612     filename_or_obj,
    613     drop_variables=drop_variables,
    614     **decoders,
    615     **kwargs,
    616 )
--> 617 ds = _dataset_from_backend_dataset(
    618     backend_ds,
    619     filename_or_obj,
    620     engine,
    621     chunks,
    622     cache,
    623     overwrite_encoded_chunks,
    624     inline_array,
    625     chunked_array_type,
    626     from_array_kwargs,
    627     drop_variables=drop_variables,
    628     **decoders,
    629     **kwargs,
    630 )
    631 return ds

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:393](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=392), in _dataset_from_backend_dataset(backend_ds, filename_or_obj, engine, chunks, cache, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
    391     ds = backend_ds
    392 else:
--> 393     ds = _chunk_ds(
    394         backend_ds,
    395         filename_or_obj,
    396         engine,
    397         chunks,
    398         overwrite_encoded_chunks,
    399         inline_array,
    400         chunked_array_type,
    401         from_array_kwargs,
    402         **extra_tokens,
    403     )
    405 ds.set_close(backend_ds._close)
    407 # Ensure source filename always stored in dataset object

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py:357](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/backends/api.py#line=356), in _chunk_ds(backend_ds, filename_or_obj, engine, chunks, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs, **extra_tokens)
    355 variables = {}
    356 for name, var in backend_ds.variables.items():
--> 357     var_chunks = _get_chunk(var, chunks, chunkmanager)
    358     variables[name] = _maybe_chunk(
    359         name,
    360         var,
   (...)
    367         from_array_kwargs=from_array_kwargs.copy(),
    368     )
    369 return backend_ds._replace(variables)

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/core/dataset.py:255](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/core/dataset.py#line=254), in _get_chunk(var, chunks, chunkmanager)
    249     chunks = dict.fromkeys(dims, chunks)
    250 chunk_shape = tuple(
    251     chunks.get(dim, None) or preferred_chunk_sizes
    252     for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
    253 )
--> 255 chunk_shape = chunkmanager.normalize_chunks(
    256     chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
    257 )
    259 # Warn where requested chunks break preferred chunks, provided that the variable
    260 # contains data.
    261 if var.size:

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py:58](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py#line=57), in DaskManager.normalize_chunks(self, chunks, shape, limit, dtype, previous_chunks)
     55 """Called by open_dataset"""
     56 from dask.array.core import normalize_chunks
---> 58 return normalize_chunks(
     59     chunks,
     60     shape=shape,
     61     limit=limit,
     62     dtype=dtype,
     63     previous_chunks=previous_chunks,
     64 )

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py:3132](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py#line=3131), in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
   3129 chunks = tuple("auto" if isinstance(c, str) and c != "auto" else c for c in chunks)
   3131 if any(c == "auto" for c in chunks):
-> 3132     chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3134 if shape is not None:
   3135     chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))

File [~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py:3237](http://localhost:8888/lab/tree/hle_iv/~/opt/anaconda3/envs/hle_iv/lib/python3.12/site-packages/dask/array/core.py#line=3236), in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3234     raise TypeError("dtype must be known for auto-chunking")
   3236 if dtype.hasobject:
-> 3237     raise NotImplementedError(
   3238         "Can not use auto rechunking with object dtype. "
   3239         "We are unable to estimate the size in bytes of object data"
   3240     )
   3242 for x in tuple(chunks) + tuple(shape):
   3243     if (
   3244         isinstance(x, Number)
   3245         and np.isnan(x)
   3246         or isinstance(x, tuple)
   3247         and np.isnan(x).any()
   3248     ):

NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

Suggestion for now: add an Exception for when chunks='auto' and use_cftime=True are called at the same time. I think this should be implementable in backends.open_dataset() (rather than in any specific engine's open_dataset) since it's likely common to any opening procedure, regardless of backend?
Something like

if (chunk == 'auto') and (use_cftime): 
   raise NotImplementedError('`use_cftime=True` changes the dtype of time variables to object, however, dask cannot yet chunk variables of object dtype. Manually specifying chunks (instead of using `chunks='auto'` will not throw this exception.')

Suggestion for later: If it's possible to estimate the size of the array with datetime objects in the time coordinate, it should be possible to estimate it with cftime objects as well (since whether or not the coordinate itself is stored in one or the other is unlikely to make a difference in how to chunk the other variables). Is there maybe a way to get conventions.decode_cf_variable() to also return the original datetime object to present for chunking in it place of the converted cftime object? Or just for chunking to just apply the same chunking to a 1D coordinate that it would to that coordinate's dimension in the non-object-dtype arrays that may be present in the same dataset? (I guess this theoretically could be unstable if the object coordinate for some reason takes up a lot more space than it would if it were numeric, etc.).

(I'm working on putting together a PR for at least the Exception - please let me know if there's anything I should keep in mind, especially with where the exception would be most appropriate to stick, if this is a bad idea, etc.)

@ks905383 ks905383 added the needs triage Issue that has not been reviewed by xarray team member label Nov 27, 2024
@dcherian
Copy link
Contributor

I agree with the need for a better message.

Longer term, we know exactly what the size is, but I don't how to send that info to dask. Perhaps we can add some logic to the .rechunk method in daskmanager.py that handles cftime arrays specifically.

@dcherian dcherian added topic-dask and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants