Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add open_datatree to xarray #8697

Merged
merged 31 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
0266b63
Merge remote-tracking branch 'prepared-datatree/main' into mhs/import…
flamingbear Jan 29, 2024
3899b06
DAS-2060: Skips datatree_ CI
flamingbear Jan 29, 2024
d5b80f9
DAS-2070: Migrate open_datatree into xarray.
flamingbear Jan 29, 2024
0c62960
DAS-2060: replace relative import of datatree to library
flamingbear Jan 30, 2024
a523d50
DAS-2060: revert the exporting of NodePath from datatree
flamingbear Jan 30, 2024
1e5e433
Merge branch 'main' into mhs/DAS-2060/open_datatree
flamingbear Feb 1, 2024
e687e4a
Don't expose open_datatree at top level
flamingbear Feb 1, 2024
4e05d5c
Point datatree imports to xarray.datatree_.datatree
flamingbear Feb 2, 2024
77405d9
Updates function signatures for mypy.
flamingbear Feb 2, 2024
81b425f
Move io tests, remove undefined reference to documentation.
flamingbear Feb 2, 2024
3c5bcda
Pass bare-minimum tests.
flamingbear Feb 5, 2024
9f89256
Update pyproject.toml to exclude imported datatree_ modules.
flamingbear Feb 5, 2024
a4bad61
Adding back type ignores
flamingbear Feb 5, 2024
e4f0374
Refactor open_datatree back together.
flamingbear Feb 6, 2024
3b1224c
Removes TODO comment
flamingbear Feb 6, 2024
e447900
Merge branch 'main' into mhs/open_datatree
flamingbear Feb 6, 2024
352222d
Merge branch 'main' into mhs/open_datatree
flamingbear Feb 7, 2024
9745864
Merge branch 'main' into mhs/open_datatree
flamingbear Feb 8, 2024
20d8691
typo fix
flamingbear Feb 8, 2024
221bc8c
typo 2
flamingbear Feb 8, 2024
b74764e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2024
6498acc
Merge branch 'main' into mhs/open_datatree
flamingbear Feb 12, 2024
4280d30
Call raised exception
flamingbear Feb 8, 2024
8c54465
Add unpacking notation to kwargs
flamingbear Feb 12, 2024
afba7ba
Use final location for DataTree doc strings
flamingbear Feb 12, 2024
aab1744
fix comment from open_dataset to open_datatree
flamingbear Feb 12, 2024
5b48973
Revert "fix comment from open_dataset to open_datatree"
flamingbear Feb 12, 2024
c6bb18a
Change sphynx link from meth to func
flamingbear Feb 13, 2024
4d306c0
Merge branch 'main' into mhs/open_datatree
flamingbear Feb 13, 2024
d386ed3
Update whats-new.rst
flamingbear Feb 14, 2024
e291587
Fix what-new.rst formatting.
flamingbear Feb 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/roadmap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ types would also be highly useful for xarray users.
By pursuing these improvements in NumPy we hope to extend the benefits
to the full scientific Python community, and avoid tight coupling
between xarray and specific third-party libraries (e.g., for
implementing untis). This will allow xarray to maintain its domain
implementing units). This will allow xarray to maintain its domain
agnostic strengths.

We expect that we may eventually add some minimal interfaces in xarray
Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ warn_redundant_casts = true
warn_unused_configs = true
warn_unused_ignores = true

# Ignore mypy errors for modules imported from datatree_.
[[tool.mypy.overrides]]
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
module = "xarray.datatree_.*"
ignore_errors = true

# Much of the numerical computing stack doesn't have type annotations yet.
[[tool.mypy.overrides]]
ignore_missing_imports = true
Expand Down
29 changes: 29 additions & 0 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
T_NetcdfTypes = Literal[
"NETCDF4", "NETCDF4_CLASSIC", "NETCDF3_64BIT", "NETCDF3_CLASSIC"
]
from xarray.datatree_.datatree import DataTree

DATAARRAY_NAME = "__xarray_dataarray_name__"
DATAARRAY_VARIABLE = "__xarray_dataarray_variable__"
Expand Down Expand Up @@ -788,6 +789,34 @@ def open_dataarray(
return data_array


def open_datatree(
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
engine: T_Engine = None,
**kwargs,
) -> DataTree:
"""
Open and decode a DataTree from a file or file-like object, creating one tree node for each group in the file.

Parameters
----------
filename_or_obj : str, Path, file-like, or DataStore
Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
engine : str, optional
Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
kwargs :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs :
**kwargs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional keyword arguments passed to :py:meth:`~xarray.open_dataset` for each group.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't exist yet. at that location.

Copy link
Collaborator

@keewis keewis Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why? This should exist, if you use :py:func: instead of :py:meth: (you can use sphobjinv to find the right role)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't exist at the top level because we wanted to migrate over the code before allowing a direct import.

from xarray import open_datatree

Until all of the code was merged, we said it would be imported from xarray.core.datatree. From #8572 First comment "EDIT: We decided it should return an xarray.DataTree object, or even xarray.core.datatree.DataTree object. So we can start by just copying the basic version in datatree/io.py right now which just calls open_dataset many times."

So I had thought xarray.core.datatree.DataTree made more sense for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the second part of your comment, makes me think I might not understand what you were asking.

Copy link
Collaborator

@keewis keewis Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm asking is: open_dataset definitely exists (and it should be possible to link to), so I suppose you meant to write xarray.open_datatree?

Suggested change
Additional keyword arguments passed to :py:meth:`~xarray.open_dataset` for each group.
Additional keyword arguments passed to :py:func:`~xarray.open_datatree` for each group.

Also, once that's fixed, does this cause the docs build to fail? If not I believe it would be fine to leave as-is.

Copy link
Member Author

@flamingbear flamingbear Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, I think this should still be open_dataset. That is what is in the original docs. And this is describing open_datatree, where it's calling open_dataset with these **kwargs each time. So I think this commit should be reverted. As to :py:meth: vs :py:func: I don't know and will look at the difference unless you tell me before I figure it out.

Returns
-------
xarray.core.datatree.DataTree
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
"""
if engine is None:
engine = plugins.guess_engine(filename_or_obj)

backend = plugins.get_backend(engine)

return backend.open_datatree(filename_or_obj, **kwargs)


def open_mfdataset(
paths: str | NestedSequence[str | os.PathLike],
chunks: T_Chunks | None = None,
Expand Down
57 changes: 57 additions & 0 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,12 @@
if TYPE_CHECKING:
from io import BufferedIOBase

from h5netcdf.legacyapi import Dataset as ncDatasetLegacyH5
from netCDF4 import Dataset as ncDataset
flamingbear marked this conversation as resolved.
Show resolved Hide resolved

from xarray.core.dataset import Dataset
from xarray.core.types import NestedSequence
from xarray.datatree_.datatree import DataTree

# Create a logger object, but don't add any handlers. Leave that to user code.
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -127,6 +131,43 @@ def _decode_variable_name(name):
return name


def _open_datatree_netcdf(
ncDataset: ncDataset | ncDatasetLegacyH5,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from xarray.backends.api import open_dataset
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

ds = open_dataset(filename_or_obj, **kwargs)
tree_root = DataTree.from_dict({"/": ds})
with ncDataset(filename_or_obj, mode="r") as ncds:
for path in _iter_nc_groups(ncds):
subgroup_ds = open_dataset(filename_or_obj, group=path, **kwargs)

# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
node_name = NodePath(path).name
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
tree_root._set_item(
path,
new_node,
allow_overwrite=False,
new_nodes_along_path=True,
)
return tree_root


def _iter_nc_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups.items():
gpath = parent / path
yield str(gpath)
yield from _iter_nc_groups(group, parent=gpath)


def find_root_and_group(ds):
"""Find the root and group name of a netCDF4/h5netcdf dataset."""
hierarchy = ()
Expand Down Expand Up @@ -458,6 +499,11 @@ class BackendEntrypoint:
- ``guess_can_open`` method: it shall return ``True`` if the backend is able to open
``filename_or_obj``, ``False`` otherwise. The implementation of this
method is not mandatory.
- ``open_datatree`` method: it shall implement reading from file, variables
decoding and it returns an instance of :py:class:`~datatree.DataTree`.
It shall take in input at least ``filename_or_obj`` argument. The
implementation of this method is not mandatory. For more details see
<reference to open_datatree documentation>.

Attributes
----------
Expand Down Expand Up @@ -508,6 +554,17 @@ def guess_can_open(

return False

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs: Any,
) -> DataTree:
"""
Backend open_datatree method used by Xarray in :py:func:`~xarray.open_datatree`.
"""

raise NotImplementedError
flamingbear marked this conversation as resolved.
Show resolved Hide resolved


# mapping of engine name to (module name, BackendEntrypoint Class)
BACKEND_ENTRYPOINTS: dict[str, tuple[str | None, type[BackendEntrypoint]]] = {}
11 changes: 11 additions & 0 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
BackendEntrypoint,
WritableCFDataStore,
_normalize_path,
_open_datatree_netcdf,
find_root_and_group,
)
from xarray.backends.file_manager import CachingFileManager, DummyFileManager
Expand Down Expand Up @@ -38,6 +39,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree


class H5NetCDFArrayWrapper(BaseNetCDF4Array):
Expand Down Expand Up @@ -423,5 +425,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from h5netcdf.legacyapi import Dataset as ncDataset

return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)


BACKEND_ENTRYPOINTS["h5netcdf"] = ("h5netcdf", H5netcdfBackendEntrypoint)
11 changes: 11 additions & 0 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
BackendEntrypoint,
WritableCFDataStore,
_normalize_path,
_open_datatree_netcdf,
find_root_and_group,
robust_getitem,
)
Expand Down Expand Up @@ -44,6 +45,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree

# This lookup table maps from dtype.byteorder to a readable endian
# string used by netCDF4.
Expand Down Expand Up @@ -667,5 +669,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from netCDF4 import Dataset as ncDataset

return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)


BACKEND_ENTRYPOINTS["netcdf4"] = ("netCDF4", NetCDF4BackendEntrypoint)
44 changes: 44 additions & 0 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree


# need some special secret attributes to tell us the dimensions
Expand Down Expand Up @@ -1039,5 +1040,48 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
import zarr

from xarray.backends.api import open_dataset
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

zds = zarr.open_group(filename_or_obj, mode="r")
ds = open_dataset(filename_or_obj, engine="zarr", **kwargs)
tree_root = DataTree.from_dict({"/": ds})
for path in _iter_zarr_groups(zds):
try:
subgroup_ds = open_dataset(
filename_or_obj, engine="zarr", group=path, **kwargs
)
except zarr.errors.PathNotFoundError:
subgroup_ds = Dataset()

# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
node_name = NodePath(path).name
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
tree_root._set_item(
path,
new_node,
allow_overwrite=False,
new_nodes_along_path=True,
)
return tree_root


def _iter_zarr_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups():
gpath = parent / path
yield str(gpath)
yield from _iter_zarr_groups(group, parent=gpath)


BACKEND_ENTRYPOINTS["zarr"] = ("zarr", ZarrBackendEntrypoint)
3 changes: 3 additions & 0 deletions xarray/core/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
"display_expand_coords",
"display_expand_data_vars",
"display_expand_data",
"display_expand_groups",
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
"display_expand_indexes",
"display_default_indexes",
"enable_cftimeindex",
Expand All @@ -44,6 +45,7 @@ class T_Options(TypedDict):
display_expand_coords: Literal["default", True, False]
display_expand_data_vars: Literal["default", True, False]
display_expand_data: Literal["default", True, False]
display_expand_groups: Literal["default", True, False]
display_expand_indexes: Literal["default", True, False]
display_default_indexes: Literal["default", True, False]
enable_cftimeindex: bool
Expand All @@ -68,6 +70,7 @@ class T_Options(TypedDict):
"display_expand_coords": "default",
"display_expand_data_vars": "default",
"display_expand_data": "default",
"display_expand_groups": "default",
"display_expand_indexes": "default",
"display_default_indexes": False,
"enable_cftimeindex": True,
Expand Down
10 changes: 0 additions & 10 deletions xarray/datatree_/datatree/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,15 @@
# import public API
from .datatree import DataTree
from .extensions import register_datatree_accessor
from .io import open_datatree
from .mapping import TreeIsomorphismError, map_over_subtree
from .treenode import InvalidTreeError, NotFoundInTreeError

try:
# NOTE: the `_version.py` file must not be present in the git repository
# as it is generated by setuptools at install time
from ._version import __version__
except ImportError: # pragma: no cover
# Local copy or not installed with setuptools
__version__ = "999"

__all__ = (
"DataTree",
"open_datatree",
"TreeIsomorphismError",
"InvalidTreeError",
"NotFoundInTreeError",
"map_over_subtree",
"register_datatree_accessor",
"__version__",
)
3 changes: 2 additions & 1 deletion xarray/datatree_/datatree/datatree.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
List,
Mapping,
MutableMapping,
NoReturn,
Optional,
Set,
Tuple,
Expand Down Expand Up @@ -160,7 +161,7 @@ def __setitem__(self, key, val) -> None:
"use `.copy()` first to get a mutable version of the input dataset."
)

def update(self, other) -> None:
def update(self, other) -> NoReturn:
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
raise AttributeError(
"Mutation of the DatasetView is not allowed, please use `.update` on the wrapping DataTree node, "
"or use `dt.to_dataset()` if you want a mutable dataset. If calling this from within `map_over_subtree`,"
Expand Down
3 changes: 0 additions & 3 deletions xarray/datatree_/datatree/formatting_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@
datavar_section,
dim_section,
)
from xarray.core.options import OPTIONS

OPTIONS["display_expand_groups"] = "default"


def summarize_children(children: Mapping[str, Any]) -> str:
Expand Down
Loading
Loading