diff --git a/doc/howdoi.rst b/doc/howdoi.rst index b6374cc5100..8cc4e9939f2 100644 --- a/doc/howdoi.rst +++ b/doc/howdoi.rst @@ -42,7 +42,7 @@ How do I ... * - extract the underlying array (e.g. NumPy or Dask arrays) - :py:attr:`DataArray.data` * - convert to and extract the underlying NumPy array - - :py:attr:`DataArray.values` + - :py:attr:`DataArray.to_numpy` * - convert to a pandas DataFrame - :py:attr:`Dataset.to_dataframe` * - sort values diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst index d403328aa2f..3b6313dbf2f 100644 --- a/doc/internals/duck-arrays-integration.rst +++ b/doc/internals/duck-arrays-integration.rst @@ -1,23 +1,55 @@ -.. _internals.duck_arrays: +.. _internals.duckarrays: Integrating with duck arrays ============================= .. warning:: - This is a experimental feature. + This is an experimental feature. Please report any bugs or other difficulties on `xarray's issue tracker `_. -Xarray can wrap custom :term:`duck array` objects as long as they define numpy's -``shape``, ``dtype`` and ``ndim`` properties and the ``__array__``, -``__array_ufunc__`` and ``__array_function__`` methods. +Xarray can wrap custom numpy-like arrays (":term:`duck array`\s") - see the :ref:`user guide documentation `. +This page is intended for developers who are interested in wrapping a new custom array type with xarray. + +Duck array requirements +~~~~~~~~~~~~~~~~~~~~~~~ + +Xarray does not explicitly check that required methods are defined by the underlying duck array object before +attempting to wrap the given array. However, a wrapped array type should at a minimum define these attributes: + +* ``shape`` property, +* ``dtype`` property, +* ``ndim`` property, +* ``__array__`` method, +* ``__array_ufunc__`` method, +* ``__array_function__`` method. + +These need to be defined consistently with :py:class:`numpy.ndarray`, for example the array ``shape`` +property needs to obey `numpy's broadcasting rules `_ +(see also the `Python Array API standard's explanation `_ +of these same rules). + +Python Array API standard support +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As an integration library xarray benefits greatly from the standardization of duck-array libraries' APIs, and so is a +big supporter of the `Python Array API Standard `_. . + +We aim to support any array libraries that follow the Array API standard out-of-the-box. However, xarray does occasionally +call some numpy functions which are not (yet) part of the standard (e.g. :py:meth:`xarray.DataArray.pad` calls :py:func:`numpy.pad`). +See `xarray issue #7848 `_ for a list of such functions. We can still support dispatching on these functions through +the array protocols above, it just means that if you exclusively implement the methods in the Python Array API standard +then some features in xarray will not work. + +Custom inline reprs +~~~~~~~~~~~~~~~~~~~ In certain situations (e.g. when printing the collapsed preview of variables of a ``Dataset``), xarray will display the repr of a :term:`duck array` in a single line, truncating it to a certain number of characters. If that would drop too much information, the :term:`duck array` may define a ``_repr_inline_`` method that takes ``max_width`` (number of characters) as an -argument: +argument .. code:: python diff --git a/doc/internals/extending-xarray.rst b/doc/internals/extending-xarray.rst index 56aeb8fa462..a180b85044f 100644 --- a/doc/internals/extending-xarray.rst +++ b/doc/internals/extending-xarray.rst @@ -1,4 +1,6 @@ +.. _internals.accessors: + Extending xarray using accessors ================================ diff --git a/doc/internals/index.rst b/doc/internals/index.rst index e4ca9779dd7..132f6c40ede 100644 --- a/doc/internals/index.rst +++ b/doc/internals/index.rst @@ -8,6 +8,12 @@ stack, NumPy and pandas. It is written in pure Python (no C or Cython extensions), which makes it easy to develop and extend. Instead, we push compiled code to :ref:`optional dependencies`. +The pages in this section are intended for: + +* Contributors to xarray who wish to better understand some of the internals, +* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users, +* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type. + .. toctree:: :maxdepth: 2 diff --git a/doc/user-guide/data-structures.rst b/doc/user-guide/data-structures.rst index e0fd4bd0d25..64e7b3625ac 100644 --- a/doc/user-guide/data-structures.rst +++ b/doc/user-guide/data-structures.rst @@ -19,7 +19,8 @@ DataArray :py:class:`xarray.DataArray` is xarray's implementation of a labeled, multi-dimensional array. It has several key properties: -- ``values``: a :py:class:`numpy.ndarray` holding the array's values +- ``values``: a :py:class:`numpy.ndarray` or + :ref:`numpy-like array ` holding the array's values - ``dims``: dimension names for each axis (e.g., ``('x', 'y', 'z')``) - ``coords``: a dict-like container of arrays (*coordinates*) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or @@ -46,7 +47,8 @@ Creating a DataArray The :py:class:`~xarray.DataArray` constructor takes: - ``data``: a multi-dimensional array of values (e.g., a numpy ndarray, - :py:class:`~pandas.Series`, :py:class:`~pandas.DataFrame` or ``pandas.Panel``) + a :ref:`numpy-like array `, :py:class:`~pandas.Series`, + :py:class:`~pandas.DataFrame` or ``pandas.Panel``) - ``coords``: a list or dictionary of coordinates. If a list, it should be a list of tuples where the first element is the dimension name and the second element is the corresponding coordinate array_like object. diff --git a/doc/user-guide/duckarrays.rst b/doc/user-guide/duckarrays.rst index 78c7d1e572a..dc1d2d1cb8a 100644 --- a/doc/user-guide/duckarrays.rst +++ b/doc/user-guide/duckarrays.rst @@ -1,30 +1,183 @@ .. currentmodule:: xarray +.. _userguide.duckarrays: + Working with numpy-like arrays ============================== +NumPy-like arrays (often known as :term:`duck array`\s) are drop-in replacements for the :py:class:`numpy.ndarray` +class but with different features, such as propagating physical units or a different layout in memory. +Xarray can often wrap these array types, allowing you to use labelled dimensions and indexes whilst benefiting from the +additional features of these array libraries. + +Some numpy-like array types that xarray already has some support for: + +* `Cupy `_ - GPU support (see `cupy-xarray `_), +* `Sparse `_ - for performant arrays with many zero elements, +* `Pint `_ - for tracking the physical units of your data (see `pint-xarray `_), +* `Dask `_ - parallel computing on larger-than-memory arrays (see :ref:`using dask with xarray `), +* `Cubed `_ - another parallel computing framework that emphasises reliability (see `cubed-xarray `_). + .. warning:: - This feature should be considered experimental. Please report any bug you may find on - xarray’s github repository. + This feature should be considered somewhat experimental. Please report any bugs you find on + `xarray’s issue tracker `_. + +.. note:: + + For information on wrapping dask arrays see :ref:`dask`. Whilst xarray wraps dask arrays in a similar way to that + described on this page, chunked array types like :py:class:`dask.array.Array` implement additional methods that require + slightly different user code (e.g. calling ``.chunk`` or ``.compute``). + +Why "duck"? +----------- + +Why is it also called a "duck" array? This comes from a common statement of object-oriented programming - +"If it walks like a duck, and quacks like a duck, treat it like a duck". In other words, a library like xarray that +is capable of using multiple different types of arrays does not have to explicitly check that each one it encounters is +permitted (e.g. ``if dask``, ``if numpy``, ``if sparse`` etc.). Instead xarray can take the more permissive approach of simply +treating the wrapped array as valid, attempting to call the relevant methods (e.g. ``.mean()``) and only raising an +error if a problem occurs (e.g. the method is not found on the wrapped class). This is much more flexible, and allows +objects and classes from different libraries to work together more easily. + +What is a numpy-like array? +--------------------------- + +A "numpy-like array" (also known as a "duck array") is a class that contains array-like data, and implements key +numpy-like functionality such as indexing, broadcasting, and computation methods. + +For example, the `sparse `_ library provides a sparse array type which is useful for representing nD array objects like sparse matrices +in a memory-efficient manner. We can create a sparse array object (of the :py:class:`sparse.COO` type) from a numpy array like this: + +.. ipython:: python + + from sparse import COO + + x = np.eye(4, dtype=np.uint8) # create diagonal identity matrix + s = COO.from_numpy(x) + s -NumPy-like arrays (:term:`duck array`) extend the :py:class:`numpy.ndarray` with -additional features, like propagating physical units or a different layout in memory. +This sparse object does not attempt to explicitly store every element in the array, only the non-zero elements. +This approach is much more efficient for large arrays with only a few non-zero elements (such as tri-diagonal matrices). +Sparse array objects can be converted back to a "dense" numpy array by calling :py:meth:`sparse.COO.todense`. -:py:class:`DataArray` and :py:class:`Dataset` objects can wrap these duck arrays, as -long as they satisfy certain conditions (see :ref:`internals.duck_arrays`). +Just like :py:class:`numpy.ndarray` objects, :py:class:`sparse.COO` arrays support indexing + +.. ipython:: python + + s[1, 1] # diagonal elements should be ones + s[2, 3] # off-diagonal elements should be zero + +broadcasting, + +.. ipython:: python + + x2 = np.zeros( + (4, 1), dtype=np.uint8 + ) # create second sparse array of different shape + s2 = COO.from_numpy(x2) + (s * s2) # multiplication requires broadcasting + +and various computation methods + +.. ipython:: python + + s.sum(axis=1) + +This numpy-like array also supports calling so-called `numpy ufuncs `_ +("universal functions") on it directly: + +.. ipython:: python + + np.sum(s, axis=1) + + +Notice that in each case the API for calling the operation on the sparse array is identical to that of calling it on the +equivalent numpy array - this is the sense in which the sparse array is "numpy-like". .. note:: - For ``dask`` support see :ref:`dask`. + For discussion on exactly which methods a class needs to implement to be considered "numpy-like", see :ref:`internals.duckarrays`. + +Wrapping numpy-like arrays in xarray +------------------------------------ + +:py:class:`DataArray`, :py:class:`Dataset`, and :py:class:`Variable` objects can wrap these numpy-like arrays. +Constructing xarray objects which wrap numpy-like arrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Missing features ----------------- -Most of the API does support :term:`duck array` objects, but there are a few areas where -the code will still cast to ``numpy`` arrays: +The primary way to create an xarray object which wraps a numpy-like array is to pass that numpy-like array instance directly +to the constructor of the xarray class. The :ref:`page on xarray data structures ` shows how :py:class:`DataArray` and :py:class:`Dataset` +both accept data in various forms through their ``data`` argument, but in fact this data can also be any wrappable numpy-like array. -- dimension coordinates, and thus all indexing operations: +For example, we can wrap the sparse array we created earlier inside a new DataArray object: + +.. ipython:: python + + s_da = xr.DataArray(s, dims=["i", "j"]) + s_da + +We can see what's inside - the printable representation of our xarray object (the repr) automatically uses the printable +representation of the underlying wrapped array. + +Of course our sparse array object is still there underneath - it's stored under the ``.data`` attribute of the dataarray: + +.. ipython:: python + + s_da.data + +Array methods +~~~~~~~~~~~~~ + +We saw above that numpy-like arrays provide numpy methods. Xarray automatically uses these when you call the corresponding xarray method: + +.. ipython:: python + + s_da.sum(dim="j") + +Converting wrapped types +~~~~~~~~~~~~~~~~~~~~~~~~ + +If you want to change the type inside your xarray object you can use :py:meth:`DataArray.as_numpy`: + +.. ipython:: python + + s_da.as_numpy() + +This returns a new :py:class:`DataArray` object, but now wrapping a normal numpy array. + +If instead you want to convert to numpy and return that numpy array you can use either :py:meth:`DataArray.to_numpy` or +:py:meth:`DataArray.values`, where the former is strongly preferred. The difference is in the way they coerce to numpy - :py:meth:`~DataArray.values` +always uses :py:func:`numpy.asarray` which will fail for some array types (e.g. ``cupy``), whereas :py:meth:`~DataArray.to_numpy` +uses the correct method depending on the array type. + +.. ipython:: python + + s_da.to_numpy() + +.. ipython:: python + :okexcept: + + s_da.values + +This illustrates the difference between :py:meth:`~DataArray.data` and :py:meth:`~DataArray.values`, +which is sometimes a point of confusion for new xarray users. +Explicitly: :py:meth:`DataArray.data` returns the underlying numpy-like array, regardless of type, whereas +:py:meth:`DataArray.values` converts the underlying array to a numpy array before returning it. +(This is another reason to use :py:meth:`~DataArray.to_numpy` over :py:meth:`~DataArray.values` - the intention is clearer.) + +Conversion to numpy as a fallback +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a wrapped array does not implement the corresponding array method then xarray will often attempt to convert the +underlying array to a numpy array so that the operation can be performed. You may want to watch out for this behavior, +and report any instances in which it causes problems. + +Most of xarray's API does support using :term:`duck array` objects, but there are a few areas where +the code will still convert to ``numpy`` arrays: + +- Dimension coordinates, and thus all indexing operations: * :py:meth:`Dataset.sel` and :py:meth:`DataArray.sel` * :py:meth:`Dataset.loc` and :py:meth:`DataArray.loc` @@ -33,7 +186,7 @@ the code will still cast to ``numpy`` arrays: :py:meth:`DataArray.reindex` and :py:meth:`DataArray.reindex_like`: duck arrays in data variables and non-dimension coordinates won't be casted -- functions and methods that depend on external libraries or features of ``numpy`` not +- Functions and methods that depend on external libraries or features of ``numpy`` not covered by ``__array_function__`` / ``__array_ufunc__``: * :py:meth:`Dataset.ffill` and :py:meth:`DataArray.ffill` (uses ``bottleneck``) @@ -49,17 +202,25 @@ the code will still cast to ``numpy`` arrays: :py:class:`numpy.vectorize`) * :py:func:`apply_ufunc` with ``vectorize=True`` (uses :py:class:`numpy.vectorize`) -- incompatibilities between different :term:`duck array` libraries: +- Incompatibilities between different :term:`duck array` libraries: * :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`: this fails if the data was not already chunked and the :term:`duck array` (e.g. a ``pint`` quantity) should - wrap the new ``dask`` array; changing the chunk sizes works. - + wrap the new ``dask`` array; changing the chunk sizes works however. Extensions using duck arrays ---------------------------- -Here's a list of libraries extending ``xarray`` to make working with wrapped duck arrays -easier: + +Whilst the features above allow many numpy-like array libraries to be used pretty seamlessly with xarray, it often also +makes sense to use an interfacing package to make certain tasks easier. + +For example the `pint-xarray package `_ offers a custom ``.pint`` accessor (see :ref:`internals.accessors`) which provides +convenient access to information stored within the wrapped array (e.g. ``.units`` and ``.magnitude``), and makes makes +creating wrapped pint arrays (and especially xarray-wrapping-pint-wrapping-dask arrays) simpler for the user. + +We maintain a list of libraries extending ``xarray`` to make working with particular wrapped duck arrays +easier. If you know of more that aren't on this list please raise an issue to add them! - `pint-xarray `_ - `cupy-xarray `_ +- `cubed-xarray `_ diff --git a/doc/whats-new.rst b/doc/whats-new.rst index e88337ba946..ce2c0a698ac 100644 --- a/doc/whats-new.rst +++ b/doc/whats-new.rst @@ -38,6 +38,8 @@ Bug fixes Documentation ~~~~~~~~~~~~~ +- Expanded the page on wrapping numpy-like "duck" arrays. + (:pull:`7911`) By `Tom Nicholas `_. Internal Changes ~~~~~~~~~~~~~~~~ @@ -98,7 +100,6 @@ Bug fixes Documentation ~~~~~~~~~~~~~ - Internal Changes ~~~~~~~~~~~~~~~~