Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis strategies in xarray.testing.strategies #6908

Open
wants to merge 111 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Aug 11, 2022

Adds a whole suite of hypothesis strategies for generating xarray objects, inspired by and separated out from the new hypothesis strategies in #4972. They are placed into the namespace xarray.testing.strategies, and publicly mentioned in the API docs, but with a big warning message. There is also a new testing page in the user guide documenting how to use these strategies.

EDIT: A variables strategy and user-facing documentation were shipped in #8404

@TomNicholas
Copy link
Member Author

I also added my chunking strategy from HypothesisWorks/hypothesis#3433

@github-actions github-actions bot added topic-testing documentation and removed topic-hypothesis Strategies or tests using the hypothesis library labels Aug 13, 2022
@github-actions github-actions bot added CI Continuous Integration tools dependencies Pull requests that update a dependency file labels Aug 13, 2022
@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 19, 2022

@Zac-HD if I could request one more review please! The two remaining problems for me are:

  1. How should we alter the API of datasets to make it easier to construct Dataset objects containing duck-typed arrays (see Hypothesis strategies in xarray.testing.strategies #6908 (comment))
  2. Why does the example generation performance seem to have gotten worse? 😕 I added what I thought were small refactors (e.g. _sizes_from_dim_names) which may somehow be related, but I'm not sure.

@Zac-HD
Copy link
Contributor

Zac-HD commented Sep 20, 2022

@Zac-HD if I could request one more review please! The two remaining problems for me are:

Absolutely! Some quick comments this evening; I would also like to do a full review again before merge but that might be next week or weekend - I'm out for a conference from early Thursday.

  1. How should we alter the API of datasets to make it easier to construct Dataset objects containing duck-typed arrays (see Hypothesis strategies in xarray.testing.strategies #6908 (comment))

Replied in the thread above.

  1. Why does the example generation performance seem to have gotten worse? 😕 I added what I thought were small refactors (e.g. _sizes_from_dim_names) which may somehow be related, but I'm not sure.

I'd be pretty surprised if that was related, st.fixed_dictionaries() is internally basically just that zip() trick anyway. I'd guess that this is mostly "as you implement the last few complicated data-gen options, they start taking nonzero time", but not confident in that without reading closely and probably measuring some perf things.

xarray/testing/strategies.py Outdated Show resolved Hide resolved
Comment on lines 262 to 272
but building a dataset from scratch (i.e. method (2)) requires building the dataset object in such as way that all of
the data variables have compatible dimensions. You can build up a dictionary of the form ``{var_name: data_variable}``
yourself, or you can use the ``data_vars`` argument to the ``data_variables`` strategy (TODO):

.. ipython:: python
:okexcept:

sparse_data_vars = xrst.data_variables(data=sparse_arrays())
sparse_datasets = xrst.datasets(data_vars=sparse_data_vars)

sparse_datasets.example()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had intended to push .pin in some form upstream, but I of course forgot about the other types of strategies so I can see why that would not be desirable.

Putting the code into the definition of the composite strategy is much better than what I had before (constructing the examples using data.draw directly in the test), so that would be fine with me.

Do you know if it is possible to use make_strategies_namespace with additional parameters to the array's constructor, like units for pint or chunks for dask? I guess if we use the pint_arrays function from above we could use partial for this (and anyway, pint does not implement __array_namespace__ at the moment).

@dcherian
Copy link
Contributor

dcherian commented Apr 1, 2024

How do we move this forward? Even Xarray objects with just numpy arrays would be quite useful

@Zac-HD
Copy link
Contributor

Zac-HD commented Apr 1, 2024

I think #8404 made a lot of progress on this, including shipping the user-facing documentation. If you wanted to open a PR rebasing this set of changes on main, I think that might be most of the remaining work.

@TomNicholas
Copy link
Member Author

So I just did a monster merge of main into this branch (probably should still rebase). It won't work yet because we still need to propagate all the array_strategy_fn stuff that went through with #8404 into the signatures of the new strategies in this PR.

How do we move this forward?

It's mostly just dealing with the above and also making sure we can generate sets of variables with alignable dimensions efficiently. We also probably should think about what we want the signatures of the more complicated strategies to be: e.g. are we wanting to pass variables to datasets? or array_strategy_fn to datasets?

Even Xarray objects with just numpy arrays would be quite useful

A lot of the work that went into #8404 was working out how to make it general enough to handle non-numpy arrays.

doc/whats-new.rst Outdated Show resolved Hide resolved
dcherian and others added 4 commits August 22, 2024 07:55
* main: (214 commits)
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  ...
@dcherian
Copy link
Contributor

We also probably should think about what we want the signatures of the more complicated strategies to be: e.g. are we wanting to pass variables to datasets? or array_strategy_fn to datasets?

These seem like O(ε) improvements to a really great PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file enhancement topic-testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Public hypothesis strategies for generating xarray data
4 participants