Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor tests around expected properties #394

Open
TomNicholas opened this issue Jan 28, 2025 · 1 comment
Open

Refactor tests around expected properties #394

TomNicholas opened this issue Jan 28, 2025 · 1 comment

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jan 28, 2025

Discussions with @maxrjones made me realise that we're thinking about testing wrong. In particular, with #124 in place we will have well-defined expected properties that we can and should test against. Let me explain.

@maxrjones asked whether implementing loading via a ManifestStore/FunctionalStore as described by @ayushnag in #124 would be a problematic mixing of concerns between creating virtual references and also loading data. But actually it's the opposite: this is exactly what we should be doing.

The reason is that if we cannot correctly load data via a ManifestStore, we cannot correctly load it from an IcechunkStore either. Loading from an in-memory ManifestArray via zarr-python reading a ManifestStore and loading from an in-memory/on-disk/S3 IcechunkStore via zarr-python performs exactly the same set of decoding steps. Contrast that to how we currently load variables via a separate xarray backend (e.g. h5netcdf) - that does it's own decoding logic that isn't encoded in the zarr model.

To check that we are able to correctly load variables, the property to test is

loaded_ds_from_manifests = open_virtual_dataset('file.nc', backend='netcdf', loadable_variables='all')

loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)

i.e., we check our creation of manifests by loading from it and comparing against loading using a pre-existing xarray backend for that filetype. @dcherian told me once it's a good thing to have multiple different ways to load data from the same file format, as it allows cross-checking. This property just encodes that. We could write a similar property for any other virtualizable file format that has an xarray backend, and each test like this should live in virtualizarr/tests/test_readers/test_hdf.py etc.

Similarly we could also load from an in-memory icechunk store instead, which would make the property:

vds = open_virtual_dataset('file.nc', backend='netcdf')
vds.virtualize.to_icechunk(icechunkstore)
loaded_ds_from_icechunk = xr.open_zarr(icechunkstore).load()

loaded_ds_from_backend = xr.open_dataset('file.nc', engine='h5netcdf').load()
xrt.assert_identical(loaded_ds_from_manifests, loaded_ds_from_backend)

This now looks a lot like our existing round-trip tests, inclusive of #376 (FYI @jsignell).

But there is no fundamental difference between how the data is loaded in these two cases: it's still zarr-python getting and decoding the data according to the ManifestArrays our readers created. So I now think what we should do is use the ManifestStore idea (the first property) as much as possible internally, as it's the most lightweight and doesn't depend on the optional icechunk dependency. Then we also additionally test that the latter property holds for round-tripping via icechunk.

@TomNicholas
Copy link
Member Author

The other reason to do this is that then it's obvious where to parameterize these tests with hypothesis strategies - i.e. for all netCDF files created by a strategy, the above property should hold. This is #347.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant