Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fsspec roundtrip #42

Merged
merged 22 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
c1c3919
move kerchunk backend imports to be specific to each backend filetype
TomNicholas Mar 18, 2024
a26902d
test roundtrip to json file then reading using fsspec
TomNicholas Mar 18, 2024
305b850
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2024
db49d03
add test env dependencies
TomNicholas Mar 18, 2024
97e7806
more test env deps
TomNicholas Mar 18, 2024
0963fd3
more
TomNicholas Mar 18, 2024
1be6e61
add pip install of xarray PR
TomNicholas Mar 18, 2024
43f5870
correct pip url
TomNicholas Mar 18, 2024
d086127
roundtrip test involving concatenation
TomNicholas Mar 20, 2024
91776a9
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas Mar 22, 2024
3f75850
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2024
f2aad9f
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas Mar 22, 2024
ad94bf4
Merge branch 'test_fsspec_roundtrip' of https://github.com/TomNichola…
TomNicholas Mar 22, 2024
aaa371c
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas Mar 25, 2024
41f0a2b
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas Mar 27, 2024
c4127c5
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas Apr 5, 2024
c2a20ce
Merge branch 'test_fsspec_roundtrip' of https://github.com/TomNichola…
TomNicholas May 15, 2024
0460888
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas May 15, 2024
6e8c372
remove duplication of pooch
TomNicholas May 15, 2024
89a7018
correct formatting
TomNicholas May 15, 2024
e527f71
Merge branch 'main' into test_fsspec_roundtrip
TomNicholas May 15, 2024
5dbdfcd
try removing netcdf4-python from the environment
TomNicholas May 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion ci/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,7 @@ dependencies:
- kerchunk
- ujson
- pydantic
- pooch
- fsspec
- pip:
- xarray>=2024.02.0.dev0
- git+https://github.com/TomNicholas/xarray.git@opt-out-auto-create-index-variables#egg=xarray
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ test = [
"pre-commit",
"pytest-mypy",
"pytest",
"fsspec",
"pooch",
"netcdf4-python",
]


Expand Down
64 changes: 64 additions & 0 deletions virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import fsspec
import xarray as xr
import xarray.testing as xrt

from virtualizarr import open_dataset_via_kerchunk


def test_kerchunk_roundtrip_no_concat(tmpdir):
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False)

# save it to disk as netCDF (in temporary directory)
ds.to_netcdf(f"{tmpdir}/air.nc")

# use open_dataset_via_kerchunk to read it as references
vds = open_dataset_via_kerchunk(f"{tmpdir}/air.nc", filetype="netCDF4")

# write those references to disk as kerchunk json
vds.virtualize.to_kerchunk(f"{tmpdir}/refs.json", format="json")

# use fsspec to read the dataset from disk via the zarr store
fs = fsspec.filesystem("reference", fo=f"{tmpdir}/refs.json")
m = fs.get_mapper("")

roundtrip = xr.open_dataset(m, engine="kerchunk")

# assert equal to original dataset
xrt.assert_equal(roundtrip, ds)


def test_kerchunk_roundtrip_concat(tmpdir):
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel(
time=slice(None, 2000)
)
Comment on lines +34 to +36

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel(
time=slice(None, 2000)
)
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=True).isel(time=slice(None, 2000))
del ds.time.encoding["units"]

This is the hard case, the two time variables will be encoded differently, and the concat makes no sense any more.

IIRC kerchunk supports this by decoding time and inlining bytes in the JSON. I think you'll want to at least raise an error here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will they be encoded differently?

In what case should I be raising an error? When the encoding attributes are different?

(Sorry I'm not very familiar with the encoding step in general)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will they be encoded differently?

The encoding chooses units separately for each dataset since they are written separately.

This happens in the wild, and you'll want to catch it.

When the encoding attributes are different?

I think so. How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks.

How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?

At the moment I'm not doing anything with encodings explicitly, and my tests have been testing with ManifestArrays created in-memory. So I'm not actually deliberately using them anywhere...

Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #69 added the ability to load selected variables in as normal lazy numpy arrays, so they will get decoded by xarray automatically.

EDIT: But we can't use it to make this roundtripping test work until I implement writing those numpy arrays out "inlined" into the kerchunk reference files, see #62.

I also opened issue #68 to discuss the encoding problem more generally.


# split into two datasets
ds1, ds2 = ds.isel(time=slice(None, 1000)), ds.isel(time=slice(1000, None))

# save it to disk as netCDF (in temporary directory)
ds1.to_netcdf(f"{tmpdir}/air1.nc")
ds2.to_netcdf(f"{tmpdir}/air2.nc")

# use open_dataset_via_kerchunk to read it as references
vds1 = open_dataset_via_kerchunk(f"{tmpdir}/air1.nc", filetype="netCDF4")
vds2 = open_dataset_via_kerchunk(f"{tmpdir}/air2.nc", filetype="netCDF4")

# concatenate virtually along time
vds = xr.concat([vds1, vds2], dim="time", coords="minimal", compat="override")
print(vds["air"].variable._data)

# write those references to disk as kerchunk json
vds.virtualize.to_kerchunk(f"{tmpdir}/refs.json", format="json")

# use fsspec to read the dataset from disk via the zarr store
fs = fsspec.filesystem("reference", fo=f"{tmpdir}/refs.json")
m = fs.get_mapper("")

roundtrip = xr.open_dataset(m, engine="kerchunk")

# user does analysis here

# assert equal to original dataset
xrt.assert_equal(roundtrip, ds)
Loading