Opening virtual datasets (dmr-adapter) #606

ayushnag · 2024-06-18T23:33:09Z

Closes Opening virtual datasets with NASA dmrpp files #605
Add docs
Unit tests. Update current test to test specific portions of the virtual dataset
Check indirect dmrpp reading support
Use updated virtualizarr version (with numpy 2.0 manifest)
Update CHANGELOG.md
~~Add tutorial notebook [TODO in later PR]~~

📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/

earthaccess/virtualizarr.py

betolink · 2024-06-20T14:43:33Z

This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag

ayushnag · 2024-06-20T17:48:22Z

https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

If there's more you want me to add or if any step is unclear I can update the notebook

github-actions · 2024-11-13T00:28:52Z

👈 Launch a binder notebook on this branch for commit b09c3f0

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit ed4a98b

👈 Launch a binder notebook on this branch for commit abe2c28

👈 Launch a binder notebook on this branch for commit dde9c57

👈 Launch a binder notebook on this branch for commit a9a9234

👈 Launch a binder notebook on this branch for commit 8fb0140

👈 Launch a binder notebook on this branch for commit 2c28278

👈 Launch a binder notebook on this branch for commit c3e43ac

👈 Launch a binder notebook on this branch for commit d3d6e7c

👈 Launch a binder notebook on this branch for commit 512e89c

👈 Launch a binder notebook on this branch for commit 67dbbe7

👈 Launch a binder notebook on this branch for commit 61afb95

chuckwondo · 2024-11-13T13:45:27Z

@ayushnag, perhaps I'm missing something here, but why have you copied code from the virtualizarr library into earthaccess? I don't see anything gained by this. We can simply use virtualizarr directly. Can you clarify?

ayushnag · 2024-11-13T16:55:12Z

Yes I can clarify. At some point the dmrpp parser was going to be a part of earthaccess instead of virtualizarr. However that is not the case anymore and this PR will be updated to just call virtualizarr directly

ayushnag · 2024-11-14T19:05:03Z

This PR is ready to review now. I have updated the pyproject.toml with an added dependency but let me know if the uv.lock or other files need to modified.

betolink · 2024-11-14T19:11:22Z

I can take a look this evening, great work @ayushnag !!

betolink · 2024-12-10T17:23:39Z

I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze

Is there anything we are missing @TomNicholas @ayushnag?

betolink

Great work Ayush! thank you for all the work you put on this PR and what you did on virtualizarr. I hope we can talk with @jgallagher59701 and Miguel Jimenez today/tomorrow about what's next. Maybe we can have people hacking on this access pattern at the Pangeo event!

earthaccess/virtualizarr.py

betolink · 2024-06-20T14:19:56Z

earthaccess/virtualizarr.py

+        open_ = _parse_dmr
+    vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules]
+    if preprocess is not None:
+        vdatasets = [preprocess(ds) for ds in vdatasets]


betolink · 2024-06-20T14:21:12Z

tests/integration/test_virtualizarr.py

+    # Open directly with `earthaccess.open`
+    expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts")
+
+    result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time",  parallel=True, preprocess=None)


do we have Dask in the test dependencies?

betolink · 2024-06-21T18:15:52Z

earthaccess/virtualizarr.py

+from __future__ import annotations
+
+import fsspec
+import xarray as xr


xarray is not one a core dependency, I think we need to add it as an optional dependency the same way the consolidate_metadata uses Dask (see the pyproject.yaml) and then the tests that are failing should pass!

earthaccess/virtualizarr.py

betolink · 2024-11-24T13:51:33Z

earthaccess/virtualizarr.py

+def open_virtual_mfdataset(
+    granules: list[earthaccess.DataGranule],
+    group: str | None = None,
+    access: str = "indirect",


Great that you're explicitly exposing this, I think we are going to deprecate the "magic" of detecting the runtime environment in-region vs out-of-region

betolink · 2024-11-24T14:11:15Z

pyproject.toml

@@ -64,6 +64,9 @@ kerchunk = [
  "h5netcdf",
  "xarray",
 ]
+virtualizarr = [
+  "virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr.git"


Is this because we want to be up to date for now? are we targeting an upcoming release?

I released like 2 days ago, so you probably want to just pin to >=1.2.0

Great, will update that!

betolink · 2024-11-24T14:12:49Z

tests/integration/test_virtualizarr.py

+
+    # TODO: replace with xr.testing when virtualizarr fill_val is fixed (https://github.com/zarr-developers/VirtualiZarr/issues/287)
+    # and dmrpp deflateLevel (zlib compression level) is always present (https://github.com/OPENDAP/bes/issues/954)
+    for var in result.variables:


Is this the replacement for "almost equal"?

betolink · 2024-12-04T15:49:40Z

earthaccess/virtualizarr.py

+    for g in granules:
+        vdatasets.append(
+            open_(
+                filepath=g.data_links(access=access)[0] + ".dmrpp",


the linter tells me that filepath is not a thing, same with indexes as this uses open_virtual_dataset() from down below in line 146. Maybe we need to refactor it?

ayushnag · 2024-12-10T18:31:00Z

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

ayushnag · 2024-12-10T18:34:14Z

Although then the function has to be part of api.py. Maybe not in that case

betolink · 2024-12-10T19:19:27Z

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

i think you're correct, I don't think the function needs to be part of API but it definitely needs to be imported, if we want the docs to show there. I think since we are including it also in the __init__.py it's already part of earthaccess., good catch! do you want to push one commit and see how the docs render?

danielfromearth · 2024-12-12T20:15:00Z

I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze

Is there anything we are missing @TomNicholas @ayushnag?

@betolink, do you think we should create a new issue for examples with open virtual dataset?

TomNicholas · 2024-12-12T20:29:23Z

earthaccess/__init__.py

@@ -58,6 +59,9 @@
    "Store",
    # kerchunk
    "consolidate_metadata",
+    # virtualizarr
+    "open_virtual_dataset",
+    "open_virtual_mfdataset",


We should probably add a virtualizarr.open_virtual_mfdataset upstream in virtualizarr. Though I would like to be more confident about the best way to parallelize reference generation first

zarr-developers/VirtualiZarr#123

zarr-developers/VirtualiZarr#7

tracked in zarr-developers/VirtualiZarr#345

TomNicholas

Amazing work @ayushnag.

TomNicholas · 2024-12-12T20:32:28Z

earthaccess/dmrpp_zarr.py

+    load: bool = False,
+    preprocess: callable | None = None,  # type: ignore
+    parallel: bool = True,
+    **xr_combine_nested_kwargs: Any,


Technically if you are able to load coordinates into memory (and therefore create pandas indexes) you could also support xr.combine_by_coords here too.

TomNicholas · 2024-12-12T20:33:52Z

earthaccess/dmrpp_zarr.py

+        refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
+        return xr.open_dataset(


So if we implemented @ayushnag 's idea in zarr-developers/VirtualiZarr#124 then we could just call that instead of using xarray.open_dataset here.

TomNicholas · 2024-12-12T20:35:37Z

tests/integration/test_virtualizarr.py

+    data_path = granule.data_links(access="indirect")[0]
+    dmrpp_path = data_path + ".dmrpp"
+
+    result = open_virtual_dataset(


So the test of correctness here is assuming that virtualizarr has correct behaviour? You might want to add more tests because virtualizarr definitely still has bugs in it (e.g. around CF coordinate decoding).

Yes, we are assuming that virtualizarr has the correct behavior. However this is mostly an integration test checking that the function operates as expected. The actual tests for correctness are in virtualizarr (for the dmrpp backend)

ayushnag · 2024-12-12T22:16:12Z

@TomNicholas do you think after changing the virtualizarr version in pyproject.toml we can merge? The other changes seem more like long term updates to me. Or which changes do you think are required now?

betolink

Docs look good, we're merging now!

betolink

We'll address @TomNicholas comments about pushing some functionality to VirtualiZarr, other than that we need the example notebooks!

Mikejmnez · 2024-12-16T16:38:21Z

Just catching up here. Awesome work @ayushnag 🥇 !

base features

06592f1

ayushnag marked this pull request as draft June 18, 2024 23:33

ayushnag mentioned this pull request Jun 18, 2024

Opening virtual datasets with NASA dmrpp files #605

Closed

TomNicholas reviewed Jun 19, 2024

View reviewed changes

earthaccess/virtualizarr.py Outdated Show resolved Hide resolved

betolink self-assigned this Jun 25, 2024

ayushnag added 5 commits August 6, 2024 22:28

Merge branch 'nsidc:main' into dmr-adapter

c3eafb1

added load param

90f296a

Merge branch 'nsidc:main' into dmr-adapter

dd006a3

add dmrpp parser and test

b626a41

Merge branch 'main' into dmr-adapter

b09c3f0

ayushnag added 2 commits November 13, 2024 22:31

add indirect access support and update docs

ed4a98b

pyproject and formatting

abe2c28

ayushnag marked this pull request as ready for review November 14, 2024 19:03

ayushnag and others added 6 commits November 14, 2024 11:16

add group param to open_virtual_dataset

dde9c57

Merge branch 'main' into dmr-adapter

a9a9234

fix duplicate file and import name

760f340

fix mypy type check

8fb0140

fix pre-commit and codespell

2c28278

update CHANGELOG

c3e43ac

betolink requested review from TomNicholas and betolink and removed request for TomNicholas December 10, 2024 17:32

betolink previously approved these changes Dec 10, 2024

View reviewed changes

add dmrpp docs to top level docs

d3d6e7c

ayushnag dismissed betolink’s stale review via d3d6e7c December 12, 2024 19:32

ayushnag added 2 commits December 12, 2024 14:46

fix pre-commit

512e89c

fix open_virtual_dataset() docs

67dbbe7

TomNicholas reviewed Dec 12, 2024

View reviewed changes

betolink self-requested a review December 13, 2024 15:09

betolink previously approved these changes Dec 13, 2024

View reviewed changes

pin virtualizarr release version

61afb95

ayushnag dismissed betolink’s stale review via 61afb95 December 13, 2024 20:41

ayushnag requested a review from betolink December 13, 2024 21:01

betolink approved these changes Dec 14, 2024

View reviewed changes

betolink merged commit b9dec8b into nsidc:main Dec 14, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening virtual datasets (dmr-adapter) #606

Opening virtual datasets (dmr-adapter) #606

ayushnag commented Jun 18, 2024 •

edited

Loading

betolink commented Jun 20, 2024

ayushnag commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024 •

edited

Loading

chuckwondo commented Nov 13, 2024

ayushnag commented Nov 13, 2024

ayushnag commented Nov 14, 2024

betolink commented Nov 14, 2024

betolink commented Dec 10, 2024

betolink left a comment

betolink Jun 20, 2024

betolink Jun 20, 2024

betolink Jun 21, 2024

betolink Nov 24, 2024

betolink Nov 24, 2024

TomNicholas Dec 12, 2024

ayushnag Dec 12, 2024

betolink Nov 24, 2024

betolink Dec 4, 2024

ayushnag commented Dec 10, 2024

ayushnag commented Dec 10, 2024

betolink commented Dec 10, 2024 •

edited

Loading

danielfromearth commented Dec 12, 2024

TomNicholas Dec 12, 2024

TomNicholas Dec 12, 2024

TomNicholas left a comment

TomNicholas Dec 12, 2024

TomNicholas Dec 12, 2024

TomNicholas Dec 12, 2024

ayushnag Dec 12, 2024

ayushnag commented Dec 12, 2024

betolink left a comment •

edited

Loading

betolink left a comment

Mikejmnez commented Dec 16, 2024

		refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
		return xr.open_dataset(

Opening virtual datasets (dmr-adapter) #606

Opening virtual datasets (dmr-adapter) #606

Conversation

ayushnag commented Jun 18, 2024 • edited Loading

betolink commented Jun 20, 2024

ayushnag commented Jun 20, 2024 • edited Loading

github-actions bot commented Nov 13, 2024 • edited Loading

chuckwondo commented Nov 13, 2024

ayushnag commented Nov 13, 2024

ayushnag commented Nov 14, 2024

betolink commented Nov 14, 2024

betolink commented Dec 10, 2024

betolink left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushnag commented Dec 10, 2024

ayushnag commented Dec 10, 2024

betolink commented Dec 10, 2024 • edited Loading

danielfromearth commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushnag commented Dec 12, 2024

betolink left a comment • edited Loading

Choose a reason for hiding this comment

betolink left a comment

Choose a reason for hiding this comment

Mikejmnez commented Dec 16, 2024

ayushnag commented Jun 18, 2024 •

edited

Loading

ayushnag commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024 •

edited

Loading

betolink commented Dec 10, 2024 •

edited

Loading

betolink left a comment •

edited

Loading