Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening virtual datasets (dmr-adapter) #606

Merged
merged 18 commits into from
Dec 14, 2024
Merged

Conversation

ayushnag
Copy link
Collaborator

@ayushnag ayushnag commented Jun 18, 2024

  • Closes Opening virtual datasets with NASA dmrpp files #605
  • Add docs
  • Unit tests. Update current test to test specific portions of the virtual dataset
  • Check indirect dmrpp reading support
  • Use updated virtualizarr version (with numpy 2.0 manifest)
  • Update CHANGELOG.md
  • Add tutorial notebook [TODO in later PR]

📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/

@ayushnag ayushnag marked this pull request as draft June 18, 2024 23:33
@betolink
Copy link
Member

This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag

@ayushnag
Copy link
Collaborator Author

ayushnag commented Jun 20, 2024

https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b

If there's more you want me to add or if any step is unclear I can update the notebook

@betolink betolink self-assigned this Jun 25, 2024
Copy link

github-actions bot commented Nov 13, 2024

Binder 👈 Launch a binder notebook on this branch for commit b09c3f0

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit ed4a98b

Binder 👈 Launch a binder notebook on this branch for commit abe2c28

Binder 👈 Launch a binder notebook on this branch for commit dde9c57

Binder 👈 Launch a binder notebook on this branch for commit a9a9234

Binder 👈 Launch a binder notebook on this branch for commit 8fb0140

Binder 👈 Launch a binder notebook on this branch for commit 2c28278

Binder 👈 Launch a binder notebook on this branch for commit c3e43ac

Binder 👈 Launch a binder notebook on this branch for commit d3d6e7c

Binder 👈 Launch a binder notebook on this branch for commit 512e89c

Binder 👈 Launch a binder notebook on this branch for commit 67dbbe7

Binder 👈 Launch a binder notebook on this branch for commit 61afb95

@chuckwondo
Copy link
Collaborator

@ayushnag, perhaps I'm missing something here, but why have you copied code from the virtualizarr library into earthaccess? I don't see anything gained by this. We can simply use virtualizarr directly. Can you clarify?

@ayushnag
Copy link
Collaborator Author

Yes I can clarify. At some point the dmrpp parser was going to be a part of earthaccess instead of virtualizarr. However that is not the case anymore and this PR will be updated to just call virtualizarr directly

@ayushnag ayushnag marked this pull request as ready for review November 14, 2024 19:03
@ayushnag
Copy link
Collaborator Author

This PR is ready to review now. I have updated the pyproject.toml with an added dependency but let me know if the uv.lock or other files need to modified.

@betolink
Copy link
Member

I can take a look this evening, great work @ayushnag !!

@betolink
Copy link
Member

I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze

Is there anything we are missing @TomNicholas @ayushnag?

@betolink betolink requested review from TomNicholas and betolink and removed request for TomNicholas December 10, 2024 17:32
betolink
betolink previously approved these changes Dec 10, 2024
Copy link
Member

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Ayush! thank you for all the work you put on this PR and what you did on virtualizarr. I hope we can talk with @jgallagher59701 and Miguel Jimenez today/tomorrow about what's next. Maybe we can have people hacking on this access pattern at the Pangeo event!

earthaccess/virtualizarr.py Outdated Show resolved Hide resolved
open_ = _parse_dmr
vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules]
if preprocess is not None:
vdatasets = [preprocess(ds) for ds in vdatasets]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing!

# Open directly with `earthaccess.open`
expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts")

result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time", parallel=True, preprocess=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have Dask in the test dependencies?

from __future__ import annotations

import fsspec
import xarray as xr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray is not one a core dependency, I think we need to add it as an optional dependency the same way the consolidate_metadata uses Dask (see the pyproject.yaml) and then the tests that are failing should pass!

earthaccess/virtualizarr.py Outdated Show resolved Hide resolved
def open_virtual_mfdataset(
granules: list[earthaccess.DataGranule],
group: str | None = None,
access: str = "indirect",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you're explicitly exposing this, I think we are going to deprecate the "magic" of detecting the runtime environment in-region vs out-of-region

pyproject.toml Outdated
@@ -64,6 +64,9 @@ kerchunk = [
"h5netcdf",
"xarray",
]
virtualizarr = [
"virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr.git"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we want to be up to date for now? are we targeting an upcoming release?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I released like 2 days ago, so you probably want to just pin to >=1.2.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, will update that!


# TODO: replace with xr.testing when virtualizarr fill_val is fixed (https://github.com/zarr-developers/VirtualiZarr/issues/287)
# and dmrpp deflateLevel (zlib compression level) is always present (https://github.com/OPENDAP/bes/issues/954)
for var in result.variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the replacement for "almost equal"?

for g in granules:
vdatasets.append(
open_(
filepath=g.data_links(access=access)[0] + ".dmrpp",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linter tells me that filepath is not a thing, same with indexes as this uses open_virtual_dataset() from down below in line 146. Maybe we need to refactor it?

@ayushnag
Copy link
Collaborator Author

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

@ayushnag
Copy link
Collaborator Author

Although then the function has to be part of api.py. Maybe not in that case

@betolink
Copy link
Member

betolink commented Dec 10, 2024

@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it?

i think you're correct, I don't think the function needs to be part of API but it definitely needs to be imported, if we want the docs to show there. I think since we are including it also in the __init__.py it's already part of earthaccess., good catch! do you want to push one commit and see how the docs render?

@danielfromearth
Copy link
Collaborator

I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze

Is there anything we are missing @TomNicholas @ayushnag?

@betolink, do you think we should create a new issue for examples with open virtual dataset?

@@ -58,6 +59,9 @@
"Store",
# kerchunk
"consolidate_metadata",
# virtualizarr
"open_virtual_dataset",
"open_virtual_mfdataset",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a virtualizarr.open_virtual_mfdataset upstream in virtualizarr. Though I would like to be more confident about the best way to parallelize reference generation first

zarr-developers/VirtualiZarr#123

zarr-developers/VirtualiZarr#7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @ayushnag.

load: bool = False,
preprocess: callable | None = None, # type: ignore
parallel: bool = True,
**xr_combine_nested_kwargs: Any,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically if you are able to load coordinates into memory (and therefore create pandas indexes) you could also support xr.combine_by_coords here too.

Comment on lines +129 to +130
refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
return xr.open_dataset(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we implemented @ayushnag 's idea in zarr-developers/VirtualiZarr#124 then we could just call that instead of using xarray.open_dataset here.

data_path = granule.data_links(access="indirect")[0]
dmrpp_path = data_path + ".dmrpp"

result = open_virtual_dataset(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the test of correctness here is assuming that virtualizarr has correct behaviour? You might want to add more tests because virtualizarr definitely still has bugs in it (e.g. around CF coordinate decoding).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are assuming that virtualizarr has the correct behavior. However this is mostly an integration test checking that the function operates as expected. The actual tests for correctness are in virtualizarr (for the dmrpp backend)

@ayushnag
Copy link
Collaborator Author

@TomNicholas do you think after changing the virtualizarr version in pyproject.toml we can merge? The other changes seem more like long term updates to me. Or which changes do you think are required now?

@betolink betolink self-requested a review December 13, 2024 15:09
betolink
betolink previously approved these changes Dec 13, 2024
Copy link
Member

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good, we're merging now!

Copy link
Member

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll address @TomNicholas comments about pushing some functionality to VirtualiZarr, other than that we need the example notebooks!

@betolink betolink merged commit b9dec8b into nsidc:main Dec 14, 2024
13 checks passed
@Mikejmnez
Copy link

Just catching up here. Awesome work @ayushnag 🥇 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening virtual datasets with NASA dmrpp files
6 participants