-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening virtual datasets (dmr-adapter) #606
Conversation
This PR looks good to me (we need to fix some minor formatting issues with Ruff). Maybe the only missing thing would be a notebook demonstrating how to use this feature? @ayushnag |
https://gist.github.com/ayushnag/bcf946a71122f5e7a54bc72b581bd31b Better viewing experience: https://nbviewer.org/gist/ayushnag/bcf946a71122f5e7a54bc72b581bd31b If there's more you want me to add or if any step is unclear I can update the notebook |
👈 Launch a binder notebook on this branch for commit b09c3f0 I will automatically update this comment whenever this PR is modified 👈 Launch a binder notebook on this branch for commit ed4a98b 👈 Launch a binder notebook on this branch for commit abe2c28 👈 Launch a binder notebook on this branch for commit dde9c57 👈 Launch a binder notebook on this branch for commit a9a9234 👈 Launch a binder notebook on this branch for commit 8fb0140 👈 Launch a binder notebook on this branch for commit 2c28278 👈 Launch a binder notebook on this branch for commit c3e43ac 👈 Launch a binder notebook on this branch for commit d3d6e7c 👈 Launch a binder notebook on this branch for commit 512e89c 👈 Launch a binder notebook on this branch for commit 67dbbe7 👈 Launch a binder notebook on this branch for commit 61afb95 |
@ayushnag, perhaps I'm missing something here, but why have you copied code from the virtualizarr library into earthaccess? I don't see anything gained by this. We can simply use virtualizarr directly. Can you clarify? |
Yes I can clarify. At some point the dmrpp parser was going to be a part of earthaccess instead of virtualizarr. However that is not the case anymore and this PR will be updated to just call virtualizarr directly |
This PR is ready to review now. I have updated the |
I can take a look this evening, great work @ayushnag !! |
I think this PR is ready to be merged, there are many considerations to the future of virtual datasets with NASA data but this PR get us closer to a workflow where we won't have to generate metadata when there is a format already available. We need to produce a few examples and put it in the documentation, maybe using SWOT, ICESat-2 and TEMPO datasets cc @danielfromearth @DeanHenze Is there anything we are missing @TomNicholas @ayushnag? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Ayush! thank you for all the work you put on this PR and what you did on virtualizarr
. I hope we can talk with @jgallagher59701 and Miguel Jimenez today/tomorrow about what's next. Maybe we can have people hacking on this access pattern at the Pangeo event!
earthaccess/virtualizarr.py
Outdated
open_ = _parse_dmr | ||
vdatasets = [open_(fs=fs, data_path=g.data_links(access=access)[0]) for g in granules] | ||
if preprocess is not None: | ||
vdatasets = [preprocess(ds) for ds in vdatasets] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing!
# Open directly with `earthaccess.open` | ||
expected = xr.open_mfdataset(earthaccess.open(granules), concat_dim="time", combine="nested", combine_attrs="drop_conflicts") | ||
|
||
result = earthaccess.open_virtual_mfdataset(granules=granules, access="indirect", concat_dime="time", parallel=True, preprocess=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have Dask in the test dependencies?
earthaccess/virtualizarr.py
Outdated
from __future__ import annotations | ||
|
||
import fsspec | ||
import xarray as xr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray is not one a core dependency, I think we need to add it as an optional dependency the same way the consolidate_metadata uses Dask (see the pyproject.yaml) and then the tests that are failing should pass!
earthaccess/virtualizarr.py
Outdated
def open_virtual_mfdataset( | ||
granules: list[earthaccess.DataGranule], | ||
group: str | None = None, | ||
access: str = "indirect", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you're explicitly exposing this, I think we are going to deprecate the "magic" of detecting the runtime environment in-region
vs out-of-region
pyproject.toml
Outdated
@@ -64,6 +64,9 @@ kerchunk = [ | |||
"h5netcdf", | |||
"xarray", | |||
] | |||
virtualizarr = [ | |||
"virtualizarr @ git+https://github.com/zarr-developers/VirtualiZarr.git" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because we want to be up to date for now? are we targeting an upcoming release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I released like 2 days ago, so you probably want to just pin to >=1.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, will update that!
|
||
# TODO: replace with xr.testing when virtualizarr fill_val is fixed (https://github.com/zarr-developers/VirtualiZarr/issues/287) | ||
# and dmrpp deflateLevel (zlib compression level) is always present (https://github.com/OPENDAP/bes/issues/954) | ||
for var in result.variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the replacement for "almost equal"?
earthaccess/virtualizarr.py
Outdated
for g in granules: | ||
vdatasets.append( | ||
open_( | ||
filepath=g.data_links(access=access)[0] + ".dmrpp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the linter tells me that filepath
is not a thing, same with indexes
as this uses open_virtual_dataset()
from down below in line 146. Maybe we need to refactor it?
@betolink One quick comment before we merge. This function doesn't appear to show up in the API reference with the docs build. I think if the new functions are imported with .api that should fix it? |
Although then the function has to be part of |
i think you're correct, I don't think the function needs to be part of API but it definitely needs to be imported, if we want the docs to show there. I think since we are including it also in the |
@betolink, do you think we should create a new issue for examples with open virtual dataset? |
@@ -58,6 +59,9 @@ | |||
"Store", | |||
# kerchunk | |||
"consolidate_metadata", | |||
# virtualizarr | |||
"open_virtual_dataset", | |||
"open_virtual_mfdataset", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a virtualizarr.open_virtual_mfdataset
upstream in virtualizarr. Though I would like to be more confident about the best way to parallelize reference generation first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracked in zarr-developers/VirtualiZarr#345
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @ayushnag.
load: bool = False, | ||
preprocess: callable | None = None, # type: ignore | ||
parallel: bool = True, | ||
**xr_combine_nested_kwargs: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically if you are able to load coordinates into memory (and therefore create pandas indexes) you could also support xr.combine_by_coords
here too.
refs = vds.virtualize.to_kerchunk(filepath=None, format="dict") | ||
return xr.open_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we implemented @ayushnag 's idea in zarr-developers/VirtualiZarr#124 then we could just call that instead of using xarray.open_dataset
here.
data_path = granule.data_links(access="indirect")[0] | ||
dmrpp_path = data_path + ".dmrpp" | ||
|
||
result = open_virtual_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the test of correctness here is assuming that virtualizarr has correct behaviour? You might want to add more tests because virtualizarr definitely still has bugs in it (e.g. around CF coordinate decoding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are assuming that virtualizarr has the correct behavior. However this is mostly an integration test checking that the function operates as expected. The actual tests for correctness are in virtualizarr (for the dmrpp backend)
@TomNicholas do you think after changing the virtualizarr version in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good, we're merging now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll address @TomNicholas comments about pushing some functionality to VirtualiZarr, other than that we need the example notebooks!
Just catching up here. Awesome work @ayushnag 🥇 ! |
virtualizarr
version (with numpy 2.0 manifest)Add tutorial notebook [TODO in later PR]📚 Documentation preview 📚: https://earthaccess--606.org.readthedocs.build/en/606/