-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata consolidation utility #278
Conversation
👈 Launch a binder notebook on this branch for commit 0a6c7fa I will automatically update this comment whenever this PR is modified 👈 Launch a binder notebook on this branch for commit 50f0cb6 👈 Launch a binder notebook on this branch for commit 1fb9bd7 👈 Launch a binder notebook on this branch for commit a303705 👈 Launch a binder notebook on this branch for commit 61d256b 👈 Launch a binder notebook on this branch for commit a7760af 👈 Launch a binder notebook on this branch for commit 81b25a3 👈 Launch a binder notebook on this branch for commit 079f385 👈 Launch a binder notebook on this branch for commit cd260a5 👈 Launch a binder notebook on this branch for commit 980cc76 👈 Launch a binder notebook on this branch for commit cd62af2 |
pyproject.toml
Outdated
@@ -37,6 +37,8 @@ s3fs = ">=2021.11, <2024" | |||
fsspec = ">=2022.1" | |||
tinynetrc = "^1.3.1" | |||
multimethod = ">=1.8" | |||
kerchunk = ">=0.1.2" | |||
dask = {extras = ["complete"], version = ">=2023.5.0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making these extra dependencies? I don't use poetry very much, maybe looks like this?
[tool.poetry.extras]
distributed = ["kerchunk >=0.1.2", "dask[complete] >=2023.5.0"]
We'll need to think a little bit about user messaging if they attempt to use consolidate_metadata
but don't have them installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I'd be happy to move these to optional dependencies, only use dask
-specific code if dask
is installed, etc. I was just adding them as hard dependencies to start as I wasn't sure what you and @betolink would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking! @betolink is far more suited to speak to overall direction than me, so I'll quiet down and let him take it from here :)
Thanks for another wonderful contribution! ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @jrbourbeau , I think what @MattF-NSIDC and you are describing makes sense, optional dependencies and only use Dask specific code if Dask is installed (if this is not too much of a hassle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW my general recommendation in cases like these is to depend on core dask
(it's trivially lightweight) and then use dask.delayed
and dask.compute
. This will pick up a dask.distributed cluster if one is available, and if not use a local thread pool.
I should have mentioned originally that, while this works, I'm sure there are additional improvements we can iterate on (e.g. making |
Hey James, looking for docs on |
Yeah, no problem. From https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html, |
I certainly failed to find it on my own :) Thanks! I haven't made time yet to dig, but I'm specifically curious about |
Ah, I see. |
Thanks for this PR @jrbourbeau, this is an incredible addition to Regarding the dependencies, I like what @mrocklin said about having dask core as |
For those interested, this was given a demo at the monthly dask meeting, recording to come shortly. Thanks for putting it together, @jrbourbeau ! The main function seems to be a specialised version of https://github.com/fsspec/kerchunk/blob/main/kerchunk/combine.py#L649 , which also supports tree reduction of the single-file reference sets. It would be nice, if there was an obvious way to point to an existing reference file, if someone else has already made it and stashed it somewhere. Of course, users can always save their own copy, but it would be one less barrier when available. Here is one dataset on the planetary computer which has a kerchunk reference file alongside the data: https://planetarycomputer.microsoft.com/dataset/deltares-water-availability |
Earlier today @martindurant mentioned that
Yeah, I agree this would be nice. I'm not sure how easy uploading to Earthdata is -- my guess is @betolink has the most context there |
This comment was marked as duplicate.
This comment was marked as duplicate.
I don't know how practical it is, but of course the reference files can be at any other accessible location, so it's deciding that location and getting the metadata to users which is the hard part. In kerchunk workflows we typically use intake (of course) to hide away the many arguments it takes to get xarray to load such data. |
@jrbourbeau I just started to test this and I wonder if we can include 2 features to the
I started to consolidate big ATL03 files, the same we used at the IS2 hackweek and ran into:
These files are very complex, with nested groups and hundreds of variables. We managed to kerchunk them at the hackweek but each took like 20 minutes and for some reason we had to run kerchunk sequentially, when I used Dask it never finished. Here are the params I used for this test if that helps. # we use the file pattern to grab the same orbit
track = "0811"
region = "12"
params = {
"short_name": "ATL03",
"version": "006",
"cloud_hosted": True,
"temporal": ("2018-11-01", "2023-08-01"),
"granule_name": f"ATL*_{track}??{region}*.h5",
"count": 4
}
# Retrieve data files for the dataset I'm interested in
granules = earthaccess.search_data(**params)
# each granule is ~5GB
outfile = earthaccess.consolidate_metadata(
granules,
outfile=f"./direct-{short_name}-metadata.json", # Writing to a local file for demo purposes
# outfile=f"s3://my-bucket/{short_name}-metadata.json", # We could also write to a remote file
access="direct",
# kerchunk_options={} # not concatenating because this is swath ~LIDAR data
) In the case of the SST dataset it worked like a charm! I can see some of these global gridded datasets being worked at scale with this approach, it's like magic! |
Can we quote you? :) |
assertions.assertTrue("EARTHDATA_USERNAME" in os.environ) | ||
assertions.assertTrue("EARTHDATA_PASSWORD" in os.environ) | ||
|
||
logger.info(f"Current username: {os.environ['EARTHDATA_USERNAME']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! way better than just printing all these info!
This is more into the Kerchunk side of things but now that we have @martindurant here... Currently the ref = SingleHdf5ToZarr(file, group='/gt1l/heights', ...) would potentially generate something that xarray can read more easily. This is just me thinking out loud and a little related to what @jrbourbeau and @martindurant talked about on where to store the reference files:
|
I'm not certain how you can persuade the h5py tree visitor to only scan part of the file, but in SingleHdf5ToZarr._translator , we could certainly prevent storing keys outside of some given group or apply key renamings to make some group appear the root of the generated zarr. However, there may be easier ways:
def keep_group(refs, group):
return {k[len(group) + 1:]: v for k, v in refs.items() if k.startswith(group)} (where group would be a "/" delimited string) |
@betolink pinged me on slack to take a look at this. Something I wanted to mention (for either this PR or future kerchunk integration) is that for larger datasets, doing a tree reduction from indivudual references to a combined reference can drastically improve performance. https://fsspec.github.io/kerchunk/advanced.html#tree-reduction. Kerchunk provides |
What is the status here? |
Chatted with @betolink about where we're at, and he identified the following needs:
Luis and I agree the work in this PR shouldn't depend on these, and that we should break these concerns off in to separate issues/discussions/PRs. @jrbourbeau what do you think? OT: We want to avoid NSIDC being a blocker to this library moving forward as much as possible. We still don't have a very robust governance plan for this library, but it's something we're thinking about. I started a discussion thread: #321 |
I've shifted attention away to other things for a bit, but this PR just needs a little cleanup and possibly offloading some logic to |
Not from me, I'm not too familiar with earthaccess at all, just trying to help out |
I wonder, is there a foolproof way to know that a particular earthaccess granule is a netCDF4 type, as is assumed here, or some other particulartype? As far as I can see, the granule objects have a |
@martindurant I think the URL will be the way as the metadata is sometimes incomplete. I know kerchunk has been improved since we last tested it with some very nested datasets, I don't think we need to cover all use cases to start experimenting with this. @jrbourbeau is there anything we can do to help merge this one? |
(as a heads-up, James is on PTO this week. I suspect he'll respond on
Monday)
…On Wed, Nov 22, 2023 at 11:45 AM Luis López ***@***.***> wrote:
@martindurant <https://github.com/martindurant> I think the URL will be
the way as the metadata is sometimes incomplete. I know kerchunk has been
improved since we last tested it with some very nested datasets, I don't
think we need to cover all use cases to start experimenting with this.
@jrbourbeau <https://github.com/jrbourbeau> is there anything we can do
to help merge this one?
—
Reply to this email directly, view it on GitHub
<#278 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE6UB2DIGY34DLFDX3YFY23LAVCNFSM6AAAAAA3RJY7LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTGIYTQMZRGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready to be merged and tested, we would probably need an example notebook and a warning that Kerchunk is experimental etc etc but definitely looking forward to start the Kerchunking.
|
||
|
||
def consolidate_metadata( | ||
granuales: list[earthaccess.results.DataGranule], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we try to kerchunk and consolidate non uniform datasets kerchunk will fail, if we try to kerchunk 1 granule kerchunk succeeds but MultiZarrToZarr requires a variable mapping, if we don't pass it fails, for one granule we should use kerchunk_options={"coo_map": {}}
and then kerchunks stays happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the same as not doing any combination at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess? or maybe we can just skip the MultiZarrToZarr when the input is only 1 file. I wanted to do a quick test on a granule from https://podaac.jpl.nasa.gov/MEaSUREs-MUR and ran into this issue.
Thanks for reviewing @betolink. Like I mentioned offline, I think there are a few additional things we can add here (e.g. |
👍 |
It can be prohibitively slow to read metadata for datasets from earthaccess for a few reasons (e.g. datasets are netcdf instead of zarr, https access is slow). @betolink has a nice description here https://discourse.pangeo.io/t/avoid-metadata-reads-when-loading-many-similar-netcdf-files/3594/5.
Kerchunk helps the situation by doing a single pass over the dataset and creating a single, consolidated metadata file that can be used to quickly access the metadata for the full dataset in subsequent reads. Additionally, the initial pass over the dataset metadata can be parallelized, which helps further speed things up.
This PR adds a new
earthaccess.consolidate_metadata(...)
utility for integratingearthaccess
andkerchunk
together to make it straightforwards for users to go through this metadata consolidation process for the datasets they're interested in.Here's an example of what this looks like in practice:
cc @betolink for thoughts