Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GRIB2ReferenceRecipe #387

Closed
rsignell-usgs opened this issue Jul 19, 2022 · 9 comments
Closed

add GRIB2ReferenceRecipe #387

rsignell-usgs opened this issue Jul 19, 2022 · 9 comments

Comments

@rsignell-usgs
Copy link

We have HDFReferenceRecipe, but kerchunk also handles GRIB2. Perhaps GRIB2ReferenceRecipe?

@rabernat
Copy link
Contributor

Or what about a generic ReferenceRecipe? We already track the file type in the input pattern. So we could just figure out the correct class to use on the fly.

@darothen
Copy link

This seems pretty useful; I've been playing around with this during the evening, and it looks like fsspec/kerchunk#198 would unblock writing recipes that read from remotely hosted GRIB files. Directly translating the example HRRR concatenation into a recipe here would be a really cool addition. Happy to play around with this and propose a refactor for HDFReferenceRecipe once that upstream PR in kerchunk lands.

@darothen
Copy link

A quick update - I was able to hack out a minimal working example of using the existing HDFReferenceRecipe class to read GRIB2 files. The demo code is in #390, but note that it would help if you build a very recent version of kerchunk - at least after the significant GRIB utility refactor in fsspec/kerchunk#198, and for the demo code below, fsspec/kerchunk#204.

In a nut-shell, it was pretty straightforward to modify the existing codepaths for scanning GRIB files. A major departure from the HDF formulation is that each GRIB message gets its own reference, so we need a more elegant way to combine this so that a single GRIB file maps to a single Zarr file. I opted just to flatten all the messages, assuming metadata about coords remains constant, but there are other places this could be fixed, namely in the MultiZarrToZarr utility. I think a simple dataclass which helps mediate between the reference spec produced by fsspec would help here, since there will probably be a variety of reference lists that need to be handled.

The other big change would be the actual interface for HDFReferenceRecipe - or whatever it would become. Given the small inconsistencies between the NetCDF4 reader and the GRIB reader here, a really good case could be made for having separate reference classes, but I'll explore how much overlap there is and see if there is a sensible re-factoring that would allow easy extension for each specialized filetype.

For a quick demo, here's a simple gist which subsets from a HRRR forecast and concatenates the output. This workflow is vastly superior to most ones that I've seen for extracting useful data from the online HRRR archive - it's fast, easy to modify, and is very concise thanks to the great mechanics inside pangeo-forge.

@darothen
Copy link

darothen commented Aug 5, 2022

Another quick update - I started re-factoring a generic ReferenceRecipe so that the custom code for each file format could be added in. In general things "just work", but debugging an issue with s3 paths not being read properly. Will consolidate the re-factor and update tests before asking for reviews on the PR.

@rabernat
Copy link
Contributor

rabernat commented Aug 7, 2022

Thanks for working on this @darothen!

For reference, we are in the process of completely refactoring the internals of pangeo forge recipes. (See #376 and #391 for latest progress.) That is happening on the beam_refactor branch. So just keep that in mind before choosing to spend a lot of time on refactoring the current code base.

That said, I'm 👍 on doing whatever is necessary to make Grib work with the current code.

@darothen
Copy link

darothen commented Aug 8, 2022

No worries @rabernat. This project is really just an excuse for me to get much more comfortable with the internals of fsspec / kerchunk. I don't think this work will result in much more than a simple re-factor - but happy to contribute something bigger when the larger library re-factors are complete!

@cisaacstern
Copy link
Member

cisaacstern commented Apr 28, 2023

Just a quick note to say #486 will resolve this issue. 🎉

This integration test, which is directly modeled on @darothen's gist linked in #387 (comment), is passing in CI and seems to confirm that at least a minimally-functional version of the GRIB2 case will soon be supported in our beam-refactor branch.

A major departure from the HDF formulation is that each GRIB message gets its own reference, so we need a more elegant way to combine this so that a single GRIB file maps to a single Zarr file.

This issue, identified by Daniel above, was indeed the crux of the matter.

We've addressed it in #486 with the precombine_inputs option, demonstrated in the following outline of a GRIB2 pipeline:

with beam.Pipeline(options=options) as p:
   (
       p
       | beam.Create(pattern.items())
       | OpenWithKerchunk(...)
       | CombineReferences(..., precombine_inputs=True)
       | WriteCombinedReference(...)
   )

When precombine_inputs=True, the CombineReferences transform first combines the set of references for each individual GRIB2 file with itself, using Kerchunk's MultiZarrToZarr, before combining them with the references for other files.

I am certainly a GRIB (and kerchunk) neophyte, so very much welcome feedback on this from Daniel, @rsignell-usgs, or others with deeper domain knowledge. Just now cleaning up #486 with hopes to merge it later today. Assuming you may not get a chance to take a look today, I'd love to follow up post-merge to get both of you to test drive 🚗 💨 these new transforms.

@cisaacstern
Copy link
Member

Resolved by #486.

See this file for the merged version of the integration test mentioned in previous comment.

Here's an excerpt demonstrating the GRIB2 -> kerchunk pipeline there:

| beam.Create(pattern.items())
| OpenWithKerchunk(
file_type=pattern.file_type,
remote_protocol=remote_protocol,
storage_options=storage_options,
kerchunk_open_kwargs={"filter": grib_filters},
)
| CombineReferences(
concat_dims=pattern.concat_dims,
identical_dims=identical_dims,
precombine_inputs=True,
)
| WriteCombinedReference(
target_root=td.name,
store_name=store_name,

As mentioned elsewhere, would love feedback on this from anyone working with GRIB2, in the form of issues, bug reports, etc.

@darothen
Copy link

darothen commented May 1, 2023

Hey @cisaacstern thanks for the awesome update here! It sounds like the solution you landed on is a great path forward.

I'll try to test-drive this with one of my personal data processing pipelines. Not sure when I'll find the time, but we will try to ping back here with results if I can squeeze in the work some evening this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants