-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hypergrib as as a grib reader #238
Comments
Sounds awesome! My intention is definitely for But |
Just to flag that, after @emfdavid pointed out that some GRIB datasets contain billions of GRIB messages (chunks), I've changed the design for A kerchunk JSON manifest of a trillion chunks would be about 50 TB! Instead, the plan is that We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels. And maybe some additional metadata for when NWPs are upgraded and, for example, more ensemble members are generated per run. I'm guessing the lack of a manifest makes The discussion that triggered this design change is here: JackKelly/hypergrib#14 (comment) The new design is sketched out here: https://github.com/JackKelly/hypergrib/blob/main/design.md |
Yikes! It's not impossible to imagine handling this with the virtualizarr stack - it just becomes territory where you need out-of-core-computation just to concatenate / save the manifests (i.e. dask/cubed). It's the same scale of problem as processing TB's of numerical array data now.
I wonder if this could be handled by having a |
If grib interleaving was predictable that would be interesting. I thought it was pretty random given what we'd seen in AMPS. Why trillions of chunks? Is each time chunk tiny? It seems like it must be a poor format choice (for a specific case) if the index must be so big. Certainly there's a lot of array data out there that should just be a table. |
I realize the xarray dataset is required to have a finite number of chunks and finite dimensions (or maybe we can work around this?) but with the ability to compute all the chunk that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate? I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data. |
Finite yes, but if you can cheaply compute the chunk references then how is finite a limitation?
You personally might not need to concatenate further for these particular datasets, but I bet you someone is immediately going to be like "I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..." 😄
I know, I'm not suggesting we load data. I'm saying that if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework. The alternative would be creating a 3rd API for creating, joining, and writing virtual references. |
Well, I'll admit that I'm not aware of any NWP datasets today that have trillions of GRIB messages 🙂. But there datasets today that have tens of billions of GRIB messages (see the calculations here), and I could well imagine seeing datasets with trillions of chunks in the next few years. In general, each "chunk" in GRIB is a 2-dimensional image, where the two dimensions are the horizontal spatial dimensions. For example, a single GRIB message might contain a 2D image which represents a prediction for temperature at the Earth's surface at a particular time. |
I like the sound of this!
Yup, sounds good to me! Not least as a nice way to generate For @emfdavid's use-case (which is the same as my use-case!), the plan is that |
heh - I have been that guy more often than not!
When you open the dataset you need to pick an end date, that defines the number of chunks and the dimensions of the dataset. Could we just pick something like a year in the future from when you call open? I hacked xreds xpublish to server the database backed grib manifest we build in big query. Reloading it every few minutes to get updates is a major drag. Reloading this new lazy dataset would much less onerous... but we still need to pick an end date for something we know goes into the future to varying degrees. |
Sounds good, but it's worth pointing out here how similar the problems of creating a new xarray backend to read data and writing a virtualizarr "reader" to generate references are. In both cases you look at files, create an array-like lazy metadata-only representation of the actual data and its location (either xarray's In fact, because its also useful for virtualizarr readers to also be able to actually load variables into memory (see the
Thinking about this more it's perhaps not trivial... The reason concatenation is possible in virtualizarr is because
So IIUC you have datasets that are appended to frequently? And by reloading you mean that you don't want to reindex the entire thing just to append the new chunks? Cheaply appending the virtual references for the new chunks to an existing store is the same problem as discussed in #21 (and something the Earthmover folks should have better solution for soon...) I don't see why this requires picking an arbitrary end date in the distant future that hopefully includes all the data you plan to write - instead the data schema should be updated only when you actually do write (/create virtual references to) new data. |
If you are creating your own dataset with thousands of chunks having an explicit manifest is great and working to make it easy to append/concat is a critical feature for the xarray/zarr toolset. Manifest building and appending datasets are problems that I worked really hard to solve for Camus over the past year. Last week @JackKelly turned the problem on it's head, pointing out the the manifest could be computed algorithmically and I look at the problem completely differently now. Our goal is working with enormous data sets we don't control, with millions of grib files, billions of chunks put in the cloud by NODD and ECMWF. They add 100,000 chunks a day to some of these datasets and for operational use, I want the latest data as it becomes available. Directly building native zarr datasets seems out of reach for these organizations at the moment, so we are looking for solutions to work with what we have got. I was planning to advocate for building a public database backed manifest that could be used with virtualizarr or with kerchunk as I have been doing privately at Camus for over 6 months now. Based on operational experience, this architecture is difficult and costly to manage operationally. I am sure my work leaves much to could be improved, but I am talking about datasets with variables have more than 100,000 chunks with more to append each hour as they land in the cloud bucket. I am using the dataset across dozens of distributed workers running big ML jobs. Lazily computing the manifest itself is just a very different way of thinking about these big NWP archival datasets that could be much easier. I fully support and appreciate virtualizarr and we should definitely build bridges between these solutions (materializing a manifest where needed) but I suspect working with an xarray or zarr dataset free of the materialized manifest could be a very powerful solution in it's own right. There a lots of pitfalls that will make this paradigm shift difficult working with the existing APIs so having insight from the folks that build the existing stack would be fantastic. Can we do this - how? |
One alternative to chunk-level dataset aggregation is file-level dataset aggregation. This is actually the traditional approach that has been in use for decades (see e.g. ncml). Chunk-level aggregation emerged in the context of cloud computing, where the cost of peeking at each file is high due to the latency of HTTP requests. But if you somehow know a-priori exactly where the chunks are in each file, you can avoid that. |
I'm not really sure what you mean in this context @rabernat . The client still needs to know where inside the file to read from, so we're back to chunks aren't we? If anyone wants to chat about this we do have a bi-weekly virtualizarr meeting starting in a couple of minutes! EDIT: In the unlikely event that you are trying to join today please say so - having some zoom issues. |
Can't make it this week - but I will try to join next time. |
I was trying to join and couldnt get in and just saw this. So if you happen to see this i am willking to join! |
Yes, but there are many ways you can provide that information:
|
If you knew this pattern for certain, couldn't you just make an implementation of the zarr Store API which implements Then you open that Zarr store with whatever tool you like to do the processing. A similar idea would be to have "functionally-derived virtual chunk manifests" in icechunk. |
Actually on second thoughts I think this:
is effectively just one possible implementation of the original idea of @JackKelly 's above:
|
tl;dr: Can we just abstract the indirection as a single mapper function, and associate that with a dedicated Zarr Store implementation?Taxonomy of datasetsTo follow that last thought to it's logical conclusion, and build on @rabernat's taxonomy, let me first step back a bit. We're all starting from a hypercube of array data stored across many file/objects. We know that in general zarr-like access to each chunk of data can be expressed in terms of byte range requests to regions of specific files (ignoring limitations of the current zarr model such as uniform codecs/chunk shapes etc. for now). I think there are only 3 possibilities for how the details of these byte range requests relate to the hypercube of files, depending on how structured the files are:
(3) Is the general problem for messy or composite data. It's what VirtualiZarr and Icechunk's concept of a static manifest assumes. You have no choice but to pay the cost of scanning all the files up front, and saving all the information for all of them. Of course you can still exploit any remaining patterns to optimise the storage of that manifest through compression, which @paraseba has been thinking about for Icechunk (see earth-mover/icechunk#430). (2) Is less common but clearly does exist - it's the case hypergrib is intended for. Here you know the pattern in the file names a priori so storing those explicitly would be redundant. You also have some special way of determining the byte ranges for any file quickly but (as in the hypergrib case) it's more complicated than just computing them analytically given only the chunk key. (1) Is a rare case but is technically possible. In this case you literally only need to be told the chunk key and you then have enough information to immediately deduce the filename, byte range and offset. An example would be an organised set of files each having the same format, schema, and with no data compression - the only difference is their values, not the locations of those values within each file. Maybe some GRIB datasets fit this? Some netCDF3 datasets might. Actually I guess (uncompressed?) native Zarr would be an example of this, where the mapping is just the identify function, and the byte range is all the bytes in the chunk. There's also technically a 4th case - Semi-structured, but not cheaply discoverable. The [C]Worthy OAE dataset (see #132) is an example of this, because the filenames have a predictable pattern, but the byte ranges and offsets have to be discovered. (Aside: the original version of that dataset was in netCDF3, so maybe even an exampe of (1)!) But this case isn't worth treating separately because you'll still need an explicit manifest to hold all byte ranges and offsets like you do for (3), it's just the filenames in your manifest have some pattern - and will therefore probably compress well anyway. Commonality: Functional MappingAccessing (1) and (2) can both be thought of as reading from a special Zarr store whose implementation of (3) is different because in that case that you can't even write such a cheap function in general, the best you can do is create a lookup table. In other words you have no choice but to pre-compute that (expensive) function for every file, i.e. scan all the files with kerchunk, arrange them in the hypercube, and record the results in a manifest. This is what VirtualiZarr does. Then at read time you still need to re-execute that function, but it's now just a lookup of your stored manifest (what Icechunk / kerchunk's xarray backend does), so now it's way faster.
|
Oh and also @ayushnag then your |
Wow! That's a very impressive write-up! And I love that even your "pseudo-code" is rigorously annotated with type hints! I love the idea of trying to create a "useful abstraction common to both cases (1) and (2)" and of all the advantages that you list. An issue to be aware of: Non-atomic updates of datasets
One issue that @emfdavid brought up is that weather forecast providers don't update their datasets atomically. So it's common to find missing / incomplete files when reading the folder containing the latest weather forecasts. So we'll need to handle this edge case. I haven't yet given this much thought other than the (obvious) realisation that crashing is not acceptable when trying to read the latest weather forecasts! Although I'm guessing that wasn't really what you had in mind when you mentioned "rogue files"? Next steps
Related
|
One other quick thought:
I realise that, in the past, I thought that we'd host a continually-updated metadata description of each NWP dataset. However, more recently, it's become fairly clear that that isn't necessary. The only dimension that changes regularly is the reference datetime. And NWPs usually use a "folder" per reference datetime. And, in Rust, it only takes a fraction of a second to list all those folders directly from the cloud storage bucket. So my current plan is that |
I told @joshmoore about the
|
I just learned about hypergrib (via this blog post).
I don't know much about how GRIB files are structured, but if @JackKelly makes a very performant GRIB reader that can:
ChunkManifest
objects and associated metadata (ideally usingChunkManifest.from_arrays()
to avoid any intermediate representation),then we could add it as a new reader to
virtualizarr.readers
.In fact FYI my intention now is to probably make an entrypoint system for virtualizarr, in addition to plugging in to xarray's backend entrypoint system. That would allow this syntax for creating virtual
xr.Dataset
objects containingManifestArrays
:The text was updated successfully, but these errors were encountered: