Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hypergrib as as a grib reader #238

Open
TomNicholas opened this issue Sep 13, 2024 · 23 comments
Open

Add hypergrib as as a grib reader #238

TomNicholas opened this issue Sep 13, 2024 · 23 comments
Labels
enhancement New feature or request references generation Reading byte ranges from archival files upstream issue usage example Real world use case examples

Comments

@TomNicholas
Copy link
Member

I just learned about hypergrib (via this blog post).

I don't know much about how GRIB files are structured, but if @JackKelly makes a very performant GRIB reader that can:

  1. be called from python,
  2. return ChunkManifest objects and associated metadata (ideally using ChunkManifest.from_arrays() to avoid any intermediate representation),

then we could add it as a new reader to virtualizarr.readers.

In fact FYI my intention now is to probably make an entrypoint system for virtualizarr, in addition to plugging in to xarray's backend entrypoint system. That would allow this syntax for creating virtual xr.Dataset objects containing ManifestArrays:

vds = xr.open_dataset(grib_paths, engine='virtualizarr', reader='hypergrib')
@JackKelly
Copy link

Sounds awesome! My intention is definitely for hypergrib to play nicely with the existing ecosystem, including xarray and virtualizarr.

But hypergrib won't do anything vaguely useful for at least a few months. When I'm a bit further down the line, I'll get back in touch, so we can discuss the optimal way to plug hypergrib into virtualizarr 🙂

@JackKelly
Copy link

JackKelly commented Oct 1, 2024

Just to flag that, after @emfdavid pointed out that some GRIB datasets contain billions of GRIB messages (chunks), I've changed the design for hypergrib so that it no longer stores a manifest of chunks. After doing some more research, it seems likely that we'll soon see NWP GRIB datasets with trillions of chunks.

A kerchunk JSON manifest of a trillion chunks would be about 50 TB!

Instead, the plan is that hypergrib will generate GRIB filenames by exploiting the fact that most public GRIB datasets have well-defined file naming conventions. So hypergrib will load the .idx files on-the-fly. If that turns out to be too slow then we'll explore ways to store the .idx metadata in a more compact and cloud-friendly form.

We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels. And maybe some additional metadata for when NWPs are upgraded and, for example, more ensemble members are generated per run.

I'm guessing the lack of a manifest makes hypergrib less interesting to VirtualiZarr?? (Sorry!)

The discussion that triggered this design change is here: JackKelly/hypergrib#14 (comment)

The new design is sketched out here: https://github.com/JackKelly/hypergrib/blob/main/design.md

@TomNicholas
Copy link
Member Author

datasets with trillions of chunks

Yikes! It's not impossible to imagine handling this with the virtualizarr stack - it just becomes territory where you need out-of-core-computation just to concatenate / save the manifests (i.e. dask/cubed). It's the same scale of problem as processing TB's of numerical array data now.

Instead, the plan is that hypergrib will generate GRIB filenames by exploiting the fact that most public GRIB datasets have well-defined file naming conventions.

We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels.

I wonder if this could be handled by having a ChunkManifest backed by numpy-like arrays which are actually lazy generators. Similar in spirit to the "functional/analytic coordinates" being discussed in xarray. I think this is the same idea as your linked discussion about "algorithmically" calculating the paths/offsets/lengths, I'm just suggesting wrapping that algorithm inside a numpy-like array / ChunkManifest object, such that you can still take advantage of virtualizarr & xarray for their IO & concatenation features.

@mdsumner
Copy link
Contributor

mdsumner commented Oct 1, 2024

If grib interleaving was predictable that would be interesting. I thought it was pretty random given what we'd seen in AMPS.

Why trillions of chunks? Is each time chunk tiny? It seems like it must be a poor format choice (for a specific case) if the index must be so big. Certainly there's a lot of array data out there that should just be a table.

@emfdavid
Copy link

emfdavid commented Oct 3, 2024

I'm just suggesting wrapping that algorithm inside a numpy-like array / ChunkManifest object, such that you can still take advantage of virtualizarr & xarray for their IO & concatenation features.

I realize the xarray dataset is required to have a finite number of chunks and finite dimensions (or maybe we can work around this?) but with the ability to compute all the chunk that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate?

I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 3, 2024

finite number of chunks and finite dimensions

Finite yes, but if you can cheaply compute the chunk references then how is finite a limitation?

with the ability to compute all the chunks that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate?

You personally might not need to concatenate further for these particular datasets, but I bet you someone is immediately going to be like "I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..." 😄

I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data.

I know, I'm not suggesting we load data. I'm saying that if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework. The alternative would be creating a 3rd API for creating, joining, and writing virtual references.

@JackKelly
Copy link

Why trillions of chunks?

Well, I'll admit that I'm not aware of any NWP datasets today that have trillions of GRIB messages 🙂. But there datasets today that have tens of billions of GRIB messages (see the calculations here), and I could well imagine seeing datasets with trillions of chunks in the next few years.

In general, each "chunk" in GRIB is a 2-dimensional image, where the two dimensions are the horizontal spatial dimensions. For example, a single GRIB message might contain a 2D image which represents a prediction for temperature at the Earth's surface at a particular time.

@JackKelly
Copy link

I wonder if this could be handled by having a ChunkManifest backed by numpy-like arrays which are actually lazy generators.

I like the sound of this!

if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework.

Yup, sounds good to me! Not least as a nice way to generate VirtualiZarr / kerchunk manifests from hypergrib!

For @emfdavid's use-case (which is the same as my use-case!), the plan is that hypergrib will plug directly into xarray as a new backend (and will never "materialise" a manifest... instead hypergrib will always generate GRIB filenames on-the-fly). But I do see @TomNicholas' point that some folks will want to merge/concat/drop virtual variables and output kerchunk/VirtualiZarr manifests; and I'm very keen for hypergrib to plug into the existing Pangeo stack as easily as possible 🙂.

@emfdavid
Copy link

emfdavid commented Oct 3, 2024

"I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..."

heh - I have been that guy more often than not!

how is finite a limitation?

When you open the dataset you need to pick an end date, that defines the number of chunks and the dimensions of the dataset. Could we just pick something like a year in the future from when you call open? I hacked xreds xpublish to server the database backed grib manifest we build in big query. Reloading it every few minutes to get updates is a major drag. Reloading this new lazy dataset would much less onerous... but we still need to pick an end date for something we know goes into the future to varying degrees.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 3, 2024

hypergrib will plug directly into xarray as a new backend (and will never "materialise" a manifest... instead hypergrib will always generate GRIB filenames on-the-fly)

Sounds good, but it's worth pointing out here how similar the problems of creating a new xarray backend to read data and writing a virtualizarr "reader" to generate references are. In both cases you look at files, create an array-like lazy metadata-only representation of the actual data and its location (either xarray's BackendArray or VZ's ManifestArray), and have a mechanism for materializing that data when you actually want it (for virtualizarr this mechanism usually involves writing those virtual references to some persistent kerchunk/zarr format first).

In fact, because its also useful for virtualizarr readers to also be able to actually load variables into memory (see the loadable_variables kwarg), the virtualizarr reader for a given filetype ends up kind of being a superset of the corresponding xarray backend for that filetype. As mentioned above, I plan to make this correspondence more explicit in the future by actually exposing an entrypoint system in virtualizarr - see #245.

merge/concat/drop virtual variables

Thinking about this more it's perhaps not trivial... The reason concatenation is possible in virtualizarr is because np.concat(list[ManifestArray, ...]) -> ManifestArray, which is ultimately possible because you can concatenate two ChunkManifest objects into a single new ChunkManifest object. But it's not clear to me that your analytic inflatable representation will be capable of that. That doesn't mean that this approach isn't useful, just that it's plausible that you end up with a representation where virtual variables can be opened, merged, dropped and serialized, but not concatenated. Note that xarray does not have a general solution to this problem either (pydata/xarray#4628). See also #246 for a generalized discussion of Generic Manifests.

but we still need to pick an end date for something we know goes into the future to varying degrees.

So IIUC you have datasets that are appended to frequently? And by reloading you mean that you don't want to reindex the entire thing just to append the new chunks? Cheaply appending the virtual references for the new chunks to an existing store is the same problem as discussed in #21 (and something the Earthmover folks should have better solution for soon...) I don't see why this requires picking an arbitrary end date in the distant future that hopefully includes all the data you plan to write - instead the data schema should be updated only when you actually do write (/create virtual references to) new data.

@emfdavid
Copy link

emfdavid commented Oct 4, 2024

If you are creating your own dataset with thousands of chunks having an explicit manifest is great and working to make it easy to append/concat is a critical feature for the xarray/zarr toolset. Manifest building and appending datasets are problems that I worked really hard to solve for Camus over the past year. Last week @JackKelly turned the problem on it's head, pointing out the the manifest could be computed algorithmically and I look at the problem completely differently now.

Our goal is working with enormous data sets we don't control, with millions of grib files, billions of chunks put in the cloud by NODD and ECMWF. They add 100,000 chunks a day to some of these datasets and for operational use, I want the latest data as it becomes available. Directly building native zarr datasets seems out of reach for these organizations at the moment, so we are looking for solutions to work with what we have got. I was planning to advocate for building a public database backed manifest that could be used with virtualizarr or with kerchunk as I have been doing privately at Camus for over 6 months now. Based on operational experience, this architecture is difficult and costly to manage operationally. I am sure my work leaves much to could be improved, but I am talking about datasets with variables have more than 100,000 chunks with more to append each hour as they land in the cloud bucket. I am using the dataset across dozens of distributed workers running big ML jobs.

Lazily computing the manifest itself is just a very different way of thinking about these big NWP archival datasets that could be much easier. I fully support and appreciate virtualizarr and we should definitely build bridges between these solutions (materializing a manifest where needed) but I suspect working with an xarray or zarr dataset free of the materialized manifest could be a very powerful solution in it's own right. There a lots of pitfalls that will make this paradigm shift difficult working with the existing APIs so having insight from the folks that build the existing stack would be fantastic. Can we do this - how?

@rabernat
Copy link
Collaborator

rabernat commented Oct 4, 2024

One alternative to chunk-level dataset aggregation is file-level dataset aggregation. This is actually the traditional approach that has been in use for decades (see e.g. ncml). Chunk-level aggregation emerged in the context of cloud computing, where the cost of peeking at each file is high due to the latency of HTTP requests. But if you somehow know a-priori exactly where the chunks are in each file, you can avoid that.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 4, 2024

file-level dataset aggregation.

I'm not really sure what you mean in this context @rabernat . The client still needs to know where inside the file to read from, so we're back to chunks aren't we?

If anyone wants to chat about this we do have a bi-weekly virtualizarr meeting starting in a couple of minutes!

EDIT: In the unlikely event that you are trying to join today please say so - having some zoom issues.

@emfdavid
Copy link

emfdavid commented Oct 4, 2024

Can't make it this week - but I will try to join next time.

@mpiannucci
Copy link
Contributor

mpiannucci commented Oct 4, 2024

I was trying to join and couldnt get in and just saw this. So if you happen to see this i am willking to join!

@rabernat
Copy link
Collaborator

rabernat commented Oct 4, 2024

The client still needs to know where inside the file to read from, so we're back to chunks aren't we?

Yes, but there are many ways you can provide that information:

  • Via explicit chunk references (how Kerchunk / VirtualiZarr work)
  • By opening the file and looking at the metadata at read time rather than as a preprocessing step. This is what happens if you do open_mfdataset(list_list_of_hdf_files, engine='h5netcdf'). It's slow, but could be optimized further (see e.g. hydefix).
  • By somehow knowing a-priori where the chunks are without having to open the file first (as proposed by hypergrib).

@TomNicholas
Copy link
Member Author

By somehow knowing a-priori where the chunks are without having to open the file first (as proposed by hypergrib).

If you knew this pattern for certain, couldn't you just make an implementation of the zarr Store API which implements Store.get by translating chunk key requests into the grib-dataset-specific paths and byte ranges for that chunk, following some functional mapping known a priori? It's a similar idea to what @ayushnag has proposed in #124, but specific to one dataset.

Then you open that Zarr store with whatever tool you like to do the processing.

A similar idea would be to have "functionally-derived virtual chunk manifests" in icechunk.

@TomNicholas
Copy link
Member Author

Actually on second thoughts I think this:

zarr Store API which implements Store.get by translating chunk key requests into the grib-dataset-specific paths and byte ranges for that chunk, following some functional mapping known a priori?

is effectively just one possible implementation of the original idea of @JackKelly 's above:

the plan is that hypergrib will plug directly into xarray as a new backend (and will never "materialise" a manifest... instead hypergrib will always generate GRIB filenames on-the-fly)

@TomNicholas TomNicholas added the usage example Real world use case examples label Jan 12, 2025
@TomNicholas
Copy link
Member Author

TomNicholas commented Jan 12, 2025

tl;dr: Can we just abstract the indirection as a single mapper function, and associate that with a dedicated Zarr Store implementation?

Taxonomy of datasets

To follow that last thought to it's logical conclusion, and build on @rabernat's taxonomy, let me first step back a bit.

We're all starting from a hypercube of array data stored across many file/objects. We know that in general zarr-like access to each chunk of data can be expressed in terms of byte range requests to regions of specific files (ignoring limitations of the current zarr model such as uniform codecs/chunk shapes etc. for now).

I think there are only 3 possibilities for how the details of these byte range requests relate to the hypercube of files, depending on how structured the files are:

  1. Structured: There's a known pattern to the filenames and to the byte offsets / lengths (e.g. because the filenames follow a convention and every file has identical internal structure).

  2. Semi-structured, but cheaply discoverable: There's a known pattern to the filenames, but the byte offsets/lengths could potentially be different for every file. However, given a filename there is some quick way to discover what the byte ranges/offsets are for that file. This might be a simple analytic function or it might involve touching a resource (e.g. quickly reading a .idx file at a known or deducable location).

  3. Unstructured: There's no pattern in the filenames or byte offsets / lengths at all, so all 3 could potentially be different for every Zarr chunk in the hypercube.

(3) Is the general problem for messy or composite data. It's what VirtualiZarr and Icechunk's concept of a static manifest assumes. You have no choice but to pay the cost of scanning all the files up front, and saving all the information for all of them. Of course you can still exploit any remaining patterns to optimise the storage of that manifest through compression, which @paraseba has been thinking about for Icechunk (see earth-mover/icechunk#430).

(2) Is less common but clearly does exist - it's the case hypergrib is intended for. Here you know the pattern in the file names a priori so storing those explicitly would be redundant. You also have some special way of determining the byte ranges for any file quickly but (as in the hypergrib case) it's more complicated than just computing them analytically given only the chunk key.

(1) Is a rare case but is technically possible. In this case you literally only need to be told the chunk key and you then have enough information to immediately deduce the filename, byte range and offset. An example would be an organised set of files each having the same format, schema, and with no data compression - the only difference is their values, not the locations of those values within each file. Maybe some GRIB datasets fit this? Some netCDF3 datasets might. Actually I guess (uncompressed?) native Zarr would be an example of this, where the mapping is just the identify function, and the byte range is all the bytes in the chunk.

There's also technically a 4th case - Semi-structured, but not cheaply discoverable. The [C]Worthy OAE dataset (see #132) is an example of this, because the filenames have a predictable pattern, but the byte ranges and offsets have to be discovered. (Aside: the original version of that dataset was in netCDF3, so maybe even an exampe of (1)!) But this case isn't worth treating separately because you'll still need an explicit manifest to hold all byte ranges and offsets like you do for (3), it's just the filenames in your manifest have some pattern - and will therefore probably compress well anyway.

Commonality: Functional Mapping

Accessing (1) and (2) can both be thought of as reading from a special Zarr store whose implementation of .get calls some cheap function to determine the filename and byte ranges from the requested chunk key. (1) Is the case that this function is totally analytic, whereas (2) is the case that this function itself has to do some more work, e.g. by reaching out and touching a file.

(3) is different because in that case that you can't even write such a cheap function in general, the best you can do is create a lookup table. In other words you have no choice but to pre-compute that (expensive) function for every file, i.e. scan all the files with kerchunk, arrange them in the hypercube, and record the results in a manifest. This is what VirtualiZarr does. Then at read time you still need to re-execute that function, but it's now just a lookup of your stored manifest (what Icechunk / kerchunk's xarray backend does), so now it's way faster.

FunctionalStore abstraction

Knowing this, can we provide a useful abstraction common to both cases (1) and (2)? I think so!

Let's call it a FunctionalStore for now (as it's similar to the proposal to create "functional indexes" for xarray (see pydata/xarray#3620 and pydata/xarray#9543):

Pseudocode:

from typing import TYPE_CHECKING, Callable, TypeAlias, Any

from zarr.abc import Store
from zarr.storage._fsspec import FsspecStore
from zarr.abc import ByteRequest, ChunkKey
from zarr.core.buffer import Buffer, BufferPrototype


if TYPE_CHECKING:
    # only needed for type hints
    from fsspec.asyn import AsyncFileSystem
    from zarr.core.chunk_grids import ChunkGrid

    from xarray import Dataset


MapperFn: TypeAlias = Callable[[str, ChunkKey], tuple[str, ByteRequest]]


# could it possibly inherit from FsspecStore? But it needs to be read-only...
class FunctionalStore(Store):
    """
    Wraps a read-only FsspecStore with an additional layer of indirection, stored as a function mapping a chunk key to an arbitrary byte range, which will be evaluated at read-time.
    
    Instances are intended for pointing at a specific dataset in persistent storage, which can be mapped to the zarr model in some relatively cheap but dataset-specific way, represented by the ``mapper_fn``.

    Parameters
    ----------
    mapper_fn: Callable[str, ChunkKey, [str, ByteRequest]]
        Function mapping a zarr key (of the form 'foo/c/1/2/3') and the top-level directory of an fsspec filesystem to the full path to the corresponding chunk, as well as the byte range that needs to be read.
    fs: fsspec.AsyncFileSystem
    path: str
        Root of the filesystem
    """

    _mapper_fn: MapperFn
    _fsspec_store: FsspecStore

    supports_writes: bool = False
    supports_deletes: bool = False
    supports_partial_writes: bool = False
    supports_listing: bool = True

    def __init__(
        self,
        mapper_fn: MapperFn,
        fs: AsyncFileSystem,
        path: str = "/",   
    ) -> None:
        self._fsspec_store = FsspecStore(
            fs=fs,
            read_only=True,
            path=path,
        )
        self._mapper_fn = mapper_fn

    @property
    def mapper_fn(self) -> MapperFn:
        return self._mapper_fn

    async def get(
        self, 
        key: str,
        prototype: BufferPrototype,
        byte_range: ByteRequest | None = None,
    ) -> Buffer | None:

        if byte_range is not None:
            # I don't think this would work unless you're grabbing the whole chunk, 
            # but I'm not sure exactly what ByteRequest does yet tbh
            raise ValueError

        # compute relative path to the file containing this chunk, and the byte range info for it
        redirected_path, redirected_byte_range = self._mapper_fn(self.path, key)

        # now we can just get the data the normal way.
        # here the implementation would basically just call the innards of the existing `async FsspecStore.get` method - everything after _deference_path should be the same.
        # alternatively we could wrap AsyncFilesystem instead of a whole FsspecStore, but I think you get the idea by now.
        return self.fsspec_store._get(
            path=redirected_path, 
            prototype=prototype,
            byte_range=redirected_byte_range,
        )

    # other read-only Store methods implemented similarly

    @classmethod
    def from_url(
        cls,
        mapper_fn: MapperFn,
        url: str,
        storage_options: dict[str, Any] | None = None,
    ) -> "FunctionalStore":
        """Create a FunctionalStore from a URL."""

        fsspec_store = FsspecStore.from_url(
            url=url,
            storage_options=storage_options,
            read_only=True,
        )

        obj = cls.__new__()
        obj._fsspec_store = fsspec_store
        obj._mapper_fn = mapper_fn

    def to_virtual_dataset(self) -> "Dataset":
        """
        Constructs a virtual dataset containing explicit chunk manifests for all arrays in the store.

        Requires computing the mapper function for all keys in the store.
        """
        from virtualizarr.manifests import ChunkManifest, ManifestArray

        # implementation would just evaluate `mapper_fn` once for every chunk in the store (asynchronously) and enter the results into virtualizarr.ChunkManifest objects
        ...

    @classmethod
    def from_virtual_dataset(
        cls, 
        vds: "Dataset", 
        filename_mapper_fn: Callable[[str], str],
        file_layout=tuple[int, ...],
    ) -> "FunctionalStore":
        """
        Create a FunctionalStore for a set of sub-datasets with identical byte ranges (e.g. each representing one of a set of identically-structured files).

        For each array, each virtual dataset becomes one or more chunks in the resulting FunctionalStore.

        Parameters
        ----------
        vds: xr.Dataset
            A Virtual Dataset (no loaded variables allowed) representing one file.

            The byte ranges and offsets contained will be used for all chunks in the resulting store, so they need to be the same in all files. The filenames are ignored, and instead described by `filename_mapper_fn`.
        file_layout : tuple[int, ...]
            Number of identically-structured files along each dimension of the hypercube.

            For an virtual dataset with only one chunk per array this is will effectively map onto the zarr.core.ChunkGrid of the resulting store.
        """
        ...
        # implementation would create and set a mapper_fn which calls `filename_mapper_fn` for the filenames,
        # but uses the corresponding byte ranges in the virtual dataset for every chunk.

Usage

Hypergrib

To use this for case (2) - for example with hypergrib - you would need to define a mapper_fn that reached out and touched the .idx files. This function is also the layer where you could do caching to prevent reading the .idx files more than once, and also where you could put fallback or error-handling mechanisms if you wanted to.

You (@JackKelly @emfdavid) would need to return optimized mapper functions which include:

  1. your knowledge of the layout of a specific dataset,
  2. your ideas for deducing byte ranges on the fly by quickly scanning .idx files (which could be async rust or whatever)
  3. some kind of global caching mechanism so that you only look up .idx files that you haven't looked up before,
  4. any error-handling or fallback code you feel you need.
def hypergrib_mapper_fn(
    filename_mapper: Callable[str, ChunkKey, [str, ByteRequest]],
)

Once you have that (and are able to distribute it to data consumers, as a package or otherwise), then to access the data users just create and open the specific store instance:

# be nicer if this was loadable from some persisted (pickled?) state
from hypergrib.datasets import ecmwf_dataset_mapper

ecmwf_hypergrib_functionalstore = FunctionalStore.from_url(
    url='s3://path/to/ecmwf/bucket/',
    mapper_fn=ecmwf_dataset_mapper,
)
    
ds = xr.open_zarr(ecmwf_hypergrib_functionalstore)

File-level aggregation

To use this for case (1) you could just write the necessary analytical mapper function and pass it in. I think this is the same thing as "file-level dataset aggregation" as @rabernat was saying above.

Note the FunctionalStore.from_virtual_dataset method, which is for aggregating files that have identical internal chunk structure. You would create it with virtualizarr:

from virtualizarr import open_virtual_dataset

vds_single_file = open_virtual_dataset("one/of/the/files.nc")

# create FunctionalStore for 10,000 files with identical internal structure
identically_structured_files_functional_store = FunctionalStore.from_virtual_dataset(
    vds=vds_single_file,
    filename_mapper_fn=chunk_key_to_filename_mapper_fn,
    layout=(100, 100),
)

ds = xr.open_zarr(identically_structured_files_functional_store)

Testing

If you want to test that your mapper function is correct, you could test against VirtualiZarr for each file, without even loading the actual data values:

import xarray.testing as xrt
from virtualizarr import open_virtual_dataset

# create your FunctionalStore instance for the whole hypercube
functional_store = ...

for file in filepaths:
    vds_from_store = functional_store.to_virtual_dataset()
    expected_vds = open_virtual_dataset(file)

    # this will check manifest contents are identical, plus all codecs and zarr metadata
    xrt.assert_identical(vds, expected_vds)

(Of course to test a hypergrib-generated FunctionalStore this would require someone to separately write a GRIB reader for virtualizarr, but it would probably have a lot of code in common with hypergrib...)

Serialization and distribution of FunctionalStores

There's no obvious serialisation format for FunctionalStore because it involves arbitrary dataset-specific executable code, but it could at least potentially be pickled (so long as the functional map can be pickled). Then you could do basic version control in the sense of just keeping track of all the pickled objects.

Pros and cons of this approach

Advantages compared to creating an xarray backend are:

  • Complete separation of the general zarr-like part of the approach from the dataset- and format-specific part (which all lives in the implementation of the mapper_fn).
  • Can read from the store using xr.open_zarr, immediately benefitting from all the async power of zarr-python v3, and requiring much less code than writing an xarray backend from scratch.
  • Can be read by (python) libraries other than xarray, they just have to understand zarr-python's Store interface.
  • Simple enough that it could be easily copied by zarr implementations written in other languages, as the general version has no more dependencies than the Fsspec store does.
  • Instances could conceivably be persisted (though I'm not sure how best do to this), and therefore be both shared and version-controlled.
  • Contains all the metadata of a normal zarr store, which is great for creating catalogs.

Disadvantages compared to creating an xarray backend:

  • Your hypercube has to actually fit the zarr data model, so no variable-length chunks etc. (for now at least).
  • Implementing fallbacks for some rogue files that don't fit the pattern would be a pain.
  • You can do caching, but it might be messy because you probably want to cache file reads rather than chunk fetches, so you're potentially reaching across layers in the stack. (But perhaps this is a good thing - a clear framework to cache at multiple levels?)

Advantages compared to general virtualizarr + icechunk approach:

  • Scalability, because you don't have a manifest that grows in size with the size of the dataset.
  • You don't actually need to even look at all the files to make the FunctionalStore (although you probably should).

Disadvantages compared to general virtualizarr + icechunk approach:

  • Because you're not tracking chunks individually you can't do per-chunk version control. You also can't choose for arbitrary chunks to be native zarr chunks instead (i.e. there's no concept of "loadable" or "inlined" variables).
    • Could we solve that by persisting a FunctionalStore that used native zarr chunks for some arrays? Or by integrating it with Icechunk somehow? Or a Zarr store that's defined as a concatenation of a FunctionalStore and an IcechunkStore?
  • You can't concatenate arrays to create composite virtual datasets like you can do with virtualizarr's virtual datasets
  • You also can't merge variables into one dataset very easily, you would have to just add match statements inside the mapper for each variable...
  • It's more fragile because there's nothing here to double-check that the files actually still exist (though that could potentially be added, see VirtualiZarr.to_icechunk's new last_updated_at arg).
  • It's harder to publicly distribute without the security risk of arbitrary code execution upon unpickling (the same worry that motivated the creation of HuggingFace's safetensors format...)
    • Though maybe it would be enough to just distribute the small bits of python code that are executed by the mapper_fn, and that's exactly as dangerous as running any other piece of python code you found on github.

Anyway, if this seems useful, maybe something like this could live upstream in zarr-python? (cc @d-v-b, @jhamman)

@TomNicholas
Copy link
Member Author

Oh and also @ayushnag then your ManifestStore idea could just be implemented as an instance of a FunctionalStore, whose mapping is implemented just by looking up the contents of the virtual dataset we already have in memory.

@JackKelly
Copy link

Wow! That's a very impressive write-up! And I love that even your "pseudo-code" is rigorously annotated with type hints!

I love the idea of trying to create a "useful abstraction common to both cases (1) and (2)" and of all the advantages that you list.

An issue to be aware of: Non-atomic updates of datasets

Disadvantages compared to creating an xarray backend:
...

  • Implementing fallbacks for some rogue files that don't fit the pattern would be a pain.

One issue that @emfdavid brought up is that weather forecast providers don't update their datasets atomically. So it's common to find missing / incomplete files when reading the folder containing the latest weather forecasts. So we'll need to handle this edge case. I haven't yet given this much thought other than the (obvious) realisation that crashing is not acceptable when trying to read the latest weather forecasts! Although I'm guessing that wasn't really what you had in mind when you mentioned "rogue files"?

Next steps

  • I'm definitely keen to support this work!
  • However, until FunctionalStores is implemented in zarr-python (or elsewhere), I'll continue chipping away at hypergrib to start getting empirical evidence for what's important for high performance. I'm 100% happy with the idea that, over time, a lot of (or all of) hypergrib might be moved upstream. That's a win for users, and a win for us developers. That said, my main concern is performance, so I'd be keen that any abstraction allows for high performance. Which might, for example, mean that it's Rust's responsibility to keep hundreds of network requests in flight at and given moment, and to schedule the decompression of those chunks. My understanding is that Python's asyncio isn't great at handling huge numbers of async tasks (although I'll admit I haven't tried!)

Related

  • Ultimately, I'm curious if we should aim for a world in which all dataset-specific information for datasets type (1) and (2) should be contained in json metadata. This metadata would contain information like the dimension names, coordinate labels, how to map from key to filename, and which parts of the hypercube are consistently missing (e.g. GFS doesn't include irradiance at forecast step 0). I've made a very early start with this metadata schema here. Although it has lots of problems right now!

@JackKelly
Copy link

One other quick thought:

# be nicer if this was loadable from some persisted (pickled?) state
from hypergrib.datasets import ecmwf_dataset_mapper

I realise that, in the past, I thought that we'd host a continually-updated metadata description of each NWP dataset. However, more recently, it's become fairly clear that that isn't necessary. The only dimension that changes regularly is the reference datetime. And NWPs usually use a "folder" per reference datetime. And, in Rust, it only takes a fraction of a second to list all those folders directly from the cloud storage bucket. So my current plan is that hypergrib would ship with descriptions of each NWP dataset, including all the dimensions and coordinate labels that only change once every year or so. As soon as the user opens a dataset, hypergrib would check the dataset's bucket for the latest reference datetime folders. This should be more robust, and simpler to maintain!

@TomNicholas
Copy link
Member Author

I told @joshmoore about the FunctionalStore thing and he said that:

  1. This sound very much like something the bio community often end up doing,
  2. The str->str function could also potentially just be a regex (which would be nice as it's cross-language),
  3. I've kind of basically re-invented zarr's concept of a Storage Transformer 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files upstream issue usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

6 participants