Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listing every format that could be represented as virtual zarr #218

Open
6 of 15 tasks
TomNicholas opened this issue Aug 8, 2024 · 9 comments
Open
6 of 15 tasks

Listing every format that could be represented as virtual zarr #218

TomNicholas opened this issue Aug 8, 2024 · 9 comments
Labels
help wanted Extra attention is needed Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files usage example Real world use case examples

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Aug 8, 2024

Let's list all the file formats that could potentially be represented efficiently as "virtual zarr" - i.e. zarr + chunk manifests.

The important criteria here is that the format must store data in a small number of contiguous chunks, such that access using http range requests to object storage is efficient. This rules out some formats, for example I don't think we can efficiently access this format that @kmuehlbauer mentioned over in openradar/xradar#187 (comment):

file formats where variables are written interleaved within one chunk of data (eg: 100 bytes v1, 100 bytes v2, 100 bytes v3, 100 bytes v1, 100 bytes v2, 100 bytes v3, ...)? Is there something like strides available?

If we start thinking of Zarr as a "SuperFormat" (super as in superset, not as in super-duper), then this is the list of existing formats comprising that set of what can be referenced using chunk manifests (see zarr-developers/zarr-specs#287).


Definitely can support:

Probably can support:

Maybe can support?

Probably can't support:

(The checkboxes indicate whether or not a working implementation already exists - going through kerchunks' in-memory format as an intermediate or creating a ManifestArray directly.)

cc @jhamman @d-v-b

@TomNicholas TomNicholas added help wanted Extra attention is needed references generation Reading byte ranges from archival files usage example Real world use case examples labels Aug 8, 2024
@TomNicholas TomNicholas changed the title Listing every format that might conceivably be represented as virtual zarr Listing every format that could be represented as virtual zarr Aug 8, 2024
@maxrjones
Copy link
Member

Unfortunately based on https://gdal.org/user/virtual_file_systems.html#jpeg2000 JPEG2000 is likely in the 'probably can't support' category. I would've liked if these datasets could be virtualized, but they're all JPEG2000 for to optimize for the download to disk model :(

Another way to phrase this question, which may help the search, is which of the formats supported by GDAL's raster drivers can be virtualized?

@martindurant
Copy link
Member

I like this issue! It's worth saying that anything kerchunk can chunk can be v-zarrred, right? In that repo, there are suggestions of other worthwhile formats, dicom and nifti (medical imaging) spring to mind. The latter is nice, but often whole-file-gzipped, the former is evil in the way that other 90s standards are evil, but extremely widespread.

@norlandrhagen
Copy link
Collaborator

... the former is evil in the way that other 90s standards are evil, but extremely widespread.

❤️

@TomNicholas
Copy link
Member Author

anything kerchunk can chunk can be v-zarrred, right?

Yes, that's the idea. This function does kerchunk refs -> virtual dataset, and this function does virtual dataset -> kerchunk refs. Any additional kerchunk file readers can be called as another if...else... in here.

@TomNicholas TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Aug 21, 2024
@TomNicholas TomNicholas pinned this issue Nov 15, 2024
@TomNicholas
Copy link
Member Author

Hugging Face safetensors is an interesting example - it's uncompressed so basically just like reading netCDF3, having no internal chunking. But it also puts all the metadata at the start of the file, making it a bit like cloud-optimized HDF5. See also huggingface/safetensors#527 (comment)

@martindurant
Copy link
Member

If the format is simple and common, I say it should be included immediately, especially when there is a straight-forward way to check correctness.

having no internal chunking

but you can assign internal chunking. Is partial reading available in upstream at all yet?

@TomNicholas
Copy link
Member Author

TomNicholas commented Jan 3, 2025

If the format is simple and common, I say it should be included immediately, especially when there is a straight-forward way to check correctness.

I raised #367 to track adding it.

but you can assign internal chunking. Is partial reading available in upstream at all yet?

This issue seems to suggest it is: zarr-developers/zarr-python#1106. But I think to take advantage of this with virtualizarr would require #199 to be merged.

@martindurant
Copy link
Member

No, zarr's PR1106 only implemented it for blosc compression, something I've been arguing about for a very very long time!

If you can dynamically re-imagine the chunking at runtime (which is what I tink #119 does), that that would be good enough for most practical uses - but still annoying. Zarr should just do this! i.e., the chunk IO function shouldn't just be passed "I need chunk X", but "I need section (:, s:t, i:j) of chunk X" and a way to characterise what the decompression pipeline looks like (this is OK for decompressed, some blosc, zstd maybe..., but not zlib). This was my suggestion in passing Contexts around in zarr v2.

@TomNicholas
Copy link
Member Author

I don't disagree, but if we want to discuss this further we should do it on a new issue (on this repo or upstream on zarr).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

4 participants