-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MISC] Update guidelines on file formats and multidimensional arrays - for derivatives #1614
base: master
Are you sure you want to change the base?
Conversation
@Lestropie do you know if Zarr has matlab support? also, addimg here @mikecroucher JIC |
Sorry, can't provide any insight on Zarr support; I had flagged it as a prospect at one point, but don't have any hands-on experience. I'm not entirely familiar with all historical discussions (eg. #197), but I wonder if it's premature to be adding "suggestions" for ND data storage given the contention? |
Yes this has been discussed a lot - see for instance bids-standard/bep021#1. |
just an fyi - ome.zarr is already adopted in the bids standard (see microscopy). the actual container is just half of the puzzle, having a good metadata layer for that container is the second and very important component. regarding zarr and matlab, we have talked to people at mathworks that they should take the lead on it or work with people to integrate it. |
this one has been extensively discussed on the google doc .. even @robertoostenveld agreed ... ready to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarifies when to keep format and add multidimensional arrays
64442f0
to
83d656c
Compare
65ea656
to
0750685
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1614 +/- ##
=======================================
Coverage 87.92% 87.92%
=======================================
Files 16 16
Lines 1375 1375
=======================================
Hits 1209 1209
Misses 166 166 ☔ View full report in Codecov by Sentry. |
HDF5 and Zarr container format files (note that `.zarr` is typically a directory) should contain the data only (with the field `data`). | ||
This `data` field should be treated as a "virtual directory tree" with a depth one level, | ||
containing BIDS paths at the level of the multidimensional file | ||
(that is, the `.zarr` directory root or the `.h5` file). | ||
BIDS path rules MUST be applied as though these paths existed within the dataset. | ||
Metadata about the multidimensional array SHOULD be documented in the associated JSON sidecar file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to review this. I think this is roughly what's being proposed here (using raw data as an example):
dataset/
sub-01/
anat.zarr/
.zgroup
sub-01_T1w/
.zarray
.zattrs
...
sub-02/
anat.zarr/
.zgroup
sub-02_T1w/
.zarray
.zattrs
...
This repackaging of BIDS data inside a hierarchical data format feels very radical and will require tools to be rewritten to understand entire datasets, as opposed to specific derivative files. I suspect that this is not what was actually intended, so I think it would be very helpful to see examples of the intent.
I see basically two cases that should be addressed:
- Existing BIDS-supported formats are built on HDF5 (
.nwb
,.snirf
) or Zarr (.ome.zarr
). When considering options for new formats, these should be prioritized to reduce the expansion of necessary tooling. - For generic multidimensional array outputs, HDF5 and Zarr can be treated as extensions of
.tsv
files. Where TSV files with a header row represent a collection of named 1D arrays, an HDF5/Zarr container contains named N-D arrays that are not constrained to have a common shape. For simplicity, it is encouraged to use a collection of names at the root, which are to be described in a sidecar JSON. For example, to output raw model parameters for an undefined model, one might use:
sub-<label>/<datatype>/<entities>_<suffix>.zarr/
.zgroup
alpha/
.zarray
...
beta/
.zarray
...
sub-<label>/<datatype>/<entities>_params.json
And the JSON file would contain:
{
"alpha": {
"Description": "alpha parameter for XYZ model, fit using ABC estimation process",
"Units": "arbitrary"
},
"beta": {
"Description": "beta parameter for XYZ model, fit using ABC estimation process",
"Units": "arbitrary"
}
}
If this was the intent, I'm happy to propose alternative text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand your example, those arrays were meant for 'stuff' that does not fit the current formats - why one would start allowing packing current data, might be BIDS 2.0. but seems to radical at this stage. ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial example is the best I can make of the current text. I don't know what is being described here.
the PR is about the use of mutidim array in derivatives only - as discussed at the meeting - and we'll use that in the other BEPs to make examples |
do I understand that you ask for an example? |
Yes, an example would be useful to understand what is being specified. At the bottom of my previous comment, I gave an example that made sense to me. |
Quick note (not necessarily specific to this PR) several maintainers have noticed that we have introduced new features in the past months / years (e.G citation.cff files) that do not necessarily have been accompanied by updates to the bids-examples which makes it harder:
I suspect that an accompanying PR in the bids-example repo would be welcomed for the present PR. |
just a note: the "in derivatives only" info might be worth being mentioned in the title of the PR, so that the content and aim of the PR is more explicit for newcomers ;) |
at the heart of Zarr is the blosc1 meta-compressor. however, blosc2 has been out for a number of years and shows significant speed benefit over its predecessor, with more options for data compression/slicing, but it is not backward compatible with blosc1. if the goal of this thread is to improve IO performance, why not leap-frog to blosc2 directly? |
What container format is using blosc2? The goal is wide support, not the fastest compression algorithm. |
I will let @FrancescAlted, the upstream author of blosc2, comment on the current availability of the codec. in general, any container format that supports "filters" can use blosc2 in place of other common compressors (such as zlib, gzip , for example, hdf5 filters, or my JSON/JData ND-array annotations) - from this document, blosc2 filter for hdf5 seems to have been available in 2022 https://www.hdfgroup.org/wp-content/uploads/2022/05/Blosc2-and-HDF5-European-HUG2022.pdf also, I added blosc2 support to my NIfTI JSON-wrapper, JNIfTI, since 2022 for both MATLAB/Octave and Python if you want to try for python
download this benchmark script at https://github.com/NeuroJSON/pyjdata/blob/master/test/benchcodecs.py
for matlab/octaveinstall zmat and jsonlab https://github.com/NeuroJSON/jnifti/blob/master/samples/headct/create_headct_jnii.m the outputs on my laptop (i7-12700H) is shown below
|
Just a note here. blosc2 is actually backward compatible with blosc1 (i.e. blosc2 tools can read blosc1 data without problems), but it is not forward compatible (i.e. blosc2 data cannot be read by using blosc1 tools). |
[Upstream developer speaking here] The format for Blosc2 is documented in a series of (relatively short) documents:
Regarding adoption, and besides the NIfTI wrapper by @fangq, there are other well-stablished formats adopting it. For example, HDF5 supports Blosc2 via PyTables since 2022, and more recently in h5py via hdf5plugin. Also, there is b2h5py a package that monkey patches h5py for optimized reading of n-dimensional Blosc2 slices in HDF5 files. Another package that recently adopted blosc2 for replacing blosc1 for lossless compression has been ADIOS2. Besides that, there are different wrappers for languages like Julia, Rust, and probably others as well. |
sorry, what I meant was that blosc2 is not forward-compatible with blosc1 - that the stream format contains breaking changes. Zarr also has filter support - is there a plan to support blosc2 as a filter? ideally, data formats used in BIDS should prioritize "terminal formats" that either standardized (like NIFTI-1, TSV/CSV/json), or formats that is committed to both forward/backward compatibility for long-term reusability (LTS). @FrancescAlted, is the blosc2 team more or less committed to keeping the forward compatibility of blosc2 format? or there is still a good chance for breaking changes in the future? |
Definitely, Blosc2 is supporting all the features we wanted, so our plan is to support it without breaking changes for the years to come. |
Hi all, thanks for contributing! should we add to the PR some specific info about what is supported/preferred as Zarr type? blosc 1 or 2 Note for myself: once this is done, we should also update the resource part (e.g. starter kit and over areas with @fangq resources for reading -- the only matlab/octave resource I am aware of) |
@CPernet - zarr uses blosc1 by default, but most people can and do choose optimized compression settings using numcodecs and/or imagecodecs depending on the application needs (read/write, datatype, etc.,). however, technically speaking blosc2 can be used through registered numcodecs (see someone's example here: https://github.com/openclimatefix/ocf_blosc2). it won't be efficient as the zarr api does not have sub-chunk capabilities at the moment, which is where blosc2 benefits generally comes in. i know there were conversations back in 2021/22 about caterva/blosc2 and zarr. for the moment, i think the biggest change coming to zarr is the implementation of sharding (i.e. storing multiple chunks in a binary blob to optimize for compute, storage, and transport) through storage transformers. @martindurant and others may know more about the state of zarr/blosc2 etc.,. or point people to the conversations. |
I have 0 knowledge about those, just asking if you guys want to change the PR adding some details about what is expected/supported (@satra blosc1 I'm guessing) |
@CPernet - personally, i would leave that to the downstream user and stay away from a specific recommendation. for example, for light sheet microscopy we can recommend blosc1 (zip std level 5) - slower to write but optimized for storage + reading. that's a very specific subtype though. |
thanks @satra and @CPernet for your comments. because blosc1/blosc2 naively supports data chunking/fast reading data slices/hypercubes from a large data volume (or sparse frame splitting into distributed chunk files), sort of overlap with the goal of zarr, for some reason, I had the impression that zarr was a distributed storage interface built on top of blosc1. after reading the docs more carefully, it appears that blosc1 was only used in zarr as if it is just a regular compressor, such as zlib/gzip, and is only applied to the chunk-level data instead of the global array. is this correct? if this is the case, does zarr benefit from blosc1 data format at all (aside from faster multi-threaded compression, SIMD and shuffle filters etc)? my only reason of bringing up blosc2 here was trying to highlight the needs for considering "forward compatibility" when adding new formats to BIDS - even for derivatives. When I tried to write a BIDS-to-JSON converter to convert 1000+ OpenNeuro BIDS datasets in order to host in my NoSQL database on https://neurojson.io, I had to handle different supported data formats. For raw data, the number of formats are still manageable, and most of these formats are "terminal formats" that are perpetually unchanged. if we allow to add more data formats to derivatives, especially formats that still evolving (for example, I see zarr v3 spec has breaking changes to v2, so does v2 to v1), this will make the ecosystems dependent on BIDS, such as my project, complicated in terms of adding support to additional parsers, additional versions of the parsers, and additional codecs of the new versions etc in order to fully handle the files. if the zarr team can affort to specify a subset of the features (metadata keys, organization schemes, codecs) that are somewhat production-stable and we only add those in BIDS, that would make the lives of additional BIDS-dependent projects easier. but I understand it is a fast evolving format and promising a stable interface is not yet feasible (same applies to HDF5) |
Hardly fast-evolving! Yes, there is a new version coming with breaking changes, but v2 is stable and will be supported far into the future. It even has partial-buffer read support for blosc1. Exactly how sharding will interact with upcoming blosc2 support, I don't know - I suspect the two are unconnected. |
the issue being the hd5 and zarr are planned for future derivatives, so the example uses something not currently supported
for more information, see https://pre-commit.ci
clarifies when to keep the same format and possibly to use h5 and zarr (although matlab format for zarr not clear)
based on https://docs.google.com/document/d/1JtTu5u7XTkWxxnCIH6sxGajGn1qG_syJ-p14aejpk3E/edit?usp=sharing