-
Notifications
You must be signed in to change notification settings - Fork 1
Introduction to LINDI, NWB Developer Hackathon
Jeremy Magland, Ryan Ly, Oliver Ruebel
NWB Developer Hackathon, DataJoint Headquarters, April 2024
HDF5 as Zarr as JSON for NWB
LINDI provides a JSON representation of NWB (Neurodata Without Borders) data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as DANDI without duplicating the large data blobs.
LINDI provides:
- A specification for representing arbitrary HDF5 files as Zarr stores. This handles scalar datasets, references, soft links, and compound data types for datasets.
- A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore).
- A mechanism for creating .lindi.json (or .nwb.lindi.json) files that reference data chunks in external files, inspired by kerchunk.
- An h5py-like interface for reading from and writing to these data sources that can be used with pynwb.
- A mechanism for uploading and downloading these data sources to and from cloud storage, including DANDI.
This project was inspired by kerchunk and hdmf-zarr.
- Represent a remote NWB/HDF5 file as a .nwb.lindi.json file.
- Read a local or remote .nwb.lindi.json file using pynwb or other tools.
- Edit a .nwb.lindi.json file using pynwb or other tools.
- Add datasets to a .nwb.lindi.json file using a local staging area.
- Upload a .nwb.lindi.json file to a cloud storage service such as DANDI.
https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1
- Efficient remote access: Lazy reading from remote HDF5 is inherently inefficient. Many serial requests are needed to load metadata. LINDI is as efficient as Zarr for remote access.
- Flexible data composition without duplication.
- Ability to edit files without rewriting / re-uploading
- Scalability
- Accessibility and interoperability: JSON is more widely supported than HDF5.
- Custom compression codecs
Cons: requires more than one file to represent the data.
- More flexible in terms of where data chunks are stored.
- Supports HDF5 features such as scalar datasets, references, and compound data types.
import json
import pynwb
import lindi
# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"
# Create a read-only Zarr store as a wrapper for the h5 file
store = lindi.LindiH5ZarrStore.from_file(h5_url)
# Generate a reference file system
rfs = store.to_reference_file_system()
# Save it to a file for later use
with open("example.lindi.json", "w") as f:
with open("example.lindi.json", "w") as f:
json.dump(rfs, f, indent=2)
# Create an h5py-like client from the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(rfs)
# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
nwbfile = io.read()
print(nwbfile)
import pynwb
import lindi
# URL of the remote .nwb.lindi.json file
url = 'https://kerchunk.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'
# Load the h5py-like client for the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(url)
# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
nwbfile = io.read()
print(nwbfile)
import json
import lindi
# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'
# Load the h5py-like client for the reference file system
# in read-write mode
client = lindi.LindiH5pyFile.from_reference_file_system(url, mode="r+")
# Edit an attribute
client.attrs['new_attribute'] = 'new_value'
# Save the changes to a new .nwb.lindi.json file
rfs_new = client.to_reference_file_system()
with open('new.nwb.lindi.json', 'w') as f:
with open('new.nwb.lindi.json', 'w') as f:
f.write(json.dumps(rfs_new, indent=2, sort_keys=True))
import lindi
# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'
# Load the h5py-like client for the reference file system
# in read-write mode with a staging area
with lindi.StagingArea.create(base_dir='lindi_staging') as staging_area:
client = lindi.LindiH5pyFile.from_reference_file_system(
url,
mode="r+",
staging_area=staging_area
)
# add datasets to client using pynwb or other tools
# upload the changes to the remote .nwb.lindi.json file
LINDI defines a set of special Zarr annotations to correspond with HDF5 features that are not natively supported in Zarr.
In HDF5, datasets can be scalar, but Zarr does not natively support scalar arrays. To overcome this limitation, LINDI represents scalar datasets as Zarr arrays with a shape of (1,)
and marks them with the _SCALAR = True
attribute.
Soft links in HDF5 are pointers to other groups within the file. LINDI utilizes the _SOFT_LINK
attribute on a Zarr group to represent this relationship. The path
key within the attribute specifies the target group within the Zarr structure. Soft link groups in Zarr should be otherwise empty, serving only as a reference to another location in the dataset.
Note that we do not currently support external links.
{
"_REFERENCE": {
"source": ".",
"path": "...",
"object_id": "...",
"source_object_id": "...",
}
}
-
source
: Always.
for now, indicating that the reference is to an object within the same Zarr. -
path
: Path to the target object within the Zarr. -
object_id
: The object_id attribute of the target object (for validation). -
source_object_id
: The object_id attribute of the source object (for validation).
The largely follows the convention used by hdmf-zarr.
HDF5 references can appear within both attributes and datasets. For attributes, the value of the attribute is a dict in the above form. For datasets, the value of an item within the dataset is a dict in the above form.
Note: Region references are not supported at this time and are planned for future implementation.
Zarr arrays can represent compound data types from HDF5 datasets. The _COMPOUND_DTYPE
attribute on a Zarr array indicates this, listing each field's name and data type. The array data should be JSON encoded, aligning with the specified compound structure. The h5py.Reference
type is also supported within these structures (represented by the type string '').
For datasets with an extensive number of chunks such that inclusion in the Zarr or reference file system is impractical, LINDI uses the _EXTERNAL_ARRAY_LINK
attribute on a Zarr array. This attribute points to an external HDF5 file, specifying the url
for remote access (or local path) and the name
of the target dataset within that file. When slicing that dataset, the LindiH5pyClient
will handle data retrieval, leveraging h5py
and remfile
for remote access.