flexible backend prototype #451

bendichter · 2023-05-21T15:38:27Z

Here's the start of an implementation for separating the backend out. This refactoring gives us the following advantages:

It allows for flexible backend. hdf5 and zarr are shown, and it can be extended for future backends as well
It separates the methods that determine the content of the nwbfile from the methods that determine the dataset configurations.
It simplifies the code in individual non-abstract DataInterace classes. Individual data interfaces no longer need to handle setting DataIO, and do not need to worry about reading an NWBFile from disk or using a make_or_load_nwbfile context.
It allows for setting different configurations for different datasets.

There's still a lot that is needed:

We will need to go into each data interface and simplify them by removing the make_or_load_nwbfile context calls, similarly to this PR: move make_or_load_nwbfile logic to BaseDataInterface. #398. Individual interfaces should not implement their own .run_conversion methods, but instead should implement .add_to_nwbfile, which must take an NWBFile object.
The new .add_to_nwbfile method should populate a list in self.datasets with the object ID and field name of each dataset we might want to chunk and compress

for more information, see https://pre-commit.ci

…backend

alejoe91 · 2023-05-22T14:13:43Z

Hi @bendichter

Thanks for drafting this out. I think this is a very good start and it seems to be in line with what @CodyCBakerPhD, @h-mayorquin and I discussed last time.

IMO, the self.datasets could be optional and auto-inferred by traversing the NWBFile fields and checking which fields are DataChunkIterator, which will need to be wrapped by an IO.
We can add some helper functions for that.

CodyCBakerPhD · 2023-05-22T15:19:39Z

I was thinking more on the problem of tracking structure and dataset provenance, especially with respect to how the GUIDE would go about representing this, and arrived at the same conclusion that @alejoe91 suggests with

IMO, the self.datasets could be optional and auto-inferred by traversing the NWBFile fields and checking which fields are DataChunkIterator, which will need to be wrapped by an IO.

That is, we can simplify this strategy greatly (and would server as more generic functionality beyond NeuroConv anyway) if we can decouple the problem of tracking which interfaces write which datasets from which datasets still need an IO set

So all we should need in the end is

a) the existing NeuroConv write tools (SI/ROIExtractors, etc) make wrapping of an IO optional; the interfaces that use those tools always set it to 'do not wrap an IO'

b) make a function that searches an in-memory NWBFile object, finds all the unbound fields that will result as h5py.Datasets that have IO info unset, and returns some basic information about each dataset; such as 'where is it in the file', maybe basic shape and data type information, etc. (very similar to how an NWB Inspector message works)

This basic information is in essence the dataset_config expected of configure_datasets, but I would unbind them from the BaseDataInterface and make it something that corresponds to the context manager responsible for writing the file so we can use it any time we use the context with or without interfaces/converter

c) the configure_datasets can then be used as-is

bendichter · 2023-05-22T17:09:41Z

I like the idea of separating out the configure_datasets to a function instead of a method so it can be used outside of DataInterfaces for any PyNWB conversion.

Auto-inferring datasets would be possible. The thing is, it might be tricky to automatically determine which datasets you want to be eligible for chunking and compression (C&C), and the question becomes how do we identify such datasets? Creating list of eligible datasets was an attempt to get around this, but I see that it would require a lot of work to modify every DataInterface to enumerate all eligible datasets.

@alejoe91 's suggestion is use all datasets that have a DataChunkIterator (DCI). That's not a bad idea, but you might want to apply chunking and compression even to datasets that fit in memory. For example TimeSeries.timestamps is rarely created with a DCI, but we may want to use C&C.

@CodyCBakerPhD 's suggestion is to include all fields that will become datasets (and don't already have an IO), which would include data like NWBFile.identifier, NWBFile.session_start_time, NWBFile.timestampss_reference_time, VoltageClampSeries.capacitance_slow, and a lot of other very small datasets that you would never want to bother with for C&C. Perhaps a modification of @CodyCBakerPhD 's approach that would be better would be to include all non-scalar datasets. This might still have a few small datasets that you would never want to apply C&C to, but maybe not very many. I think it's worth trying out on a few files to see what this would look like.

If we can agree upon an automated solution that sounds like it would be fine. Maybe, as @alejoe91 pointed out, we can try to automate where we can, and then have DataInterfaces just handle the exceptions somehow, maybe we providing a list of datasets to exclude.

CodyCBakerPhD · 2023-05-22T17:49:58Z

which would include data like NWBFile.identifier, NWBFile.session_start_time, NWBFile.timestampss_reference_time, VoltageClampSeries.capacitance_slow, and a lot of other very small datasets that you would never want to bother with for C&C.

Yeah, forgot about those...

Would it work to restrict the inclusion rule to be a h5py.Dataset contained within either a TimeSeries or VectorData object?

Brief sketch in Pythonic pseudo-code

for object in nwbfile.objects();  # search for time series or columns of tables
    if isinstance(object, TimeSeries) or isinstance(object, VectorData):  #VectorData one probably wouldn't work...
        for field in object:  # search for datasets
            if will_become_hdf5_datset and not_wrapped_in_dataio:
                # grab info, return in dataset_config

then later

# Specify common back-end to use
# Specify choices of compression methods and options (otherwise let them default based on back-end) for each dataset in the config

for dataset in dataset_config:
    object_in_nwbfile = nwbfile[dataset.path_to_dataset]
    if not wrapped_in_iterator(object_in_nwbfile ):  # that is, ALWAYS wrap in generic data chunk iterator regardless
        wrap_in_iterator(object_in_nwbfile)
    wrap_in_dataio(object_in_nwbfile )

CodyCBakerPhD · 2023-05-22T17:51:03Z

I guess that might exclude extended data types that don't inherit from either of those bases, but I'm OK with support for this only applying to core types for now

bendichter · 2023-05-22T17:57:13Z

@CodyCBakerPhD yes, the approach of only selecting VectorData.data, TimeSeries.data, and TimeSeries.timestamps might satisfy most of our needs right now.

What is the advantage of always wrapping in a DCI?

CodyCBakerPhD · 2023-05-22T17:59:15Z

What is the advantage of always wrapping in a DCI?

Personally it's my preferred way of communicating chunk information, since that is usually what the io.write procedure in HDMF requests that info from as I recall

CodyCBakerPhD · 2023-05-22T18:00:20Z

Personally it's my preferred way of communicating chunk information

More specifically, using the GenericDataChunkIterator to generate a good default chunk_shape

src/neuroconv/basedatainterface.py

* configure_datasets moved to function * automatically include data in TimeSeries and VectorData

for more information, see https://pre-commit.ci

bendichter · 2023-05-22T18:53:36Z

@alejoe91 and @CodyCBakerPhD what do you think about this?

…tor_backend

for more information, see https://pre-commit.ci

…tor_backend

CodyCBakerPhD · 2023-05-22T19:25:46Z

src/neuroconv/tools/nwb_helpers.py

+    for object_ in nwbfile.objects():
+        if isinstance(object_, TimeSeries) or isinstance(object_, VectorData):
+            for field in object_.__fields__:  # search for datasets
+                if field in ("data", "timestamps") and not isinstance(getattr(object_, field), DataIO):


I was wondering about that, but can we for sure guarantee that the main dataset of every child of TimeSeries will be under the data field?

every descendent of TimeSeries should have a data field. That is a required argument. They may also have other datasets, and these other datasets would be missed. That could potentially be an issue if those other datasets are big, but usually they are just small metadata that you wouldn't want to compress anyway

src/neuroconv/tools/nwb_helpers.py

for more information, see https://pre-commit.ci

h-mayorquin

Another possible option for filtering which datasets are extracting for wrapping for chunking and compression is to just set a minimal object size (as in memory). Don't we have the types and shapes of the objects available to determine this?

h-mayorquin · 2023-05-24T18:43:13Z

src/neuroconv/tools/nwb_helpers.py

+                if getattr(object_, field) is not None and not isinstance(getattr(object_, field), DataIO):
+                    yield object_.id, field
+        if isinstance(object_, DynamicTable):
+            for field in getattr(object_, "colnames"):


Out of curiosity, what is this condition for?
isinstance(getattr(object_, field), DataIO)

Is this so we don't overwrite previous configuration?

Also, what happens in this function if the object was previously bound as in we read this from a file already and we are appending new datasets? Is it related to this?

CodyCBakerPhD · 2024-02-11T04:31:44Z

The major ideas from this have been either merged from [Backend Configuration I-VI] or are still TODO items in the reference #475

flexible backend prototype

b6cc1ca

bendichter marked this pull request as draft May 22, 2023 13:44

bendichter requested a review from alejoe91 May 22, 2023 13:49

bendichter and others added 2 commits May 22, 2023 09:52

Merge branch 'main' into refactor_backend

58686e7

[pre-commit.ci] auto fixes from pre-commit.com hooks

92b6e58

for more information, see https://pre-commit.ci

bendichter requested review from h-mayorquin, luiztauffer and CodyCBakerPhD and removed request for luiztauffer May 22, 2023 13:52

bendichter added 2 commits May 22, 2023 09:57

add_to_nwb -> add_to_nwbfile

66c46e6

Merge remote-tracking branch 'origin/refactor_backend' into refactor_…

babeaa8

…backend

Merge branch 'main' into refactor_backend

399521b

Merge branch 'main' into refactor_backend

9226fa1

bendichter commented May 22, 2023

View reviewed changes

src/neuroconv/basedatainterface.py Outdated Show resolved Hide resolved

Update src/neuroconv/basedatainterface.py

e7908c8

bendichter commented May 22, 2023

View reviewed changes

src/neuroconv/basedatainterface.py Outdated Show resolved Hide resolved

bendichter and others added 3 commits May 22, 2023 14:05

Update src/neuroconv/basedatainterface.py

e5dab4d

apply comments:

ffacea6

* configure_datasets moved to function * automatically include data in TimeSeries and VectorData

[pre-commit.ci] auto fixes from pre-commit.com hooks

a08a519

for more information, see https://pre-commit.ci

bendichter added 2 commits May 22, 2023 14:55

update docstring

bd27e72

Merge remote-tracking branch 'bendichter/refactor_backend' into refac…

dab93e4

…tor_backend

bendichter and others added 4 commits May 22, 2023 15:08

handle VectorData better

8c66f00

[pre-commit.ci] auto fixes from pre-commit.com hooks

77f97b0

for more information, see https://pre-commit.ci

spell out configure_datasets into more lines

45e5a40

Merge remote-tracking branch 'bendichter/refactor_backend' into refac…

7ab7bbc

…tor_backend

CodyCBakerPhD reviewed May 22, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers.py Outdated Show resolved Hide resolved

bendichter and others added 2 commits May 22, 2023 16:53

refactor ecephys datainterfaces

4706637

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff563a3

for more information, see https://pre-commit.ci

h-mayorquin reviewed May 24, 2023

View reviewed changes

CodyCBakerPhD closed this Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flexible backend prototype #451

flexible backend prototype #451

bendichter commented May 21, 2023 •

edited

Loading

alejoe91 commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023 •

edited

Loading

CodyCBakerPhD commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023

CodyCBakerPhD May 22, 2023 •

edited

Loading

bendichter May 22, 2023

h-mayorquin left a comment

h-mayorquin May 24, 2023

CodyCBakerPhD commented Feb 11, 2024

flexible backend prototype #451

flexible backend prototype #451

Conversation

bendichter commented May 21, 2023 • edited Loading

alejoe91 commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023 • edited Loading

CodyCBakerPhD commented May 22, 2023

CodyCBakerPhD commented May 22, 2023

bendichter commented May 22, 2023

CodyCBakerPhD May 22, 2023 • edited Loading

Choose a reason for hiding this comment

bendichter May 22, 2023

Choose a reason for hiding this comment

h-mayorquin left a comment

Choose a reason for hiding this comment

h-mayorquin May 24, 2023

Choose a reason for hiding this comment

CodyCBakerPhD commented Feb 11, 2024

bendichter commented May 21, 2023 •

edited

Loading

bendichter commented May 22, 2023 •

edited

Loading

CodyCBakerPhD May 22, 2023 •

edited

Loading