[Feature]: separation of NWBFile formation and dataset configuration #441

bendichter · 2023-05-09T17:56:04Z

What would you like to see added to NeuroConv?

We are in the process of generalizing NeuroConv to support HDF5 and Zarr backends and found that there are many places in the code that use H5DataIO, which is causing problems for the Zarr backend. Another problem is that you cannot set conversion options individually for different datasets, which has become a problem in some applications. The proposed solution is to remove all of these H5DataIO calls and instead provide a separate interface to alter these settings on a per-dataset basis in a later step. This provides a separation between setting the values of the data and metadata and setting the storage options.

Here is a proposal based on our conversation and some follow-up conversations with the NWB team:

Remove all H5DataIO and any other backend-specific code from the underlying methods of run_conversion, and move all other code into a new method, named something like add_to_nwb(). For each large dataset (where an H5DataIO was previously used) we would instead add each container to a dict. Something like: data_interface.dataset_configs.update[container_id, field] = dict(**default_io_kwargs)
The BaseDataInterface would have an additional optional backend arg. This can be called during init or can be set later with data_interface.set_backend("hdf5|zarr") with a default "hdf5". NWBConverter would similarly have a backend arg that propagates to inner interfaces.
BaseDataInterface.set_dataset_configs() would take args of the same form as dataset_configs: **{(container_id, field): dict(**io_kwargs)}.
BaseDataInterface.write() would first apply all the dataset_configs and then call the NWB**IO.write() command based on the backend chosen.
BaseDataInterface.run_conversion() would stick around for backwards compatibility, and would call .add_to_nwb() and then .write(). We could allow for some basic compression options as before, but this method would not allow for dataset-specific conversion options.

Thoughts on this approach?

One thing I'm not entirely happy with is the fact that it will be difficult to see which container you are adjusting with the tuple (container_id, field), because a user won't be able to see the name, location, or neurodata type of that container. It might be better for this dict to be {readable_key: dict(container_id=uuid, field=str, io_kwargs=dict)},

Is your feature request related to a problem?

No response

Do you have any interest in helping implement the feature?

No.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

bendichter · 2023-05-09T19:01:05Z

This is closely related to and supersedes #398 and #436

and would depend on hdmf-dev/hdmf#856

CodyCBakerPhD · 2023-05-10T15:00:31Z

One quick thing to check on the make_or_load_nwbfile context; how does everyone feel about using joint contexts?

An example of this in practice is

with open(file="C:/Users/Raven/Downloads/test_manager_raw.txt") as file, NWBHDF5IO(path="...", mode="r+") as io:
    lines = file.readlines()
    nwbfile = io.read()
    # put lines in it somewhere
    io.write(...)

and I would consider using it to break up the enter/exit logic similar to

with make_or_load_nwbfile(...) as nwbfile, write_nwbfile(backend="zarr") as neuroconv_io:
     converter.run_conversion(nwbfile)  # or anytthing else that 'assembles' the file structure
     neurconv_io.prepare_for_write(nwbfile, backend="...", dataset_configurations=...)  # where all the compression and other backend stuff gets evaluated and set
     neuroconv_io.write()

otherwise right now the write step silently occurs during exit of the make_or_load_nwbfile, which maybe isn't that obvious from the function name

bendichter · 2023-05-11T18:26:09Z

@CodyCBakerPhD This is a good start. I see you are making the prepare_for_write call on the write_nwbfile, which already knows the backend, so you should be able to omit it within the context. I think we can combine make_or_load_nwbfile into write_nwbfile and combine parse_for_write with write. Something like:

DataIOMap = {"hdf5": H5DataIO, "zarr": ZarrDataIO}
writers = {"hdf5: NWBHDF5IO, "zarr": ZarrIO}
dset_config_defaults = {"hdf5" dict(compression="gzip", compression_opts=4), "zarr": dict(...)}

class BaseDataInterface:
    ...
    def run_conversion(..., backend="hdf5"):
        with make_or_load_nwbfile(...) as nwbfile
           # call signature unchanged except for optional backend arg
           nwbfile = self.add_to_nwbfile(nwbfile)  # or anything else that 'assembles' the file structure
           self.write(nwbfile, filepath, backend=backend, dataset_configurations=...)
    
    def configure_datasets(backend="hdf5", dset_configurations: dict):
        for container_id, field in self.get_datasets():
            dset_config = dataset_configurations.get((container_id, field), dset_config_defaults[backend])
            nwbfile.get_object(container_id).getattr(field) = DataIOMap[backend](data=data, **dset_config)
    
    def write(nwbfile, filepath, backend="hdf5", dset_configurations: Optional[dict] = None, mode="r+"):
        if dset_configurations:
            self.configure_datasets(dset_configurations)
        with writers[backend](filepath, mode=mode) as io:
            io.write(nwbfile)

bendichter · 2023-05-23T13:23:00Z

After talking with @CodyCBakerPhD a bit, I'm realizing this will probably need to be broken up into a few phases.

Phase 1: Create .add_to_nwbfile method for BaseDataInterface and descendants. This method non-optionally takes an NWBFile, and does not accept an nwbfile_path. This will require changing e.g. write_recording and write_sorting to add_recording and add_sorting. This will separate the methods for assembling the NWBFile from the methods for writing it, allowing for different types of data specifications and writing methods. .add_to_nwbfile will be called within the .run_conversion method. Technically, .run_conversion will still work on DataInterfaces, but it will not be recommended and when it is called it will just call .add_to_nwbfile under the hood.

Draft PR: #455

Phase 2: Move all H5DataIO calls out of .add_to_nwbfile methods and into a separate method: configure_datasets. This allows us to move forward without making any changes to the API or the resulting files.

This step requires resolution of hdmf-dev/hdmf#868

Phase 3: Add in support for multiple backends. Each backend will be defined using a dataclass, which specified the NWBIO, the data IO and the default IO settings. .run_conversion will optionally take a backend as an argument, and the appropriate classes will be used when reading, configuring datasets, and writing.

Phase 4: Add support for external functions get_configurable_datasets and configure_datasets which can be used to adjust the data IO of individual datasets. The .configure_datasets method will call these functions.

Phase 5: Expose these features in GUIDE

h-mayorquin · 2023-05-24T19:35:49Z

I like the idea of adding separate interface methods for both appending (add_to_nwb) and writing. In my view, having separate methods instead of a single run_conversion that writes or appends depending on the input is way cleaner and transparent.

Looking at #455 it seems that this run_conversion will still work for NWBConveter and the ConverterPipe. I think it would be a good idea to add a write method to those classes. I like the idea of having consistent methods between stand-alone interfaces and the converters such as in:

https://neuroconv.readthedocs.io/en/main/conversion_examples_gallery/combinations/spikeglx_and_phy.html

h-mayorquin · 2023-07-05T14:32:05Z

I am thinking more about this:

Where is the metadata supposed to be added?

Discussing with @CodyCBakerPhD in another issue it seems that the idea there is that the write method will have it as an argument which makes me realize I am confused about it as I thought it would be in the add_to_nwb method.

Currently the abstract class does not have the keyword:

neuroconv/src/neuroconv/basedatainterface.py

Lines 54 to 57 in 25165e7

    
           @abstractmethod 
        
           def add_to_nwbfile(self, nwbfile: NWBFile, **conversion_options) -> None: 
        
               raise NotImplementedError()

But the instance method in BaseRecordingExtractorInterface does:

neuroconv/src/neuroconv/datainterfaces/ecephys/baserecordingextractorinterface.py

Lines 264 to 276 in 25165e7

    
           def add_to_nwbfile( 
        
               self, 
        
               nwbfile: NWBFile, 
        
               metadata: Optional[dict] = None, 
        
               stub_test: bool = False, 
        
               starting_time: Optional[float] = None, 
        
               write_as: Literal["raw", "lfp", "processed"] = "raw", 
        
               write_electrical_series: bool = True, 
        
               compression: Optional[str] = "gzip", 
        
               compression_opts: Optional[int] = None, 
        
               iterator_type: str = "v2", 
        
               iterator_opts: Optional[dict] = None, 
        
           ):

And that seems correct to me. One use case of metadata is to add the name of the devices, or the properties of the recording device channels or missing keyword arguments from pynwb object. It seems appropriate that those things should be passed when you are buillding the in-memory nwb-file.

Are there other use cases of metadata that are used in the context of writing?

If not that's nice, because then we can just simply add the metadata merging when we call add in the NWBConversor:

neuroconv/src/neuroconv/nwbconverter.py

Lines 111 to 116 in 25165e7

    
           def add_to_nwbfile(self, nwbfile: NWBFile, metadata, conversion_options: Optional[dict] = None) -> None: 
        
               conversion_options = conversion_options or dict() 
        
               for interface_name, data_interface in self.data_interface_objects.items(): 
        
                   data_interface.add_to_nwbfile( 
        
                       nwbfile=nwbfile, metadata=metadata, **conversion_options.get(interface_name, dict()) 
        
                   )

Maybe also validation there.

How are you guys thinking about it?

bendichter · 2023-07-06T17:52:41Z

@h-mayorquin yes, that's the way I am thinking about it. metadata is added via data_interface.add_to_nwbfile, then data_interface.write handles data packing settings but does not change any of the data values

h-mayorquin · 2023-07-06T18:12:07Z

All right, then it is just missing from the signature in the base class.

bendichter · 2023-07-06T18:16:07Z

Well actually that was done intentionally. Note that with the **kwargs, you don't really need to put metadata in that signature- it will work even if not explicitly defined in the abstract class. There are edge cases where interfaces actually don't need a metadata arg at all, and if you put metadata in the abstract class then code inspectors complain that the downstream class methods do not match the abstract method signature.

h-mayorquin · 2023-07-06T18:46:24Z

I see.
I was mislead about the purpose because the kwargs here are named conversion_options. But this is probably temporary until we fully separate adding from writing.

CodyCBakerPhD · 2024-08-19T05:30:54Z

This is mostly done; only more thorough testing of Zarr output per interface remains

Good to close?

bendichter added the enhancement label May 9, 2023

bendichter assigned h-mayorquin, alejoe91 and CodyCBakerPhD May 9, 2023

h-mayorquin mentioned this issue May 15, 2023

Update write_waveforms to write sparse waveforms #437

Closed

bendichter mentioned this issue May 21, 2023

flexible backend prototype #451

Closed

h-mayorquin mentioned this issue Jun 9, 2023

Would it be useful for our users if we supported a method chaining workflow? #473

Closed

bendichter mentioned this issue Jun 9, 2023

add_to_nwbfile method added across all data interfaces #455

Merged

bendichter closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: separation of NWBFile formation and dataset configuration #441

[Feature]: separation of NWBFile formation and dataset configuration #441

bendichter commented May 9, 2023 •

edited

Loading

bendichter commented May 9, 2023 •

edited

Loading

CodyCBakerPhD commented May 10, 2023 •

edited

Loading

bendichter commented May 11, 2023 •

edited

Loading

bendichter commented May 23, 2023 •

edited

Loading

h-mayorquin commented May 24, 2023 •

edited

Loading

h-mayorquin commented Jul 5, 2023 •

edited

Loading

bendichter commented Jul 6, 2023 •

edited

Loading

h-mayorquin commented Jul 6, 2023

bendichter commented Jul 6, 2023

h-mayorquin commented Jul 6, 2023

CodyCBakerPhD commented Aug 19, 2024

[Feature]: separation of NWBFile formation and dataset configuration #441

[Feature]: separation of NWBFile formation and dataset configuration #441

Comments

bendichter commented May 9, 2023 • edited Loading

What would you like to see added to NeuroConv?

Is your feature request related to a problem?

Do you have any interest in helping implement the feature?

Code of Conduct

bendichter commented May 9, 2023 • edited Loading

CodyCBakerPhD commented May 10, 2023 • edited Loading

bendichter commented May 11, 2023 • edited Loading

bendichter commented May 23, 2023 • edited Loading

h-mayorquin commented May 24, 2023 • edited Loading

h-mayorquin commented Jul 5, 2023 • edited Loading

bendichter commented Jul 6, 2023 • edited Loading

h-mayorquin commented Jul 6, 2023

bendichter commented Jul 6, 2023

h-mayorquin commented Jul 6, 2023

CodyCBakerPhD commented Aug 19, 2024

bendichter commented May 9, 2023 •

edited

Loading

bendichter commented May 9, 2023 •

edited

Loading

CodyCBakerPhD commented May 10, 2023 •

edited

Loading

bendichter commented May 11, 2023 •

edited

Loading

bendichter commented May 23, 2023 •

edited

Loading

h-mayorquin commented May 24, 2023 •

edited

Loading

h-mayorquin commented Jul 5, 2023 •

edited

Loading

bendichter commented Jul 6, 2023 •

edited

Loading