Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: separation of NWBFile formation and dataset configuration #441

Closed
2 tasks done
bendichter opened this issue May 9, 2023 · 11 comments
Closed
2 tasks done
Assignees

Comments

@bendichter
Copy link
Contributor

bendichter commented May 9, 2023

What would you like to see added to NeuroConv?

We are in the process of generalizing NeuroConv to support HDF5 and Zarr backends and found that there are many places in the code that use H5DataIO, which is causing problems for the Zarr backend. Another problem is that you cannot set conversion options individually for different datasets, which has become a problem in some applications. The proposed solution is to remove all of these H5DataIO calls and instead provide a separate interface to alter these settings on a per-dataset basis in a later step. This provides a separation between setting the values of the data and metadata and setting the storage options.

Here is a proposal based on our conversation and some follow-up conversations with the NWB team:

  1. Remove all H5DataIO and any other backend-specific code from the underlying methods of run_conversion, and move all other code into a new method, named something like add_to_nwb(). For each large dataset (where an H5DataIO was previously used) we would instead add each container to a dict. Something like: data_interface.dataset_configs.update[container_id, field] = dict(**default_io_kwargs)

  2. The BaseDataInterface would have an additional optional backend arg. This can be called during init or can be set later with data_interface.set_backend("hdf5|zarr") with a default "hdf5". NWBConverter would similarly have a backend arg that propagates to inner interfaces.

  3. BaseDataInterface.set_dataset_configs() would take args of the same form as dataset_configs: **{(container_id, field): dict(**io_kwargs)}.

  4. BaseDataInterface.write() would first apply all the dataset_configs and then call the NWB**IO.write() command based on the backend chosen.

  5. BaseDataInterface.run_conversion() would stick around for backwards compatibility, and would call .add_to_nwb() and then .write(). We could allow for some basic compression options as before, but this method would not allow for dataset-specific conversion options.

Thoughts on this approach?

One thing I'm not entirely happy with is the fact that it will be difficult to see which container you are adjusting with the tuple (container_id, field), because a user won't be able to see the name, location, or neurodata type of that container. It might be better for this dict to be {readable_key: dict(container_id=uuid, field=str, io_kwargs=dict)},

Is your feature request related to a problem?

No response

Do you have any interest in helping implement the feature?

No.

Code of Conduct

@bendichter
Copy link
Contributor Author

bendichter commented May 9, 2023

This is closely related to and supersedes #398 and #436

and would depend on hdmf-dev/hdmf#856

@CodyCBakerPhD
Copy link
Member

CodyCBakerPhD commented May 10, 2023

One quick thing to check on the make_or_load_nwbfile context; how does everyone feel about using joint contexts?

An example of this in practice is

with open(file="C:/Users/Raven/Downloads/test_manager_raw.txt") as file, NWBHDF5IO(path="...", mode="r+") as io:
    lines = file.readlines()
    nwbfile = io.read()
    # put lines in it somewhere
    io.write(...)

and I would consider using it to break up the enter/exit logic similar to

with make_or_load_nwbfile(...) as nwbfile, write_nwbfile(backend="zarr") as neuroconv_io:
     converter.run_conversion(nwbfile)  # or anytthing else that 'assembles' the file structure
     neurconv_io.prepare_for_write(nwbfile, backend="...", dataset_configurations=...)  # where all the compression and other backend stuff gets evaluated and set
     neuroconv_io.write()

otherwise right now the write step silently occurs during exit of the make_or_load_nwbfile, which maybe isn't that obvious from the function name

@bendichter
Copy link
Contributor Author

bendichter commented May 11, 2023

@CodyCBakerPhD This is a good start. I see you are making the prepare_for_write call on the write_nwbfile, which already knows the backend, so you should be able to omit it within the context. I think we can combine make_or_load_nwbfile into write_nwbfile and combine parse_for_write with write. Something like:

DataIOMap = {"hdf5": H5DataIO, "zarr": ZarrDataIO}
writers = {"hdf5: NWBHDF5IO, "zarr": ZarrIO}
dset_config_defaults = {"hdf5" dict(compression="gzip", compression_opts=4), "zarr": dict(...)}

class BaseDataInterface:
    ...
    def run_conversion(..., backend="hdf5"):
        with make_or_load_nwbfile(...) as nwbfile
           # call signature unchanged except for optional backend arg
           nwbfile = self.add_to_nwbfile(nwbfile)  # or anything else that 'assembles' the file structure
           self.write(nwbfile, filepath, backend=backend, dataset_configurations=...)
    
    def configure_datasets(backend="hdf5", dset_configurations: dict):
        for container_id, field in self.get_datasets():
            dset_config = dataset_configurations.get((container_id, field), dset_config_defaults[backend])
            nwbfile.get_object(container_id).getattr(field) = DataIOMap[backend](data=data, **dset_config)
    
    def write(nwbfile, filepath, backend="hdf5", dset_configurations: Optional[dict] = None, mode="r+"):
        if dset_configurations:
            self.configure_datasets(dset_configurations)
        with writers[backend](filepath, mode=mode) as io:
            io.write(nwbfile)
        

@bendichter
Copy link
Contributor Author

bendichter commented May 23, 2023

After talking with @CodyCBakerPhD a bit, I'm realizing this will probably need to be broken up into a few phases.

Phase 1: Create .add_to_nwbfile method for BaseDataInterface and descendants. This method non-optionally takes an NWBFile, and does not accept an nwbfile_path. This will require changing e.g. write_recording and write_sorting to add_recording and add_sorting. This will separate the methods for assembling the NWBFile from the methods for writing it, allowing for different types of data specifications and writing methods. .add_to_nwbfile will be called within the .run_conversion method. Technically, .run_conversion will still work on DataInterfaces, but it will not be recommended and when it is called it will just call .add_to_nwbfile under the hood.

Draft PR: #455

Phase 2: Move all H5DataIO calls out of .add_to_nwbfile methods and into a separate method: configure_datasets. This allows us to move forward without making any changes to the API or the resulting files.

This step requires resolution of hdmf-dev/hdmf#868

Phase 3: Add in support for multiple backends. Each backend will be defined using a dataclass, which specified the NWBIO, the data IO and the default IO settings. .run_conversion will optionally take a backend as an argument, and the appropriate classes will be used when reading, configuring datasets, and writing.

Phase 4: Add support for external functions get_configurable_datasets and configure_datasets which can be used to adjust the data IO of individual datasets. The .configure_datasets method will call these functions.

Phase 5: Expose these features in GUIDE

@h-mayorquin
Copy link
Collaborator

h-mayorquin commented May 24, 2023

I like the idea of adding separate interface methods for both appending (add_to_nwb) and writing. In my view, having separate methods instead of a single run_conversion that writes or appends depending on the input is way cleaner and transparent.

Looking at #455 it seems that this run_conversion will still work for NWBConveter and the ConverterPipe. I think it would be a good idea to add a write method to those classes. I like the idea of having consistent methods between stand-alone interfaces and the converters such as in:

https://neuroconv.readthedocs.io/en/main/conversion_examples_gallery/combinations/spikeglx_and_phy.html

@h-mayorquin
Copy link
Collaborator

h-mayorquin commented Jul 5, 2023

I am thinking more about this:

Where is the metadata supposed to be added?

Discussing with @CodyCBakerPhD in another issue it seems that the idea there is that the write method will have it as an argument which makes me realize I am confused about it as I thought it would be in the add_to_nwb method.

Currently the abstract class does not have the keyword:

@abstractmethod
def add_to_nwbfile(self, nwbfile: NWBFile, **conversion_options) -> None:
raise NotImplementedError()

But the instance method in BaseRecordingExtractorInterface does:

def add_to_nwbfile(
self,
nwbfile: NWBFile,
metadata: Optional[dict] = None,
stub_test: bool = False,
starting_time: Optional[float] = None,
write_as: Literal["raw", "lfp", "processed"] = "raw",
write_electrical_series: bool = True,
compression: Optional[str] = "gzip",
compression_opts: Optional[int] = None,
iterator_type: str = "v2",
iterator_opts: Optional[dict] = None,
):

And that seems correct to me. One use case of metadata is to add the name of the devices, or the properties of the recording device channels or missing keyword arguments from pynwb object. It seems appropriate that those things should be passed when you are buillding the in-memory nwb-file.

Are there other use cases of metadata that are used in the context of writing?

If not that's nice, because then we can just simply add the metadata merging when we call add in the NWBConversor:

def add_to_nwbfile(self, nwbfile: NWBFile, metadata, conversion_options: Optional[dict] = None) -> None:
conversion_options = conversion_options or dict()
for interface_name, data_interface in self.data_interface_objects.items():
data_interface.add_to_nwbfile(
nwbfile=nwbfile, metadata=metadata, **conversion_options.get(interface_name, dict())
)

Maybe also validation there.

How are you guys thinking about it?

@bendichter
Copy link
Contributor Author

bendichter commented Jul 6, 2023

@h-mayorquin yes, that's the way I am thinking about it. metadata is added via data_interface.add_to_nwbfile, then data_interface.write handles data packing settings but does not change any of the data values

@h-mayorquin
Copy link
Collaborator

All right, then it is just missing from the signature in the base class.

@bendichter
Copy link
Contributor Author

Well actually that was done intentionally. Note that with the **kwargs, you don't really need to put metadata in that signature- it will work even if not explicitly defined in the abstract class. There are edge cases where interfaces actually don't need a metadata arg at all, and if you put metadata in the abstract class then code inspectors complain that the downstream class methods do not match the abstract method signature.

@h-mayorquin
Copy link
Collaborator

I see.
I was mislead about the purpose because the kwargs here are named conversion_options. But this is probably temporary until we fully separate adding from writing.

@CodyCBakerPhD
Copy link
Member

This is mostly done; only more thorough testing of Zarr output per interface remains

Good to close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants