-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: separation of NWBFile formation and dataset configuration #441
Comments
This is closely related to and supersedes #398 and #436 and would depend on hdmf-dev/hdmf#856 |
One quick thing to check on the An example of this in practice is with open(file="C:/Users/Raven/Downloads/test_manager_raw.txt") as file, NWBHDF5IO(path="...", mode="r+") as io:
lines = file.readlines()
nwbfile = io.read()
# put lines in it somewhere
io.write(...) and I would consider using it to break up the enter/exit logic similar to with make_or_load_nwbfile(...) as nwbfile, write_nwbfile(backend="zarr") as neuroconv_io:
converter.run_conversion(nwbfile) # or anytthing else that 'assembles' the file structure
neurconv_io.prepare_for_write(nwbfile, backend="...", dataset_configurations=...) # where all the compression and other backend stuff gets evaluated and set
neuroconv_io.write() otherwise right now the |
@CodyCBakerPhD This is a good start. I see you are making the DataIOMap = {"hdf5": H5DataIO, "zarr": ZarrDataIO}
writers = {"hdf5: NWBHDF5IO, "zarr": ZarrIO}
dset_config_defaults = {"hdf5" dict(compression="gzip", compression_opts=4), "zarr": dict(...)}
class BaseDataInterface:
...
def run_conversion(..., backend="hdf5"):
with make_or_load_nwbfile(...) as nwbfile
# call signature unchanged except for optional backend arg
nwbfile = self.add_to_nwbfile(nwbfile) # or anything else that 'assembles' the file structure
self.write(nwbfile, filepath, backend=backend, dataset_configurations=...)
def configure_datasets(backend="hdf5", dset_configurations: dict):
for container_id, field in self.get_datasets():
dset_config = dataset_configurations.get((container_id, field), dset_config_defaults[backend])
nwbfile.get_object(container_id).getattr(field) = DataIOMap[backend](data=data, **dset_config)
def write(nwbfile, filepath, backend="hdf5", dset_configurations: Optional[dict] = None, mode="r+"):
if dset_configurations:
self.configure_datasets(dset_configurations)
with writers[backend](filepath, mode=mode) as io:
io.write(nwbfile)
|
After talking with @CodyCBakerPhD a bit, I'm realizing this will probably need to be broken up into a few phases. Phase 1: Create Draft PR: #455 Phase 2: Move all This step requires resolution of hdmf-dev/hdmf#868 Phase 3: Add in support for multiple backends. Each backend will be defined using a dataclass, which specified the NWBIO, the data IO and the default IO settings. Phase 4: Add support for external functions Phase 5: Expose these features in GUIDE |
I like the idea of adding separate interface methods for both appending ( Looking at #455 it seems that this |
I am thinking more about this: Where is the metadata supposed to be added? Discussing with @CodyCBakerPhD in another issue it seems that the idea there is that the write method will have it as an argument which makes me realize I am confused about it as I thought it would be in the Currently the abstract class does not have the keyword: neuroconv/src/neuroconv/basedatainterface.py Lines 54 to 57 in 25165e7
But the instance method in neuroconv/src/neuroconv/datainterfaces/ecephys/baserecordingextractorinterface.py Lines 264 to 276 in 25165e7
And that seems correct to me. One use case of metadata is to add the name of the devices, or the properties of the recording device channels or missing keyword arguments from pynwb object. It seems appropriate that those things should be passed when you are buillding the in-memory nwb-file. Are there other use cases of metadata that are used in the context of writing? If not that's nice, because then we can just simply add the metadata merging when we call add in the NWBConversor: neuroconv/src/neuroconv/nwbconverter.py Lines 111 to 116 in 25165e7
Maybe also validation there. How are you guys thinking about it? |
@h-mayorquin yes, that's the way I am thinking about it. metadata is added via |
All right, then it is just missing from the signature in the base class. |
Well actually that was done intentionally. Note that with the **kwargs, you don't really need to put metadata in that signature- it will work even if not explicitly defined in the abstract class. There are edge cases where interfaces actually don't need a |
I see. |
This is mostly done; only more thorough testing of Zarr output per interface remains Good to close? |
What would you like to see added to NeuroConv?
We are in the process of generalizing NeuroConv to support HDF5 and Zarr backends and found that there are many places in the code that use
H5DataIO
, which is causing problems for the Zarr backend. Another problem is that you cannot set conversion options individually for different datasets, which has become a problem in some applications. The proposed solution is to remove all of theseH5DataIO
calls and instead provide a separate interface to alter these settings on a per-dataset basis in a later step. This provides a separation between setting the values of the data and metadata and setting the storage options.Here is a proposal based on our conversation and some follow-up conversations with the NWB team:
Remove all
H5DataIO
and any other backend-specific code from the underlying methods of run_conversion, and move all other code into a new method, named something likeadd_to_nwb()
. For each large dataset (where anH5DataIO
was previously used) we would instead add each container to a dict. Something like:data_interface.dataset_configs.update[container_id, field] = dict(**default_io_kwargs)
The
BaseDataInterface
would have an additional optionalbackend
arg. This can be called during init or can be set later withdata_interface.set_backend("hdf5|zarr")
with a default "hdf5".NWBConverter
would similarly have a backend arg that propagates to inner interfaces.BaseDataInterface.set_dataset_configs()
would take args of the same form asdataset_configs
:**{(container_id, field): dict(**io_kwargs)}
.BaseDataInterface.write()
would first apply all the dataset_configs and then call theNWB**IO.write()
command based on the backend chosen.BaseDataInterface.run_conversion()
would stick around for backwards compatibility, and would call.add_to_nwb()
and then.write()
. We could allow for some basic compression options as before, but this method would not allow for dataset-specific conversion options.Thoughts on this approach?
One thing I'm not entirely happy with is the fact that it will be difficult to see which container you are adjusting with the tuple (container_id, field), because a user won't be able to see the name, location, or neurodata type of that container. It might be better for this dict to be
{readable_key: dict(container_id=uuid, field=str, io_kwargs=dict)}
,Is your feature request related to a problem?
No response
Do you have any interest in helping implement the feature?
No.
Code of Conduct
The text was updated successfully, but these errors were encountered: