Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flexible backend prototype #451

Closed
wants to merge 19 commits into from

Conversation

bendichter
Copy link
Contributor

@bendichter bendichter commented May 21, 2023

#441

Here's the start of an implementation for separating the backend out. This refactoring gives us the following advantages:

  1. It allows for flexible backend. hdf5 and zarr are shown, and it can be extended for future backends as well
  2. It separates the methods that determine the content of the nwbfile from the methods that determine the dataset configurations.
  3. It simplifies the code in individual non-abstract DataInterace classes. Individual data interfaces no longer need to handle setting DataIO, and do not need to worry about reading an NWBFile from disk or using a make_or_load_nwbfile context.
  4. It allows for setting different configurations for different datasets.

There's still a lot that is needed:

  1. We will need to go into each data interface and simplify them by removing the make_or_load_nwbfile context calls, similarly to this PR: move make_or_load_nwbfile logic to BaseDataInterface. #398. Individual interfaces should not implement their own .run_conversion methods, but instead should implement .add_to_nwbfile, which must take an NWBFile object.
  2. The new .add_to_nwbfile method should populate a list in self.datasets with the object ID and field name of each dataset we might want to chunk and compress

@bendichter bendichter marked this pull request as draft May 22, 2023 13:44
@bendichter bendichter requested a review from alejoe91 May 22, 2023 13:49
@bendichter bendichter requested review from h-mayorquin, luiztauffer and CodyCBakerPhD and removed request for luiztauffer May 22, 2023 13:52
@alejoe91
Copy link
Contributor

Hi @bendichter

Thanks for drafting this out. I think this is a very good start and it seems to be in line with what @CodyCBakerPhD, @h-mayorquin and I discussed last time.

IMO, the self.datasets could be optional and auto-inferred by traversing the NWBFile fields and checking which fields are DataChunkIterator, which will need to be wrapped by an IO.
We can add some helper functions for that.

@CodyCBakerPhD
Copy link
Member

I was thinking more on the problem of tracking structure and dataset provenance, especially with respect to how the GUIDE would go about representing this, and arrived at the same conclusion that @alejoe91 suggests with

IMO, the self.datasets could be optional and auto-inferred by traversing the NWBFile fields and checking which fields are DataChunkIterator, which will need to be wrapped by an IO.

That is, we can simplify this strategy greatly (and would server as more generic functionality beyond NeuroConv anyway) if we can decouple the problem of tracking which interfaces write which datasets from which datasets still need an IO set

So all we should need in the end is

a) the existing NeuroConv write tools (SI/ROIExtractors, etc) make wrapping of an IO optional; the interfaces that use those tools always set it to 'do not wrap an IO'

b) make a function that searches an in-memory NWBFile object, finds all the unbound fields that will result as h5py.Datasets that have IO info unset, and returns some basic information about each dataset; such as 'where is it in the file', maybe basic shape and data type information, etc. (very similar to how an NWB Inspector message works)

This basic information is in essence the dataset_config expected of configure_datasets, but I would unbind them from the BaseDataInterface and make it something that corresponds to the context manager responsible for writing the file so we can use it any time we use the context with or without interfaces/converter

c) the configure_datasets can then be used as-is

@bendichter
Copy link
Contributor Author

I like the idea of separating out the configure_datasets to a function instead of a method so it can be used outside of DataInterfaces for any PyNWB conversion.

Auto-inferring datasets would be possible. The thing is, it might be tricky to automatically determine which datasets you want to be eligible for chunking and compression (C&C), and the question becomes how do we identify such datasets? Creating list of eligible datasets was an attempt to get around this, but I see that it would require a lot of work to modify every DataInterface to enumerate all eligible datasets.

@alejoe91 's suggestion is use all datasets that have a DataChunkIterator (DCI). That's not a bad idea, but you might want to apply chunking and compression even to datasets that fit in memory. For example TimeSeries.timestamps is rarely created with a DCI, but we may want to use C&C.

@CodyCBakerPhD 's suggestion is to include all fields that will become datasets (and don't already have an IO), which would include data like NWBFile.identifier, NWBFile.session_start_time, NWBFile.timestampss_reference_time, VoltageClampSeries.capacitance_slow, and a lot of other very small datasets that you would never want to bother with for C&C. Perhaps a modification of @CodyCBakerPhD 's approach that would be better would be to include all non-scalar datasets. This might still have a few small datasets that you would never want to apply C&C to, but maybe not very many. I think it's worth trying out on a few files to see what this would look like.

If we can agree upon an automated solution that sounds like it would be fine. Maybe, as @alejoe91 pointed out, we can try to automate where we can, and then have DataInterfaces just handle the exceptions somehow, maybe we providing a list of datasets to exclude.

@CodyCBakerPhD
Copy link
Member

which would include data like NWBFile.identifier, NWBFile.session_start_time, NWBFile.timestampss_reference_time, VoltageClampSeries.capacitance_slow, and a lot of other very small datasets that you would never want to bother with for C&C.

Yeah, forgot about those...

Would it work to restrict the inclusion rule to be a h5py.Dataset contained within either a TimeSeries or VectorData object?

Brief sketch in Pythonic pseudo-code

for object in nwbfile.objects();  # search for time series or columns of tables
    if isinstance(object, TimeSeries) or isinstance(object, VectorData):  #VectorData one probably wouldn't work...
        for field in object:  # search for datasets
            if will_become_hdf5_datset and not_wrapped_in_dataio:
                # grab info, return in dataset_config

then later

# Specify common back-end to use
# Specify choices of compression methods and options (otherwise let them default based on back-end) for each dataset in the config

for dataset in dataset_config:
    object_in_nwbfile = nwbfile[dataset.path_to_dataset]
    if not wrapped_in_iterator(object_in_nwbfile ):  # that is, ALWAYS wrap in generic data chunk iterator regardless
        wrap_in_iterator(object_in_nwbfile)
    wrap_in_dataio(object_in_nwbfile )

@CodyCBakerPhD
Copy link
Member

I guess that might exclude extended data types that don't inherit from either of those bases, but I'm OK with support for this only applying to core types for now

@bendichter
Copy link
Contributor Author

bendichter commented May 22, 2023

@CodyCBakerPhD yes, the approach of only selecting VectorData.data, TimeSeries.data, and TimeSeries.timestamps might satisfy most of our needs right now.

What is the advantage of always wrapping in a DCI?

@CodyCBakerPhD
Copy link
Member

What is the advantage of always wrapping in a DCI?

Personally it's my preferred way of communicating chunk information, since that is usually what the io.write procedure in HDMF requests that info from as I recall

@CodyCBakerPhD
Copy link
Member

Personally it's my preferred way of communicating chunk information

More specifically, using the GenericDataChunkIterator to generate a good default chunk_shape

bendichter and others added 3 commits May 22, 2023 14:05
* configure_datasets moved to function
* automatically include data in TimeSeries and VectorData
@bendichter
Copy link
Contributor Author

@alejoe91 and @CodyCBakerPhD what do you think about this?

for object_ in nwbfile.objects():
if isinstance(object_, TimeSeries) or isinstance(object_, VectorData):
for field in object_.__fields__: # search for datasets
if field in ("data", "timestamps") and not isinstance(getattr(object_, field), DataIO):
Copy link
Member

@CodyCBakerPhD CodyCBakerPhD May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that, but can we for sure guarantee that the main dataset of every child of TimeSeries will be under the data field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every descendent of TimeSeries should have a data field. That is a required argument. They may also have other datasets, and these other datasets would be missed. That could potentially be an issue if those other datasets are big, but usually they are just small metadata that you wouldn't want to compress anyway

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible option for filtering which datasets are extracting for wrapping for chunking and compression is to just set a minimal object size (as in memory). Don't we have the types and shapes of the objects available to determine this?

if getattr(object_, field) is not None and not isinstance(getattr(object_, field), DataIO):
yield object_.id, field
if isinstance(object_, DynamicTable):
for field in getattr(object_, "colnames"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what is this condition for?
isinstance(getattr(object_, field), DataIO)

Is this so we don't overwrite previous configuration?

Also, what happens in this function if the object was previously bound as in we read this from a file already and we are appending new datasets? Is it related to this?

@CodyCBakerPhD
Copy link
Member

The major ideas from this have been either merged from [Backend Configuration I-VI] or are still TODO items in the reference #475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants