-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flexible backend prototype #451
Conversation
Hi @bendichter Thanks for drafting this out. I think this is a very good start and it seems to be in line with what @CodyCBakerPhD, @h-mayorquin and I discussed last time. IMO, the |
I was thinking more on the problem of tracking structure and dataset provenance, especially with respect to how the GUIDE would go about representing this, and arrived at the same conclusion that @alejoe91 suggests with
That is, we can simplify this strategy greatly (and would server as more generic functionality beyond NeuroConv anyway) if we can decouple the problem of tracking which interfaces write which datasets from which datasets still need an So all we should need in the end is a) the existing NeuroConv write tools (SI/ROIExtractors, etc) make wrapping of an b) make a function that searches an in-memory This basic information is in essence the c) the |
I like the idea of separating out the Auto-inferring datasets would be possible. The thing is, it might be tricky to automatically determine which datasets you want to be eligible for chunking and compression (C&C), and the question becomes how do we identify such datasets? Creating list of eligible datasets was an attempt to get around this, but I see that it would require a lot of work to modify every @alejoe91 's suggestion is use all datasets that have a @CodyCBakerPhD 's suggestion is to include all fields that will become datasets (and don't already have an IO), which would include data like If we can agree upon an automated solution that sounds like it would be fine. Maybe, as @alejoe91 pointed out, we can try to automate where we can, and then have DataInterfaces just handle the exceptions somehow, maybe we providing a list of datasets to exclude. |
Yeah, forgot about those... Would it work to restrict the inclusion rule to be a Brief sketch in Pythonic pseudo-code for object in nwbfile.objects(); # search for time series or columns of tables
if isinstance(object, TimeSeries) or isinstance(object, VectorData): #VectorData one probably wouldn't work...
for field in object: # search for datasets
if will_become_hdf5_datset and not_wrapped_in_dataio:
# grab info, return in dataset_config then later # Specify common back-end to use
# Specify choices of compression methods and options (otherwise let them default based on back-end) for each dataset in the config
for dataset in dataset_config:
object_in_nwbfile = nwbfile[dataset.path_to_dataset]
if not wrapped_in_iterator(object_in_nwbfile ): # that is, ALWAYS wrap in generic data chunk iterator regardless
wrap_in_iterator(object_in_nwbfile)
wrap_in_dataio(object_in_nwbfile ) |
I guess that might exclude extended data types that don't inherit from either of those bases, but I'm OK with support for this only applying to core types for now |
@CodyCBakerPhD yes, the approach of only selecting VectorData.data, TimeSeries.data, and TimeSeries.timestamps might satisfy most of our needs right now. What is the advantage of always wrapping in a DCI? |
Personally it's my preferred way of communicating chunk information, since that is usually what the |
More specifically, using the |
* configure_datasets moved to function * automatically include data in TimeSeries and VectorData
for more information, see https://pre-commit.ci
@alejoe91 and @CodyCBakerPhD what do you think about this? |
src/neuroconv/tools/nwb_helpers.py
Outdated
for object_ in nwbfile.objects(): | ||
if isinstance(object_, TimeSeries) or isinstance(object_, VectorData): | ||
for field in object_.__fields__: # search for datasets | ||
if field in ("data", "timestamps") and not isinstance(getattr(object_, field), DataIO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that, but can we for sure guarantee that the main dataset of every child of TimeSeries
will be under the data
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every descendent of TimeSeries should have a data
field. That is a required argument. They may also have other datasets, and these other datasets would be missed. That could potentially be an issue if those other datasets are big, but usually they are just small metadata that you wouldn't want to compress anyway
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible option for filtering which datasets are extracting for wrapping for chunking and compression is to just set a minimal object size (as in memory). Don't we have the types and shapes of the objects available to determine this?
if getattr(object_, field) is not None and not isinstance(getattr(object_, field), DataIO): | ||
yield object_.id, field | ||
if isinstance(object_, DynamicTable): | ||
for field in getattr(object_, "colnames"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what is this condition for?
isinstance(getattr(object_, field), DataIO)
Is this so we don't overwrite previous configuration?
Also, what happens in this function if the object was previously bound as in we read this from a file already and we are appending new datasets? Is it related to this?
The major ideas from this have been either merged from [Backend Configuration I-VI] or are still TODO items in the reference #475 |
#441
Here's the start of an implementation for separating the backend out. This refactoring gives us the following advantages:
DataInterace
classes. Individual data interfaces no longer need to handle settingDataIO
, and do not need to worry about reading an NWBFile from disk or using amake_or_load_nwbfile
context.There's still a lot that is needed:
make_or_load_nwbfile
context calls, similarly to this PR: move make_or_load_nwbfile logic to BaseDataInterface. #398. Individual interfaces should not implement their own.run_conversion
methods, but instead should implement.add_to_nwbfile
, which must take anNWBFile
object..add_to_nwbfile
method should populate a list inself.datasets
with the object ID and field name of each dataset we might want to chunk and compress