-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve serialization concept : memory/json/pickle #2027
Improve serialization concept : memory/json/pickle #2027
Conversation
… engine like pickle.
@alejoe91 : I think I would not complexify too much this PR. Tell what you think. |
src/spikeinterface/core/base.py
Outdated
return self._is_dumpable | ||
|
||
def check_serializablility(self, type="json"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike this idea of using a dictionary as a carrier of arbitrary serilization method. I think that the previous flag method is simple, explicit and does its job well.
Json and pickle are separate concepts with very different purposes. I think it would better to have some code duplication in methods like instead of coupling this concepts together which I feel will bring pain in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can help to make loop like:
for type in ('json', 'pickle'):
if check_serializablility(type):
rec.dump(f'file.{type}')
Which should be convinient in various places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need 3 flags "dumpable", "json_serializable", "pickle_to_disk_srializable"
We could also remove self._is_dumpable and put it also this one in the dict.
This avoid duplication of the (actually buggy) function the recursivly find a non dumpable parent.
@h-mayorquin @alejoe91 : what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I was saying that I prefered the code duplication (so you can't write your helpful loop) instead of merging these flags in a common dict and processing them in a single function. I don't like nesting even more the check_if_dumpable
function which is already complicated. I think that is easier for you who has a good idea of what everything there means and where is it located and you want to save yourself a couple of lines of typing but I think that the costs are: 1) Is horrendous to read as a beginner, 2) It couples these two concepts (can be saved to json, can be saved to pickled) into one function and I don't think they should be mixed. Code should have low-coupling
I really think that the corner case of Numpy should be addressed as what it is, a corner case.If what you want to avoid is having stuff being dumped into pickle and json when saving in waveform extractor why not just filter them out right there in the waveform extractor instead of propagating these concepts to the whole library by introducing a new flag?
can_be_pickled
can_be_jsonfied
are simple flags that I think should transparently mean what they say in their name. If we believe that some specific functionality in the library should not rely on those mechanisms then we should throw a warning or an assertion in the specific place where the happens.
It seems conceptually onerous and hard to document to create a whole new category for "thinks that will be sub-optimal in a specific function if done this way". We will soon forget why we introduce them in the first place, whereas if we put a filter where things work wrongly then the use case is kind of self-documented.
@@ -517,7 +522,8 @@ def __init__(self, snippets_list, spikesframes_list, sampling_frequency, nbefore | |||
) | |||
|
|||
self._is_dumpable = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that "dumpable" was mean to be pickable? Can you remind me of how they are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes or no.
dumpable is pickle for multiprocessing but absolutly want to avoid to dump silently NumpyRecording when extract waveforms.
check_serializablility('pickle)
is clearly for the file version.
NumpyRecording is really the corner case we need to have in mind.
@@ -127,7 +128,9 @@ def __init__(self, spikes, sampling_frequency, unit_ids): | |||
BaseSorting.__init__(self, sampling_frequency, unit_ids) | |||
|
|||
self._is_dumpable = True | |||
self._is_json_serializable = False | |||
self._serializablility["json"] = False | |||
# theorically this should be False but for simplicity make generators simples we still need this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you guys were talking about this today. I am not sure I got it.
You don't wnat stuff to be pickable because they are heavy. You put all the spike_vector into disk. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid NumpySorting to also be pickle into file. But finally it is too hard!!
So lets keep self._serializablility["pickle"] = True
for NumpySorting so the generated recording can be save into pickle which the most urgent need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But right now the generator recording can be pickled, can't it?
This works for me:
import spikeinterface.core as si
rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000.,
num_channels=32,
num_units=64,
)
from spikeinterface.core import load_extractor
rec.dump_to_pickle('generated_recording.pickle')
load_extractor('generated_recording.pickle')
sorting.dump_to_pickle('generated_sorting.pickle')
load_extractor('generated_sorting.pickle')
Is there any case where saving and loading from json is better than pickling? |
We used json for human redability and so for debugging. |
@h-mayorquin I made the change you proposed |
OK, so I had a discussion with @alejoe91. Some points that I remember:
This seems to be the status quo. Something important is that for the ground truth recording I am not sure that pickling looks bad at all. Look at this, it is not that bad for an hour and 20 minutes recording generated using @samuelgarcia ground truth recording: rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000.,
num_channels=32,
num_units=64,
) Times:
Note that the recording is large because it contains the spike vector and the templates as array. Otherwise it would be superlight. Maybe there is a way of getting around that. This was produced with the following code import spikeinterface.core as si
from pathlib import Path
import time
rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000.,
num_channels=32,
num_units=64,
)
from spikeinterface.core import load_extractor
# Timing for dumping recording to pickle
start_time = time.time()
rec.dump_to_pickle('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to dump recording to pickle: {end_time - start_time:.2f} seconds")
# Timing for loading recording from pickle
start_time = time.time()
load_extractor('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to load recording from pickle: {end_time - start_time:.2f} seconds")
# Print the size of the file in MiB
size_pickle = Path('ground_truth_recording.pickle').stat().st_size / 1024**2
print(f"Size of the recording pickle: {size_pickle:.2f} MiB")
# Timing for dumping recording to json
start_time = time.time()
rec.dump_to_json('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to dump recording to json: {end_time - start_time:.2f} seconds")
# Timing for loading recording from json
start_time = time.time()
load_extractor('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to load recording from json: {end_time - start_time:.2f} seconds")
size_json = Path('ground_truth_recording.json').stat().st_size / 1024**2
print(f"Size of the recording json: {size_json:.2f} MiB")
# Timing for dumping sorting to pickle
start_time = time.time()
sorting.dump_to_pickle('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to dump sorting to pickle: {end_time - start_time:.2f} seconds")
# Timing for loading sorting from pickle
start_time = time.time()
load_extractor('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to load sorting from pickle: {end_time - start_time:.2f} seconds")
size_pickle = Path('ground_truth_sorting.pickle').stat().st_size / 1024**2
print(f"Size of the sorting pickle: {size_pickle:.2f} MiB")
# Timing for dumping sorting to json
start_time = time.time()
sorting.dump_to_json('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to dump sorting to json: {end_time - start_time:.2f} seconds")
# Timing for loading sorting from json
start_time = time.time()
load_extractor('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to load sorting from json: {end_time - start_time:.2f} seconds")
size_json = Path('ground_truth_sorting.json').stat().st_size / 1024**2
print(f"Size of the sorting json: {size_json:.2f} MiB") |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Some recording are not json serializable but they be dump with pickle (the new generator for instance)?.
this prevent the waveform extractor to be load back and also to run sorters.
Here a proposal to extend at some place the engine to serializa an object.
This will be super convinient for running sorter on simulated data which are not file based.
This also make rename
is_dumpable()
to a more explicitis_memory_serializable()
This also make a better seprartion between:
is_memory_serializable()
/check_serializablility("memory")
: a object can be dump to memory for multi processingcheck_if_json_serializable()
/check_serializablility("json")
: a object can be serialized to a file using jsoncheck_if_pickle_serializable()
/check_serializablility("pickle")
: a object can be serialized to file using pickleIn this PR:
check_serializablility(type)
with an internalself._serializablility
dictNote: