Improve serialization concept : memory/json/pickle #2027

samuelgarcia · 2023-09-20T08:58:40Z

Some recording are not json serializable but they be dump with pickle (the new generator for instance)?.
this prevent the waveform extractor to be load back and also to run sorters.

Here a proposal to extend at some place the engine to serializa an object.
This will be super convinient for running sorter on simulated data which are not file based.

This also make rename is_dumpable() to a more explicit is_memory_serializable()

This also make a better seprartion between:

is_memory_serializable() / check_serializablility("memory"): a object can be dump to memory for multi processing
check_if_json_serializable() / check_serializablility("json"): a object can be serialized to a file using json
check_if_pickle_serializable() / check_serializablility("pickle") : a object can be serialized to file using pickle

In this PR:

Improve the concept check_serializablility(type) with an internal self._serializablility dict
enable pickle dump to file in waveform extractor
enable pickle dump to file base sorter
debug the nested check

Note:

NumpRecording is not json and not pickle because it is too big : we do not want a pickle/json copy of theses object.

… engine like pickle.

samuelgarcia · 2023-09-20T12:24:08Z

@alejoe91 : I think I would not complexify too much this PR.
I finally leave the pickle for NumpySorting other is too much a ascade of change for now.
this is not perfect but at least we can play with generators recording.

Tell what you think.

src/spikeinterface/core/tests/test_waveform_extractor.py

h-mayorquin · 2023-09-20T21:50:07Z

src/spikeinterface/core/base.py

        return self._is_dumpable

+    def check_serializablility(self, type="json"):


I really dislike this idea of using a dictionary as a carrier of arbitrary serilization method. I think that the previous flag method is simple, explicit and does its job well.

Json and pickle are separate concepts with very different purposes. I think it would better to have some code duplication in methods like instead of coupling this concepts together which I feel will bring pain in the future.

It can help to make loop like:

for type in ('json', 'pickle'): if check_serializablility(type): rec.dump(f'file.{type}')

Which should be convinient in various places.

We need 3 flags "dumpable", "json_serializable", "pickle_to_disk_srializable"
We could also remove self._is_dumpable and put it also this one in the dict.
This avoid duplication of the (actually buggy) function the recursivly find a non dumpable parent.
@h-mayorquin @alejoe91 : what do you think ?

Yes, so I was saying that I prefered the code duplication (so you can't write your helpful loop) instead of merging these flags in a common dict and processing them in a single function. I don't like nesting even more the check_if_dumpable function which is already complicated. I think that is easier for you who has a good idea of what everything there means and where is it located and you want to save yourself a couple of lines of typing but I think that the costs are: 1) Is horrendous to read as a beginner, 2) It couples these two concepts (can be saved to json, can be saved to pickled) into one function and I don't think they should be mixed. Code should have low-coupling

I really think that the corner case of Numpy should be addressed as what it is, a corner case.If what you want to avoid is having stuff being dumped into pickle and json when saving in waveform extractor why not just filter them out right there in the waveform extractor instead of propagating these concepts to the whole library by introducing a new flag?

can_be_pickled can_be_jsonfied are simple flags that I think should transparently mean what they say in their name. If we believe that some specific functionality in the library should not rely on those mechanisms then we should throw a warning or an assertion in the specific place where the happens.

It seems conceptually onerous and hard to document to create a whole new category for "thinks that will be sub-optimal in a specific function if done this way". We will soon forget why we introduce them in the first place, whereas if we put a filter where things work wrongly then the use case is kind of self-documented.

src/spikeinterface/core/generate.py

h-mayorquin · 2023-09-20T21:52:50Z

src/spikeinterface/core/numpyextractors.py

@@ -517,7 +522,8 @@ def __init__(self, snippets_list, spikesframes_list, sampling_frequency, nbefore
        )

        self._is_dumpable = False


I thought that "dumpable" was mean to be pickable? Can you remind me of how they are different?

yes or no.
dumpable is pickle for multiprocessing but absolutly want to avoid to dump silently NumpyRecording when extract waveforms.
check_serializablility('pickle) is clearly for the file version.

NumpyRecording is really the corner case we need to have in mind.

h-mayorquin · 2023-09-20T21:55:54Z

src/spikeinterface/core/numpyextractors.py

@@ -127,7 +128,9 @@ def __init__(self, spikes, sampling_frequency, unit_ids):
        BaseSorting.__init__(self, sampling_frequency, unit_ids)

        self._is_dumpable = True
-        self._is_json_serializable = False
+        self._serializablility["json"] = False
+        # theorically this should be False but for simplicity make generators simples we still need this.


Yeah, you guys were talking about this today. I am not sure I got it.

You don't wnat stuff to be pickable because they are heavy. You put all the spike_vector into disk. Is that correct?

I wanted to avoid NumpySorting to also be pickle into file. But finally it is too hard!!
So lets keep self._serializablility["pickle"] = True for NumpySorting so the generated recording can be save into pickle which the most urgent need.

But right now the generator recording can be pickled, can't it?

This works for me:

import spikeinterface.core as si rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., num_channels=32, num_units=64, ) from spikeinterface.core import load_extractor rec.dump_to_pickle('generated_recording.pickle') load_extractor('generated_recording.pickle') sorting.dump_to_pickle('generated_sorting.pickle') load_extractor('generated_sorting.pickle')

h-mayorquin · 2023-09-20T21:58:28Z

Is there any case where saving and loading from json is better than pickling?

samuelgarcia · 2023-09-21T09:50:41Z

Is there any case where saving and loading from json is better than pickling?

We used json for human redability and so for debugging.
And also now pickle do not support anymore (and this is a big pain) the relative_to option which is super convinient.
so json still is a good option I think.

samuelgarcia · 2023-09-22T14:22:22Z

@h-mayorquin I made the change you proposed

h-mayorquin · 2023-09-25T15:29:38Z

OK, so I had a discussion with @alejoe91.

Some points that I remember:

The numpy extractor (recording and sorter) should not be turn to json (as we have right now) because is a bad practice. They are too large.
Moreover, because even the pickle might be large we don't want NumpyRecording to participate in any sort of multiprocessing where it has to be pickled (this is the is_dumpable of Sam). We will only allow the memory mode for this.

This seems to be the status quo.

Something important is that for the ground truth recording I am not sure that pickling looks bad at all. Look at this, it is not that bad for an hour and 20 minutes recording generated using @samuelgarcia ground truth recording:

rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., 
                                                  num_channels=32,
                                                  num_units=64,
                                                  )

Times:

Time taken to dump recording to pickle: 0.20 seconds
Time taken to load recording from pickle: 0.07 seconds
Size of the recording pickle: 107.64 MiB
Time taken to dump recording to json: 10.81 seconds
Time taken to load recording from json: 4.55 seconds
Size of the recording json: 591.58 MiB
Time taken to dump sorting to pickle: 0.21 seconds
Time taken to load sorting from pickle: 0.04 seconds
Size of the sorting pickle: 105.47 MiB
Time taken to dump sorting to json: 9.39 seconds
Time taken to load sorting from json: 3.81 seconds
Size of the sorting json: 407.57 MiB

Note that the recording is large because it contains the spike vector and the templates as array. Otherwise it would be superlight. Maybe there is a way of getting around that.

This was produced with the following code

import spikeinterface.core as si
from pathlib import Path
import time

rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., 
                                                  num_channels=32,
                                                  num_units=64,
                                                  )

from spikeinterface.core import load_extractor

# Timing for dumping recording to pickle
start_time = time.time()
rec.dump_to_pickle('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to dump recording to pickle: {end_time - start_time:.2f} seconds")

# Timing for loading recording from pickle
start_time = time.time()
load_extractor('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to load recording from pickle: {end_time - start_time:.2f} seconds")

# Print the size of the file in MiB
size_pickle = Path('ground_truth_recording.pickle').stat().st_size / 1024**2
print(f"Size of the recording pickle: {size_pickle:.2f} MiB")

# Timing for dumping recording to json
start_time = time.time()
rec.dump_to_json('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to dump recording to json: {end_time - start_time:.2f} seconds")

# Timing for loading recording from json
start_time = time.time()
load_extractor('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to load recording from json: {end_time - start_time:.2f} seconds")

size_json = Path('ground_truth_recording.json').stat().st_size / 1024**2
print(f"Size of the recording json: {size_json:.2f} MiB")

# Timing for dumping sorting to pickle
start_time = time.time()
sorting.dump_to_pickle('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to dump sorting to pickle: {end_time - start_time:.2f} seconds")

# Timing for loading sorting from pickle
start_time = time.time()
load_extractor('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to load sorting from pickle: {end_time - start_time:.2f} seconds")

size_pickle = Path('ground_truth_sorting.pickle').stat().st_size / 1024**2
print(f"Size of the sorting pickle: {size_pickle:.2f} MiB")

# Timing for dumping sorting to json
start_time = time.time()
sorting.dump_to_json('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to dump sorting to json: {end_time - start_time:.2f} seconds")

# Timing for loading sorting from json
start_time = time.time()
load_extractor('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to load sorting from json: {end_time - start_time:.2f} seconds")

size_json = Path('ground_truth_sorting.json').stat().st_size / 1024**2
print(f"Size of the sorting json: {size_json:.2f} MiB")

…ble()

for more information, see https://pre-commit.ci

Improve the concept of check_if_json_serializable to more serialation…

9a97e68

… engine like pickle.

samuelgarcia added the core Changes to core module label Sep 20, 2023

samuelgarcia added 2 commits September 20, 2023 14:01

More check and clean for check_if_serializable()

3f4e182

Make pickle possible to dump in run sorter when json is not possible.

615c5d9

samuelgarcia marked this pull request as ready for review September 20, 2023 12:22

h-mayorquin reviewed Sep 20, 2023

View reviewed changes

src/spikeinterface/core/tests/test_waveform_extractor.py Outdated Show resolved Hide resolved

h-mayorquin reviewed Sep 20, 2023

View reviewed changes

src/spikeinterface/core/generate.py Outdated Show resolved Hide resolved

h-mayorquin reviewed Sep 20, 2023

View reviewed changes

feedback from Ramon

f218826

alejoe91 mentioned this pull request Sep 22, 2023

Refactoring of GroundTruthStudy #1983

Merged

samuelgarcia and others added 7 commits September 27, 2023 10:00

replace is_dumpable() by a more explicit naming : is_memory_serializa…

3c3451e

…ble()

Merge branch 'main' into fix_non_json_rec

b8791c0

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d3dcea

for more information, see https://pre-commit.ci

rename to check_if_memory_serializable

7329927

oups

b9c6a38

Remove "dumpable" naming also in doc and warnings.

331379a

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ea10e3

for more information, see https://pre-commit.ci

samuelgarcia changed the title ~~Add pickle dump in some places~~ Improve serialization concept : memory/json/pickle Sep 27, 2023

alejoe91 approved these changes Sep 27, 2023

View reviewed changes

samuelgarcia merged commit c5bafd1 into SpikeInterface:main Sep 27, 2023

alejoe91 mentioned this pull request Nov 6, 2023

Serialization project outline #1235

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve serialization concept : memory/json/pickle #2027

Improve serialization concept : memory/json/pickle #2027

samuelgarcia commented Sep 20, 2023 •

edited

Loading

samuelgarcia commented Sep 20, 2023

h-mayorquin Sep 20, 2023

samuelgarcia Sep 21, 2023

samuelgarcia Sep 22, 2023

h-mayorquin Sep 25, 2023

h-mayorquin Sep 20, 2023

samuelgarcia Sep 21, 2023

h-mayorquin Sep 20, 2023

samuelgarcia Sep 21, 2023

h-mayorquin Sep 25, 2023

h-mayorquin commented Sep 20, 2023

samuelgarcia commented Sep 21, 2023

samuelgarcia commented Sep 22, 2023

h-mayorquin commented Sep 25, 2023 •

edited

Loading

		return self._is_dumpable

		def check_serializablility(self, type="json"):

		@@ -517,7 +522,8 @@ def __init__(self, snippets_list, spikesframes_list, sampling_frequency, nbefore
		)

		self._is_dumpable = False

Improve serialization concept : memory/json/pickle #2027

Improve serialization concept : memory/json/pickle #2027

Conversation

samuelgarcia commented Sep 20, 2023 • edited Loading

samuelgarcia commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-mayorquin commented Sep 20, 2023

samuelgarcia commented Sep 21, 2023

samuelgarcia commented Sep 22, 2023

h-mayorquin commented Sep 25, 2023 • edited Loading

samuelgarcia commented Sep 20, 2023 •

edited

Loading

h-mayorquin commented Sep 25, 2023 •

edited

Loading