Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve serialization concept : memory/json/pickle #2027

Merged
merged 11 commits into from
Sep 27, 2023

Conversation

samuelgarcia
Copy link
Member

@samuelgarcia samuelgarcia commented Sep 20, 2023

Some recording are not json serializable but they be dump with pickle (the new generator for instance)?.
this prevent the waveform extractor to be load back and also to run sorters.

Here a proposal to extend at some place the engine to serializa an object.
This will be super convinient for running sorter on simulated data which are not file based.

This also make rename is_dumpable() to a more explicit is_memory_serializable()

This also make a better seprartion between:

  • is_memory_serializable() / check_serializablility("memory"): a object can be dump to memory for multi processing
  • check_if_json_serializable() / check_serializablility("json"): a object can be serialized to a file using json
  • check_if_pickle_serializable() / check_serializablility("pickle") : a object can be serialized to file using pickle

In this PR:

  • Improve the concept check_serializablility(type) with an internal self._serializablility dict
  • enable pickle dump to file in waveform extractor
  • enable pickle dump to file base sorter
  • debug the nested check

Note:

  • NumpRecording is not json and not pickle because it is too big : we do not want a pickle/json copy of theses object.

@samuelgarcia samuelgarcia added the core Changes to core module label Sep 20, 2023
@samuelgarcia samuelgarcia marked this pull request as ready for review September 20, 2023 12:22
@samuelgarcia
Copy link
Member Author

@alejoe91 : I think I would not complexify too much this PR.
I finally leave the pickle for NumpySorting other is too much a ascade of change for now.
this is not perfect but at least we can play with generators recording.

Tell what you think.

return self._is_dumpable

def check_serializablility(self, type="json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this idea of using a dictionary as a carrier of arbitrary serilization method. I think that the previous flag method is simple, explicit and does its job well.

Json and pickle are separate concepts with very different purposes. I think it would better to have some code duplication in methods like instead of coupling this concepts together which I feel will bring pain in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can help to make loop like:

for type in ('json', 'pickle'):
   if check_serializablility(type):
      rec.dump(f'file.{type}')

Which should be convinient in various places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need 3 flags "dumpable", "json_serializable", "pickle_to_disk_srializable"
We could also remove self._is_dumpable and put it also this one in the dict.
This avoid duplication of the (actually buggy) function the recursivly find a non dumpable parent.
@h-mayorquin @alejoe91 : what do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I was saying that I prefered the code duplication (so you can't write your helpful loop) instead of merging these flags in a common dict and processing them in a single function. I don't like nesting even more the check_if_dumpable function which is already complicated. I think that is easier for you who has a good idea of what everything there means and where is it located and you want to save yourself a couple of lines of typing but I think that the costs are: 1) Is horrendous to read as a beginner, 2) It couples these two concepts (can be saved to json, can be saved to pickled) into one function and I don't think they should be mixed. Code should have low-coupling

I really think that the corner case of Numpy should be addressed as what it is, a corner case.If what you want to avoid is having stuff being dumped into pickle and json when saving in waveform extractor why not just filter them out right there in the waveform extractor instead of propagating these concepts to the whole library by introducing a new flag?

can_be_pickled can_be_jsonfied are simple flags that I think should transparently mean what they say in their name. If we believe that some specific functionality in the library should not rely on those mechanisms then we should throw a warning or an assertion in the specific place where the happens.

It seems conceptually onerous and hard to document to create a whole new category for "thinks that will be sub-optimal in a specific function if done this way". We will soon forget why we introduce them in the first place, whereas if we put a filter where things work wrongly then the use case is kind of self-documented.

@@ -517,7 +522,8 @@ def __init__(self, snippets_list, spikesframes_list, sampling_frequency, nbefore
)

self._is_dumpable = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that "dumpable" was mean to be pickable? Can you remind me of how they are different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes or no.
dumpable is pickle for multiprocessing but absolutly want to avoid to dump silently NumpyRecording when extract waveforms.
check_serializablility('pickle) is clearly for the file version.

NumpyRecording is really the corner case we need to have in mind.

@@ -127,7 +128,9 @@ def __init__(self, spikes, sampling_frequency, unit_ids):
BaseSorting.__init__(self, sampling_frequency, unit_ids)

self._is_dumpable = True
self._is_json_serializable = False
self._serializablility["json"] = False
# theorically this should be False but for simplicity make generators simples we still need this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you guys were talking about this today. I am not sure I got it.

You don't wnat stuff to be pickable because they are heavy. You put all the spike_vector into disk. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid NumpySorting to also be pickle into file. But finally it is too hard!!
So lets keep self._serializablility["pickle"] = True for NumpySorting so the generated recording can be save into pickle which the most urgent need.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But right now the generator recording can be pickled, can't it?

This works for me:

import spikeinterface.core as si

rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., 
                                                  num_channels=32,
                                                  num_units=64,
                                                  )

from spikeinterface.core import load_extractor

rec.dump_to_pickle('generated_recording.pickle')
load_extractor('generated_recording.pickle')


sorting.dump_to_pickle('generated_sorting.pickle')
load_extractor('generated_sorting.pickle')

@h-mayorquin
Copy link
Collaborator

Is there any case where saving and loading from json is better than pickling?

@samuelgarcia
Copy link
Member Author

Is there any case where saving and loading from json is better than pickling?

We used json for human redability and so for debugging.
And also now pickle do not support anymore (and this is a big pain) the relative_to option which is super convinient.
so json still is a good option I think.

@samuelgarcia
Copy link
Member Author

@h-mayorquin I made the change you proposed

@h-mayorquin
Copy link
Collaborator

h-mayorquin commented Sep 25, 2023

OK, so I had a discussion with @alejoe91.

Some points that I remember:

  • The numpy extractor (recording and sorter) should not be turn to json (as we have right now) because is a bad practice. They are too large.
  • Moreover, because even the pickle might be large we don't want NumpyRecording to participate in any sort of multiprocessing where it has to be pickled (this is the is_dumpable of Sam). We will only allow the memory mode for this.

This seems to be the status quo.

Something important is that for the ground truth recording I am not sure that pickling looks bad at all. Look at this, it is not that bad for an hour and 20 minutes recording generated using @samuelgarcia ground truth recording:

rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., 
                                                  num_channels=32,
                                                  num_units=64,
                                                  )

Times:

Time taken to dump recording to pickle: 0.20 seconds
Time taken to load recording from pickle: 0.07 seconds
Size of the recording pickle: 107.64 MiB
Time taken to dump recording to json: 10.81 seconds
Time taken to load recording from json: 4.55 seconds
Size of the recording json: 591.58 MiB
Time taken to dump sorting to pickle: 0.21 seconds
Time taken to load sorting from pickle: 0.04 seconds
Size of the sorting pickle: 105.47 MiB
Time taken to dump sorting to json: 9.39 seconds
Time taken to load sorting from json: 3.81 seconds
Size of the sorting json: 407.57 MiB

Note that the recording is large because it contains the spike vector and the templates as array. Otherwise it would be superlight. Maybe there is a way of getting around that.

This was produced with the following code

import spikeinterface.core as si
from pathlib import Path
import time

rec, sorting = si.generate_ground_truth_recording(durations=[3600., 1200.], sampling_frequency=30000., 
                                                  num_channels=32,
                                                  num_units=64,
                                                  )

from spikeinterface.core import load_extractor

# Timing for dumping recording to pickle
start_time = time.time()
rec.dump_to_pickle('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to dump recording to pickle: {end_time - start_time:.2f} seconds")

# Timing for loading recording from pickle
start_time = time.time()
load_extractor('ground_truth_recording.pickle')
end_time = time.time()
print(f"Time taken to load recording from pickle: {end_time - start_time:.2f} seconds")

# Print the size of the file in MiB
size_pickle = Path('ground_truth_recording.pickle').stat().st_size / 1024**2
print(f"Size of the recording pickle: {size_pickle:.2f} MiB")

# Timing for dumping recording to json
start_time = time.time()
rec.dump_to_json('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to dump recording to json: {end_time - start_time:.2f} seconds")

# Timing for loading recording from json
start_time = time.time()
load_extractor('ground_truth_recording.json')
end_time = time.time()
print(f"Time taken to load recording from json: {end_time - start_time:.2f} seconds")

size_json = Path('ground_truth_recording.json').stat().st_size / 1024**2
print(f"Size of the recording json: {size_json:.2f} MiB")

# Timing for dumping sorting to pickle
start_time = time.time()
sorting.dump_to_pickle('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to dump sorting to pickle: {end_time - start_time:.2f} seconds")

# Timing for loading sorting from pickle
start_time = time.time()
load_extractor('ground_truth_sorting.pickle')
end_time = time.time()
print(f"Time taken to load sorting from pickle: {end_time - start_time:.2f} seconds")

size_pickle = Path('ground_truth_sorting.pickle').stat().st_size / 1024**2
print(f"Size of the sorting pickle: {size_pickle:.2f} MiB")

# Timing for dumping sorting to json
start_time = time.time()
sorting.dump_to_json('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to dump sorting to json: {end_time - start_time:.2f} seconds")

# Timing for loading sorting from json
start_time = time.time()
load_extractor('ground_truth_sorting.json')
end_time = time.time()
print(f"Time taken to load sorting from json: {end_time - start_time:.2f} seconds")

size_json = Path('ground_truth_sorting.json').stat().st_size / 1024**2
print(f"Size of the sorting json: {size_json:.2f} MiB")

@samuelgarcia samuelgarcia changed the title Add pickle dump in some places Improve serialization concept : memory/json/pickle Sep 27, 2023
@samuelgarcia samuelgarcia merged commit c5bafd1 into SpikeInterface:main Sep 27, 2023
@alejoe91 alejoe91 mentioned this pull request Nov 6, 2023
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to core module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants