Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_filtered to annotations in binary.json #3245

Merged
merged 8 commits into from
Sep 11, 2024

Conversation

zm711
Copy link
Collaborator

@zm711 zm711 commented Jul 23, 2024

Fixes #3244

When we save as binary it seems like we are not saving the is_filtered param. This fixes that. The issue linked however, is for the waveform extractor where this mattered more.

@zm711 zm711 added bug Something isn't working core Changes to core module labels Jul 23, 2024
@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jul 23, 2024

Nice!! I read the discussion and didn't think to acutally check if it was propagated 🤦‍♂️.

Getting outside of the scope of this PR, is it worth adding a test to ensure all parameters are propagated going forward? This is similar to #2515, I wonder if worth reviewing all exporters and setting up some tests to ensure everything required is always propagated. If a config is not propagated between save / load rounds, depending on the config, I guess there is the propensity for some very nasty bugs, in this case we are lucky it was just is_filtered.

@zm711
Copy link
Collaborator Author

zm711 commented Jul 23, 2024

Yeah, some sort of test would probably be a good idea. I guess there is a difference between user annotations vs fundamental annotations, but if you have something in mind I think it would be a useful PR.

@JoeZiminski
Copy link
Collaborator

Yes that makes sense, I got confused between _annotations and _properties, I guess _properties are much more important and are probably tested somewhere else. The diff looks good to me, I am testing locally and am confused why this is not failing on main, I must be misunderstanding something:

import spikeinterface.full as si
import numpy as np

recording, _ = si.generate_ground_truth_recording(durations=[10])
recording = si.filter(recording)

recording.save(folder="hello_world", format="binary", overwrite=True)

load_recording = si.load_extractor("hello_world")

# for key in recording._properties.keys():
#    assert np.all(recording._properties[key] == load_recording._properties[key])

for key in recording._annotations.keys():
    # How is the `is_filtered` being propagated?
    assert np.all(recording._annotations[key] == load_recording._annotations[key])

@zm711
Copy link
Collaborator Author

zm711 commented Jul 24, 2024

Actually you raise a really good point. Looking at the results:

This is from Main.

image

@zm711
Copy link
Collaborator Author

zm711 commented Jul 24, 2024

This might actually be a problem with the generator.

If I run

>>> recording = generate_recording(100)
>>> recording.is_filtered()
True

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jul 24, 2024

Interesting! I guess that makes sense for generate_recording as it is essentially filtered, so the filter step can be removed from the example. But it is still suprising the information is propagated to the loaded file (i.e. it is not default False or None in the loaded recording). One think of note is that the information is properly saved in si_folder.json but is wrong in binary.json, so maybe the load_extractor is just ignoring binary.json and using si_folder.json? but then maybe the waveform extractor is relying on binary.json resulting in #3244?

@JoeZiminski
Copy link
Collaborator

Just a quick update, it seems the binary.json is only used when a file is loaded by BinaryFolderRecording but could not see anywhere that uses BinaryFolderRecording as an extractor in the codebase. Maybe, BinaryFolderRecording can be deprecated @samuelgarcia in favour of whatever si.load_extractor() is doing, working from the si_folder.json?

ATM I can't figure out how to replicate the annotations not being propagated, when I try sorting on a test dataset I see the spikeinterface_recording.pickle (rather than spikeinterface_recording.json), but loading it has the annotations propagated. I tried saving waveforms, and the recording_info/recording_attributes.json also has the propagated annotations. So I am confused 😄

@zm711
Copy link
Collaborator Author

zm711 commented Jul 25, 2024

I think you digging into this is super important. I too didn't realize that we had is_filtered everyone. You've convinced me that fixing this in one place may be the wrong place. I think we leave this open as a discussion point for trying to understand if the duplication is necessary (or is because of backward compatibility) and if it possible we clean up the duplication everywhere.

):
# This assigns num_channels if num_channels is not None, otherwise num_chan is assigned
num_channels = num_channels or num_chan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zm711 unfortunately, I think we need to keep this. This is not only a deprecation for a current version, but it allows to load recordings that were saved with old versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Le'ts get rid of the deprecation comment then and switch it to "needed for backward compatiblity"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@zm711 zm711 changed the title add is_filtered to annotations [discussion-no merge] add is_filtered to annotations Aug 22, 2024
@JoeZiminski
Copy link
Collaborator

Hey @samuelgarcia @alejoe91, do you think the writing of the binary.json can be removed? It seems to be missing certain properties and I can't see anywhere it is used in the codebase. In this case, I think the only outstanding mystery why they were not propagated in the original issue #3244, maybe this was a red herring. If no one else can replicate the bug from that issue, and @samuelgarcia @alejoe91 are happy with removal of binary.json, then removing binary.json may be the best way forward?

@alejoe91
Copy link
Member

I think we need the binary.json since it's used to read the binary folder, no? @samuelgarcia ?

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Sep 10, 2024

Ah yes sorry I had to look into this a bit more, I was looking for BinaryFolderRecording but it is cast to read_binary_folder which is a public function, so it is used somewhere (i.e. as a public function). And it is used internally in the BaseRecording._save() function.

However, I think it's internal use there is not necessary and can be removed, and read_binary_folder will load binary.json that has missing metadata, and it would be better to use si_folder.json like load_extractor does, as discussed below.

In the below code,

import spikeinterface.full as si
import numpy as np

recording, _ = si.generate_ground_truth_recording(durations=[10])
recording = si.filter(recording)

recording.set_property("group", np.arange(recording.get_num_channels()))
print(recording._properties["group"])

recording.save(folder="hello_world", format="binary", overwrite=True)

recording.save() call chain goes:

  1. Base.save() ->
  2. Base.save_to_folder() ->
  3. BaseRecording._save()

This writes the binary, creates a BinaryRecordingExtractor around the binary file, dumps it to binary.json and reloads it as BinaryFolderRecording. However, a number of properties in binary.json / BinaryFolderRecording are not updated, for example is_filtered (but now there is Zach's fix above), but also other entries, for example the group property is not propagated (you can test with above code). This may be a design choice though, as BinaryFolderRecording is returned as cached and we go back up to Base.save().

  1. Base.save() now calls it's copy_metadata() method which overwrites all metadata on the BinaryFolderRecording with the correct metadata from self (i.e. the recording object on which .save() was initially called). This is all then dumped to si_folder.json which has a complete and correct metadata set.

So, I think that the intermediate step of writing to binary.json in BaseRecording._save() is not needed, the BinaryRecordingExtractor can be returned as cached instead. I tried this and it works. I think if a user were to use read_binary_recording() directly on the output of recording.save(..., format="binary") then they would have missing metadata on the recording, and it would be better to use load_extractor. Otherwise, I do not think binary.json is used anywhere. Therefore, one approach is to:

  1. Remove the BinaryFolderRecording and read_binary_recording. Pass BinaryRecordingExtractor as cached instead in BaseRecording.save() and deprecate the public read_binary_recording function in favour of load_extractor.

This would be the main thing as it would remove from public view the binary.json and potentially broken read_binary_folder().

Additionally, some refactorings to BinaryRecordingExtractor might be useful, but there are things here I might be missing and it is much more work. At the moment BinaryRecordingExtractor is storing some metadata but not others, and seems to need rescuing with copy_metadata to ensure it has correct metadata. Therefore it might be nice to remove all metadata variables from its constructor, to make it explicitly that it does not contain metadata until overwrite with copy_metadata. Then it's use can be deprecated elsewhere in the code base, in favour of load_extractor, and just used as a light wrapper in BaseRecording._save().

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Sep 10, 2024

BTW @zm711 maybe this discussion is getting a bit out of scope, for sure your fix will help propagate is_filtered to binary.json and any loading done by load_binary_recording for now. Maybe this can be merged and further discussion on possibly removing binary.json can be done in an issue?

@zm711
Copy link
Collaborator Author

zm711 commented Sep 10, 2024

yeah that's possible. We can merge this just to get the propagation. Let me add the comment about not removing num_chans and then this is technically ready.

@zm711 zm711 changed the title [discussion-no merge] add is_filtered to annotations Add is_filtered to annotations in binary.json Sep 10, 2024
@zm711
Copy link
Collaborator Author

zm711 commented Sep 10, 2024

Okay @JoeZiminski and @alejoe91, this is ready to be merged. Like Joe said maybe we can have a deeper discussion about how different properties and metadata are propagated in the library (since multiple jsons are used). So this can be merged to make sure binary.json has the annotation moving forward. Then we could think about better ways to deal with this library wide.

@alejoe91 alejoe91 added this to the 0.101.1 milestone Sep 11, 2024
@alejoe91 alejoe91 merged commit 36ccbd4 into SpikeInterface:main Sep 11, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Changes to core module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is is_filtered a manual param?
3 participants