-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A couple updates to mountainsort5 sorter #2225
Conversation
Co-authored-by: Zach McKenzie <[email protected]>
sorting = ms5.sorting_scheme2(recording=recording, sorting_parameters=scheme2_sorting_parameters) | ||
elif p["scheme"] == "3": | ||
sorting = ms5.sorting_scheme3(recording=recording, sorting_parameters=scheme3_sorting_parameters) | ||
assert isinstance(recording, BaseRecording) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this assert doing? is there a concern that users wouldn't supply a recording. Should we tell the user what they put in wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to keep the linter happy. The output of load_recording_from_folder
is BaseExtractor | None
, whereas mountainsort5 expects a BaseRecording. I will update this to make it more clear.
@zm711 I made the type checking more clear in that last commit. |
Co-authored-by: Zach McKenzie <[email protected]>
for more information, see https://pre-commit.ci
This really depend on the IO speed and how many times you need to access every piece of chunks. |
Thanks @samuelgarcia. Is that something that would block this PR from being merged? |
sorting = ms5.sorting_scheme3(recording=recording, sorting_parameters=scheme3_sorting_parameters) | ||
with TemporaryDirectory() as tmpdir: | ||
# cache the recording to a temporary directory for efficient reading (so we don't have to re-filter) | ||
recording_cached = create_cached_recording(recording=recording, folder=tmpdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs=1
here will make this extremely slow for long recordings.
Of course. I am just happy to chat. |
I think @samuelgarcia means "of course NOT, it's not blocking the PR" :) In my opinion forcing |
That's a good point. Right now, remfile is not thread safe. But maybe it is okay with multiprocessing, not sure. I can make the necessary changes if needed. But what about bandpass_filter and whiten? I know whiten requires an initial step of computing the whitening matrix. That should be done just once right? In general, since an arbitrary recording extractor could be passed in, how do we know whether it can be parallelized for reading? EDIT: In the case of either (a) reading from a remote file or (b) doing heaving multi-threaded preprocessing or (c) both, the bottleneck may not be writing to disk, and so in those cases maybe n_jobs=1 is most appropriate. @alejoe91 @samuelgarcia what would you recommend here? |
@magland for whitening we have a trick to only compute the whitening matrix once (at initialization): I don't understand your comment:
Even if writing is not the bottleneck, parallel processing should speed things up dramatically. Say you stream the file, apply bp filter, and whiten in parallel. Each process will stream smaller portions of the file, process them, and writing to disk, so effectively streaming should be faster! You could actually test this in the |
Okay great.
Good idea I'll test it out. What I'm saying is that since multi-threaded preprocessing already is parallelized (e.g., by numpy) there may not be further gains... similar for downloading... the streaming pipe may already be full (and remfile already uses some parallelization). |
Oh for preprocessing we see a massive increase in performance :) |
Thinking more about remfile, I don't think it is suited for parallel use unless the different processes are processing entirely different parts of the file. That's because it needs to do smart chunk retrieval, and if there are multiple processes loading from generally the same part of the file, then the download will be very redundant and inefficient. I expect you'd encounter similar problems with fsspec. Do you think we should address this by creating a (perhaps manditory) n_jobs_for_preprocessing sorting parameter? |
At least having the parameter would give some flexibility. Also I wouldn't use a MS5 function to save the recording with SpikeInterface. You can save it in the wrapper directly in the |
@alejoe91 thanks so much. I made a few adjustments in response to your comments. Now using tempfile.TemporaryDirectory() There are two new optional parameters for this sorter: |
recording_cached = create_cached_recording( | ||
recording=recording, folder=tmpdir, n_jobs=p["n_jobs_for_preprocessing"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magland you should use the recording.save()
function here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but I want to have control of exactly how it is saved, and also be consistent with the mountainsort5 docs. Here's my implementation
The SI implementation is doing a lot more indirect stuff. I see save_metadata_to_folder
._save
copy_metadata
dump
. I realize that is all needed in the more general context. But I don't need all that and I want to do only what is needed for mountainsort5 in a very straightforward way. That's why I created this as a utility of ms5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam and Pierre made a new convenient function that allows you to "cache" a recording with different modes. See here: https://github.com/SpikeInterface/spikeinterface/pull/2276/files
Maybe worth using it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe lateer ;)
I'm just writing myself a note so that I'll be able to find this later. After this update I just tried doing mountainsort5 on one of my datasets and it errored out. Mountainsort5 was working fine before, the error seems to be with tempfile use on Windows. I'll open issues here and on Mountainsort5 tomorrow with the trace. |
Here are a couple important updates for mountainsort5 that I would like to merge in.
First, it's important to use dtype=float for the bandpass filter because it seems that it will preserve the int type otherwise. I saw one example where this made a big difference for this algorithm.
Second, for efficiency, it makes a big difference to cache the recording after preprocessing to a temporary directory as a binary recording, because otherwise the data will be re-filtered throughout the process, including potentially expensive whitening.