A couple updates to mountainsort5 sorter #2225

magland · 2023-11-17T18:56:08Z

Here are a couple important updates for mountainsort5 that I would like to merge in.

First, it's important to use dtype=float for the bandpass filter because it seems that it will preserve the int type otherwise. I saw one example where this made a big difference for this algorithm.

Second, for efficiency, it makes a big difference to cache the recording after preprocessing to a temporary directory as a binary recording, because otherwise the data will be re-filtered throughout the process, including potentially expensive whitening.

src/spikeinterface/sorters/external/mountainsort5.py

Co-authored-by: Zach McKenzie <[email protected]>

zm711 · 2023-11-17T19:05:12Z

src/spikeinterface/sorters/external/mountainsort5.py

-            sorting = ms5.sorting_scheme2(recording=recording, sorting_parameters=scheme2_sorting_parameters)
-        elif p["scheme"] == "3":
-            sorting = ms5.sorting_scheme3(recording=recording, sorting_parameters=scheme3_sorting_parameters)
+        assert isinstance(recording, BaseRecording)


What is this assert doing? is there a concern that users wouldn't supply a recording. Should we tell the user what they put in wrong?

This is to keep the linter happy. The output of load_recording_from_folder is BaseExtractor | None, whereas mountainsort5 expects a BaseRecording. I will update this to make it more clear.

magland · 2023-11-17T19:21:16Z

@zm711 I made the type checking more clear in that last commit.

src/spikeinterface/sorters/external/mountainsort5.py

Co-authored-by: Zach McKenzie <[email protected]>

for more information, see https://pre-commit.ci

src/spikeinterface/sorters/external/mountainsort5.py

samuelgarcia · 2023-11-17T20:20:41Z

Second, for efficiency, it makes a big difference to cache the recording after preprocessing to a temporary directory as a binary recording, because otherwise the data will be re-filtered throughout the process, including potentially expensive whitening.

This really depend on the IO speed and how many times you need to access every piece of chunks.
Sometimes on network drive writting is slow.
My plan is to make a in memory (shared) to cache the recording soon.

magland · 2023-11-17T20:33:09Z

This really depend on the IO speed and how many times you need to access every piece of chunks. Sometimes on network drive writting is slow. My plan is to make a in memory (shared) to cache the recording soon.

Thanks @samuelgarcia. Is that something that would block this PR from being merged?

alejoe91 · 2023-11-20T09:02:06Z

src/spikeinterface/sorters/external/mountainsort5.py

-            sorting = ms5.sorting_scheme3(recording=recording, sorting_parameters=scheme3_sorting_parameters)
+        with TemporaryDirectory() as tmpdir:
+            # cache the recording to a temporary directory for efficient reading (so we don't have to re-filter)
+            recording_cached = create_cached_recording(recording=recording, folder=tmpdir)


n_jobs=1 here will make this extremely slow for long recordings.

samuelgarcia · 2023-11-20T09:34:34Z

Thanks @samuelgarcia. Is that something that would block this PR from being merged?

Of course. I am just happy to chat.

alejoe91 · 2023-11-20T09:38:04Z

Thanks @samuelgarcia. Is that something that would block this PR from being merged?

Of course. I am just happy to chat.

I think @samuelgarcia means "of course NOT, it's not blocking the PR" :)

In my opinion forcing n_jobs=1 will be a bottleneck, especiall for high-channel counts. Have you encountered any errors reading with remfile in parallel? IMO, streaming is an edge case, so it should be handled separately

magland · 2023-11-20T11:21:03Z

@alejoe91

In my opinion forcing n_jobs=1 will be a bottleneck, especiall for high-channel counts. Have you encountered any errors reading with remfile in parallel? IMO, streaming is an edge case, so it should be handled separately

That's a good point. Right now, remfile is not thread safe. But maybe it is okay with multiprocessing, not sure. I can make the necessary changes if needed.

But what about bandpass_filter and whiten? I know whiten requires an initial step of computing the whitening matrix. That should be done just once right? In general, since an arbitrary recording extractor could be passed in, how do we know whether it can be parallelized for reading?

EDIT: In the case of either (a) reading from a remote file or (b) doing heaving multi-threaded preprocessing or (c) both, the bottleneck may not be writing to disk, and so in those cases maybe n_jobs=1 is most appropriate.

@alejoe91 @samuelgarcia what would you recommend here?

alejoe91 · 2023-11-20T11:28:30Z

@magland for whitening we have a trick to only compute the whitening matrix once (at initialization):
basically, we compute W and M and then add them to the self._kwargs, so that subprocesses just load them (see here).

I don't understand your comment:

EDIT: In the case of either (a) reading from a remote file or (b) doing heaving multi-threaded preprocessing or (c) both, the bottleneck may not be writing to disk, and so in those cases maybe n_jobs=1 is most appropriate.

Even if writing is not the bottleneck, parallel processing should speed things up dramatically. Say you stream the file, apply bp filter, and whiten in parallel. Each process will stream smaller portions of the file, process them, and writing to disk, so effectively streaming should be faster!

You could actually test this in the remfile PR!

magland · 2023-11-20T11:34:30Z

@magland for whitening we have a trick to only compute the whitening matrix once (at initialization): basically, we compute W and M and then add them to the self._kwargs, so that subprocesses just load them (see here).

Okay great.

I don't understand your comment:

EDIT: In the case of either (a) reading from a remote file or (b) doing heaving multi-threaded preprocessing or (c) both, the bottleneck may not be writing to disk, and so in those cases maybe n_jobs=1 is most appropriate.

Even if writing is not the bottleneck, parallel processing should speed things up dramatically. Say you stream the file, apply bp filter, and whiten in parallel. Each process will stream smaller portions of the file, process them, and writing to disk, so effectively streaming should be faster!

You could actually test this in the remfile PR!

Good idea I'll test it out. What I'm saying is that since multi-threaded preprocessing already is parallelized (e.g., by numpy) there may not be further gains... similar for downloading... the streaming pipe may already be full (and remfile already uses some parallelization).

alejoe91 · 2023-11-20T12:07:08Z

Oh for preprocessing we see a massive increase in performance :)

magland · 2023-11-20T14:58:19Z

Thinking more about remfile, I don't think it is suited for parallel use unless the different processes are processing entirely different parts of the file. That's because it needs to do smart chunk retrieval, and if there are multiple processes loading from generally the same part of the file, then the download will be very redundant and inefficient. I expect you'd encounter similar problems with fsspec.

Do you think we should address this by creating a (perhaps manditory) n_jobs_for_preprocessing sorting parameter?

alejoe91 · 2023-11-20T15:17:37Z

At least having the parameter would give some flexibility. Also I wouldn't use a MS5 function to save the recording with SpikeInterface. You can save it in the wrapper directly in the output_folder (which is exactly what Matlab-based sorters do).

src/spikeinterface/sorters/external/mountainsort5.py

magland · 2023-11-20T16:19:44Z

@alejoe91 thanks so much. I made a few adjustments in response to your comments.

Now using tempfile.TemporaryDirectory()

There are two new optional parameters for this sorter: n_jobs_for_preprocessing and temporary_base_dir

alejoe91 · 2023-11-22T11:17:51Z

src/spikeinterface/sorters/external/mountainsort5.py

+            recording_cached = create_cached_recording(
+                recording=recording, folder=tmpdir, n_jobs=p["n_jobs_for_preprocessing"]
+            )


@magland you should use the recording.save() function here!

I understand, but I want to have control of exactly how it is saved, and also be consistent with the mountainsort5 docs. Here's my implementation

https://github.com/flatironinstitute/mountainsort5/blob/main/mountainsort5/util/create_cached_recording.py

The SI implementation is doing a lot more indirect stuff. I see save_metadata_to_folder ._save copy_metadata dump. I realize that is all needed in the more general context. But I don't need all that and I want to do only what is needed for mountainsort5 in a very straightforward way. That's why I created this as a utility of ms5.

@magland

Sam and Pierre made a new convenient function that allows you to "cache" a recording with different modes. See here: https://github.com/SpikeInterface/spikeinterface/pull/2276/files

Maybe worth using it here too?

maybe lateer ;)

src/spikeinterface/sorters/external/mountainsort5.py

zm711 · 2024-02-05T23:09:18Z

I'm just writing myself a note so that I'll be able to find this later. After this update I just tried doing mountainsort5 on one of my datasets and it errored out. Mountainsort5 was working fine before, the error seems to be with tempfile use on Windows. I'll open issues here and on Mountainsort5 tomorrow with the trace.

update mountainsort5 sorter

7364fb7

zm711 reviewed Nov 17, 2023

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Outdated Show resolved Hide resolved

improve exception msg in mountainsort5

18ecd53

Co-authored-by: Zach McKenzie <[email protected]>

zm711 reviewed Nov 17, 2023

View reviewed changes

magland added 2 commits November 17, 2023 14:19

improve assertions in mountainsort5.py

2390dc6

Merge branch 'main' of https://github.com/magland/spikeinterface

ca638ad

Merge branch 'main' into main

8e031e0

zm711 reviewed Nov 17, 2023

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Outdated Show resolved Hide resolved

magland and others added 2 commits November 17, 2023 14:38

improve error message in mountainsort5.py

5c15a3d

Co-authored-by: Zach McKenzie <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d395dba

for more information, see https://pre-commit.ci

zm711 added the sorters Related to sorters module label Nov 17, 2023

samuelgarcia reviewed Nov 17, 2023

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Outdated Show resolved Hide resolved

use np.float32 in mountainsort5.py

dc86474

alejoe91 reviewed Nov 20, 2023

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Outdated Show resolved Hide resolved

add params to mountainsort5.py

fd2258a

alejoe91 reviewed Nov 22, 2023

View reviewed changes

alejoe91 reviewed Dec 13, 2023

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Outdated Show resolved Hide resolved

Merge branch 'main' into main

d6471a2

alejoe91 added this to the 0.100.0 milestone Jan 9, 2024

Update src/spikeinterface/sorters/external/mountainsort5.py

171329e

alejoe91 reviewed Jan 19, 2024

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Show resolved Hide resolved

alejoe91 reviewed Jan 19, 2024

View reviewed changes

src/spikeinterface/sorters/external/mountainsort5.py Show resolved Hide resolved

alejoe91 approved these changes Jan 22, 2024

View reviewed changes

alejoe91 merged commit fb3c9c9 into SpikeInterface:main Jan 22, 2024
11 checks passed

alejoe91 mentioned this pull request Mar 21, 2024

Mountainsort5 Wrapper Todo list #2607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A couple updates to mountainsort5 sorter #2225

A couple updates to mountainsort5 sorter #2225

magland commented Nov 17, 2023

zm711 Nov 17, 2023

magland Nov 17, 2023

magland commented Nov 17, 2023

samuelgarcia commented Nov 17, 2023

magland commented Nov 17, 2023

alejoe91 Nov 20, 2023

samuelgarcia commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023 •

edited

Loading

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 Nov 22, 2023

magland Nov 22, 2023

alejoe91 Jan 19, 2024

alejoe91 Jan 22, 2024

zm711 commented Feb 5, 2024

A couple updates to mountainsort5 sorter #2225

A couple updates to mountainsort5 sorter #2225

Conversation

magland commented Nov 17, 2023

zm711 Nov 17, 2023

Choose a reason for hiding this comment

magland Nov 17, 2023

Choose a reason for hiding this comment

magland commented Nov 17, 2023

samuelgarcia commented Nov 17, 2023

magland commented Nov 17, 2023

alejoe91 Nov 20, 2023

Choose a reason for hiding this comment

samuelgarcia commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023 • edited Loading

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

magland commented Nov 20, 2023

alejoe91 Nov 22, 2023

Choose a reason for hiding this comment

magland Nov 22, 2023

Choose a reason for hiding this comment

alejoe91 Jan 19, 2024

Choose a reason for hiding this comment

alejoe91 Jan 22, 2024

Choose a reason for hiding this comment

zm711 commented Feb 5, 2024

magland commented Nov 20, 2023 •

edited

Loading