Add Poisson statistics to `generate_sorting` and optimize memory profile #2226

h-mayorquin · 2023-11-18T12:58:33Z

I added a function that generates spikes as if they are a poisson process which, in my understanding, is the most common statistics of firing rate.

This is is also 3 times faster. Generates a ten hour sorting with 1000 units in 7 seconds instead of 20 as the current one. A ~ 3 speedup.

--------------------------------------------------
`synthesize_poisson_spike_vector`
Mean time over 3 iterations: 6.94 seconds
Std over 3 iterations: 0.01 seconds
times=['6.92', '6.96', '6.95']
--------------------------------------------------
`synthesize_random_firings`
Mean time over 3 iterations: 21.12 seconds
Std over 3 iterations: 0.09 seconds
times=['21.05', '21.05', '21.25']

Speedup: 3.04

Plus, it is more memory efficient as it has around 70 % memory requirements of the current function (I did try to make it less memory hungry but it is HARD). It also seems that the current implementation has some kind of leak, see the temporal profile of memory utilizaiton in the profile above.

The first three are the PR implementation, the last three is the current implementation.

[EDIT]:
Something important that I forgot to say is that this sometimes randomly (when your durations are too short and your firing rates too low) will produce empty spike_trains but I think that's fine. It seems like a relevant think to discuss anyway:

from spikeinterface.core.generate import generate_sorting


seed = 4
sorting = generate_sorting(num_units=2, durations=[1.0], sampling_frequency=30000.0, seed=seed)
sorting.get_unit_spike_train(0, return_times=True)
array([], dtype=float64)

CodyCBakerPhD · 2023-11-18T17:41:31Z

@h-mayorquin Curious what speeds you observe with

import numpy

seed = 0
random_number_generator = numpy.random.default_rng(seed=seed)

number_of_units = 1_000

firing_rates = 10.0  # Hz, scalar usage
# firing_rates = [10.0 for unit_index in range(number_of_units)]  # Hz, vector usage

duration = 60.0  # seconds

# sampling_frequency = None
sampling_frequency = 30_000  # Hz, if specified

# refractory_period = None
refractory_period = 4.0  # milliseconds


def _clean_refractory_period(original_spike_times: numpy.ndarray, refractory_period_seconds: float) -> numpy.ndarray:
    inter_spike_intervals = numpy.diff(original_spike_times, prepend=refractory_period_seconds)
    violations = inter_spike_intervals < refractory_period_seconds  # scale ms to s
    if numpy.any(violations):
        return original_spike_times
    spike_time_shifts = refractory_period_seconds - inter_spike_intervals[violations]
    return original_spike_times[violations] + spike_time_shifts


if numpy.isscalar(firing_rates):
    number_of_spikes_per_unit = random_number_generator.poisson(lam=firing_rates * duration, size=number_of_units)
else:
    number_of_spikes_per_unit = numpy.empty(shape=number_of_units, dtype="uint16")
    for unit_index in range(number_of_units):
        number_of_spikes_per_unit[unit_index] = int(
            random_number_generator.poisson(lam=firing_rates[unit_index] * duration, size=1)
        )

spike_times = list()
if sampling_frequency is None:
    for number_of_spikes in number_of_spikes_per_unit:
        spikes = numpy.sort(random_number_generator.uniform(low=0, high=duration, size=number_of_spikes))
        if refractory_period is not None:
            spikes = _clean_refractory_period(
                original_spike_times=spikes, refractory_period_seconds=refractory_period / 1e3
            )
        spike_times.append(spikes)
else:
    for number_of_spikes in number_of_spikes_per_unit:
        spikes = numpy.sort(
            random_number_generator.integers(
                low=0, high=int(duration * sampling_frequency), size=number_of_spikes, dtype="uint64"
            )
        ) / sampling_frequency
        if refractory_period is not None:
            spikes = _clean_refractory_period(
                original_spike_times=spikes, refractory_period_seconds=refractory_period / 1e3
            )
        spike_times.append(spikes)

zm711

docstring cleanup :)

src/spikeinterface/core/basesorting.py

src/spikeinterface/core/generate.py

zm711 · 2023-11-18T19:54:06Z

src/spikeinterface/core/generate.py

+    - The function uses a geometric distribution to simulate the discrete inter-spike intervals,
+    based that would be an exponential process for continuous time.


This isn't quite clear. Maybe it is missing a word. I'm not quite sure how to fix.

You are correct I think. this is not clear. I will expand on this. Thanks. Also for all the other comments. They all make sense and are very helpful as usual.

I modified the docstring here, let me know what you think and if it is still unclear to you. Also, any other advice you might have is useful.

Co-authored-by: Zach McKenzie <[email protected]>

h-mayorquin · 2023-11-20T11:56:05Z

@CodyCBakerPhD
Most of the computation in the function happens here:
https://github.com/catalystneuro/spikeinterface/blob/ac41b95eb31fc76d21fd2e8f0917092d086b481e/src/spikeinterface/core/generate.py#L400-L405

That is the bottleneck. Generating sorted concatenated spikes and their correspoding units. This is how the output should look like (more or less):

spike_frames
[  452 15511 37989 42417 62234 71248 74939]
unit indexes
[1 1 0 0 1 1 0]

The concatenated frames have to be sorted but the frames of different units can happen at the same time. So you can not sort until you concatenate but the cumulative sum has to happen with the compact version. So it is hard to avoid the memory allocation or sort smaller vectors.

Reading your code it seems that you just concatenated sorted spikes and I can't see the vector of the units. Adding the concatenation of all the spikes and then sorting (without the units part) the code is already slower than the old implementation with a %%timeit decorator in a notebook.

src/spikeinterface/core/generate.py

DradeAW · 2023-11-20T13:15:44Z

Maybe I am not understanding what you mean by negligible?

To me, the difference between 10,000 ; 10,200 and 10,400 is negligible.

I am not sure what you mean by bias, it is surely not in a formal sense

I meant that the resulting ISI might be erroneous.
Let's take the (extreme) example where lambda is significantly smaller than the delta time between two samples.
Then you're only generating zeros. But if you are casting after, you will still add values smaller than 0 and after multiple spikes, it will go to the next sample.

h-mayorquin · 2023-11-20T13:38:58Z

To me, the difference between 10,000 ; 10,200 and 10,400 is negligible.

Right, it was me who did not understood above. You are correct. Let's switch to 4 stds. I think it makes sense.

I meant that the resulting ISI might be erroneous.
Let's take the (extreme) example where lambda is significantly smaller than the delta time between two samples.
Then you're only generating zeros. But if you are casting after, you will still add values smaller than 0 and after multiple spikes, it will go to the next sample.

I still don't understand this. Lambda is the firing rate, right?
What would be an erroneous ISI? Maybe can test this, are you saying that the frames generated by this function transformed to times will not have an exponential distribution in some sense?

Also, this is something that the current implementation is doing, generating the frames in integers. Is this being avoided right now? Maybe this will be helpful to me for understanding.

DradeAW · 2023-11-20T13:52:00Z

Let's take an example where the generation would give [4.4, 3.3, 6.3]. The cumulative sum is [4.4, 7.7, 14.0], which after casting is [4, 7/8, 14] (depending on the rule)

But if you first convert to integers, then you have [4, 7, 13] (which is clearly different, and what I was referring to when I said bias).
Of course if the firing rate is low, you won't notice any difference. But with a high firing rate and high refractory period, I think it can be noticeable.

h-mayorquin · 2023-11-20T14:20:09Z

@DradeAW
But I am not sure what I am doing is equivalent. I am generating the integers already, where a spike can be (or not) on each of the ticks of the sampling rate. I am not converting to integers. Like, the problem would be the other way around, how good a binomial distribution approximates a poisson but the assumptions are very well held here.

But with a high firing rate and high refractory period, I think it can be noticeable.

Yeah, maybe try to run some experiments? Maybe you are right but "noticeable" is not very actionable to me, do you expect it to fail a fitness of test? The mean would be far from the true mean? It would be more or less skwed than the exponential? I think that this would be the easiest think to do. Try the current implementation, try this in different scenarios, let's see if in any it does worse. That would be very useful.

DradeAW · 2023-11-20T14:21:55Z

I am generating the integers already, where a spike can be (or not) on each of the ticks of the sampling rate

I sorry I misunderstood!
Then yes, it is fine :)

h-mayorquin · 2023-11-20T14:28:34Z

Ah, OK, still it would be useful to know if this breaks in some limit. Check it out:

Blue is empirical.

h-mayorquin · 2023-11-20T18:20:24Z

Your math seems to be failing you

Here is the mathematical proof then:

The mean of a exponential distribution is E(X) = 1 / lambda = beta So E(X + refractory_period) = E(X) + refractory_period

Thus, E(X) + refractory_period = 1 / firing_rate <=> E(X) = 1 / firing_rate - refractory_period QED

I did not mean this, I meant your expectations about how the exponential distribution should look like. But now that I understandg you are privileging the count divided by long-term time that makes sense.

DradeAW · 2023-11-20T18:20:52Z

I really don't understand what you are doing,

We just want a unit with a given mean firing rate and refractory period, and you method just outputs a wrong firing rate.

DradeAW · 2023-11-20T18:21:51Z

you are privileging the count divided by long-term time that makes sense

Yes! It is the definition of the mean firing rate!!

h-mayorquin · 2023-11-20T18:41:41Z

I am implement the Poisson statistics as described in books like:
https://mitpress.mit.edu/9780262041997/theoretical-neuroscience/#:~:text=Larry%20Abbott%20is%20Professor%20of,Unit%20at%20University%20College%20London.

Check out page 31:

The first references in google indicate a similar implementation.

The thing that you are proposing I have never seen. Maybe it is correct and it makes a lot of sense to you but this is not how it is done in the books I have read. Can you show me some reference or other library implementation to see that you proposed convention is common place:

Increase the firing rate outisde of the refactory perdiod so as to keep the number of spikes constant over long intervals.

DradeAW · 2023-11-20T18:52:28Z

I've not read this anywhere, this is what I do by simply manipulating statistics :)

I may not be the correct answer in some cases, but in the case of firing_rate = 99.9 Hz and refractory_period = 10 ms, this version simply fails

h-mayorquin · 2023-11-20T19:02:43Z

I've not read this anywhere, this is what I do by simply manipulating statistics :)

x D
Lol

I will ask Alessio and Sam what they prefer for this case next time we meet as I don't think they will dare to read this super long-thread that you and I did.

alejoe91 · 2023-11-20T19:04:08Z

I have been passively enjoyed the passionate exchange ;)

JoeZiminski · 2023-11-20T20:05:57Z

Interesting! Could I confirm that the only outstanding point of discussion is whether to rescale $\lambda$ such that the passed firing rate to the function is the firing rate of the final spike train? Alternatively, the passed firing rate is the firing rate $\lambda$ used for the Poisson process prior to adding refractory effects. Are there any other differences between the implementations? (let me know if I have misunderstood)

On this, I am not familiar with the conventions in the field, some seem to do this rescaling 1, 2 whereas in other sources 3, [Dayan & Abbot] it is not mentioned.

h-mayorquin · 2023-11-20T21:24:21Z

Could I confirm that the only outstanding point of discussion is whether to rescale
such that the passed firing rate to the function is the firing rate of the final spike train? Alternatively, the passed firing rate is the firing rate used for the Poisson process prior to adding refractory effects.

@JoeZiminski this is a good summary.

Thanks for the references. So Bartoz does indeed modify the rate, no restrains. Andrew does modify the rate but excludes the case of the refatory period being larger than the inter spike period. Plus, Andrew does provide a reference to Stefan paper where they called this Poisson Process with dead time to get away from the fact that this is not a poisson process anymore:

https://link.springer.com/article/10.1007/s10827-011-0362-8

This answers my concerns. If we decide to go for the version in the aferomentioned paper I am fine with it. Thanks, @DradeAW, I learned something today. I never thought carefully before on when do the assumptions of the Poisson renewal process that I learned from Abbot's treatment break down. I still think that the bursting model they have over there makes more sense but this seems to just be another equally valid convention in the field. It also has the advantage that is easier to implement now that I think about it.

DradeAW · 2023-11-20T22:53:56Z

only outstanding point of discussion is whether to rescale such that the passed firing rate to the function is the firing rate of the final spike train?

Maybe this whole conversation is just my bad for not understanding this 😅
But as a user, if I ask for a spike train with a given firing rate, I expect the output to be that firing rate?

Sorry if I seemed a bit harsh, but criticizing my math when I spent many hours / days on the question tends to trigger me ahah

h-mayorquin · 2023-11-22T07:59:05Z

@DradeAW

But as a user, if I ask for a spike train with a given firing rate, I expect the output to be that firing rate?

I think this is a very strong point. In fact, I feel convinced by it.

The red-hering was that we started discussing about the statistics. You came to a PR that is called "add poisson statistics" and then told me that the ISI interval should not follow the theoretical exepected distribution for a Poisson. Turns out that you have a good reason for that to be not be poisson but I instead doubled down on "this is a poisson distribution, why do you want me to make my statistics not poissonian, you must be getting something wrong!". At the end I think we came to the right crux and we should take pride on that:

I asked you:

Your proposal keeps the total number of spikes in time divided by time equal to the firing rate by sacrificing the instantaneous firing rate. Mine does the opposite and preserves the instantaneous firing rate as given at the expense of the count. I care about the statistics being changed, why should I change my statistics to keep the counting over longer averages like that

And I think that you have a great answer for my question there which is:

But as a user, if I ask for a spike train with a given firing rate, I expect the output to be that firing rate?

I do believe that user-centered decisions trumps mathematical soudness so that argment convinces me.

And looking in retrospective the discussion went to the distribution because I failed to read the following. You said:

This looks wrong, can you check that what you get is at 99.99 Hz?

I actually did not understand that you were asking me to measure the firing rate there (probably the use of the proposition "at" and the omission of the word firing rate, writing is hard). And then I started discussing about the pdf of the distribution.

But overall the discussion was good to me. I learned something new and I think making this a Poisson distribution with dead times (as Stefan and co call it) is better for maintenance and usability. So thanks for bearing with me, I think the library is better thanks to your efforts.

DradeAW · 2023-11-22T08:41:27Z

Alright, everything turned out well ^^

Sorry I know I'm not always super great at expressing stuff (even in my native language 😅), but we both learned something!

Although I'm still curious about something: your implementation is still Poissonian?
I thought (but I might be wrong about this) that a Poisson distribution with a refractory period was not a true Poissonian distribution?
Genuinely asking!

h-mayorquin · 2023-11-22T11:29:38Z

Not strictly, the support is changed (by the shifting) but the ISI is still an exponential and the mean and the std ratio should remain constant in the new support.

That said, I am not sure that is not the case in the poisson process with dead time. I tried to fit to the exponential with a shift and that did not work but maybe the paraemters need to be changed in some other way.

I pushed a version of your suggestion now but the tests are failing. It seems that those are the new metrics. I will take a look later.

src/spikeinterface/qualitymetrics/tests/test_metrics_functions.py

h-mayorquin · 2023-11-22T13:37:55Z

And this are some related plots that I did after our discussion.

Discretization

This should answer @DradeAW question about how the discretization behaves in the extreme case:

The bins accumulate in 1/sampling_rate because that the smallest time represented by the sorting.

How much does the method actually modifies the firing rates

For myself I was curios on how much is the instantaneous firing rate modified depending on the refactory period:

(note the color map is cut at 10 but it goes very large)

So, yes, the lesson is quite a lot, for high firing rates a small refactory period can easily double the firing rate.

zm711 · 2023-11-27T15:14:50Z

src/spikeinterface/core/generate.py

+    # We estimate how many spikes we will have in the duration
+    max_frames = duration * sampling_frequency
+    max_binomial_p = float(np.max(binomial_p))
+    num_spikes_expected = ceil(max_frames * max_binomial_p)


Suggested change

num_spikes_expected = ceil(max_frames * max_binomial_p)

num_spikes_expected = int(np.ceil(max_frames * max_binomial_p))

Any interest in this instead and then you can remove the ceil import from math? Or did you really only want to use math.ceil?

What's the advantage of this?

Last time I checked the math module functions are faster for scalars than numpy function as they avoid the overhead. Speed won't matter that much at this scale though.

Honestly, the only advantage for this scalar is that you import one less function into the code that is only used once. But reducing imports is not necessarily a good reason. So my comment was more question than hard recommendation.

Last time I checked the math module functions are faster for scalars than numpy function

Last time I checked, even math.PI was faster than np.PI, which I still don't understand ahah
I agree, math for scalars is better

@zm711 I see. Yes, importing from the standard library at will is my prior until proven otherwise.

Run the following script:

import pkgutil import timeit import sys # Get a list of all standard library modules standard_lib_modules = [module for module in pkgutil.iter_modules() if module.name in sys.stdlib_module_names] # Dictionary to store import times import_times = {} for module in standard_lib_modules: # Measure the import time time = timeit.timeit(f"import {module.name}", number=1) import_times[module.name] = time # Print or process the import times for module, time in import_times.items(): print(f"{module}: {time} seconds")

You will see that importing from the standard library is at the scale of main memory reference:
https://brenocon.com/dean_perf.html

Thanks @h-mayorquin! Makes sense.

src/spikeinterface/qualitymetrics/tests/test_metrics_functions.py

samuelgarcia · 2024-01-19T13:37:52Z

Hi Ramon and Aurelien.
Thanks a lot for this PR and this discusssion.
This is really an excelent discussion and piece of work.

I have to admit (with lot of shame) that I did had time to read it before today.
So lets merge this now.

h-mayorquin added 9 commits November 15, 2023 13:57

working code

64f70f7

improve this

7b48621

Merge branch 'main' into improve_generate_sorting

4b9ebfb

modify

1cec784

add docstring

79a383c

final docstring

d8dc19e

one malloc less

7720c04

one last malloc

c647406

name

d9886aa

h-mayorquin added the core Changes to core module label Nov 18, 2023

h-mayorquin marked this pull request as ready for review November 18, 2023 14:47

Merge branch 'main' into improve_generate_sorting

a1861e6

h-mayorquin changed the title ~~Add posion statistics to generate_sorting~~ Add Poisson statistics to generate_sorting and optimize memory profile Nov 18, 2023

h-mayorquin self-assigned this Nov 18, 2023

zm711 reviewed Nov 18, 2023

View reviewed changes

h-mayorquin and others added 5 commits November 20, 2023 12:32

Update src/spikeinterface/core/basesorting.py

2f49b46

Co-authored-by: Zach McKenzie <[email protected]>

Update src/spikeinterface/core/generate.py

116355c

Co-authored-by: Zach McKenzie <[email protected]>

Update src/spikeinterface/core/generate.py

158a17b

Co-authored-by: Zach McKenzie <[email protected]>

Update src/spikeinterface/core/generate.py

cc5a475

Co-authored-by: Zach McKenzie <[email protected]>

improve docstring

ac41b95

DradeAW reviewed Nov 20, 2023

View reviewed changes

src/spikeinterface/core/generate.py Outdated Show resolved Hide resolved

Aurelien feedback

dcf2116

h-mayorquin added 2 commits November 21, 2023 09:51

go full binomial

492f5c6

poisson with dead times

6246869

Merge branch 'main' into improve_generate_sorting

f71f634

fix docstring

20e73aa

h-mayorquin commented Nov 22, 2023

View reviewed changes

src/spikeinterface/qualitymetrics/tests/test_metrics_functions.py Show resolved Hide resolved

docstring and fixes

abb9538

zm711 reviewed Nov 27, 2023

View reviewed changes

Merge branch 'main' into improve_generate_sorting

b0e579b

h-mayorquin commented Nov 27, 2023

View reviewed changes

src/spikeinterface/qualitymetrics/tests/test_metrics_functions.py Outdated Show resolved Hide resolved

Update src/spikeinterface/qualitymetrics/tests/test_metrics_functions.py

db1568b

alejoe91 added this to the 0.100.0 milestone Jan 9, 2024

alejoe91 added the hybrid Related to Hybrid testing label Jan 19, 2024

samuelgarcia approved these changes Jan 19, 2024

View reviewed changes

alejoe91 merged commit 581d8d1 into SpikeInterface:main Jan 22, 2024
11 checks passed

alejoe91 deleted the improve_generate_sorting branch January 22, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Poisson statistics to `generate_sorting` and optimize memory profile #2226

Add Poisson statistics to `generate_sorting` and optimize memory profile #2226

h-mayorquin commented Nov 18, 2023 •

edited

Loading

CodyCBakerPhD commented Nov 18, 2023

zm711 left a comment

zm711 Nov 18, 2023

h-mayorquin Nov 18, 2023 •

edited

Loading

h-mayorquin Nov 20, 2023

h-mayorquin commented Nov 20, 2023 •

edited

Loading

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023 •

edited

Loading

h-mayorquin commented Nov 20, 2023 •

edited

Loading

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

JoeZiminski commented Nov 20, 2023 •

edited

Loading

h-mayorquin commented Nov 20, 2023 •

edited

Loading

DradeAW commented Nov 20, 2023 •

edited

Loading

h-mayorquin commented Nov 22, 2023

DradeAW commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

zm711 Nov 27, 2023

h-mayorquin Nov 27, 2023

zm711 Nov 27, 2023

DradeAW Nov 27, 2023

h-mayorquin Nov 27, 2023

zm711 Nov 27, 2023

samuelgarcia commented Jan 19, 2024

		- The function uses a geometric distribution to simulate the discrete inter-spike intervals,
		based that would be an exponential process for continuous time.

	num_spikes_expected = ceil(max_frames * max_binomial_p)
	num_spikes_expected = int(np.ceil(max_frames * max_binomial_p))

Add Poisson statistics to generate_sorting and optimize memory profile #2226

Add Poisson statistics to generate_sorting and optimize memory profile #2226

Conversation

h-mayorquin commented Nov 18, 2023 • edited Loading

CodyCBakerPhD commented Nov 18, 2023

zm711 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-mayorquin Nov 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-mayorquin commented Nov 20, 2023 • edited Loading

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023 • edited Loading

h-mayorquin commented Nov 20, 2023 • edited Loading

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

DradeAW commented Nov 20, 2023

h-mayorquin commented Nov 20, 2023

alejoe91 commented Nov 20, 2023

JoeZiminski commented Nov 20, 2023 • edited Loading

h-mayorquin commented Nov 20, 2023 • edited Loading

DradeAW commented Nov 20, 2023 • edited Loading

h-mayorquin commented Nov 22, 2023

DradeAW commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

Discretization

How much does the method actually modifies the firing rates

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelgarcia commented Jan 19, 2024

Add Poisson statistics to `generate_sorting` and optimize memory profile #2226

Add Poisson statistics to `generate_sorting` and optimize memory profile #2226

h-mayorquin commented Nov 18, 2023 •

edited

Loading

h-mayorquin Nov 18, 2023 •

edited

Loading

h-mayorquin commented Nov 20, 2023 •

edited

Loading

DradeAW commented Nov 20, 2023 •

edited

Loading

h-mayorquin commented Nov 20, 2023 •

edited

Loading

JoeZiminski commented Nov 20, 2023 •

edited

Loading

h-mayorquin commented Nov 20, 2023 •

edited

Loading

DradeAW commented Nov 20, 2023 •

edited

Loading