-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend whitening tests #3531
base: main
Are you sure you want to change the base?
Extend whitening tests #3531
Conversation
5e91f67
to
19105e2
Compare
19105e2
to
602e6ce
Compare
dda72da
to
8841f3d
Compare
8841f3d
to
14c83af
Compare
@@ -223,27 +198,88 @@ def compute_whitening_matrix( | |||
# type and we estimate a more reasonable eps in the case | |||
# where the data is on a scale less than 1. | |||
if eps is None: | |||
eps = 1e-8 | |||
if data.dtype.kind == "f": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the random_chunk
-> data
is always float, I removed the int16
option, let me know if this makes sense.
Out of interest, what is the purpose of this scaling? will it have much effect if it is capped in the maximum direction by 1e-16
?
0070301
to
1d84e5d
Compare
@JoeZiminski for the |
Thanks @alejoe91, there is no added value! I just didn't know |
The NoiseGeneratorRecording already has the option of passing the covariance matrix: spikeinterface/src/spikeinterface/core/generate.py Lines 1228 to 1264 in e525d85
Would not that eliminate some of your scaffolding? Is there something that you are missing there? Also, maybe the following would be useful: |
Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take Awesome thanks for that importdata skip will check it out! 🤤 |
I think that since you need to craft specific sets of traces, you'd be better off using the |
Yeah, I think @alejoe91 suggestion is the simpler one. I would vote for that unless you need considerable memory heavy traces as it would simplify your scaffolding and would not require any other changes. For the NoiseRecording You could `astype' preprocessing if you need the type of that data (the problem is that we use white noise under the hood and that' can be generated on the fly, I can give you more details on slack if you are curios). Extend NoiseRecording to accept a mean seems like a good thing though, maybe something to do but is outside of the scope of this PR. |
Great thanks both! I went for NumpyRecording in the end as suggested, I think this is ready for review. |
Hi. |
Hi @samuelgarcia this is a good point, I will think about this more and we should discuss on call. I will leave some thoughts here as notes. For sure the mean is biased by the spikes when estimating correlated noise. But, the whole procedure is biased by the spikes, as the covariance is computed over data with the spikes included. To estimate covariance unbiased by the spikes we will need to remove the spikes explicitly at the start of the procedure. Otherwise, there is no way around the fact we are whitening the traces over a covariance that includes both noise and signal (spikes), and including vs. not including the mean does not get around that. The only way would be to cut the spikes first. As an aside, I wonder whether in kilosort the first paper says the covariance is estimated from traces with spikes removed, but all later versions include the spikes in the covariance estimate. It may be that this was on purpose, so the whitening step not only removes correlated noise but also removes correlated signal i.e. forces spikes waveforms onto a subset of channels. Maybe in the end this is better for the sorter, and is an explicit aim of the whitening step. The above is slightly philosophical, I think a more convincing point is that later preprocessing steps will re-introduce noise and so the means might no longer be close to zero, as they are after filtering. For example CMR (computed at separate timepoints) and drift correction (interpolate separate time chunks) will add noise to the data across the timeseries of a single channel. So the bias introduced here may outweigh the bias introduced by the spikes. I think some benchmarking could be useful, mainly:
|
Hey @yger might you have some time to take a look at this? would be great to get your feedback |
Sorry, I'm only catching up late with this thread. Whitening is indeed a very sensitive issue, since it can improve the overall results, but at the same time easilly be illconditionned and worsen everything. This is why I introduced the possibility to use sklearn.covariance and all the estimators that are built-in to properly estimate regularized covariance matrix. The problem is that they can be fairly slow. As can bee seen here https://scikit-learn.org/1.5/auto_examples/covariance/plot_sparse_cov.html#sphx-glr-auto-examples-covariance-plot-sparse-cov-py however, most of these estimators are better than the naive one we are using, only relying on random snippets of data (with spikes). In KS (and also in spyking-circus), covariance has always been estimated on snippets with spikes. A while ago, we thought about concatenating only snippets with no peaks, assuming they were pure noise, and computing the covariance matrix on such concatenated recording, but it was rather complicated to build and not making so much difference (might depend on the recording). That beeing said, the level of benchmark back then was not the one we are expecting today, so this might be worth re-checking. Happy to discuss everything with a call whenever you want |
Further to #3505, this PR extends the whitening tests to check the values are correct for both
int16
andfloat32
, as well as checking most of the parameters. Also, some functions are factored out inwhiten.py
to help with testing.For the most part the approach is to generate some test data with known covariance, then compute whitening and check that a) the covariance matrix is recovered correctly b) the whitened data is indeed white.
Some points for discussion @yger would be great to get your thoughts:
apply_mean=False
by default, I don't think (?) there is any reason not to remove the mean prior to computing the covariance, which will reduce bias. Indeed the whitened data is not really white whenapply_mean=False
but is whenapply_mean=True
. Should this be defaultTrue
?For regularizing,
regularize_kwargs["assume_centered"] = True
is always set in the code, though the defaultapply_mean=False
as above. I added an error in this case, to force the user to switch toapply_mean=True
. Also, in case the user explicitly setsregularize_kwargs["assume_centered"] = False
an error is raised, as this will be overwritten toTrue
. However, maybe a better approach is to setregularize_kwargs["assume_centered"]
directly fromapply_mean
?At the moment sklearn_covariance provides some features that are represented as classes, and others as functions. At the moment the
regularize
kwargs can only accept the class ones and will crash for function ones. I just added a note to the docstring, but wanted to raise this in case it is important to support both.Question on
CustomRecording
@alejoe91 @samuelgarcia I wanted to make a quick recording object and set some custom data on it for testing purposes. The result is the small
CustomRecording
(top oftest_whiten,py
) which basically exposes aset_data
function, andget_traces()
just returns the data that was set directly without processing. Is this okay for you both? If so I will factor this out into a separate PR.