-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scale_to_uV
preprocessing
#3053
Add scale_to_uV
preprocessing
#3053
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I'm constantly getting confused about how scaling and dtype is handled and this is very straightforward. A short how-to or tutorial would be useful in future (I could add suggestion to documentation thread) which could exemplify use of this preprocessor.
) | ||
|
||
rng = np.random.default_rng(0) | ||
gains = rng.random(size=(num_channels)).astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth parameterising over some extreme cases (e.g. very small values, typical values, extremely large values) for gains and offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm do you have some suggestions of what would be the purpose more specifically? I agree with this philosophy (e.g. "think about what will break the test") but nothing concrete comes to mind other than testing overflow problems. Let me think more but if you come with anything it would be good.
I would not like to stop the PR if we don't come with good extreme case criteria though : P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think gains and offsets should really be based on reality. We could check a bunch of gains and offsets from Neo if we want to parameterize against real values. For example I think gain=1, offset=0 comes up because some data formats don't scale. We could test against Intan's scaling for example since the format is pretty common.
Ie to be clearer I don't think our tests should specifically look at "wrong" values that a users could input, but rather test for real inputs that our library should handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good points, in general I am in two minds. For sure I think it's good to test with a range of expected values. But the tests can also be viewed as a way of stress-testing the software, and putting extreme values in can reveal strange or unexpected behaviours. For example we sometimes struggle with overflow errors from numpy dtypes that would probably be caught in tests if we were using some extreme values. So I think, as including more values for these short tests is basically free, we might as well as more information is better, but I'm not strong on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that the random gains kind of cover for all the common use cases save the stress (outlier / weird) part. For the overflow problem I think we could think on someting but I am leaning on testing overflow in a more general check. I actually do feel that that one needs generalization to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree it's not super important to check the overflow errors in sporadic tests and would be better centralised somewhere. I like the rand but isnt rand bounded [0, 1] and gains can in theory be (0, (some large number?]? For me it seems most efficient rather than guessing bounds on realistic values just to test very small, commonplace, very large number and for sure cover all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, to move this forward I propose the following:
- You give me the numbers you want to test.
- I increase the bounds of float to be very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good, this was a suggestion that I'm not super-strong on and I wouldn't want it to be a blocker! Please feel free to proceed however you think best, personally I would do something like parameterize over small, medium and large number for gain and offset:
@pytest.mark.parametrize("gains", [1e-6,, 50, 1e6])
@pytest.mark.parametrize("offsets", [1e-6,, 50, 1e6])
I'm not advocating this as necessarily the best approach but it is the kind of habit I have fallen into to test bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to avoid repeatedly generating a recording in such an case, the recording can be generated in a session-scoped fixture)
Thanks @h-mayorquin ! Do you guys think it sould make sense to just have a function to do this instead of a class? It could simply be:
|
yeah I mean that also makes sense to me. And way less surface area to have to test and maintain, which I like @h-mayorquin likes too! |
Yeah, maybe that's better 🤔 |
I am not against this but I think that write_binary_recording() with a option scale to micro volt would be easier for end user no ? In short I am not sure to share the starting point of this. |
Easier to mantain instead of adding parameters everywhere and more general. |
After having got stuck / confused trying to write a recording recently, I'm now +1 on adding The reasons are:
But I think that and this PR are not mutually exclusive I still think |
I'm currently slight on Heberto's side. In that this makes the step explicit which puts the onus on the user to know they are doing this and I also hate when functions have 1000 arguments especially when they start getting super mutually exclusive. So I prefer the unix philosophy of small tools do one things and string the tools together. (I also never write scaled binary so I'm not the target audience). |
Can you describe more about your confusion? I think people should not be using But maybe I am wrong about the purpose of
Don't the same complexities appear when using recording_in_uV = scale_to_uv(recording)
some_other_function(scaled_recording, **kwargs) some_other_function(scaled_recodring, return_scaled=True, **kwargs) What is a complexity that appears on the first that does not appear on the second?
We can argue about which one a better API (I prefer fluents API) and that's fine but the argument for this is the following
The later point is crucial for the following reason: Finally, I personally think that is easier to reason about preprocessing when is in units of uV, I don't have to be concerned about rounding or whether the result will mean what I think it does (does applying a PCA to the waveforms in raw has the same statistical properties than a PCA of the waveforms in uV for example). The cost is more memory, but I rather understand something first, know what to expect, and then optimize memory and run the pipeline in raw data once I am done and I want to fine tune performance. Fuck I write a lot. Sorry. |
Thanks @h-mayorquin I'm glad you write a lot that was super interesting! I think we are largely in agreement and problems that with the clarity of
I will stop bogging down your PR with this discussion and we can continue elsewhere, maybe #2958 can prompt a more general documentation / refactoring of the data-saving pipeline. LGTM! |
I have been convinced since #2958 that we need an IO tutorial. |
How about creating a new function in So here is my suggestion: create another function like The final code snippet could be: recording.save_to_binary(return_scale=True, **kwargs) Maybe this code is convenient for the user and keeps with the P.S. I'm not sure if I should write in PR comments since it seems like a developer's discussion area. If it bothers, sorry. |
@JoeZiminski feel free to keep making comments, man. I discovered an error by changing the assertion to an error and your comments about how to handle double scaling have been very useful. I don't think I would have noticed that problem as quick as I did thanks to you. And I think your solution is the one that will work. I am thankful for your feedback. @jnjnnjzch
|
Thanks @jnjnnjzch I also appreciate people commenting. Sometimes this is a space for future developers ;) I just want to clarify one thing you said:
We actually very much encourage using the spikeinterface.core module as it has the core components. You're right that we need to move more functions to private status to make it clear the external API from the internal, but the module itself is completely public! |
Ok, two changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM!
merci. |
The justification from:
#2958 (comment)