Add `scale_to_uV` preprocessing #3053

h-mayorquin · 2024-06-20T00:08:16Z

The justification from:

Now that I think about it myself what I would like is to have a preprocessing called scale_to_uV that which allows me to write:
recording_in_uV = spre.scale_to_uV(recording=raw_recording)
si.write_binary_recording(recording_in_uV, file_paths=file_path, dtype=dtype, **job_kwargs)
I think this is both concise while at the same time generalizes to situations beyond write_binary_recording. What do you think @jnjnnjzch, do you feel this would make the API more in line with your concept of elegance?

I actually would like your take on this @JoeZiminski as you have been thinking about preprocessing. My thinking here is that I think is better to combine simple functions and apply them flexibly depending on the situation at hand rather than adding a new parameter to each of the core functions for every situation. The latter also has the downside of making the core functions more difficult to optimize and document. See here for another instance where I think this question arises: #2985. I recognize there might be performance limitations that might limit this approach but I would rather take it as a baseline.

src/spikeinterface/preprocessing/preprocessinglist.py

JoeZiminski

This is great! I'm constantly getting confused about how scaling and dtype is handled and this is very straightforward. A short how-to or tutorial would be useful in future (I could add suggestion to documentation thread) which could exemplify use of this preprocessor.

src/spikeinterface/preprocessing/scale.py

src/spikeinterface/preprocessing/tests/test_scaling.py

JoeZiminski · 2024-06-22T08:03:10Z

src/spikeinterface/preprocessing/tests/test_scaling.py

+    )
+
+    rng = np.random.default_rng(0)
+    gains = rng.random(size=(num_channels)).astype(np.float32)


is it worth parameterising over some extreme cases (e.g. very small values, typical values, extremely large values) for gains and offsets?

Mmm do you have some suggestions of what would be the purpose more specifically? I agree with this philosophy (e.g. "think about what will break the test") but nothing concrete comes to mind other than testing overflow problems. Let me think more but if you come with anything it would be good.

I would not like to stop the PR if we don't come with good extreme case criteria though : P

I think gains and offsets should really be based on reality. We could check a bunch of gains and offsets from Neo if we want to parameterize against real values. For example I think gain=1, offset=0 comes up because some data formats don't scale. We could test against Intan's scaling for example since the format is pretty common.

Ie to be clearer I don't think our tests should specifically look at "wrong" values that a users could input, but rather test for real inputs that our library should handle.

Yes good points, in general I am in two minds. For sure I think it's good to test with a range of expected values. But the tests can also be viewed as a way of stress-testing the software, and putting extreme values in can reveal strange or unexpected behaviours. For example we sometimes struggle with overflow errors from numpy dtypes that would probably be caught in tests if we were using some extreme values. So I think, as including more values for these short tests is basically free, we might as well as more information is better, but I'm not strong on this.

My idea was that the random gains kind of cover for all the common use cases save the stress (outlier / weird) part. For the overflow problem I think we could think on someting but I am leaning on testing overflow in a more general check. I actually do feel that that one needs generalization to be useful.

Yes agree it's not super important to check the overflow errors in sporadic tests and would be better centralised somewhere. I like the rand but isnt rand bounded [0, 1] and gains can in theory be (0, (some large number?]? For me it seems most efficient rather than guessing bounds on realistic values just to test very small, commonplace, very large number and for sure cover all cases.

OK, to move this forward I propose the following:

You give me the numbers you want to test.

I increase the bounds of float to be very large.

This sounds good, this was a suggestion that I'm not super-strong on and I wouldn't want it to be a blocker! Please feel free to proceed however you think best, personally I would do something like parameterize over small, medium and large number for gain and offset:

@pytest.mark.parametrize("gains", [1e-6,, 50, 1e6]) @pytest.mark.parametrize("offsets", [1e-6,, 50, 1e6])

I'm not advocating this as necessarily the best approach but it is the kind of habit I have fallen into to test bounds.

(to avoid repeatedly generating a recording in such an case, the recording can be generated in a session-scoped fixture)

src/spikeinterface/preprocessing/tests/test_scaling.py

src/spikeinterface/preprocessing/scale.py

src/spikeinterface/preprocessing/tests/test_scaling.py

alejoe91 · 2024-06-24T12:55:55Z

Thanks @h-mayorquin !

Do you guys think it sould make sense to just have a function to do this instead of a class?

It could simply be:

def scale_to_uv(recording):
    """
    ...
    """
    assert recording.has_scaleable_traces(), "Recording must have scaleable traces"
    gain = recording.get_channel_gains()[None, :]
    offset = recording.get_channel_offsets()[None, :]
    return scale(recording, gain=gain, offset=offset, dtype="float32")

zm711 · 2024-06-24T12:57:49Z

@alejoe91 ,

yeah I mean that also makes sense to me. And way less surface area to have to test and maintain, which I like @h-mayorquin likes too!

h-mayorquin · 2024-06-24T13:59:42Z

Thanks @h-mayorquin !

Do you guys think it sould make sense to just have a function to do this instead of a class?

It could simply be:

def scale_to_uv(recording):
    """
    ...
    """
    assert recording.has_scaleable_traces(), "Recording must have scaleable traces"
    gain = recording.get_channel_gains()[None, :]
    offset = recording.get_channel_offsets()[None, :]
    return scale(recording, gain=gain, offset=offset, dtype="float32")

Yeah, maybe that's better 🤔

samuelgarcia · 2024-06-25T11:10:07Z

I am not against this but I think that write_binary_recording() with a option scale to micro volt would be easier for end user no ? In short I am not sure to share the starting point of this.

h-mayorquin · 2024-06-25T14:57:56Z

This solution can be used anywhere where a user needs it. If a need to scale to uV a function other than write_binary appears, we can point them out to this instead of adding another parameter to every function. Teach a man how to fish instead of giving him fish kind of thing.
The return_scale works post segment traces scaling, this one, like scaling, works before. In other words, you can do your operation in natural units if that is whath you need.

Easier to mantain instead of adding parameters everywhere and more general.

JoeZiminski · 2024-06-27T10:16:20Z

After having got stuck / confused trying to write a recording recently, I'm now +1 on adding return_scaled as an argument to write_binary_recording() (and any other relevant writer functions).

The reasons are:

keeps with the get_traces() convention that whenever you are 'accessing' the data, you have the option to return scaled or not. At the moment the behaviour is ambiguous for writer functions because this pattern is broken.
it is maximally convenient for the user - i) the option is very easy to see as a function arg rather than behaviour hidden further down in docstring ii) applying scaling requires having to read about and understand behaviour of a new function (which is not necessarily clear, considering the complexities of scaling and dtype).

But I think that and this PR are not mutually exclusive I still think scale_to_uV is a great addition that should be included.

zm711 · 2024-06-27T12:26:45Z

I'm currently slight on Heberto's side. In that this makes the step explicit which puts the onus on the user to know they are doing this and I also hate when functions have 1000 arguments especially when they start getting super mutually exclusive. So I prefer the unix philosophy of small tools do one things and string the tools together. (I also never write scaled binary so I'm not the target audience).

h-mayorquin · 2024-06-27T15:54:09Z

@JoeZiminski

After having got stuck / confused trying to write a recording recently, I'm now +1 on adding return_scaled as an argument to write_binary_recording()

Can you describe more about your confusion? I think people should not be using write_binary_recording to save their data. It loses a lot of information. I am fine with adding return_scaled it in a bunch of places (like it is on the widgets for example) but there are core functions like write_binary_recording, to_dict, etc that I want to keep as simple as possible because they are used in way too many places. I want to keep complexity out of the core and close to the application interface for two reasons: 1) the core is simpler to optimize, document and mantain, 2) the errors make sense and are easier to debug when they surface close to the application layer.

But maybe I am wrong about the purpose of write_binary_recording and it should be user facing. The user in #2958 proposed that we can have a public write_binary_recording that is user facing with a lot of bell of whistles and I think that is fine idea. If the powers that be (@alejoe91 and @samuelgarcia ) decide that it is the way to go I would like to have a private _write_binary_function that does not have those bells and whistles and is used on core. The function in this PR enables that workflow.

2. ) applying scaling requires having to read about and understand behaviour of a new function (which is not necessarily clear, considering the complexities of scaling and dtype).

Don't the same complexities appear when using return_scaled? That is, this, these two should do the same:

recording_in_uV = scale_to_uv(recording)
some_other_function(scaled_recording, **kwargs)

some_other_function(scaled_recodring, return_scaled=True, **kwargs)

What is a complexity that appears on the first that does not appear on the second?

it is maximally convenient for the user

We can argue about which one a better API (I prefer fluents API) and that's fine but the argument for this is the following

With the kwargs approach you need to document that everywhere. This is simpler than job_kwargs but look at how that looks to see the difficulties. Plus, as per the discussion in [Feature request] Add return_scaled option for spikeinterface.core.write_binary_recording #2958 return_scaled is not really an intuitive parameter either.
If we forgot it to put it somewhere then the users have no recourse until we decide to add it and enable it there.
Finally, the function proposed here allows you to operate in the preprocessing phase in physical units. This workflow can't be enabled with return_scaled which only does the scaling after all the preprocessing pipe is ran.

The later point is crucial for the following reason:
We have taken advantage of the fact that most formats use linear scaling and thanks to magic of linear algebra a lot of preprocessing can be done in the raw format and then scale in the end. This workflow might have some rough corners like the numeric acrobatics with z-score recording and all the rounding that we do so histograms are calculated like we expect but it works and is a bit cheaper memory wise in some places. But this is an accident, some formats do not have linear conversion between raw and units representation and you can't not operate on the raw space and hope your results to hold. One example is the EDF format that uses a polynomial interpolation.

Finally, I personally think that is easier to reason about preprocessing when is in units of uV, I don't have to be concerned about rounding or whether the result will mean what I think it does (does applying a PCA to the waveforms in raw has the same statistical properties than a PCA of the waveforms in uV for example). The cost is more memory, but I rather understand something first, know what to expect, and then optimize memory and run the pipeline in raw data once I am done and I want to fine tune performance.

Fuck I write a lot. Sorry.

JoeZiminski · 2024-06-27T16:23:09Z

Thanks @h-mayorquin I'm glad you write a lot that was super interesting! I think we are largely in agreement and problems that with the clarity of write_binary_recording can be solved in other ways instead of adding return_scaled. To summarise:

I definitely think this is a useful PR and also prefer the way of working you describe. I think whatever happens with write_binary_recording this should be merged.
The problem I was trying to solve with the return_scaled to write_binary_recording is that at first, the data-scaling issue is very confusing and to understand properly requires quite a deep knowledge of how SI represents data. Often it is not feasible to properly understand this in one sitting and requires some familiarity with the codebase to use the scale functions. As such having an easy-to-interpret function argument is nice so it is explicit to newer users how their data is handled.
However, I think the above can be solved by as you say finessing the API to make the workflow super-clear. A big problem is that recording.save() is not properly listed in the API docs, for me still I am not sure the recommended way to save data. So I would agree make write_binary_recording private and have a super-clear, well documented workflow for saving data.

I will stop bogging down your PR with this discussion and we can continue elsewhere, maybe #2958 can prompt a more general documentation / refactoring of the data-saving pipeline. LGTM!

h-mayorquin · 2024-06-27T16:25:42Z

I have been convinced since #2958 that we need an IO tutorial.

jnjnnjzch · 2024-06-27T16:51:26Z

How about creating a new function in spre or sw etc? I mean, let the core be the core. Calling a spikeinterface.core function explicitly in the pipeline should not be encouraged anyway.

So here is my suggestion: create another function like save_to_binary (which hasa consistent naming format with the existing 'save_to_folder') somewhere else. This function contains the return_scale or return_to_uV argument. Additionally, the documentation might need to be changed to discourage direct calls to write_binary_recording.

The final code snippet could be:

recording.save_to_binary(return_scale=True, **kwargs)

Maybe this code is convenient for the user and keeps with the get_traces() convention? @JoeZiminski
Also in this way there is no need to change the write_binary_recording function and keeps its purity... @h-mayorquin

P.S. I'm not sure if I should write in PR comments since it seems like a developer's discussion area. If it bothers, sorry.

h-mayorquin · 2024-06-27T17:10:51Z

@JoeZiminski feel free to keep making comments, man. I discovered an error by changing the assertion to an error and your comments about how to handle double scaling have been very useful. I don't think I would have noticed that problem as quick as I did thanks to you. And I think your solution is the one that will work. I am thankful for your feedback.

@jnjnnjzch
No, you are welcome, thanks for your feedback. You made us think about it. Basically now I think we:

might need to rethink the naming of return_scaled which makes sense when is close to get_traces but makes less sense far away from it (like in the widgets). Maybe something like data_in_uV but we need to think about it and definitley document the behavior better.
Add an IO tutorial.
The save_to_folder uses write_to_binary and has had this TODO since forever to change the name to save_to_binary_folder:

https://github.com/h-mayorquin/spikeinterface/blob/62485d5ee5c9dbb30085b3eaad9ea07a448d2ace/src/spikeinterface/core/base.py#L848-L856

zm711 · 2024-06-27T17:19:37Z

Thanks @jnjnnjzch I also appreciate people commenting. Sometimes this is a space for future developers ;)

I just want to clarify one thing you said:

I mean, let the core be the core. Calling a spikeinterface.core function explicitly in the pipeline should not be encouraged anyway.

We actually very much encourage using the spikeinterface.core module as it has the core components. You're right that we need to move more functions to private status to make it clear the external API from the internal, but the module itself is completely public!

h-mayorquin · 2024-06-28T22:21:59Z

Ok, two changes:

I have a test to handle the case of users using return_scaled=True . This preprocessor is not affected by it and the result is as expected.
I changed the definition from a class to a function as @alejoe91 suggested.

alejoe91

Awesome! LGTM!

samuelgarcia · 2024-07-03T09:18:14Z

merci.

add scale to microvolts

4ecc164

h-mayorquin added the core Changes to core module label Jun 20, 2024

h-mayorquin self-assigned this Jun 20, 2024

h-mayorquin added 2 commits June 19, 2024 18:11

added untracked files, ups

3dbb8c1

add more personality to the test

be59dbe

zm711 reviewed Jun 20, 2024

View reviewed changes

src/spikeinterface/preprocessing/preprocessinglist.py Outdated Show resolved Hide resolved

name changing

a98e81a

JoeZiminski reviewed Jun 22, 2024

View reviewed changes

alejoe91 added this to the 0.101.0 milestone Jun 27, 2024

h-mayorquin added 3 commits June 27, 2024 10:54

first comments

511ebd5

Merge branch 'main' into add_scale_to_uV_preprocessing

a7c5fb3

add failing test

62485d5

h-mayorquin mentioned this pull request Jun 27, 2024

Add IO tutorial #3098

Open

h-mayorquin added 2 commits June 28, 2024 16:08

tests are passing

bf3cc4b

@Alejo91 suggestion

c6a521b

fix imports

0b5c635

alejoe91 approved these changes Jun 29, 2024

View reviewed changes

JoeZiminski self-requested a review July 1, 2024 11:51

JoeZiminski approved these changes Jul 1, 2024

View reviewed changes

samuelgarcia approved these changes Jul 3, 2024

View reviewed changes

samuelgarcia merged commit 5a7d890 into SpikeInterface:main Jul 3, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `scale_to_uV` preprocessing #3053

Add `scale_to_uV` preprocessing #3053

h-mayorquin commented Jun 20, 2024 •

edited

Loading

JoeZiminski left a comment

JoeZiminski Jun 22, 2024

h-mayorquin Jun 22, 2024

zm711 Jun 24, 2024 •

edited

Loading

JoeZiminski Jun 24, 2024 •

edited

Loading

h-mayorquin Jun 24, 2024

JoeZiminski Jun 27, 2024

h-mayorquin Jun 27, 2024

JoeZiminski Jul 1, 2024

JoeZiminski Jul 1, 2024

alejoe91 commented Jun 24, 2024

zm711 commented Jun 24, 2024

h-mayorquin commented Jun 24, 2024

samuelgarcia commented Jun 25, 2024

h-mayorquin commented Jun 25, 2024

JoeZiminski commented Jun 27, 2024 •

edited

Loading

zm711 commented Jun 27, 2024

h-mayorquin commented Jun 27, 2024 •

edited

Loading

JoeZiminski commented Jun 27, 2024 •

edited

Loading

h-mayorquin commented Jun 27, 2024

jnjnnjzch commented Jun 27, 2024

h-mayorquin commented Jun 27, 2024

zm711 commented Jun 27, 2024

h-mayorquin commented Jun 28, 2024

alejoe91 left a comment

samuelgarcia commented Jul 3, 2024

Add scale_to_uV preprocessing #3053

Add scale_to_uV preprocessing #3053

Conversation

h-mayorquin commented Jun 20, 2024 • edited Loading

JoeZiminski left a comment

Choose a reason for hiding this comment

JoeZiminski Jun 22, 2024

Choose a reason for hiding this comment

h-mayorquin Jun 22, 2024

Choose a reason for hiding this comment

zm711 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

JoeZiminski Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

h-mayorquin Jun 24, 2024

Choose a reason for hiding this comment

JoeZiminski Jun 27, 2024

Choose a reason for hiding this comment

h-mayorquin Jun 27, 2024

Choose a reason for hiding this comment

JoeZiminski Jul 1, 2024

Choose a reason for hiding this comment

JoeZiminski Jul 1, 2024

Choose a reason for hiding this comment

alejoe91 commented Jun 24, 2024

zm711 commented Jun 24, 2024

h-mayorquin commented Jun 24, 2024

samuelgarcia commented Jun 25, 2024

h-mayorquin commented Jun 25, 2024

JoeZiminski commented Jun 27, 2024 • edited Loading

zm711 commented Jun 27, 2024

h-mayorquin commented Jun 27, 2024 • edited Loading

JoeZiminski commented Jun 27, 2024 • edited Loading

h-mayorquin commented Jun 27, 2024

jnjnnjzch commented Jun 27, 2024

h-mayorquin commented Jun 27, 2024

zm711 commented Jun 27, 2024

h-mayorquin commented Jun 28, 2024

alejoe91 left a comment

Choose a reason for hiding this comment

samuelgarcia commented Jul 3, 2024

Add `scale_to_uV` preprocessing #3053

Add `scale_to_uV` preprocessing #3053

h-mayorquin commented Jun 20, 2024 •

edited

Loading

zm711 Jun 24, 2024 •

edited

Loading

JoeZiminski Jun 24, 2024 •

edited

Loading

JoeZiminski commented Jun 27, 2024 •

edited

Loading

h-mayorquin commented Jun 27, 2024 •

edited

Loading

JoeZiminski commented Jun 27, 2024 •

edited

Loading