Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scale_to_uV preprocessing #3053

Merged

Conversation

h-mayorquin
Copy link
Collaborator

@h-mayorquin h-mayorquin commented Jun 20, 2024

The justification from:

#2958 (comment)

Now that I think about it myself what I would like is to have a preprocessing called scale_to_uV that which allows me to write:

recording_in_uV = spre.scale_to_uV(recording=raw_recording)
si.write_binary_recording(recording_in_uV, file_paths=file_path, dtype=dtype, **job_kwargs)

I think this is both concise while at the same time generalizes to situations beyond write_binary_recording. What do you think @jnjnnjzch, do you feel this would make the API more in line with your concept of elegance?

I actually would like your take on this @JoeZiminski as you have been thinking about preprocessing. My thinking here is that I think is better to combine simple functions and apply them flexibly depending on the situation at hand rather than adding a new parameter to each of the core functions for every situation. The latter also has the downside of making the core functions more difficult to optimize and document. See here for another instance where I think this question arises: #2985. I recognize there might be performance limitations that might limit this approach but I would rather take it as a baseline.

@h-mayorquin h-mayorquin added the core Changes to core module label Jun 20, 2024
@h-mayorquin h-mayorquin self-assigned this Jun 20, 2024
Copy link
Collaborator

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I'm constantly getting confused about how scaling and dtype is handled and this is very straightforward. A short how-to or tutorial would be useful in future (I could add suggestion to documentation thread) which could exemplify use of this preprocessor.

src/spikeinterface/preprocessing/scale.py Show resolved Hide resolved
src/spikeinterface/preprocessing/scale.py Show resolved Hide resolved
src/spikeinterface/preprocessing/tests/test_scaling.py Outdated Show resolved Hide resolved
)

rng = np.random.default_rng(0)
gains = rng.random(size=(num_channels)).astype(np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth parameterising over some extreme cases (e.g. very small values, typical values, extremely large values) for gains and offsets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm do you have some suggestions of what would be the purpose more specifically? I agree with this philosophy (e.g. "think about what will break the test") but nothing concrete comes to mind other than testing overflow problems. Let me think more but if you come with anything it would be good.

I would not like to stop the PR if we don't come with good extreme case criteria though : P

Copy link
Collaborator

@zm711 zm711 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gains and offsets should really be based on reality. We could check a bunch of gains and offsets from Neo if we want to parameterize against real values. For example I think gain=1, offset=0 comes up because some data formats don't scale. We could test against Intan's scaling for example since the format is pretty common.

Ie to be clearer I don't think our tests should specifically look at "wrong" values that a users could input, but rather test for real inputs that our library should handle.

Copy link
Collaborator

@JoeZiminski JoeZiminski Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good points, in general I am in two minds. For sure I think it's good to test with a range of expected values. But the tests can also be viewed as a way of stress-testing the software, and putting extreme values in can reveal strange or unexpected behaviours. For example we sometimes struggle with overflow errors from numpy dtypes that would probably be caught in tests if we were using some extreme values. So I think, as including more values for these short tests is basically free, we might as well as more information is better, but I'm not strong on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that the random gains kind of cover for all the common use cases save the stress (outlier / weird) part. For the overflow problem I think we could think on someting but I am leaning on testing overflow in a more general check. I actually do feel that that one needs generalization to be useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree it's not super important to check the overflow errors in sporadic tests and would be better centralised somewhere. I like the rand but isnt rand bounded [0, 1] and gains can in theory be (0, (some large number?]? For me it seems most efficient rather than guessing bounds on realistic values just to test very small, commonplace, very large number and for sure cover all cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, to move this forward I propose the following:

  1. You give me the numbers you want to test.
  2. I increase the bounds of float to be very large.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good, this was a suggestion that I'm not super-strong on and I wouldn't want it to be a blocker! Please feel free to proceed however you think best, personally I would do something like parameterize over small, medium and large number for gain and offset:

@pytest.mark.parametrize("gains", [1e-6,, 50, 1e6])
@pytest.mark.parametrize("offsets", [1e-6,, 50, 1e6])

I'm not advocating this as necessarily the best approach but it is the kind of habit I have fallen into to test bounds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to avoid repeatedly generating a recording in such an case, the recording can be generated in a session-scoped fixture)

src/spikeinterface/preprocessing/tests/test_scaling.py Outdated Show resolved Hide resolved
src/spikeinterface/preprocessing/tests/test_scaling.py Outdated Show resolved Hide resolved
src/spikeinterface/preprocessing/scale.py Outdated Show resolved Hide resolved
src/spikeinterface/preprocessing/tests/test_scaling.py Outdated Show resolved Hide resolved
@alejoe91
Copy link
Member

Thanks @h-mayorquin !

Do you guys think it sould make sense to just have a function to do this instead of a class?

It could simply be:

def scale_to_uv(recording):
    """
    ...
    """
    assert recording.has_scaleable_traces(), "Recording must have scaleable traces"
    gain = recording.get_channel_gains()[None, :]
    offset = recording.get_channel_offsets()[None, :]
    return scale(recording, gain=gain, offset=offset, dtype="float32")

@zm711
Copy link
Collaborator

zm711 commented Jun 24, 2024

@alejoe91 ,

yeah I mean that also makes sense to me. And way less surface area to have to test and maintain, which I like @h-mayorquin likes too!

@h-mayorquin
Copy link
Collaborator Author

Thanks @h-mayorquin !

Do you guys think it sould make sense to just have a function to do this instead of a class?

It could simply be:

def scale_to_uv(recording):
    """
    ...
    """
    assert recording.has_scaleable_traces(), "Recording must have scaleable traces"
    gain = recording.get_channel_gains()[None, :]
    offset = recording.get_channel_offsets()[None, :]
    return scale(recording, gain=gain, offset=offset, dtype="float32")

Yeah, maybe that's better 🤔

@samuelgarcia
Copy link
Member

I am not against this but I think that write_binary_recording() with a option scale to micro volt would be easier for end user no ? In short I am not sure to share the starting point of this.

@h-mayorquin
Copy link
Collaborator Author

  1. This solution can be used anywhere where a user needs it. If a need to scale to uV a function other than write_binary appears, we can point them out to this instead of adding another parameter to every function. Teach a man how to fish instead of giving him fish kind of thing.
  2. The return_scale works post segment traces scaling, this one, like scaling, works before. In other words, you can do your operation in natural units if that is whath you need.

Easier to mantain instead of adding parameters everywhere and more general.

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jun 27, 2024

After having got stuck / confused trying to write a recording recently, I'm now +1 on adding return_scaled as an argument to write_binary_recording() (and any other relevant writer functions).

The reasons are:

  1. keeps with the get_traces() convention that whenever you are 'accessing' the data, you have the option to return scaled or not. At the moment the behaviour is ambiguous for writer functions because this pattern is broken.
  2. it is maximally convenient for the user - i) the option is very easy to see as a function arg rather than behaviour hidden further down in docstring ii) applying scaling requires having to read about and understand behaviour of a new function (which is not necessarily clear, considering the complexities of scaling and dtype).

But I think that and this PR are not mutually exclusive I still think scale_to_uV is a great addition that should be included.

@alejoe91 alejoe91 added this to the 0.101.0 milestone Jun 27, 2024
@zm711
Copy link
Collaborator

zm711 commented Jun 27, 2024

I'm currently slight on Heberto's side. In that this makes the step explicit which puts the onus on the user to know they are doing this and I also hate when functions have 1000 arguments especially when they start getting super mutually exclusive. So I prefer the unix philosophy of small tools do one things and string the tools together. (I also never write scaled binary so I'm not the target audience).

@h-mayorquin
Copy link
Collaborator Author

h-mayorquin commented Jun 27, 2024

@JoeZiminski

After having got stuck / confused trying to write a recording recently, I'm now +1 on adding return_scaled as an argument to write_binary_recording()

Can you describe more about your confusion? I think people should not be using write_binary_recording to save their data. It loses a lot of information. I am fine with adding return_scaled it in a bunch of places (like it is on the widgets for example) but there are core functions like write_binary_recording, to_dict, etc that I want to keep as simple as possible because they are used in way too many places. I want to keep complexity out of the core and close to the application interface for two reasons: 1) the core is simpler to optimize, document and mantain, 2) the errors make sense and are easier to debug when they surface close to the application layer.

But maybe I am wrong about the purpose of write_binary_recording and it should be user facing. The user in #2958 proposed that we can have a public write_binary_recording that is user facing with a lot of bell of whistles and I think that is fine idea. If the powers that be (@alejoe91 and @samuelgarcia ) decide that it is the way to go I would like to have a private _write_binary_function that does not have those bells and whistles and is used on core. The function in this PR enables that workflow.

2. ) applying scaling requires having to read about and understand behaviour of a new function (which is not necessarily clear, considering the complexities of scaling and dtype).

Don't the same complexities appear when using return_scaled? That is, this, these two should do the same:

recording_in_uV = scale_to_uv(recording)
some_other_function(scaled_recording, **kwargs)
some_other_function(scaled_recodring, return_scaled=True, **kwargs)

What is a complexity that appears on the first that does not appear on the second?

it is maximally convenient for the user

We can argue about which one a better API (I prefer fluents API) and that's fine but the argument for this is the following

  • With the kwargs approach you need to document that everywhere. This is simpler than job_kwargs but look at how that looks to see the difficulties. Plus, as per the discussion in [Feature request] Add return_scaled option for spikeinterface.core.write_binary_recording #2958 return_scaled is not really an intuitive parameter either.
  • If we forgot it to put it somewhere then the users have no recourse until we decide to add it and enable it there.
  • Finally, the function proposed here allows you to operate in the preprocessing phase in physical units. This workflow can't be enabled with return_scaled which only does the scaling after all the preprocessing pipe is ran.

The later point is crucial for the following reason:
We have taken advantage of the fact that most formats use linear scaling and thanks to magic of linear algebra a lot of preprocessing can be done in the raw format and then scale in the end. This workflow might have some rough corners like the numeric acrobatics with z-score recording and all the rounding that we do so histograms are calculated like we expect but it works and is a bit cheaper memory wise in some places. But this is an accident, some formats do not have linear conversion between raw and units representation and you can't not operate on the raw space and hope your results to hold. One example is the EDF format that uses a polynomial interpolation.

Finally, I personally think that is easier to reason about preprocessing when is in units of uV, I don't have to be concerned about rounding or whether the result will mean what I think it does (does applying a PCA to the waveforms in raw has the same statistical properties than a PCA of the waveforms in uV for example). The cost is more memory, but I rather understand something first, know what to expect, and then optimize memory and run the pipeline in raw data once I am done and I want to fine tune performance.

Fuck I write a lot. Sorry.

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jun 27, 2024

Thanks @h-mayorquin I'm glad you write a lot that was super interesting! I think we are largely in agreement and problems that with the clarity of write_binary_recording can be solved in other ways instead of adding return_scaled. To summarise:

  1. I definitely think this is a useful PR and also prefer the way of working you describe. I think whatever happens with write_binary_recording this should be merged.
  2. The problem I was trying to solve with the return_scaled to write_binary_recording is that at first, the data-scaling issue is very confusing and to understand properly requires quite a deep knowledge of how SI represents data. Often it is not feasible to properly understand this in one sitting and requires some familiarity with the codebase to use the scale functions. As such having an easy-to-interpret function argument is nice so it is explicit to newer users how their data is handled.
  3. However, I think the above can be solved by as you say finessing the API to make the workflow super-clear. A big problem is that recording.save() is not properly listed in the API docs, for me still I am not sure the recommended way to save data. So I would agree make write_binary_recording private and have a super-clear, well documented workflow for saving data.

I will stop bogging down your PR with this discussion and we can continue elsewhere, maybe #2958 can prompt a more general documentation / refactoring of the data-saving pipeline. LGTM!

@h-mayorquin
Copy link
Collaborator Author

I have been convinced since #2958 that we need an IO tutorial.

@jnjnnjzch
Copy link

How about creating a new function in spre or sw etc? I mean, let the core be the core. Calling a spikeinterface.core function explicitly in the pipeline should not be encouraged anyway.

So here is my suggestion: create another function like save_to_binary (which hasa consistent naming format with the existing 'save_to_folder') somewhere else. This function contains the return_scale or return_to_uV argument. Additionally, the documentation might need to be changed to discourage direct calls to write_binary_recording.

The final code snippet could be:

recording.save_to_binary(return_scale=True, **kwargs)

Maybe this code is convenient for the user and keeps with the get_traces() convention? @JoeZiminski
Also in this way there is no need to change the write_binary_recording function and keeps its purity... @h-mayorquin

P.S. I'm not sure if I should write in PR comments since it seems like a developer's discussion area. If it bothers, sorry.

@h-mayorquin
Copy link
Collaborator Author

@JoeZiminski feel free to keep making comments, man. I discovered an error by changing the assertion to an error and your comments about how to handle double scaling have been very useful. I don't think I would have noticed that problem as quick as I did thanks to you. And I think your solution is the one that will work. I am thankful for your feedback.

@jnjnnjzch
No, you are welcome, thanks for your feedback. You made us think about it. Basically now I think we:

  1. might need to rethink the naming of return_scaled which makes sense when is close to get_traces but makes less sense far away from it (like in the widgets). Maybe something like data_in_uV but we need to think about it and definitley document the behavior better.
  2. Add an IO tutorial.
  3. The save_to_folder uses write_to_binary and has had this TODO since forever to change the name to save_to_binary_folder:

https://github.com/h-mayorquin/spikeinterface/blob/62485d5ee5c9dbb30085b3eaad9ea07a448d2ace/src/spikeinterface/core/base.py#L848-L856

@h-mayorquin h-mayorquin mentioned this pull request Jun 27, 2024
@zm711
Copy link
Collaborator

zm711 commented Jun 27, 2024

Thanks @jnjnnjzch I also appreciate people commenting. Sometimes this is a space for future developers ;)

I just want to clarify one thing you said:

I mean, let the core be the core. Calling a spikeinterface.core function explicitly in the pipeline should not be encouraged anyway.

We actually very much encourage using the spikeinterface.core module as it has the core components. You're right that we need to move more functions to private status to make it clear the external API from the internal, but the module itself is completely public!

@h-mayorquin
Copy link
Collaborator Author

Ok, two changes:

  • I have a test to handle the case of users using return_scaled=True . This preprocessor is not affected by it and the result is as expected.
  • I changed the definition from a class to a function as @alejoe91 suggested.

Copy link
Member

@alejoe91 alejoe91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM!

@JoeZiminski JoeZiminski self-requested a review July 1, 2024 11:51
@samuelgarcia
Copy link
Member

merci.

@samuelgarcia samuelgarcia merged commit 5a7d890 into SpikeInterface:main Jul 3, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to core module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants