-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add new ERP measures #12434
base: main
Are you sure you want to change the base?
[ENH] Add new ERP measures #12434
Conversation
@larsoner @drammock I went ahead and created a pull request. There's nothing to review yet. Do you think Also, I saw another issue related to Type Hints, do you think these are still useful even though these are 'global' functions not called from |
Hi all, I also wanted to ask if the edit: I guess it would be nice to not have to use other operations to restrict the time interval. So in that case maybe I should just remove (or update) the tutorial that mentions using |
In MNE-Python sample rates are always uniform
Yeah if it's as easy as |
trapz should be plenty fast, I think
Currently the plan is to add type hints to whole submodules at once (e.g., do it for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is draft mode still, but I wanted to ask why all the other files within stats
are at the root level (stats/_adjacency.py
, stats/cluster_level.py
, etc), but erp
is going within a subfolder (stats/erp/_erp.py
)? Is there a particular reason you're choosing to organize it that way?
Hey @drammock, thanks for helping me with this!
I thought that's how @larsoner mentioned it should be structured: |
ah OK. He probably had a good reason :) |
@drammock @larsoner Is it better to match the functionality of For that reason, I am leaning more towards all the new functions just returning the new measures for all channels even |
I am not familiar with what people actually do for ERPs -- for example averaging over a subset of channels then computing area, computing it for a single channel, computing it for multiple channels, etc. It's a reasonable starting point to match
or similar, too. |
same here; I've read some papers (but not recently) and don't have a complete grasp of the ERP literature. I tend to rank the goal state like this:
Does knowing that resolve any questions in your mind @withmywoessner? If not, we could ping a couple other folks to weigh in on what should be included in "super common" vs "less common" |
@drammock @larsoner Thank you for the responses. I also think it's important to match existing functionality to make it less confusing, but my main question was, if we are going in the direction of returning a single value for all channels, what single value should I return for |
Sorry if I've missed the discussion/resolution of this question, but are we leaning in the direction of returning a single value across all channels? I know that's what From an API perspective, it's easy enough to have a switch like A further question is that if we do offer to aggregate, does that mean "combine channels first, then compute the metric" or does it mean "compute metric for each channel, and then combine the per-channel metrics"? If the latter makes more sense, we could at least consider not building in the aggregation (I think we can return a numpy array or pandas series and expect that people can use its
When dealing with latencies, yeah, returning the smallest could be one sensible way to aggregate across channels. So there are further questions around how each metric should be aggregated (is there one obviously right choice for each metric, or do we need to let the users pick? etc). But the question "should we aggregate at all" needs to be decided first. |
Thanks for the in-depth response @drammock! I am also confused on what approach is best. I did some looking and in this tutorial by Steven J. Luck who created ERPLAB. He mentions that it is common to average across channels (link):
However, talking to a member of my lab he stated that they prefer to find the peaks/areas of each channel then do a correction for multiple comparisons if they end up doing some statistical analysis.
Personally, I am leaning more towards returning all the channels even though that does not match the functuality of
I think it would be good to include an option to aggregate across channels before computing measures from what I have read. If people are confused on how to do the later method of averaging, I could always write a tutorial later even if it is quite simple. Also, I know this may be a bad idea, but would it be horrible if we made another |
If Steve Luck recommends (at least in some cases) to aggregate across channels before computing these metrics, then we should make it easy to do that. But I do lean toward making it optional via
horrible because of name clash? I.e., is it acceptable to have both |
Yes that's what I was referencing. @drammock What if we have a parameter So for instance |
I'm struggling to keep track of what all the desired computations are, which makes it hard to weigh on on whether the
Side note: remember that if it is best/easiest from a UX perspective for the user to have 10 distinct functions with clear, descriptive function names and few or no parameters to supply, then we can build that as the public API --- even if under the hood they all just wrap to a (more complicated) private function (if that's the most efficient / least redundant way to actually do the computations). |
Sure, no problem @drammock! Let me know if this makes it more clear.
Areas Amps and areas
For amps and areas For latencies return_offset -> return onset as well as the 'offset' - the first time after the peak that the ERP drops to X% as large as the peak. |
fractional_peak will be renamed to fractional_amp |
Hey @drammock! Just pinging here to see if you have a chance to look over my suggestions because I saw you edited one of my posts. Let me know if I am pinging too much. Thank you! |
Edited other post because it cross-ref'd the wrong issue --- I look quickly at (nearly) every new issue/PR as it comes in if I can, just to mentally triage its urgency/difficulty and chime in as appropriate. Rest assured that this one is still on my radar, but there's a lot going on right now and I haven't had to wrap my head around the (long!) list of supported functionality. One thing slowing me down is that I asked for / expected a list of prose descriptions of the functionality you want to support (as a way of helping figure out what the API should look like). What you wrote is a mock-up of the API call for each case. |
Okay, I understand, thanks for letting me know!
I placed prose descriptions of every keyword and value that keyword can take at the bottom, I thought that would be easier to see everything laid out in the most succinct format instead of just having a prose description of each so that way if I miss something it is easier to see. If I did a prose description there would be 12 possible combinations for some of them and I thought that would be hard to parse. EDIT: I'Il add a prose description below each. |
This comment was marked as outdated.
This comment was marked as outdated.
Yes, I saw that. It requires scrolling/scanning back and forth between the API descriptions and the "key".
Could you do what you need to do locally to make sure you've included all possibilities, rather than imposing that structure here?
I appreciate the "trying to be helpful" aspect of this choice. But so there is no further confusion: it would be easier for me to see descriptions in prose when I'm trying to figure out what an API should look like. By describing what you want to do already in the form of an API makes that task harder for me, because I have to first "un-api" your proposal in order to imagine / visualize what other possible APIs might look like. Feel free to use nested bullets if that helps, or comments like "each of the next 5 bullets also should have a way to specify a restricted time window over which to operate" or whatever --- in other words you don't necessarily need to spell out every detail on every line, if it helps clarity. |
This comment was marked as outdated.
This comment was marked as outdated.
OK @withmywoessner sorry for the delay. Here is my take on what the API should be like:
WDYT about that proposal? |
Thanks!
This one matches evoked.get_peak() which has a boolean
I personally like
Do you think accounting for something like this would ever be useful. Maybe I should look for papers of people needing this. |
That sounds good. I don't want us to spend extra effort coding/reviewing/testing/maintaining this code path if nobody is going to use it.
that's a good reason (internal consistency) to stick with
see above, I'm not convinced that the
we can't use |
I browsed a couple papers and most just mentioned using ERPLAB. I believe this is not something we need to support. I thought about emailing Steven Luck, but if he doesn't get back to me. I think I'll just go ahead and proceed with supporting this:
and
|
ok so in summary I think the proposal on the table is: def get_erp_peak(
tmin,
tmax,
picks,
mode, # "pos" | "neg" | "abs"
average, # bool, whether to compute per-channel or avg across picks
strict, # bool, whether to error if e.g. mode is "pos" and signal is all neg
output, # "pandas" | "dict" (DF with ch_name as index, or dict of namedtuples)
):
pass
def get_erp_area(
tmin,
tmax,
picks,
mode, # "pos" | "neg" | "abs" | "intg"
average, # bool, whether to compute per-channel or avg across picks
output, # "pandas" | "dict" (DF with ch_name as index, or dict of namedtuples)
):
pass
def get_erp_frac_peak_latency(
frac,
which, # "onset" | "offset"
tmin,
tmax,
picks,
mode, # "pos" | "neg" | "abs"
average, # bool, whether to compute per-channel or avg across picks
strict, # bool, whether to error if e.g. mode is "pos" and signal is all neg
output, # "pandas" | "dict" (DF with ch_name as index, or dict of namedtuples)
):
pass
def get_erp_frac_area_latency(
frac,
which, # "onset" | "offset"
tmin,
tmax,
picks,
mode, # "pos" | "neg" | "abs" | "intg"
average, # bool, whether to compute per-channel or avg across picks
output, # "pandas" | "dict" (DF with ch_name as index, or dict of namedtuples)
):
pass The only bit I'm still hesitating over is
@withmywoessner WDYT about how to handle @larsoner @cbrnr this might be a good time for you to weigh in too, now that the proposal is (almost) complete |
My one major comment is that going with function names
To simplify the |
I forget, did we conclusively decide
Tuple seems like a good choice because people might not be as familar with the syntax of a namedtuple.
What is the behavior of get_area(..., average=True, pandas=True)? Does it return a scalar or pandas data structure?
Can we also return amplitude for this set of parameters? |
Sounds good. If we go this route maybe we can get rid of the
|
|
@larsoner is suggesting we don't need to be so cautious with requiring pandas, so let's scrap the whole
I meant to say (not just imply) before that
If that seems desirable I don't object. it's information that could be derived from the value of Side note: your mock-up output tables all still show channel name in a column; I've said a couple of times I think it's preferable to put channel name as the Series/DataFrame Index. Are you not on board with that? (if not, why not?) |
👍🏻 |
I am! Sorry forget to change it to include that. |
I don't have any detailed feedback due to limited time, but from browsing the discussion I'd like to mention the following four points:
|
I'm not sure if this was discussed in the issue or PR, but mean amplitude can be extracted by
I think @larsoner implied that this would be best to add as a separate PR to make it less onerous to review. But I plan on it! |
I'll acknowledge that this could be interpreted to go against the "each variable forms a column" maxim, but it is aligned with the more general principle of "structure datasets to facilitate analysis". If channel names are in the index, the resulting dataframe will be all numeric, so users can get separate stats for each channel and very easily do things like >>> import pandas as pd
>>> foo = pd.DataFrame(dict(foo=range(5), bar=list('abcde')), index=range(5))
>>> foo.mean()
TypeError: Could not convert ['abcde'] to numeric So to me it's not clear that there's one "right" way to do it here. My proposal seems like it will save users a bit of extra effort in what is likely a very common use case... but ultimately it's a single operation to switch back and forth ( |
@drammock if it doesn't make a huge difference then let's please use a regular column. All other data analysis packages (including polars) do not use indexes, because operations are generally much more readable without an index (see also https://docs.pola.rs/user-guide/migration/pandas/#polars-does-not-have-a-multi-indexindex). I appreciate the use case you mentioned, and in this case it is nice to have, but for me still not worth the downsides of an index. |
(And the added benefit of having an integer sequence as an index is that we automatically have channel numbers, which are also nice to have.) |
what about pandas?! :) But sure, as I said before I don't think it makes a huge difference, so since you feel strongly about it we can go with no index. |
I think this is the last thing that needs to be addressed @drammock. Just a thought, but could we also have |
eh, let's be consistent and return a Series there too. If I had to guess, the problems from expecting a Pandas object and not getting one will be more annoying than the problems from having a length-one Series when you expect a scalar.
that feels like a slippery slope (why not |
Okay! Sounds good! I think I will begin implementing everything if there is nothing else. Thanks everyone for the help! |
Another thing @drammock, do you know what the purpose of |
The behavior of because we are now passing multiple channels of which a few may not contain positive or negative values it may be annoying to raise an error. That is why I am thinking it is best to return Is this fine with everyone? |
I disagree. We cannot expect our users to be familiar with pandas, so they will expect a single number for the average area and not a single-element pandas series. I don't think it's a good idea to be consistent with respect to the return type, I'd rather be consistent with users' expectations. |
I don't know if I agree with this. I find it very unlikely that users would be doing erp analysis that only required average area so if they're going to be exposed to pandas anyway we might as well keep it consistent. I may have said it previously, but when one is examining an erp component it is often the case that you are interested in finding several areas per erp (200-300 ms, 450-550 ms...) and pandas operations seem easier to work with when you have several subjects each with several areas and would like to combine their data in some way. |
OK, let me put it another way. If the function returns a scalar via |
if that is true, they won't know how to handle the output of any of the other functions (or the output of
How do we know what those expectations are? You're citing your own introspection as evidence of what users in general will expect. I don't think that's valid. If the docstring says "returns a pd.Series if average=True and a pd.Dataframe if average=False" then any user who reads the docs will have the right expectations. We can also easily add a note saying "in the case of average=True, if you just want the scalar, do import pandas as pd
area_scalar = 5.1
area_series = pd.Series([5.1], index=["area"])
peak_series = pd.Series([3.0, 0.98], index=["amplitude", "latency"])
2 * area_series + 1 # arithmetic works
pd.concat([area_series, peak_series]) # combining with output from other funcs works
pd.concat([area_scalar, peak_series]) # TypeError:
# cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid ...so I think many users won't even need to call |
Reference issue
Adds features from #7848
What does this implement/fix?
Mean amplitudeEDIT: Also adds function to find erp peak.