Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Define channel column for events and Delimiter field for column descriptions #1483

Merged
merged 7 commits into from
Nov 13, 2023

Conversation

effigies
Copy link
Collaborator

@effigies effigies commented May 2, 2023

Closes #1446.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/schema/objects/columns.yaml Outdated Show resolved Hide resolved
@effigies
Copy link
Collaborator Author

effigies commented May 2, 2023

Thanks! Perhaps we could extend the example here? --> https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files

I would be up for an added example in https://bids-specification.readthedocs.io/en/latest/modality-specific-files/task-events.html, but I think it would be cluttering in common principles. @christinerogers Would you be able to put together a small example of how channel would be used in EEG tasks?

@effigies effigies force-pushed the enh/array_data branch 2 times, most recently from 52e80eb to 3c6b978 Compare May 2, 2023 14:05
@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8dc5221) 87.83% compared to head (d1056c1) 87.83%.
Report is 13 commits behind head on master.

❗ Current head d1056c1 differs from pull request most recent head 2a12171. Consider uploading reports for the commit 2a12171 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1483   +/-   ##
=======================================
  Coverage   87.83%   87.83%           
=======================================
  Files          16       16           
  Lines        1356     1356           
=======================================
  Hits         1191     1191           
  Misses        165      165           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@christinerogers
Copy link
Contributor

Thanks! Perhaps we could extend the example here? --> https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files

I would be up for an added example in https://bids-specification.readthedocs.io/en/latest/modality-specific-files/task-events.html, but I think it would be cluttering in common principles. @christinerogers Would you be able to put together a small example of how channel would be used in EEG tasks?

Sure @effigies - @Andesha and i will get back to you on that

Also, point taken about cluttering common principles, though how about a simple "See also" link, @effigies ? Just so the pointer is there to help the next person embedding a data structure to find supported formats. Would help avoid divergence across modalities and major headaches for the bids-validator and other software platforms at the end of the day.
Or what's a better place contributors will come across it like BEP maintainers guide?

@dorahermes
Copy link
Member

If this is still awaiting an example, here is a link to an HED-SCORE example undergoing review of how channel would be used in an EEG dataset in the BIDS examples repository.

@effigies
Copy link
Collaborator Author

effigies commented Jun 1, 2023

Thanks for that. It looks like that has both comma separation and dashes with unclear meaning. Would this be considered valid under the current proposal and everybody agrees FP1-F7,F7-T3 would parse as ["FP1-F7", "F7-T3"] but it's up to tools to understand what "FP1-F7" means? Or would there be an expectation that the validator would check that all listed channels are in channels.tsv, so these would fail?

What tools are generating and consuming these columns at this point?

@dorahermes
Copy link
Member

Thanks for that. It looks like that has both comma separation and dashes with unclear meaning. Would this be considered valid under the current proposal and everybody agrees FP1-F7,F7-T3 would parse as ["FP1-F7", "F7-T3"] but it's up to tools to understand what "FP1-F7" means? Or would there be an expectation that the validator would check that all listed channels are in channels.tsv, so these would fail?

Thanks for noting. This is actually something that should be corrected in this example. This example is an EEG set with a bipolar montage, so each channel is a difference between 2 channels, hence the channel name is typically F7-T3. However, in this case the corresponding channels.tsv file correctly names the channel F7 and the reference T3. I think the channel columns should be at least recommended to match the channel names. In the example I made comments to correct this, such that the channel column in the events file matches the channel name.

What tools are generating and consuming these columns at this point?

The artifact and seizure examples were translated to BIDS from the TUH EEG corpus . This is a large dataset where EEG data were annotated by experts. I use custom code to process the channel column. Perhaps @christinerogers can comment on other tools that are generating and consuming these.

@sappelhoff
Copy link
Member

I think the channel columns should be at least recommended to match the channel names.

agreed, I would make this REQUIRED, because otherwise it may lead to confusion (i.e., when documenting that an event is related to channels that are nowhere to be found in the data).

@effigies
Copy link
Collaborator Author

effigies commented Jun 2, 2023

That sounds pretty painful to validate, especially trying to write as a declarative rule instead of custom code. We need to take each entry in a channel column, check to see if there's a specified delimiter (falling back to comma) in the sidecar.channel object, split by that delimiter, collect all the resulting channels, and verify they exist in the associated channels.tsv.

@Andesha
Copy link
Collaborator

Andesha commented Jun 15, 2023

@effigies My new draft PR on bids-examples here is the first step in contributing a dataset to your point. There should be more commits soon with the channel lists.

I believe @VisLab is planning commits to bids-examples to cover some more cases, too.

cc @christinerogers

@sappelhoff sappelhoff self-assigned this Jun 23, 2023
@sappelhoff
Copy link
Member

I will add an example within the next days/weeks and merge this. We have agreed on this at the BIDS derivatives meeting Copenhagen 2023.

@sappelhoff
Copy link
Member

I finally got around to adding an example. please feel free to review:

old new
image image

@christinerogers
Copy link
Contributor

christinerogers commented Nov 2, 2023

looping @dorahermes and @Andesha in here --- Dora does this work for you as well?

@effigies
Copy link
Collaborator Author

effigies commented Nov 2, 2023

This is adding "Delimiter" as a valid sidecar field for any column in any TSV. What is the obligation of the validator for a column that declares a "Delimiter"?

If I have channels that include F,1 and I specify a channel column with no delimiter because I do not use lists, I think the proposed behavior above would cause us to parse this as two channels (F and 1) which do not exist in channels.tsv.

I think at best this should be a warning, because otherwise the complexity of writing a validation rule in schema that covers all edge cases is going to be painful.

@sappelhoff
Copy link
Member

sappelhoff commented Nov 2, 2023

If I have channels that include F,1 and I specify a channel column with no delimiter because I do not use lists, I think the proposed behavior above would cause us to parse this as two channels (F and 1) which do not exist in channels.tsv.

Or we make the Delimiter metadata REQUIRED. If it's not declared, then channels cells are not parsed as arrays.

However, the majority of people in #1146 was in favor of a (in my opinion only "seemingly") simpler solution that is currently proposed in this PR.


I think at best this should be a warning, because otherwise the complexity of writing a validation rule in schema that covers all edge cases is going to be painful.

you mean that IF

  1. channels is a column in channels.tsv
  2. cells in the channels column contain ,
  3. no Delimiter is declared in channels.json under the channels key

THEN

  1. we warn

???

This is adding "Delimiter" as a valid sidecar field for any column in any TSV.

yes, but currently defined columns other than channels do not declare that their cells are ,-separated arrays by default ... so how I read it, for all (current) non-channels-columns, you would HAVE TO declare Delimiter to parse cells as arrays.

If I have channels that include F,1 and I specify a channel column with no delimiter because I do not use lists, I think the proposed behavior above would cause us to parse this as two channels (F and 1) which do not exist in channels.tsv.

I think if you wanted to specify a channel F,1 in a channels column and do NOT wish this to be interpreted as an array ["F", "1"], then you would HAVE to declare something like: "Delimiter": "+" in your channels.json file.

@VisLab
Copy link
Member

VisLab commented Nov 3, 2023 via email

@sappelhoff
Copy link
Member

I think to validate this, we will need to add/modify schema functions. We don't currently have a way of saying "x is a subset of y", or splitting a list of strings into a list of lists and combining into a list.

when talking about validation, what exactly do you want to validate?

  1. for each row in the channel column
  2. splitting a potential array into values
  3. for each value
  4. checking whether that channel is in the corresponding channels.tsv
  5. if not, warn

is that it? If you expected an error instead of a warning, we need to tune the language to include something like.

Channel names listed in the channel column MUST refer to entries in the channels.tsv file

Was there something else you wanted to validate that I am not thinking of right now?


I tagged some folks for reviews.

@effigies
Copy link
Collaborator Author

effigies commented Nov 6, 2023

Channel names listed in the channel column MUST refer to entries in the channels.tsv file

Was there something else you wanted to validate that I am not thinking of right now?

Yes, and the check you describe is the one I'm thinking of. The wording would indicate to me that an error needs to be raised if the check fails.

I should clarify though: I don't want to validate this at all because it's a pain that requires aggregating data from a minimum of three files (events.tsv, events.json, channels.tsv), handling n/a and performing a check that's not currently possible in our schema expression language.

If XEG users/developers are happy to handle this without the validator producing a warning or error, then that works for me. The size of the audience for this whose use wouldn't be mediated by a tool seems like it's probably small?

@sappelhoff
Copy link
Member

The size of the audience for this whose use wouldn't be mediated by a tool seems like it's probably small?

yes.

Yes, and the check you describe is the one I'm thinking of. The wording would indicate to me that an error needs to be raised if the check fails.

given that we do not have said language in the text currently, the situation is slightly ambiguous. We are neither requiring, nor recommending that each listed channel be in channels.tsv.

I am fine not validating it, which probably effectively means that we make it OPTIONAL whether or not listed channels are in channels.tsv.

I would, however, be in favor of adding REQUIRED language to the spec, but putting off validating to a future timepoint (potentially never), or non-bids-validator tools.

Please react with a thumbs-up here if you agree, and then I'll add the language. Or comment otherwise.

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a minor comment but should not block merging.

@effigies
Copy link
Collaborator Author

effigies commented Nov 7, 2023

Oh wow, I really thought I'd read the language that they have to line up. My brain is failing me...

I will leave it to your judgment whether to add the language, but I would treat a MUST as a commitment to eventually either validate it or relax the rule. If we don't add the MUST then I think it can only ever at most be a warning for backwards compatibility's sake, but the difficulty of the validation might justify limiting to a warning.

@sappelhoff
Copy link
Member

Thanks, I am fine with a warning then. It seems reasonable to me. Waiting some days with this for potential further reviews -- if none are coming in, I will merge it, as this has been discussed at length.

@VisLab
Copy link
Member

VisLab commented Nov 7, 2023 via email

@@ -10,6 +10,11 @@ TaskEvents:
response_time: optional
HED: optional
stim_file: optional
channel:
level: optional
description_addendum: |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand why this is in the task.yaml file. Is task.yaml where it determines if tabular data applies to a particular modality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task.yaml is an arbitrary name. It could have been called events.yaml, but the rule schema.rules.tabular_data.task.TaskEvents defines what are valid columns in an _events.tsv file. The filename construction (and hence what modalities apply) are defined in the filename rules:

events:
suffixes:
- events
extensions:
- .tsv
- .json
datatypes:
- beh
- eeg
- ieeg
- meg
- nirs
entities:
subject: required
session: optional
task: required
acquisition: optional
run: optional

events:
suffixes:
- events
extensions:
- .tsv
- .json
datatypes:
- beh
- eeg
- ieeg
- meg
- nirs
entities:
subject: required
session: optional
task: required
acquisition: optional
run: optional

@@ -574,7 +574,7 @@ Delimiter:
name: Delimiter
display_name: Delimiter
description: |
If a column may be interpreted as a list of values, the character that
If rows in a column may be interpreted as a lists of values, the character that
Copy link
Member

@VisLab VisLab Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better as:
"If a column entry may be interpreted as..."

Also, does the Delimiter only apply to "channel" column or to any column as the above implies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delimiter will from now on apply to any column.

onset duration trial_type response_time stim_file channel artifact
1.23 0.65 start 1.435 images/red_square.jpg n/a n/a
5.65 0.65 stop 1.739 images/blue_square.jpg n/a n/a
12.1 2.35 n/a n/a n/a F,1|F,2|Cz sweat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "sweat" was "musc" in a later example and the "artifact" column was called "annot". I think the same example should be continued along.

Thus, the channels related to the event in the third row of the example
are called `F,1`, `F,2`, and `Cz`.

1. The example contains a column called `artifact`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the two examples that relate to this should have this column either called "artifact" or "annot".

@effigies
Copy link
Collaborator Author

@VisLab Some of your comments are marked "Outdated". I'm not sure if you're using a link to a specific commit, or just need to refresh. Can you check that you're reviewing the latest version?

@VisLab
Copy link
Member

VisLab commented Nov 11, 2023 via email

@VisLab
Copy link
Member

VisLab commented Nov 13, 2023

BTW --- I think you should go ahead and merge this PR. If any clarifications are needed, they can be added in an additional PR.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @VisLab, I am merging this now.

@sappelhoff sappelhoff merged commit 51a40fb into bids-standard:master Nov 13, 2023
17 checks passed
@effigies effigies deleted the enh/array_data branch November 13, 2023 16:26
@effigies effigies changed the title ENH: Define channel column for events and Delimiter field for column descriptions [ENH] Define channel column for events and Delimiter field for column descriptions Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Array data in .tsv cells
7 participants