-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Array data in .tsv
cells
#1446
Comments
I always like to see what I'm in for downstream, so I did a couple of simple tests of the alternatives on Python (pandas DataFrame) and Matlab (Table), which I believe will be the two most common downstream data structures to hold tabular data for users. After view the results (below and attached files), I would prefer option 3 of just having channels be [x , y, z]. It seems that the tabular input for standard tools sees the [ ] and immediately thinks this is a string. Here is my code and the results. The three test files are attached. All of the Matlab results have outer { } because they all come back as a string in a cell array of size 1 x 1. table1 = readtable('test1_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table2 = readtable('test2_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table3 = readtable('test3_events.tsv', 'Delimiter', 'tab', 'FileType', 'text'); Python: table1 = pd.read_csv('test1_events.tsv', delimiter='\t')
table2 = pd.read_csv('test2_events.tsv', delimiter='\t')
table3 = pd.read_csv('test3_events.tsv', delimiter='\t')
|
Thanks @sappelhoff and @VisLab . My EEGNet colleague Tyler @Andesha Collins will weigh below, and I'll re-iterate part of our concern with embedding json in a tsv --: It adds challenges for researchers trying to BIDS-ify their data, since it's harder to read and format in a .tsv cell, more prone to formatting errors, and can't be validated as easily (afaik) - a recipe for frustration and attrition in data sharing. I understand the reasons why embedded json can be preferred, and at the same time it would be nice to also aim for reducing barriers in onboarding many new research groups to BIDS (as my group does with large open-source projects like LORIS and data-sharing platforms like EEGNet). p.s. @sappelhoff per my original question in the BIDS Maintainers meeting March 9 - We can't be the only corner of the BIDS world looking at this, can we? who else should be consulted in from other specs? |
Note to self: check how octave + bids-matlab will deal with this. |
Some notes on Kay's comment: I would suggest that we don't allow heterogeneous arrays, and instead say the type is "array of strings". Encoded in JSON it would be:
(I've added another row to show an empty list.) There are two approaches with Pandas (note that the reprs don't make clear the string versus object; if Convert on load>>> import pandas as pd
>>> import json
>>> import math
>>> def loader(val):
... if val == 'n/a':
... return None
... return json.loads(val)
...
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t', converters={'channels': loader})
>>> table
onset duration trial_type channels
0 3.452 NaN blink [P1, P2, P3]
1 6.112 0.0 show None
2 6.820 0.2 response [CP1]
3 7.345 NaN start [1, 2, 3]
4 8.543 5.0 mixed [1, CP1, 2]
5 9.112 0.0 mixed [] 1, 2 and 3 are strings here. Load then convert>>> import pandas as pd
>>> import json
>>> def from_default(val):
... if isinstance(val, str):
... return json.loads(val)
... return val
...
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t')
>>> table
onset duration trial_type channels
0 3.452 NaN blink ["P1", "P2", "P3"]
1 6.112 0.0 show NaN
2 6.820 0.2 response ["CP1"]
3 7.345 NaN start ["1", "2", "3"]
4 8.543 5.0 mixed ["1", "CP1", "2"]
5 9.112 0.0 mixed []
>>> table.channels = table.channels.apply(from_default)
>>> table
onset duration trial_type channels
0 3.452 NaN blink [P1, P2, P3]
1 6.112 0.0 show NaN
2 6.820 0.2 response [CP1]
3 7.345 NaN start [1, 2, 3]
4 8.543 5.0 mixed [1, CP1, 2]
5 9.112 0.0 mixed [] 1, 2 and 3 are also strings here. The other thing we considered was a simple comma-separated string, so it would be
This would be easy enough to deal with simply by using When suggesting JSON-encoded strings, I assumed that these would not be hand-edited but generated from spreadsheets or annotation tools. If this will be tooling mediated, the JSON will be better structured and the validator could distinguish numbers from strings and give errors. If people are going to be writing these by hand, then I think the comma-separated strings will be the easiest. |
Just some wild thoughts: Besides it being JSON, what would be the advantage of using the Possible answers to myself: The The Why would we use a different separator inside the list (comma) than outside the list (tab); that could also both be a tab. Yup, I realize this makes parsing harder, both for human and machine. But would that mean that if we have a nested list, we would use yet another separator (e.g., I like the idea what the downstream consequences are when people use software to process the data. Some softwares that I suggest to consider (besides native Python, Python+Pandas, MATLAB ) are R, SPSS, Excel, ... |
I hadn't personally considered nested lists as part of my suggestion based on the simple List format... @robertoostenveld Is that something that we would need to support? 🤔 |
The comma separated list inside of brackets seems to be the most simple (and safest) representation that fits the range of use cases that I am aware of.. from what I can tell, things like adding the quotations around items inside the brackets would only allow the users to do things like have commas in channels names for example (which I think I am okay with restricting). |
Talking with @sappelhoff, one of the problems with comma-separated channels is that in practice channels might have commas in their names. I can see two approaches that would allow custom delimiters that are only incrementally more complex than comma-separated strings.
These could be implemented: def wrapped_delimiters(val):
if isinstance(val, str):
delimiter = val[0]
if not delimiter.isalnum() and val[-1] == delim:
return val[1:-1].split(delimiter)
return val
def prefixed_delimiters(val):
if isinstance(val, str) and len(val) > 2:
if val[1] == ":":
delimiter = val[0]
return val[2:].split(delimiter)
return val Unless there's some globally safe delimiter (or we're willing to force people to mass-rename channels in their data), A dynamic delimiter or escape characters are likely the only options. One additional step of complexity is an overridable default:
|
No, I do not see the use case for it at the moment. But then, I also did not see the use case in the past for lists within one cell of the TSV table. Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use |
Thanks for the suggestion, Chris. I like the idea of delimited lists where the delimiter is wrapped and may be freely selected from a list of defined delimiters (like |
I'm glad this discussion is progressing towards a resolution. I don't see a use case for nested lists at this time and don't think we should address this until there is a need for it. Either the wrapped delimiter solution or the overridable delimiter solution is fine. |
This is great, thank you, either the wrapped delimiter solution or the overridable delimiter solution should work for our use cases. |
Personally, I prefer the simple format of separated values with commas. P1,P2,P3 BIDS is supposed to be readable by users. Numerical array can be 1,2,3 and if you want string "1","2","3" |
Okay, if you need to distinguish numbers and strings, then that makes sense, but you're now two characters away from a JSON string, which I thought we were trying to avoid. Adding in quoted strings makes this a parsing problem and syntax errors will be possible. |
I think instead of distinguishing between an |
I take comfort in @robertoostenveld 's comment (and the majority here of array data CSV cells in TSV files supporters):
I also like a comma-separated list's simplicity for "array data" inside of a TSV cell (which is trying to work around the spirit of a TSV already) and the acknowledgment that anything you might put inside a CSV array inside an otherwise TSV file should be absent of commas. Names can use a delimiter that is not a comma, such as @effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation? |
In EEG I have only seen [A-Za-z0-9], - and _. I don't know about eye tracking and motion capture. |
It's not common to have non-alphanumeric characters in channel names (except maybe a space character ... not sure about the statistics here and what's "common"). However it does happen -- I have seen it raised in issues on ephys software, and for example, BrainProducts explicitly mention in their spec, how commas are supposed to be escaped:
(taken from the BrainVision data format spec) And given that the suggestion with wrapped delimiters is in my opinion a robust and easy-to-understand solution, I see no reason why we should force people to rename their channels.
I currently don't see a need for "mixed" arrays (str and int/float), so I agree with the suggestion to define it in the JSON, which data type is supposed to be expected. However having that said, if the need ever does arise, then a JSON array probably would be the best solution for us. |
Another proposal could be to define the delimiter of a See this TSV file:
and this accompanying JSON file: {
"my_channels1":{
"LongName":"My Channels 1",
"Description":"The first batch of channels associated with a given event.",
"Delimiter":","
},
"my_channels2":{
"LongName":"My Channels 2",
"Description":"The second batch of channels associated with a given event.",
"Delimiter":"|"
}
} We would basically extend the list of "defined JSON keywords to describe TSV columns" (current doc on this: https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files) by an additional keyword: If In most cases, we will have EDIT: Or we think in this general direction and make this a key like so: {
"CellArray": {
"Delimiter": ","
"DataType": "int"
}
} and so on ... I propose this just as a new direction for our thoughts, and because I am worried that a simple "comma separated list" will not be a good idea to capture the heterogeneity that we have with data. |
@sappelhoff I get where you're coming from with explicit delimiter metadata, but I think putting parsing instructions into JSON sidecars would be extremely error-prone. I can't think of an existing case where the data file cannot be fully parsed without reference to the sidecar, and TSV files in general have not had mandatory sidecars. I suspect you'll end up with a lot of tools just assuming comma-separated values. Not to cut the conversation short, but does it make sense to put forward some questions for a vote? I have a sense that there's at least a camp that's pretty settled on comma-separated-unquoted-strings, and if that camp is larger than those trying to find an acceptable dynamic delimiter, we may just be wasting time coming up with possibilities. |
Thanks for your perspective @effigies I agree. I proposed this because I am just really unhappy with the plain comma-separated list, as I the wrapped delimiter solution is so much more versatile at a minuscule increase in "complexity". I think a vote is a good idea. |
Just to clarify what is being voted upon, it sounds like it should either be "Comma-separation" or "Wrapped delimiters" that are used to create an array of values inside any TSV cell. A brief summary based on discussion in this issue is below. I think it is important to note that any way you want to use delimiters for an array in a TSV cell, it will always be interpreted as a string which you would have to parse in your software. Comma-separationUse commas as the delimiter to separate values in an array.
Wrapped delimitersIf the first and last character are non-alphanumeric and identical, then that character is identified as the delimiter and used to split the string. That would permit the user to select whatever they find most readable, given their channel names.
@sappelhoff @effigies Did I miss anything else important? |
Would it perhaps be possible to clarify 2 things before voting:
|
At the moment this is just about lists of channels. I think this would end up as instructions that will show up in a column description table, not something to go in "Common Principles". The most this would do is create a precedent, making it more likely that the next column to have an array type (if there ever is one) will have whatever format we declare here, but it wouldn't be so strong that we couldn't make a different choice.
I think we're talking 2-3 lines of (Python) code difference if we picked one vs the other and then later wanted to add dynamic delimiting. For what it's worth, here are the questions I wrote up (but didn't post) last time:
Would getting votes on these be sufficiently clarifying? |
@sappelhoff your original suggestion would be to circulate this to the bids community (off github?) -- Seems like we're ready? cc @effigies you have 7 upvotes so far! Thanks also @dorahermes above for circling back to visibility and generalization across all BIDS (hence this spin-off thread) -- @effigies @sappelhoff let us know if you need help/ideas for increasing visibility or linkages across modalities etc. |
I vote 1) with @VisLab since it covers most of our use cases as well. Only question I have left is whether the channel column always contains a list or whether this should be specified in the .json sidecar? |
I think I agree basically the same as @VisLab and @dorahermes.
|
Agree based on comments from @VisLab and @dorahermes |
👍 with @VisLab and @dorahermes and @Andesha 's votes and @ericearl's helpful clarification @arnodelorme - for EEGNet this seems to be the most workable, a :+1 from you as well? |
Using something like , |
Hello! The whole idea of adding arrays to a cell of a tsv file seems a bit unsavory to me, because it would takes what is otherwise a tidy table (in this sense) and make it a bit more messy, as is apparent in this discussion. Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value. I believe that this comes with no loss of information, and retains the tidy format of the table. And IIUC, requires less obtrusive changes to the spec (if at all?). For example, the table in this comment would become:
For disambiguation, we could add a column that explicitly highlights that these rows together form an array (a Boolean) and even the index within the array (i.e., integer values: 0, 1, 2). Does this preclude any use-cases that this proposal is designed to meet? |
In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging. Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode. |
In the HED Working Group, we are considering how to explicitly enter lines
in the *events.tsv* files for *non*-onset markers into event processes -
for example, markers of the onsets of syllables in a spoken word event. HED
will refer to these as 'Inset' event markers. Explicit 'Offset' event
markers (also occupying a line in the .tsv file) are also being supported.
Here, a new (optional) column in the .tsv file, 'Marker_type' can be used -
in this case now-standard event onset entries in the .tsv file should be
referred to as being an event 'Onset' marker_type.
This advance in event annotation raises the same issue of how to group
event markers to refer to the same event (process). This requires another
(optional) column in the .tsv file titled 'Event_index'. Currently, we are
thinking to allow the user to use their own event indexing scheme
(including the obvious possibility, monotonically increasing event number).
I have no opinion about the proposal to allow interpretation of
vector entries. For small (short) vectors, thsi doesn't seem a problem -
for long vectors, however, ...
Scott Makeig
…On Thu, Apr 13, 2023 at 10:22 AM Tyler Collins ***@***.***> wrote:
In my opinion the strategy of breaking up the events across multiple lines
has the problem of how to group them back together. For example, the table
requires post processing to recognise the first three events are the "same"
just on different channels. Further, it does not account for the situation
of: "this event is an artifact that is present across these three channels"
versus "there are three different artifacts separate across three channels
at the same time". This makes the previously mentioned grouping challenging.
Lastly, I agree that this breaks the "tidy" assumptions. I feel it is
necessary though. Pandas does have functions for manipulating data like
this through explode
<https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9FJqgOae$>
.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507061592__;Iw!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9JDtJUZc$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFTPPM37ERMQWIXV2DLXBAD3XANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9DEGcUK1$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Scott Makeig, Research Scientist and Director, Swartz Center for
Computational Neuroscience, Institute for Neural Computation, University of
California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott
|
Thanks for the comments @Andesha. Regarding this one:
I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig in his comment. |
Yes, Kay Robbins has demonstrated using events.tsv columns to reveal the
statistical design of the experiment. For example in a simple 2X2 design
({loud, soft}X{beep,boop}) specifying the nature of the event (onset) using
two columns (Loudness, Pitch), abetted by HED definitions of the (2x2)
levels of these two variables in an events.json file (typically at the
top level of the BIDS file hierarchy) allows HED tools to compile a full
HED annotation for the events in the dataset automatically. I suppose it
might be possible to allow vector column entries along with (already
supported) multicolumn entries? (the key there being that the definitions
of the columns and their values are given in the .json file). -Scott
…On Thu, Apr 13, 2023 at 7:22 PM Ariel Rokem ***@***.***> wrote:
Thanks for the comments @Andesha
<https://urldefense.com/v3/__https://github.com/Andesha__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEUStLqMy$>.
Regarding this one:
Pandas does have functions for manipulating data like this through explode
<https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEfzLhmoN$>
.
I think that this could work both ways. On the one hand, it means that a
vector cell could be interpreted provided software that knows how to
interpret it (but the complexity of that is apparent in the above
discussion). On the other hand, it also suggests that (almost?) anything
that can be represented using vectors could be represented in a tidy table,
provided informative columns. For example, event identifier columns, as
suggested by @smakeig
<https://urldefense.com/v3/__https://github.com/smakeig__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEZ3qR8V8$>
in his comment.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507723742__;Iw!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReERO9A-0W$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFQQMI5PGKJG3YYRJNDXBCDD5ANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEW8_z4bp$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Scott Makeig, Research Scientist and Director, Swartz Center for
Computational Neuroscience, Institute for Neural Computation, University of
California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott
|
For me that would work. |
per @Andesha's response to @arokem - This is going to get a little messy no matter which way we slice it. For data platforms (like LORIS and EEGNet.org), distributing an event over multiple rows is problematic (normalization etc) and also leads to more user typos, and zeroing out repeated values is tricky too. -- @effigies @sappelhoff or is it time for a wider recap/survey? p.s. @smakeig agreed this might grow further, and to support HED tagging at a basic level, I think we'd shoot for the simplest approach needed currently here. |
I'm not aware of any precedent. |
I'm not aware of any precedent in BIDS either. The tidy suggestion (maybe together with an index column that "binds rows together" (is this what you mean by "event identifier columns" above?)) sounds like a reasonable alternative to the wrapped delimiter suggestion to me. (A combination of both would even easily allow us a one-level nesting 😉, but I'm not advocating for that) I'm not entirely sure (yet) how this constitutes a problem for loris, eegnet, or HED. And would love to see an example of the potential problems. |
As HED was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list. |
Yes, there is the precedent of the HED column in the event file. Each
entry in the HED column is treated as a string as far as BIDS is concerned.
The hed-javascript validator verifies the actual contents in detail as part
of bids-validation.
From an annotator's perspective, putting each channel on a separate line in
the event file is a nightmare, especially when you want to annotate
different features that start at the same time, many channels are
included, and the feature needs to be identified with markers a starting
time and an ending time (which will be the predominant use case for which
this question arose) and you don't know what the ending point is when you
identify the starting point.
One possibility (not necessarily ideal) is to pass this off to the
HED Working Group to deal with. From the BIDS perspective, the list column
entry would be treated as a string. Users would need to provide HED
annotation in an events.json file (or as part of the HED column in the
events.tsv file) using the Collection tag. The string would need to conform
to the syntax HWG imposes on the values for the Collection tag (which
haven't been specified yet). @scott Makeig ***@***.***>
@christinerogers @dorahermes.
…On Sat, Apr 15, 2023 at 3:52 AM Robert Oostenveld ***@***.***> wrote:
Is there any precedent in BIDS for either following tidyverse or nesting a
list within a column ?
As HED
<https://bids-specification.readthedocs.io/en/stable/appendices/hed.html>
was brought up, is that itself not a precedent? See
https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv
where the HED column contains a comma-separated list.
—
Reply to this email directly, view it on GitHub
<#1446 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJCJOU3KBOTSBJ2X4JYY6LXBJOWDANCNFSM6AAAAAAV5D2DSA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Agreed @robertoostenveld @VisLab - consistency with what has already been accepted by BIDS makes sense vs adding tidyverse compliance into the mix. |
While it would be nice to keep the .tsv files tidy, I am not sure whether the one line version is always the easiest to human read. Putting each channel on a separate line in the event file is extremely cumbersome as in the iEEG case seizures can start on a few channels and quickly spread to many channels, which would add many lines for 1 event in the _events.tsv. This is the particular use-case that we have in mind now. HED columns containing a comma-separated list may be a precedent? |
Thanks @dorahermes - @effigies @sappelhoff there was discussion in the BIDS maintainers meeting that we had enough consensus to move forward and note this elsewhere in the BIDS ecosystem. Any thoughts? |
note that 'we' (the steering group) have nothing against the proposed solution - just when we discussed it, @arokem came up with repeated line solution, and we just wanted to see if that would work - avoiding introducing new things. Since there is a precedent, and @dorahermes makes a compelling argument that it will be cumbersome -- I vote (again) for the simplest option as in e.g. |
Agreed. Based on the responses here, I think that the original line of thought seems to be prevailing. I would much rather have folks who are actually using this kind of data weigh in on what is practical for them. |
Okay I also agree on dropping the "tidy solution" from the options for BIDS then. Following @effigies summary from here: #1446 (comment) I am still in favor of comma separated lists with:
However I would also be fine with this (although I find it less nice):
|
I'm fine with either of these solutions. Thx
…On Wed, Apr 26, 2023 at 3:06 AM Stefan Appelhoff ***@***.***> wrote:
Okay I also agree on dropping the "tidy solution" from the options for
BIDS then.
Following @effigies <https://github.com/effigies> summary from here: #1446
(comment)
<#1446 (comment)>
I am still in favor of comma separated lists with:
- "dynamic" delimiters, where the delimiter is declared in the value:
For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for
|P1|P2|P3|, the delimiter is |
However I would *also* be fine with this (although I find it less nice):
- dynamic delimiters, where the delimiter is declared in the JSON
sidecar
- if no delimiter is declared in JSON, then a comma is the default
delimiter
—
Reply to this email directly, view it on GitHub
<#1446 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Or, is it then also possible to declare a 2D matrix using p1,p2,p3;
p4,p5,p6 (commas/semi-colons) notation?
For both vectors and matrices, there should be some guideline advice not to
make them too big...
IF the tsv file cell vectors (or matrices) need to be quite large (e.g.,
say 97 sEEG channels showing some event), might these vectors or matrices
go into a 'objects.json' file as named objects?
Then, in the 'events.tsv' file the object defined in the 'objects.json'
file can be referenced in the 'events.tsv' cell as, e.g., 'obj:thing27
where the object names (here, 'thing27') are the user's choice.
This would preserve the (human) readability of the events.tsv files...
…On Wed, Apr 26, 2023 at 6:10 AM VisLab ***@***.***> wrote:
I'm fine with either of these solutions. Thx
On Wed, Apr 26, 2023 at 3:06 AM Stefan Appelhoff ***@***.***>
wrote:
> Okay I also agree on dropping the "tidy solution" from the options for
> BIDS then.
>
> Following @effigies <https://github.com/effigies>
<https://urldefense.com/v3/__https://github.com/effigies*3E__;JQ!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe95VplLHI$>
summary from here: #1446
> (comment)
> <
#1446 (comment)>
<https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1494687731*3E__;IyU!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9-PA52hO$>
>
> I am still in favor of comma separated lists with:
>
> - "dynamic" delimiters, where the delimiter is declared in the value:
> For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for
> |P1|P2|P3|, the delimiter is |
>
> However I would *also* be fine with this (although I find it less nice):
>
> - dynamic delimiters, where the delimiter is declared in the JSON
> sidecar
> - if no delimiter is declared in JSON, then a comma is the default
> delimiter
>
> —
> Reply to this email directly, view it on GitHub
> <
#1446 (comment)>
<https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1522967227*3E__;IyU!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9zlQekmX$>
,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA>
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA*3E__;JQ!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe96GLZFCE$>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1523161407__;Iw!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9waD44aj$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFWZ53A654LK4NOVTO3XDDYC5ANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe98rBRAxS$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Scott Makeig, Research Scientist and Director, Swartz Center for
Computational Neuroscience, Institute for Neural Computation, University of
California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott
|
Happy with the following from @sappelhoff 's breakdown
|
Please bear in mind that any rule that goes in BIDS needs to be validatable. I think adding matrices or object references into the mix is asking for pain and, in any case, it is beyond the scope of what people have discussed and formed some kind of agreement around. At this point, it seems that there is a majority for simple comma-separated strings, and if people want/need to use alternative delimiters, they can add a It's time to make a concrete PR to verify that this can be specified and validated to everyone's satisfaction. The following is my assessment: The bids-specification/src/common-principles.md Lines 460 to 472 in 77066b3
I suppose we add a bids-specification/src/schema/rules/tabular_data/task.yaml Lines 2 to 16 in 77066b3
This could be narrowed to only apply to MEG/EEG/iEEG/NIRS. |
Array data in .tsv cells discussion is converging towards a comma-separated list (possibly wrapped delimiters list), so the example was edited accordingly - combined rows representing the same event with a list of channels in the channel column. See full discussion here: bids-standard/bids-specification#1446
Potentially to be dealt with:
Discussion originated in this thread:
Currently, cells in BIDS
.tsv
files may take on numeric or string values (see tabluar data).Here I list some examples:
n/a
, reflecting a missing value or that a value is not applicable for the present cellmystr
, that is further specified in thecolumn_name.Levels
object in a.json
file that accompanies the.tsv
file, wherecolumn_name
refers to the name of the column in the.tsv
file under which the stringmystr
occursonset
column inevents.tsv
column_name.Units
, analogously to themystr
example aboveHowever, for some cases, it would be desirable to define an array of values in a particular
.tsv
cell. For example when a given event (in a row) in anevents.tsv
file is associated with multiple channels from a neural recording (e.g., EEG, iEEG, NIRS, ...):Currently, there is no formal way to specify such an array in BIDS. We think that it could be helpful for several data modalities (now and in the future) to have such a concept to work with.
In bids-standard/bids-examples#324 (comment), @effigies suggests that we use JSON arrays for this case (see his comment for advantages over more basic structures, like a comma-separated string).
I think if we were to adopt JSON arrays as a valid input for
.tsv
cells, we would would benefit by (at least initially) restricting the kinds of values that may be listed in a given JSON array. Values are defined as such:For our present purposes, I think it would suffice to confine ourselves to:
What are your opinions? Do you see other use cases apart from annotations in ephys data (in
events.tsv
files)? Please comment and advise.The text was updated successfully, but these errors were encountered: