-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistencies Between Code, Documentation, and Schema #534
Comments
+1
@ptth222 this should be fixed when we merge 'issue-511' into 'develop'.
@ptth222 the "components" were left out from the implementation as hardly used / inconsistently used. "Parameters" provide a similar capability. Yet, you are right about the undocumented discrepancy between schema and implementation.
Yes. Currently 'derivesFrom' is only available from Sample.py, and not from Material.py., which is inconsistant with the current schemas (1.0).
@ptth222 , the schema 'material_attribute_schema.json' is the one you are looking for and it is used by the Study_schema.json and assay_schema.json. Agreed the naming should have been consistent.
known issue: @ptth222 this will be addressed by merging 'issue-511' into 'develop'
correct.
known issue: @ptth222 this will be addressed by merging 'issue-511' into 'develop'
Indeed. considered minor but if this is really becoming a nuisance, a fix will be issued. @terazus , what do you think?
There are indeed plans to revamp but as the changes would likely break background compatibility, we have held back.
@ptth222 +1, we discussed this with @terazus . could also be an Organization (CRO use case). We did not implement to avoid complexity and also possible privacy issue of 'performer' (e.g. animal studies).
there is no arguing on this one.
@ptth222 +1 we need to check. 'Extract Name' , 'Labeled Extract Name' are Material and should get "Characteristics".
This is what happened de-factor-> both 'Material Type' and 'Description' recast as 'Characteristics'.
+1
Agreed about clarifying that aspect .
+1, I think it is best to have this as a separate issue but as we go over the merge review for issue-511 (which covers more than the initial issue), we will double check this. Thanks for all the detailed reporting. |
@proccaserra Thank you for the detailed response. I apologize for double posting about some of this. I'm not familiar with all of the issues and was not aware of #511. I have been trying to pay attention to the develop branch and keep my fork up to date. I will take a look at this again once #511 is merged. |
@ptth222 , no apologies needed at all. We were hoping to push the corrections sooner , thus avoiding you that pain. I am the one to apologise for these inconsistencies. Your input is very much appreciated |
A small update. The JSON Schema for data_schema has the following types listed for data files: 3 of these don't have a class in datafile.py in the model: |
I have found another inconsistency with the "Assay Nodes". The terminology gets kind of confusing here, so I am going to try and explain. In the specification, https://isa-specs.readthedocs.io/en/latest/isatab.html, in the "Assay Table file" section, it discusses the use of the column names "Assay Name", "Data Transformation Name", and "Normalization Name". In the code, in "isatab\defaults.py", these column names and some others are listed in a constant "_LABELS_ASSAY_NODES". The first thing that is rather confusing is the naming. Having a column, "Assay Name", inside the assay table file, which already has a name, is confusing. If you look in the code for ProcessSequenceFactory all these columns do is end up giving a process a name, so a column name like "Process Name" would make much more sense. It's also confusing why there are so many name variations on a column that just serves to name a process. What is inconsistent between the code and the specification is that the specification indicates that these name columns MUST be used, but that is not the case in the code. There are no validations that require them, and the ProcessSequenceFactory just looks for 1 to be present and uses it to set the process name if 1 and only 1 exist for the process. Essentially, they are treated as qualifying columns for a process in the code. It looks to me like there is some history with these columns that I'm not aware of and their use has changed over time. One of the example assay table files, a_proteome.txt, seems to use them more like a synonym for "Protocol REF". The "preprocess" function in ProcessSequenceFactory.py also seems to support this because it will insert "Protocol REF" columns in front of these name columns if it finds one of these columns unconnected to a "Protocol REF" column. What is difficult now is how to validate these columns. If they should be qualifying columns for "Protocol REF", then reporting an error when they are out of position is reasonable. If they could be a synonym for "Protocol REF", then you shouldn't report an error. These are mutually exclusive, and I need to know which behavior should be correct for the "load_table_checks" validating function. |
I went through the code in the "model" folder, the JSON Schemas in "resources/schemas/isa_model_version_1_0_schemas", the JSON Schemas on https://isa-specs.readthedocs.io/en/latest/isajson.html, and the specification on https://isa-specs.readthedocs.io/en/latest/isatab.html to try and find inconsistencies. My impetus to do this was after trying to figure out the correct way to fix one of the validation functions and realizing that I don't have a source of truth for what should be valid.
I am going to list off some of what I found and categorize them a little bit.
Hard Inconsistencies
Minor Things I Noticed
Questions
Suggestions or Possible Improvements
The Hard Inconsistencies between the model code and JSON Schemas really needs addressed, and most importantly, which one is the ground truth? Does the model code need to change or the JSON Schemas?
Deprecated Columns?
This next bit I want to do separately because I really need to understand it to fix the validation function that started this whole thing. The function in question is "load_table_checks" in "isatab/validate/rules/rules_40xx.py". It is printing an unexpected column error for a "Characteristic" column after an "Extract Name" column. That is to say the function is saying that it is not valid to have a "Characteristic" column after an "Extract Name" column. I am quite confident that this is not correct. When I went to double check the truth of this I ran into what seems like contradictory information from my point of view between some of the sources mentioned above. That's why I did a deeper dive into things to try and find a decent list of issues that we can hopefully get resolved so there isn't conflicting information and code.
The main thing I want to ask about now is in the specification here: https://isa-specs.readthedocs.io/en/latest/isatab.html#assay-table-file. Under "2.3.7 Assay Table file" where it talks about "Extract Name" and "Labeled Extract Name" it says they "MAY be qualified with Characteristics, Material Type and Description". The "Characteristics" is consistent with the JSON Schema and model code, but I don't think "Material Type" and "Description" are relevant anymore. At the very least I cannot see what "Material Type" and "Description" would translate to in the JSON or model. I think maybe "Material Type" would just be a characteristic, and "Description" would just be a comment.
I have looked over some of the conversion code and loading code, and I can find where there is some special handling for "Material Type" in the "load_table" function in the "isatab/load/core.py" file, and the first version of the tab2json converter, "ISATab2ISAjson_v1", also has special handling for "Material Type", but I can't find anything for "Description". It should be noted that this "load_table" function is only used for the ISA-Tab validator, the converter, new and old, load tables in a different way. The way the new converter loads tables does not have any special handling for "Material Type" or "Description". All of the special handling for "Material Type" makes it into a characteristic.
Similarly, "Label" has the same issues as "Material Type" and "Description". To me it looks like it should just be a characteristic. I can find special handling for "Label" in the old tab2json converter and in the new converter (ProcessSequenceFactory). Both turn it into a characteristic.
Can we decide exactly what should be done with these 3 columns, "Label", "Material Type", and "Description"? The easiest thing is to just drop them and remove their mention in the documentation, but if you want to stay backward compatible then they need some special handling. "Label" and "Material Type" already have some special handling precedent that simply turn them into characteristics, which I think is fine, but needs to be consistent everywhere in the code. The 3 different loading paradigms in the code all do things differently. "Description" has no precedent for special handling that I can find, so it might be okay to just drop this and remove all mentions. I do need to have an answer to this before I can fix the "load_table_checks" function though.
This last part may need to be it's own issue, but it is related to everything. If you would rather have it separated, then that's fine.
The text was updated successfully, but these errors were encountered: