Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use clearer class names following SPHN conventions #3

Merged
merged 32 commits into from
Jan 12, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Dec 22, 2023

Expectations: I would like you to raise any issue with the clarity or structure of the schema. This will drive the development of a companion library (sdsc-ordes/smoc-api).

Notes:

  • I manually deployed this branch to the docs website in case that's useful https://sdsc-ordes.github.io/smoc-schema/
  • The only file to review is src/smoc_schema/schema/smoc_schema.yaml All other files are auto-generated from this one.

This PR makes some class names clearer by taking inspiration from SPHN conventions.
Note that we cannot be fully compliant with SPHN 2024 for the DataFile, as we aim to support Zarr arrays. This means that assay's outputs may be arrays nested inside of a Zarr file instead of individual file.

Here is the schema without showing subclasses:

erDiagram
StudyCollection {

}
Study {
    datetime start_date  
    datetime completion_date  
    uriorcurie id  
    string name  
    string description  
}
Assay {
    OmicsTypeList omics_type  
    uriorcurie id  
    string name  
    string description  
}
DataEntity {
    uri location  
    DataFormat data_format  
    uriorcurie id  
    string name  
    string description  
}
ReferenceGenome {
    uri location  
    integerList taxon_id  
    uri source_uri  
    uriorcurie id  
    string name  
    string description  
}
ReferenceSequence {
    string sequence_md5  
    uri source_uri  
    uriorcurie id  
    string name  
    string description  
}
Sample {
    integerList taxon_id  
    stringList collector  
    uriorcurie id  
    string name  
    string description  
}

StudyCollection ||--}o Study : "entries"
Study ||--}o Assay : "has_assay"
Assay ||--}o Sample : "has_sample"
Assay ||--}o DataEntity : "has_data"
DataEntity ||--}o Sample : "has_sample"
DataEntity ||--|o ReferenceGenome : "has_reference"
ReferenceGenome ||--}o ReferenceSequence : "has_sequence"

Loading

See here for a schema of the full schema with subclasses and cardinalities.

@cmdoret cmdoret requested a review from supermaxiste December 22, 2023 10:34
@cmdoret cmdoret self-assigned this Dec 22, 2023
@supermaxiste
Copy link
Member

Thank you for letting me review the schema @cmdoret!
praise: the schema seems to be covering all the necessary concepts and all of my suggestions are related to changing and adding stuff and clarifications 🎉
note: I'm aware that you're providing a schema to have a starting point. Feel free to ignore some points if they go beyond this goal. Those are labeled them suggestion (development).

Major

  1. thought/issue: What seems to be missing in the schema is anything related to experiments and I'm not sure this is something planned or not. My main point is that some of the data might have specific conditions (which I call "experiment") that influence the outcome, particularly for transcriptomic, proteomic, metabolomic data. Do we want this information in the schema or are we focusing specifically on bundling data and that's it?
    I'm not sure it's relevant, but I'd like to point it out because it might be. I also have some suggestions in case this might be relevant, so let me know!

  2. thought/issue: I'm wondering if we can keep track of data updates within the schema too? I'm thinking specifically about DataEntity where maybe we want to know if there was an update to the data. This can result from either errors or also updates if the reference changes.

Minor

  1. suggestion (development): sample is currently defined minimally with taxon_id and collector. I would recommend trying to follow BioSamples, specifically terms such as organism, tissue, sex, strain, cell_type, etc. The BioSamples standard is used by the 3 largest databases in the world for bio stuff: NCBI (US), DDBJ (Japan) and ENA (Europe).

  2. clarification (concept): dataEntity has a property location defined as The uniform resource identifier to access a resource, either on the web or the filesystem. Local paths and web links can be treated very differently, are we sure we want location to be defined as both? I'm also thinking that the term location could be confused with the physical location, which is another information that might also be included. My suggestion would be to either have sub-concepts for location to distinguish exactly what we mean location hasType fileSystemPath or to directly split location into web or path.

  3. suggestion (development): ReferenceSequence doesn't include any version information which I think would be practical and useful. It would be nice to add a property specifying the version of the reference used.

  4. nitpicky (non-blocking): just because I worked on it, I'm wondering if Epigenomics falls under Genomics in OmicsType? The data is quite different, but it might not be the focus of the project anyway.

  5. clarification (definition): currently has_data has a range DataEntity which includes Array, AlignmentSet and VariantSet. What is Array there for? Is it coming from Proteo-, Metabolomics?

  6. question (development, non-blocking): for taxon_id were you thinking of using the NCBI Taxonomy?

  7. suggestion (development): Studies currently have a start and completion date, if there's an ongoing study, should there be a lastUpdate property?

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Thanks @supermaxiste, these are Excellent points ! :)

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Supporting experiments
⏳ Indeed, this is maybe too ambitious for this PR, but I see 2 ways to structurally ad these "experimental conditions": add some property on the sample/assay, or do it via SampleProcessing class like SPHN. I imagine could be freetext and/or codes. We will definitely have to tackle this at some point, and I created a separate issue to keep track of it.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Tracking data changes
⏳ We could indeed represent processing / transformation that the data went through. This will probably come at a later stage, as this is not strictly required to have a functional object (but makes it more FAIR) and I kept track of it in another issue.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Biosamples
✅ Thanks for the suggestion! The biosamples website also refers to bioschemas:BioSample, on which smoc:Sample is loosely based. We will follow this model focusing on human-centric properties.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Data location
❓ Deciding how to handle a URI is often done based on the "scheme" of the uri (the first part before :// , e.g. s3:// file:// ftp://. This was inspired from sphn:has_uniform_resource_identifier. I was hoping to avoid multiplying properties for the sake of reducing the code complexity downstream.

I agree that the naming is problematic. Other options I considered were uri, path, at_location, access_path, access_location, content_uri, content_location, data_path, data_uri. Any suggestion / preference?

EDIT: temporarily went with data_path to disambiguate from physical location. Let me know if you can think of something better.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

ReferenceSequence version
✅ Indeed, references now have an optional version property with string value.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Using ncbi taxonomy
⏳ For now we will keep an integer, but yes we eventually want to move to controlled vocabularies. Created an issue for this #6

Based on your advice, we also added source_material, sex and cell_type properties for sample. For now they take string as values, but we will want to use controlled sets as well.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Epigenomics
❌ For now we are limiting the scope to pure genomics (aka WGS, WES, ...)

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

has_data with Array
ℹ️ Array is a multi-dimensional array contained inside the zarr archive. It is not a file, which is why we don't use DataFile, but DataEntity as superclass.

In practice, yes it will likely be used for {prote,metabol}omics data, but not only.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Study completion_date
Based on feedback from our collaborators, the Study object has been replaced with Multi-Omics Digital Object (abbr. MODO).

In the process, we removed properties associated with studies, incl. start/completion date instead and went with "Creation date" instead.

EDIT: added last_modified_date property on MODO

@supermaxiste
Copy link
Member

To wrap things up from my previous comments:

Major

  1. Issue for tracking ✅
  2. Issue for tracking ✅

Minor

  1. Integrated ✅
  2. DataLocation term❓
    I would suggest access_path or access_location because I think in this case two words are better than one 👍
  3. Integrated ✅
  4. Clarified ✅
  5. Array ℹ️
    question: just to make sure I get it right, we're now defining AlignmentSet and VariantSet which are the only two data types I can think of in genomics and transcriptomics. Since I have extremely limited knowledge on proteo- and metabolomics, I guess they might also have their own "standard sets". Is Array there to make sure to include other data types that we don't know yet or is it a placeholder for other sets besides AlignmentSet and VariantSet?
    For context: to me "Array" is a very generic word so I'm trying to pin down what we're aiming for more or less here.
  6. Issue for tracking ✅
  7. Integrated ✅

@cmdoret
Copy link
Member Author

cmdoret commented Jan 10, 2024

Thanks @supermaxiste !
You're right, Array is a placeholder and we'll likely subclass it or make additional variants for proteo/metabolomics stuff once we know exactly what's required 👍

For now I've already replaced location with data_path in the schema + companion API, and we'll use it for the prototype, but the schema will likely be revisited later.

Copy link
Member

@supermaxiste supermaxiste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of your replies @cmdoret!

I'll provide a couple more in-depth comments here that you can decide whether we can turn them into issues or address them in this PR. To some extent they address clarity, but I'll let you be the judge.

  1. suggestion (ontology): For some of the genomic terms such as "Reference Genome" and "Reference Sequence" we could also point to the GENO Ontology since they provide some nice definitions too. With "point to" I mean adding a see_also property pointing to GENO entries:
  • Reference sequence: http://purl.obolibrary.org/obo/GENO_0000017 (link)
  • Reference genome: http://purl.obolibrary.org/obo/GENO_0000914 (link)

note: I checked for alignment set and variant set but couldn't find anything there

  1. question (cardinalities): in the project/ folder I saw a bunch of files with different formats and I noticed that some include shacl shapes. This lead me to check cardinalities and I don't think it's very clear what needs to be there and what not. In the overall diagram you shared it looks like a class existing doesn't require any other class from existing and I'm wondering if that should be changed. If you add a MODO, shouldn't it include at least 1 Assay with at least 1 DataEntity? Or are we thinking in a "placeholder" way where people can create empty MODO objects and fill them up later?
    On top of this, it's not clear right now 2.1) which properties are mandatory and 2.2) where the information is coming from, because the schema include some required entries, but the shacl shapes seem to go further than that.

@supermaxiste
Copy link
Member

To end on a sweet note:
praise (ontology): I double checked the ontology codes for file format and -omics to make sure they're indeed pointing at the right corresponding concept and all was perfect!
praise (repository): Excellent cookie cutter choice and I have to say that I like that the project/ folder offers so many different formats that users can interact with. I had some fun checking all the formats out of curiosity.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 12, 2024

Thanks @supermaxiste !

  • Reference sequence meaning added in 7c85ca7
  • The reason for non-mandatory Assay / Data is indeed so that we can create a placeholder MODO container and fill it interactively, which would otherwise not be possible.
  • Mandatory properties are all those marked as required in the smoc_schema.yaml file which acts as the single source of truth for the schema.
  • I can see why this is confusing; in the yaml schema all properties are optional and single valued by default. In SHACL, all properties are optional and multivalued by default. Specifically, these are the following possibilities when shacl cardinality constraints are generated from the yaml:
    • default -> sh:maxCount 1
    • required -> sh:minCount 1 sh:maxCount 1
    • multivalued -> no constraint
    • required + multivalued -> sh:minCount 1

@supermaxiste
Copy link
Member

Thank you for all the work @cmdoret and your clarification about the yaml make it fully clear now. I also found out that the docs specify all the cardinalities nicely too.

All looking good to me now 🌞

@cmdoret cmdoret merged commit 3b5e996 into main Jan 12, 2024
2 checks passed
@cmdoret cmdoret deleted the refactor/sphn branch October 7, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants