refactor: use clearer class names following SPHN conventions #3

cmdoret · 2023-12-22T10:34:18Z

Expectations: I would like you to raise any issue with the clarity or structure of the schema. This will drive the development of a companion library (sdsc-ordes/smoc-api).

Notes:

I manually deployed this branch to the docs website in case that's useful https://sdsc-ordes.github.io/smoc-schema/
The only file to review is src/smoc_schema/schema/smoc_schema.yaml All other files are auto-generated from this one.

This PR makes some class names clearer by taking inspiration from SPHN conventions.
Note that we cannot be fully compliant with SPHN 2024 for the DataFile, as we aim to support Zarr arrays. This means that assay's outputs may be arrays nested inside of a Zarr file instead of individual file.

Here is the schema without showing subclasses:

erDiagram
StudyCollection {

}
Study {
    datetime start_date  
    datetime completion_date  
    uriorcurie id  
    string name  
    string description  
}
Assay {
    OmicsTypeList omics_type  
    uriorcurie id  
    string name  
    string description  
}
DataEntity {
    uri location  
    DataFormat data_format  
    uriorcurie id  
    string name  
    string description  
}
ReferenceGenome {
    uri location  
    integerList taxon_id  
    uri source_uri  
    uriorcurie id  
    string name  
    string description  
}
ReferenceSequence {
    string sequence_md5  
    uri source_uri  
    uriorcurie id  
    string name  
    string description  
}
Sample {
    integerList taxon_id  
    stringList collector  
    uriorcurie id  
    string name  
    string description  
}

StudyCollection ||--}o Study : "entries"
Study ||--}o Assay : "has_assay"
Assay ||--}o Sample : "has_sample"
Assay ||--}o DataEntity : "has_data"
DataEntity ||--}o Sample : "has_sample"
DataEntity ||--|o ReferenceGenome : "has_reference"
ReferenceGenome ||--}o ReferenceSequence : "has_sequence"

See here for a schema of the full schema with subclasses and cardinalities.

supermaxiste · 2024-01-09T10:09:02Z

Thank you for letting me review the schema @cmdoret!
praise: the schema seems to be covering all the necessary concepts and all of my suggestions are related to changing and adding stuff and clarifications 🎉
note: I'm aware that you're providing a schema to have a starting point. Feel free to ignore some points if they go beyond this goal. Those are labeled them suggestion (development).

Major

thought/issue: What seems to be missing in the schema is anything related to experiments and I'm not sure this is something planned or not. My main point is that some of the data might have specific conditions (which I call "experiment") that influence the outcome, particularly for transcriptomic, proteomic, metabolomic data. Do we want this information in the schema or are we focusing specifically on bundling data and that's it?
I'm not sure it's relevant, but I'd like to point it out because it might be. I also have some suggestions in case this might be relevant, so let me know!
thought/issue: I'm wondering if we can keep track of data updates within the schema too? I'm thinking specifically about DataEntity where maybe we want to know if there was an update to the data. This can result from either errors or also updates if the reference changes.

Minor

suggestion (development): sample is currently defined minimally with taxon_id and collector. I would recommend trying to follow BioSamples, specifically terms such as organism, tissue, sex, strain, cell_type, etc. The BioSamples standard is used by the 3 largest databases in the world for bio stuff: NCBI (US), DDBJ (Japan) and ENA (Europe).
clarification (concept): dataEntity has a property location defined as The uniform resource identifier to access a resource, either on the web or the filesystem. Local paths and web links can be treated very differently, are we sure we want location to be defined as both? I'm also thinking that the term location could be confused with the physical location, which is another information that might also be included. My suggestion would be to either have sub-concepts for location to distinguish exactly what we mean location hasType fileSystemPath or to directly split location into web or path.
suggestion (development): ReferenceSequence doesn't include any version information which I think would be practical and useful. It would be nice to add a property specifying the version of the reference used.
nitpicky (non-blocking): just because I worked on it, I'm wondering if Epigenomics falls under Genomics in OmicsType? The data is quite different, but it might not be the focus of the project anyway.
clarification (definition): currently has_data has a range DataEntity which includes Array, AlignmentSet and VariantSet. What is Array there for? Is it coming from Proteo-, Metabolomics?
question (development, non-blocking): for taxon_id were you thinking of using the NCBI Taxonomy?
suggestion (development): Studies currently have a start and completion date, if there's an ongoing study, should there be a lastUpdate property?

cmdoret · 2024-01-10T00:29:20Z

Thanks @supermaxiste, these are Excellent points ! :)

cmdoret · 2024-01-10T00:29:34Z

Supporting experiments
⏳ Indeed, this is maybe too ambitious for this PR, but I see 2 ways to structurally ad these "experimental conditions": add some property on the sample/assay, or do it via SampleProcessing class like SPHN. I imagine could be freetext and/or codes. We will definitely have to tackle this at some point, and I created a separate issue to keep track of it.

cmdoret · 2024-01-10T00:29:45Z

Tracking data changes
⏳ We could indeed represent processing / transformation that the data went through. This will probably come at a later stage, as this is not strictly required to have a functional object (but makes it more FAIR) and I kept track of it in another issue.

cmdoret · 2024-01-10T00:29:52Z

Biosamples
✅ Thanks for the suggestion! The biosamples website also refers to bioschemas:BioSample, on which smoc:Sample is loosely based. We will follow this model focusing on human-centric properties.

cmdoret · 2024-01-10T00:50:29Z

Data location
❓ Deciding how to handle a URI is often done based on the "scheme" of the uri (the first part before :// , e.g. s3:// file:// ftp://. This was inspired from sphn:has_uniform_resource_identifier. I was hoping to avoid multiplying properties for the sake of reducing the code complexity downstream.

I agree that the naming is problematic. Other options I considered were uri, path, at_location, access_path, access_location, content_uri, content_location, data_path, data_uri. Any suggestion / preference?

EDIT: temporarily went with data_path to disambiguate from physical location. Let me know if you can think of something better.

cmdoret · 2024-01-10T10:25:08Z

ReferenceSequence version
✅ Indeed, references now have an optional version property with string value.

cmdoret · 2024-01-10T10:29:53Z

Using ncbi taxonomy
⏳ For now we will keep an integer, but yes we eventually want to move to controlled vocabularies. Created an issue for this #6

Based on your advice, we also added source_material, sex and cell_type properties for sample. For now they take string as values, but we will want to use controlled sets as well.

cmdoret · 2024-01-10T10:31:42Z

Epigenomics
❌ For now we are limiting the scope to pure genomics (aka WGS, WES, ...)

cmdoret · 2024-01-10T10:33:57Z

has_data with Array
ℹ️ Array is a multi-dimensional array contained inside the zarr archive. It is not a file, which is why we don't use DataFile, but DataEntity as superclass.

In practice, yes it will likely be used for {prote,metabol}omics data, but not only.

cmdoret · 2024-01-10T10:40:48Z

Study completion_date
Based on feedback from our collaborators, the Study object has been replaced with Multi-Omics Digital Object (abbr. MODO).

In the process, we removed properties associated with studies, incl. start/completion date instead and went with "Creation date" instead.

EDIT: added last_modified_date property on MODO

supermaxiste · 2024-01-10T15:33:23Z

To wrap things up from my previous comments:

Major

Issue for tracking ✅
Issue for tracking ✅

Minor

Integrated ✅
DataLocation term❓
I would suggest access_path or access_location because I think in this case two words are better than one 👍
Integrated ✅
Clarified ✅
Array ℹ️
question: just to make sure I get it right, we're now defining AlignmentSet and VariantSet which are the only two data types I can think of in genomics and transcriptomics. Since I have extremely limited knowledge on proteo- and metabolomics, I guess they might also have their own "standard sets". Is Array there to make sure to include other data types that we don't know yet or is it a placeholder for other sets besides AlignmentSet and VariantSet?
For context: to me "Array" is a very generic word so I'm trying to pin down what we're aiming for more or less here.
Issue for tracking ✅
Integrated ✅

cmdoret · 2024-01-10T19:37:07Z

Thanks @supermaxiste !
You're right, Array is a placeholder and we'll likely subclass it or make additional variants for proteo/metabolomics stuff once we know exactly what's required 👍

For now I've already replaced location with data_path in the schema + companion API, and we'll use it for the prototype, but the schema will likely be revisited later.

supermaxiste

Thank you for all of your replies @cmdoret!

I'll provide a couple more in-depth comments here that you can decide whether we can turn them into issues or address them in this PR. To some extent they address clarity, but I'll let you be the judge.

suggestion (ontology): For some of the genomic terms such as "Reference Genome" and "Reference Sequence" we could also point to the GENO Ontology since they provide some nice definitions too. With "point to" I mean adding a see_also property pointing to GENO entries:

Reference sequence: http://purl.obolibrary.org/obo/GENO_0000017 (link)
Reference genome: http://purl.obolibrary.org/obo/GENO_0000914 (link)

note: I checked for alignment set and variant set but couldn't find anything there

question (cardinalities): in the project/ folder I saw a bunch of files with different formats and I noticed that some include shacl shapes. This lead me to check cardinalities and I don't think it's very clear what needs to be there and what not. In the overall diagram you shared it looks like a class existing doesn't require any other class from existing and I'm wondering if that should be changed. If you add a MODO, shouldn't it include at least 1 Assay with at least 1 DataEntity? Or are we thinking in a "placeholder" way where people can create empty MODO objects and fill them up later?
On top of this, it's not clear right now 2.1) which properties are mandatory and 2.2) where the information is coming from, because the schema include some required entries, but the shacl shapes seem to go further than that.

supermaxiste · 2024-01-11T09:56:55Z

To end on a sweet note:
praise (ontology): I double checked the ontology codes for file format and -omics to make sure they're indeed pointing at the right corresponding concept and all was perfect!
praise (repository): Excellent cookie cutter choice and I have to say that I like that the project/ folder offers so many different formats that users can interact with. I had some fun checking all the formats out of curiosity.

cmdoret · 2024-01-12T10:08:31Z

Thanks @supermaxiste !

Reference sequence meaning added in 7c85ca7
The reason for non-mandatory Assay / Data is indeed so that we can create a placeholder MODO container and fill it interactively, which would otherwise not be possible.
Mandatory properties are all those marked as required in the smoc_schema.yaml file which acts as the single source of truth for the schema.
I can see why this is confusing; in the yaml schema all properties are optional and single valued by default. In SHACL, all properties are optional and multivalued by default. Specifically, these are the following possibilities when shacl cardinality constraints are generated from the yaml:
- default -> sh:maxCount 1
- required -> sh:minCount 1 sh:maxCount 1
- multivalued -> no constraint
- required + multivalued -> sh:minCount 1

supermaxiste · 2024-01-12T10:15:50Z

Thank you for all the work @cmdoret and your clarification about the yaml make it fully clear now. I also found out that the docs specify all the cardinalities nicely too.

All looking good to me now 🌞

refactor: use clearer class names following SPHN conventions

49658fb

cmdoret requested a review from supermaxiste December 22, 2023 10:34

cmdoret self-assigned this Dec 22, 2023

cmdoret added 18 commits December 22, 2023 13:06

feat: add cardinalities

7d76e56

chore: regenerate schemas+classes

b6d45a9

fix: format -> data_format to avoid shadowing python built-in keyword

d9b40d2

chore: regenerate project

eb09c01

chore: regenerate project (bis)

67dedc0

test: fix test instance

a18c67d

chore: bump linkml to 1.6.7 to fix enum instantiation

eeb02fd

refactor: rename uri->location to avoid name collision with uri type

b02979b

chore: regen

96c64eb

fix: add missing omics_type on assay

8cd6e45

chore: regen

a07f11d

fix: make most classes inherit from NamedEntity to be addressable.

0c47d38

refactor: location moved to ReferenceGenome

13d29f0

refactor: default prefix smoc_schema -> smoc

a11d574

feat: make has_... subproperties of schema:hasPart

9d994a6

chore: regen

8931625

docs(readme): update mmd diagram

97ef786

doc: add author field

f7c6305

cmdoret added 2 commits January 10, 2024 10:50

feat: add version slot for references

c42da64

feat: add sample properties sex, source_material and cell_type

611050e

refactor: rm study properties from Modo object

f440527

cmdoret added 7 commits January 10, 2024 11:44

fix: typos in schema

dc7f4a0

feat: add creation/update dates

9639499

chore: regenerate project

c00fd5a

refactor: location -> data_path

83561dc

chore: regen

23244f6

test: update import StudyCollection -> MODOCollection

96581ea

test: update test data Study->MODO

169054e

supermaxiste reviewed Jan 11, 2024

View reviewed changes

docs: add meaning binding to GENO for reference genome/sequence

7c85ca7

cmdoret added 2 commits January 12, 2024 11:10

chore: regen

ff2e144

docs(readme): update er diagram

e75ce4d

supermaxiste approved these changes Jan 12, 2024

View reviewed changes

cmdoret merged commit 3b5e996 into main Jan 12, 2024
2 checks passed

cmdoret deleted the refactor/sphn branch October 7, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use clearer class names following SPHN conventions #3

refactor: use clearer class names following SPHN conventions #3

cmdoret commented Dec 22, 2023 •

edited

Loading

supermaxiste commented Jan 9, 2024

cmdoret commented Jan 10, 2024

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

supermaxiste commented Jan 10, 2024

cmdoret commented Jan 10, 2024

supermaxiste left a comment

supermaxiste commented Jan 11, 2024

cmdoret commented Jan 12, 2024

supermaxiste commented Jan 12, 2024

refactor: use clearer class names following SPHN conventions #3

refactor: use clearer class names following SPHN conventions #3

Conversation

cmdoret commented Dec 22, 2023 • edited Loading

supermaxiste commented Jan 9, 2024

Major

Minor

cmdoret commented Jan 10, 2024

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

cmdoret commented Jan 10, 2024 • edited Loading

supermaxiste commented Jan 10, 2024

Major

Minor

cmdoret commented Jan 10, 2024

supermaxiste left a comment

Choose a reason for hiding this comment

supermaxiste commented Jan 11, 2024

cmdoret commented Jan 12, 2024

supermaxiste commented Jan 12, 2024

cmdoret commented Dec 22, 2023 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading

cmdoret commented Jan 10, 2024 •

edited

Loading