Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C&P/GKS/Discovery components #67

Open
mcourtot opened this issue Oct 11, 2018 · 10 comments
Open

C&P/GKS/Discovery components #67

mcourtot opened this issue Oct 11, 2018 · 10 comments

Comments

@mcourtot
Copy link

  1. Minimum core metadata attributes for search. There was a spreadsheet with a minimal set that was circulated on the discovery side which I can’t find again. I think it may have been from Tony Brookes? It has only 20 attributes or so and could provide a good starting point to make sure C&P covers those appropriately.
  2. General metadata schema (to be worked on on the C&P side) should accommodate work from the Variant group (GKS)
@mcourtot
Copy link
Author

To clarify, I am not proposing Discovery develops a new metadata model, rather that it aligns with the SchemaBlocks at https://github.com/ga4gh-metadata/schemas and informs their further development.

@mbaudis
Copy link

mbaudis commented Oct 11, 2018

@mcourtot +1 - We want the requirements from SearchAPI, Beacon & data exchange products (think phenopackets PXF) to inform the development of standards in SchemaBlocks, which then should serve as the reference for "product" developments.

(Maybe this isn't the final place for the code; could be starting point for a work stream subgroup...)

@Relequestual
Copy link
Member

I think there's a lot of overlap here in terms of what Discovery has been ask to do regarding its component architecture and component data model definitions, and the MetaData Schema Blocks.

I need to discuss this further with @mfiume.

I'll be putting out a Discovery Search API work retrospective soon regaridng the standard not getting approved at the SC meeting in Basel. Part of the feedback I heard was we need to be clear about where components are defined, by whom, and how the process works.

I need to write up some documents detailiing those aspects.

My personal expectation is these processes will change, and obviously there are others who want define schemas or blocks of data for interoperability.

Search (and broader, Discovery) does have specific requirements, but from my perspective these are less about the models, and more about the format of the models and how they can be used. For example, using JSON Schema to make the components non-ambigious and automatically validatable.

It looks like Schema Blocks may have intended to make machine readable schemas, but has missed that mark. I'm hoping the documents I draw up regarding process and format of schemas can help with this moving forward, for wherever these schemas sit.

@mbaudis
Copy link

mbaudis commented Oct 22, 2018

It looks like Schema Blocks may have intended to make machine readable schemas, but has missed that mark.

No, actually not (intended to be machine readable). It is just "human readable but decent in consistency", waiting to be picked up for formalisation ... Format so far "informed by" OpenAPI, but more in a pseudo-code way.

@Relequestual
Copy link
Member

@mbaudis OK, that's actaully good news from my perspective!

Payload definitions in OpenAPI are actually a sub/superset of JSON Schema, which is better suited for individual model definitions for data represented in JSON.

There are lots of aspects at play here regarding model definitions and formalisation. Happy to discuss more in detail if you think that's useful right now.

@mcourtot
Copy link
Author

The intent was indeed to have a human readable spec, but leave the implementation free. For example, Phenopackets uses protobuf, but for my purposes I want to use JSON as we are also working on a common validator (with Elixir/HCA).

With respect to Discovery search I would like to have a consensus on attributes to be used, i.e., the Minimum core metadata attributes for search mentioned in 1. above. Anything that can be reused should be, anything that is missing could be added. It'd be good to discover the same things consistently :)

@Relequestual
Copy link
Member

I think one of the aspects here is you're defining models which can aslo be for storage, and not just models to be used for transmission of data, right? That's a tangent anyway.

Could you expand on what you mean by "metadata attributes"?
Maybe you could give an example of what you think one COULD be.
I think maybe we have a different understanding of "metadata".

In terms of the previously mentioned spreadsheet sent round by Tony, it's something we still need to consider, but it doesn't reflect anything in terms of group requirements. There is work to be done.

@mcourtot
Copy link
Author

In his very first email in June 2018 @mfiume said "The GA4GH Discovery Work Stream is incubating a new standard for data discovery. Many of you have or are currently developing data exploration portals for faceted search of genomics and clinical data, and we think there is value in developing a set of standards to create a common API for data exploration and/or defining and harmonizing metadata (e.g. sex vs. gender) so that searches across the universe of genomics datasets can be made to be more consistent."
This is what I mean by metadata - we should agree on having an attribute 'biological sex' (or whichever label we want to agree on), which will be the one we will be able to search against. Expected values are one of [male, http://purl.obolibrary.org/obo/PATO_0000384; female ....]

This would allow data providers to have a GA4GH compliant API over their own data sources. This GA4GH API would provide the federated search capabilities as well as common export formats such as phenopackets.

Did you mean something different? If you want we could try and have a quick f2f and see where there is intersection then report to the group?

@mbaudis
Copy link

mbaudis commented Oct 22, 2018

@Relequestual Metadata attributes: "Everything but the sequence" (well, sequence and associated positioning, quantitative elements). So it could mean phenotypic attributes, disease codes, time attributes, geolocation data ... See https://ga4gh-metadata.github.io for examples.

For Beacon (but anywhere else "data search to transmit" for that matter), it is not sufficient (in the long run) to have some attribute being queried; it has to be scoped, too. If you query a disease code, it has to be clear if this is directed at a biosample level, or at the individual:

  • "breast cancer" in "biosample" => variants => somatic variants for driver gene identification
  • "breast cancer" in "individual" from GWAS dataset => germline variants associated with predisposition

So in the current Beacon demonstrator (click "CNV example"), the "metadata" query (cancer ontology codes) is directed against the "biosamples" annotation level.

Therefore, one doesn't only need the proper terminology for an attribute (e.g. "biocharacteristics.type.id" : "ncit:C3324"), but also the object scope ({ "biosample_query" : { "biocharacteristics.type.id" : "ncit:C3324" } })

There has been the use case assessment previously by the C/P group - need a pointer again. But anyway, a lot of pieces are in preformed & just have to be put, gradually, in place (using your preferred schema language ...).

@Relequestual
Copy link
Member

@mcourtot Thanks for the clarification, and thanks @mbaudis too, I think this is now clearer to me.
Meta depends on the context, so I need to be clear what the context is. I try to prefix "meta" to give context, like "Search meta".

I think scoping objects as you described above is interesting, and a discussion we need to have, but I'm not keen to have right now, at least not in the long form required to give the problem justice. My hope is that scoping can be done in a different way to in your example, but I'm not sure that's possible, and I know Tony Brookes has strong opinions on this also.

@mcourtot I'm not sure we now need a face to face, but appreciate the offer. I expect we will at some point in the next month or so!

@mbaudis I'm time limited as to how much effort I can put in, so I need to focus on core work for now. I want to table the scoping discussion, but I fully acknowledge it's one that needs to be had!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants