Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support ga4gh identifiers #90

Open
jmmut opened this issue Sep 2, 2020 · 0 comments
Open

support ga4gh identifiers #90

jmmut opened this issue Sep 2, 2020 · 0 comments

Comments

@jmmut
Copy link
Collaborator

jmmut commented Sep 2, 2020

from the refget paper:

Refget defines three supported identifier algorithms; MD5,TRUNC512 and GA4GH Identifier. All three algorithms normalise sequence input by stripping all whitespace characters and restricting to characters in the range A-Z. We chose this character range as a compromise between the methods and requirements employed by CRAM, ENA and the Variation Representation Specification (VRS).10MD5 is the default checksum algorithm used by the CRAM format’s M5 tag and hence the CRR. It is provided for backwards compatibility with existing CRAM files. However,there are limitations to md5’s algorithm the occurrence of a checksum collision between non-identical sequences would be catastrophic. To mitigate this concern, we co-developed two schemes with the Genomic Knowledge Standards’ Variation Representation Specification (VRS) based on the SHA-512 checksum algorithm called TRUNC512 andGA4GH identifier. Both schemes use the first 24 bytes of aSHA-512 digest. TRUNC512 chooses to represent this as ahex encoded string. GA4GH identifier converts these bytes into a base64 URL encoded string formatted as “ga4gh:SQ.XXXX”. Both algorithms are interchangeable since both represent the same underlying SHA-512 digest,however the GA4GH identifier is preferred to maintain VRS compatibility.

I tought that refget only used trunc512 and md5 but it seems we should support the GA4GH identifiers. Luckily, I think we can store just trunc512 and md5 as we are doing at the moment and allow searches by ga4gh id by transforming it on the fly to the trunc512 id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant