Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript/Genome alignments (with gaps) - Necessary for RefSeq HGVS c./g. conversion #81

Open
davmlaw opened this issue Sep 19, 2024 · 1 comment

Comments

@davmlaw
Copy link

davmlaw commented Sep 19, 2024

Hi, this project looks good! Thanks!

I would like to use Tark as a source of transcripts for Biocommons HGVS Python library

RefSeq transcripts can differ from the genome sequence, so can align to the genome build with indels

For instance NM_001205122.2 (ATG13) aligning to GRCh38 has a 2bp deletion in exon 15 (alignment is 509bp match, 2 bp deletion, 1753bp match).This is critical to know when converting between genomic (g.) and c. HGVS so you can adjust for these gaps

I have already done so in my own project -cdot - which reads RefSeq/Ensembl GFF/GTF files, ideally I would like to stop maintaining this myself and move over to Tark

Eg: https://cdot.cc/transcript/NM_001205122.2 has this alignment info (in Biocommons HGVS style)

[46672254, 46674518, 14, 1635, 3896, "M509 D2 M1753"]

As far as I can see, Tark doesn't have this yet:

https://tark.ensembl.org/api/transcript/?stable_id=NM_001205122&stable_id_version=2&expand_all=true

                {
                    "exon_id": 73193759,
                    "stable_id": "exon-NR_144423.2-19",
                    "stable_id_version": 1,
                    "assembly": "GRCh38",
                    "loc_start": 46672255,
                    "loc_end": 46674518,
                    "loc_strand": 1,
                    "loc_region": "11",
                    "loc_checksum": "F44BD3F6F8F8764182282A78AE315772F78ECCF8",
                    "exon_checksum": "55D9C6A38CC3510856809E31ED688BB19C01786A",
                    "exon_order": 15
                }

Could you please add these alignment strings to RefSeq transcript exons? Knowing mismatches would also be beneficial

I hope to write a JSON client for HGVS, that will only be enabled for Ensembl to start with. Thanks!

@davmlaw davmlaw changed the title Transcript/Genome alignments (CIGAR) - Necessary for RefSeq HGVS conversion Transcript/Genome alignments (with gaps) - Necessary for RefSeq HGVS c./g. conversion Sep 20, 2024
@davmlaw
Copy link
Author

davmlaw commented Oct 9, 2024

Hi, I've made an initial implementation of the biocommons HGVS TARK loader - review/comments would be very helpful!

I check the TARK sequence and compare it to the sequence from pasting together genome exomes, if different, I say we don't support that transcript / genome alignment so we at least don't get it wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant