Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate clinical annotations without RS IDs #28

Open
apriltuesday opened this issue Oct 23, 2023 · 6 comments
Open

Investigate clinical annotations without RS IDs #28

apriltuesday opened this issue Oct 23, 2023 · 6 comments

Comments

@apriltuesday
Copy link
Collaborator

For example CYP2C9*1.
Look into the data and discuss how we should represent these.

@apriltuesday
Copy link
Collaborator Author

apriltuesday commented Dec 21, 2023

Some earlier investigation here particularly use of PharmVar.

See also representation in PharmGKB itself, including an allele definition spreadsheet we could leverage.

@apriltuesday
Copy link
Collaborator Author

Notebook here, looking at how named/star alleles are annotated and a bit at how we might resolve them to specific variants. Summary and some questions are at the bottom.

@M-casado @tcezard any thoughts? What else should we look into before discussing with OT?

@tcezard
Copy link
Member

tcezard commented Jan 10, 2024

The Notebook looks great and you clearly highlight the main issues.
My main concern is around the different way the Phenotype is associated with:

  • a genotype like here or here
  • a single allele and the text describe the effect of that allele like here
  • a single allele and the text describes the effect of the hemizygous, heterozygous and homozygous like here

We could pass that heterogeneity to OT but I think it is worth highlighting

@M-casado
Copy link
Collaborator

Like always, very good job in the thorough analysis with the metrics in a notebook, @apriltuesday. I also have to say that I like all the comments in it, it makes it almost like a novel of a coder's mind when exploring data.

is it safe to just comma-split these strings

I'm pretty sure it's not, given that some names seem to have commas between them (e.g. Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham). Perhaps I would actually use the "gene name" as the breaking token, since they seem to add it at the beginning of each variant name.

Now that I saw the API (api.pharmgkb.org/v1) from PGKB, I reckon it may be feasible to parse them and extract some metrics from their spreadsheets. An approach for a similar issue we took at the EGA was parse the first column of the spreadsheets, assign row numbers to the rows of interest (e.g. rsID) and interpret them that way. Although there are not so many (for now) genes with allele tables (pharmgkb_genes).

Speaking of, is there a directory with all download/file/attachment/ files? Just in case the allele_definition_url.format(gene=gene) is not always working because of a wrong name (e.g. someone put an underscore in the filename or something) and we are counting fewer genes than we should. Although I assume you did due diligence, since you also mention 90% of the non-rsIDs have the tables, so the gap shouldn't be big if there was one at all. It's probably just my little trust in files without proper naming conventions.

Or is e.g. *1/first row the reference?

I'm positive it's that way, like we discussed. I checked a few of the rsIDs of CYP2D6, and the ref allele at NCBI was the one at *1.

If so what does missing value mean?

Not sure, since I wasn't able to find an example in the spreadsheet that had the rsID and not the reference, so I couldn't compare.

we can rely on the "Gene" column in PGKB data

Similarly I would advise to get as many raw tables from PGKB as we can, rather than parsing the text produced by them. I'm talking especially about the annotation text field. The fewer text that is generated from structured fields that we need to parse, the better.

Note that our PGx schema uses genotype IDs not variant IDs

Sounds wacky, but if we have the variant IDs and we know the reference, could we not craft a similar genotype ID? Except for the black sheep ones with weird naming conventions (?)

Do we want to resolve named alleles to variants, and if so how to convey this information?

Related to my question during today's meeting: we might as well ask them directly or search for why there is this legacy naming convention, when some have proper variant IDs that are not being used. There may be a good reason, or just a "hey, we didn't make the rules", to which we can adapt.

@apriltuesday
Copy link
Collaborator Author

Thanks Marcos, I'm on the same page as you for pretty much everything you mention. A couple specific points:

I also have to say that I like all the comments in it, it makes it almost like a novel of a coder's mind when exploring data.

Haha I'm glad you appreciate it, I usually clean these up a bit before posting them (because you don't really want to look too closely into my mind...), but it can also be nice to keep them as a sort of "real" research notebook, in case one of you picks up on something I didn't see.

Speaking of, is there a directory with all download/file/attachment/ files?

Yes I'm also mistrustful of the "guess the filename" method of fetching these, I didn't find such a central location but I did at least check that I also couldn't find spreadsheets for the ones that the code couldn't find. I think if we wanted to use this in the pipeline we should ask PGKB about a central location or API.

Sounds wacky, but if we have the variant IDs and we know the reference, could we not craft a similar genotype ID?

I think we could craft a genotype ID for these, the problem would be associating them with the correct annotations when PGKB only has the allele annotated vs. the full genotype - like the examples Tim highlighted above. We used the genotype ID for SNPs because that was consistently the level at which they were annotated, but that's not the case here unfortunately.

I'm planning to spend a bit of time today seeing how many of the allele definition tables are informative (i.e. consist of variants rather than just "not callable"), stay tuned...

@apriltuesday
Copy link
Collaborator Author

Updated notebook with informativeness, plus some basic counts on how many alleles and how many variants are contained in the tables - basically about 64% of the tables that we get (corresponding to 64% of the non-rs records) should list actual variants. It looks like all the "not callable" ones are HLA, which I guess is expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants