-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate clinical annotations without RS IDs #28
Comments
Some earlier investigation here particularly use of PharmVar. See also representation in PharmGKB itself, including an allele definition spreadsheet we could leverage. |
The Notebook looks great and you clearly highlight the main issues.
We could pass that heterogeneity to OT but I think it is worth highlighting |
Like always, very good job in the thorough analysis with the metrics in a notebook, @apriltuesday. I also have to say that I like all the comments in it, it makes it almost like a novel of a coder's mind when exploring data.
I'm pretty sure it's not, given that some names seem to have commas between them (e.g. Now that I saw the API (api.pharmgkb.org/v1) from PGKB, I reckon it may be feasible to parse them and extract some metrics from their spreadsheets. An approach for a similar issue we took at the EGA was parse the first column of the spreadsheets, assign row numbers to the rows of interest (e.g. Speaking of, is there a directory with all
I'm positive it's that way, like we discussed. I checked a few of the rsIDs of CYP2D6, and the ref allele at NCBI was the one at *1.
Not sure, since I wasn't able to find an example in the spreadsheet that had the rsID and not the reference, so I couldn't compare.
Similarly I would advise to get as many raw tables from PGKB as we can, rather than parsing the text produced by them. I'm talking especially about the
Sounds wacky, but if we have the variant IDs and we know the reference, could we not craft a similar genotype ID? Except for the black sheep ones with weird naming conventions (?)
Related to my question during today's meeting: we might as well ask them directly or search for why there is this legacy naming convention, when some have proper variant IDs that are not being used. There may be a good reason, or just a "hey, we didn't make the rules", to which we can adapt. |
Thanks Marcos, I'm on the same page as you for pretty much everything you mention. A couple specific points:
Haha I'm glad you appreciate it, I usually clean these up a bit before posting them (because you don't really want to look too closely into my mind...), but it can also be nice to keep them as a sort of "real" research notebook, in case one of you picks up on something I didn't see.
Yes I'm also mistrustful of the "guess the filename" method of fetching these, I didn't find such a central location but I did at least check that I also couldn't find spreadsheets for the ones that the code couldn't find. I think if we wanted to use this in the pipeline we should ask PGKB about a central location or API.
I think we could craft a genotype ID for these, the problem would be associating them with the correct annotations when PGKB only has the allele annotated vs. the full genotype - like the examples Tim highlighted above. We used the genotype ID for SNPs because that was consistently the level at which they were annotated, but that's not the case here unfortunately. I'm planning to spend a bit of time today seeing how many of the allele definition tables are informative (i.e. consist of variants rather than just "not callable"), stay tuned... |
Updated notebook with informativeness, plus some basic counts on how many alleles and how many variants are contained in the tables - basically about 64% of the tables that we get (corresponding to 64% of the non-rs records) should list actual variants. It looks like all the "not callable" ones are HLA, which I guess is expected? |
For example CYP2C9*1.
Look into the data and discuss how we should represent these.
The text was updated successfully, but these errors were encountered: