-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare genes from VEP and genes from PGKB #21
Comments
Assuming we don't flag any problematic discrepancies and hence can confirm the PGKB gene has the same semantics as the VEP gene, OT has requested that we continue to report the VEP genes but fall back on the PGKB genes when VEP fails to provide one. |
@tcezard Quick counts from the 2023.12 submission - this is a count of annotations, including only records with RS IDs and only where VEP genes and PGKB genes don't match (so the 770 from here):
These are "CMAT paper"-style categories, so I think the case we're really worried about is Here is a spreadsheet with just the mismatches, in case you can pick out any patterns. I think it would also be good to get how many records have both missing, as those are cases where adding PGKB genes won't help... I think for that I would have to rerun the pipeline though. |
Updated counts, now including
Also updated the spreadsheet with variant locations for the 106 mismatches, and added a tab with the only 12 records where PGKB has a gene but Biomart could not get Ensembl IDs. Notebook is viewable here if you're interested... |
Thank you for the counts and the spreadsheet. AssumptionsI'm going to start by assuming that VEP is providing the correct answer to the question we're asking which is "what is the gene impacted by the variant?" Added Contribution of PharmGKB genesThere are 4505 variants being queried here 4063 of which have gene associated that can be compared between VEP and PGKB. Assuming we would use VEP primarily and only add PGKB gene when we cannot get results from VEP. Accuracy of PharmGKB genesI looked in more detail at some cases where VEP and PGKB disagree in the spreadsheet and it looks like PGKB genes are off by a couple of KBs. In some case they provide a related gene to the one VEP provides and in others they provide a completely different but it is usually within the vicinity (but not overlapping) of the variant. This makes me think that PGKB might be using an older (or at least different) set of annotation compare to VEP. My concern is that adding the 4.1% of genes found in PGKB and not in VEP is very likely enriched in gene that would be divergent between the two annotation sets. ConclusionI'm not convinced that pulling the gene from PGKB and adding it to the same field in the schema is a sensible thing to do. I think we would loose the ability to know the mechanism by which this gene was assign to the variant and therefore not be able to provide provenance. This said OT might disagree. |
I've also looked into why the variant specifically highlighted in this OT issue didn't get an annotation from VEP. The reason in this case is because we failed to resolve the reference allele and thus couldn't query VEP with the coordinates. The location provided by PGKB is So if falling back on PGKB genes isn't a reliable option, as Tim suggests above, two more options would be:
@DSuveges @ireneisdoomed @tskir, please take a look and let us know your thoughts. |
Closing this as the PGKB vs. VEP comparison is complete, but will continue the work to improve gene coverage under the new Github issue. |
Currently the pipeline only includes genes from VEP, but outputs both sets of genes where they differ on a single annotation. We should investigate any discrepancies and plan any necessary next steps.
The text was updated successfully, but these errors were encountered: