Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare genes from VEP and genes from PGKB #21

Closed
apriltuesday opened this issue Aug 24, 2023 · 6 comments
Closed

Compare genes from VEP and genes from PGKB #21

apriltuesday opened this issue Aug 24, 2023 · 6 comments

Comments

@apriltuesday
Copy link
Collaborator

Currently the pipeline only includes genes from VEP, but outputs both sets of genes where they differ on a single annotation. We should investigate any discrepancies and plan any necessary next steps.

@apriltuesday
Copy link
Collaborator Author

Assuming we don't flag any problematic discrepancies and hence can confirm the PGKB gene has the same semantics as the VEP gene, OT has requested that we continue to report the VEP genes but fall back on the PGKB genes when VEP fails to provide one.

@apriltuesday
Copy link
Collaborator Author

@tcezard Quick counts from the 2023.12 submission - this is a count of annotations, including only records with RS IDs and only where VEP genes and PGKB genes don't match (so the 770 from here):

Counter({'vep_superset': 140,
         'vep_subset': 273,
         'mismatch': 106,
         'vep_missing': 182,
         'pgkb_missing': 61,
         'divergent_match': 8})

These are "CMAT paper"-style categories, so I think the case we're really worried about is mismatch (though maybe not? what do you think?)

Here is a spreadsheet with just the mismatches, in case you can pick out any patterns.

I think it would also be good to get how many records have both missing, as those are cases where adding PGKB genes won't help... I think for that I would have to rerun the pipeline though.

@apriltuesday
Copy link
Collaborator Author

Updated counts, now including exact_match and both_missing - this is also using newer data, though it doesn't make a big difference.

Counter({'exact_match': 3535,
         'vep_superset': 143,
         'vep_subset': 271,
         'both_missing': 196,
         'mismatch': 106,
         'vep_missing': 185,
         'pgkb_missing': 61,
         'divergent_match': 8})

Also updated the spreadsheet with variant locations for the 106 mismatches, and added a tab with the only 12 records where PGKB has a gene but Biomart could not get Ensembl IDs.

Notebook is viewable here if you're interested...

@tcezard
Copy link
Member

tcezard commented Mar 15, 2024

Thank you for the counts and the spreadsheet.

Assumptions

I'm going to start by assuming that VEP is providing the correct answer to the question we're asking which is "what is the gene impacted by the variant?"
The point of this exercise is to confirm that VEP and PGKB are providing an answer to the same question and that PGKB is answer is similarly accurate.

Added Contribution of PharmGKB genes

There are 4505 variants being queried here 4063 of which have gene associated that can be compared between VEP and PGKB.
Overall the two methods are giving similar results 3535 have perfect match, (78.4% total, 87.0% testable) and 3957 (87.8% total, 97.4% testable) are in relative agreement.

Assuming we would use VEP primarily and only add PGKB gene when we cannot get results from VEP.
VEP provide a gene in 91.3% of the cases and adding PGKB results would add 4.1%

Accuracy of PharmGKB genes

I looked in more detail at some cases where VEP and PGKB disagree in the spreadsheet and it looks like PGKB genes are off by a couple of KBs. In some case they provide a related gene to the one VEP provides and in others they provide a completely different but it is usually within the vicinity (but not overlapping) of the variant.

This makes me think that PGKB might be using an older (or at least different) set of annotation compare to VEP. My concern is that adding the 4.1% of genes found in PGKB and not in VEP is very likely enriched in gene that would be divergent between the two annotation sets.

Conclusion

I'm not convinced that pulling the gene from PGKB and adding it to the same field in the schema is a sensible thing to do. I think we would loose the ability to know the mechanism by which this gene was assign to the variant and therefore not be able to provide provenance. This said OT might disagree.

@apriltuesday
Copy link
Collaborator Author

apriltuesday commented Mar 15, 2024

I've also looked into why the variant specifically highlighted in this OT issue didn't get an annotation from VEP. The reason in this case is because we failed to resolve the reference allele and thus couldn't query VEP with the coordinates. The location provided by PGKB is NC_000001.11:97740411_97740418 (same as the one provided in dbSNP) which corresponds to the sequence GATGAATGA in the reference (with context base added). But the annotated alleles in the clinical annotation are ATGA,DEL. Since we can't match either of these to the reference, we fail to generate the complete coordinates.

So if falling back on PGKB genes isn't a reliable option, as Tim suggests above, two more options would be:

  • Implement a more clever reference allele determination algorithm. Some of the complications here were discussed back in this issue, and I've added a tab to the spreadsheet with the "no genotype ID" errors I get from the pipeline (repeats are a known issue which we'll get to at some point)
  • Query VEP with just position information to get just the overlapping gene. This resolves the most pressing issue and I'm guessing it would be accurate since the positions come from dbSNP, but we wouldn't get a functional consequence and wouldn't have a genotype ID.

@DSuveges @ireneisdoomed @tskir, please take a look and let us know your thoughts.

@apriltuesday
Copy link
Collaborator Author

Closing this as the PGKB vs. VEP comparison is complete, but will continue the work to improve gene coverage under the new Github issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants