Compare genes from VEP and genes from PGKB #21

apriltuesday · 2023-08-24T13:34:58Z

Currently the pipeline only includes genes from VEP, but outputs both sets of genes where they differ on a single annotation. We should investigate any discrepancies and plan any necessary next steps.

apriltuesday · 2024-02-26T13:30:38Z

Assuming we don't flag any problematic discrepancies and hence can confirm the PGKB gene has the same semantics as the VEP gene, OT has requested that we continue to report the VEP genes but fall back on the PGKB genes when VEP fails to provide one.

apriltuesday · 2024-03-12T16:28:20Z

@tcezard Quick counts from the 2023.12 submission - this is a count of annotations, including only records with RS IDs and only where VEP genes and PGKB genes don't match (so the 770 from here):

Counter({'vep_superset': 140,
         'vep_subset': 273,
         'mismatch': 106,
         'vep_missing': 182,
         'pgkb_missing': 61,
         'divergent_match': 8})

These are "CMAT paper"-style categories, so I think the case we're really worried about is mismatch (though maybe not? what do you think?)

Here is a spreadsheet with just the mismatches, in case you can pick out any patterns.

I think it would also be good to get how many records have both missing, as those are cases where adding PGKB genes won't help... I think for that I would have to rerun the pipeline though.

apriltuesday · 2024-03-13T14:47:57Z

Updated counts, now including exact_match and both_missing - this is also using newer data, though it doesn't make a big difference.

Counter({'exact_match': 3535,
         'vep_superset': 143,
         'vep_subset': 271,
         'both_missing': 196,
         'mismatch': 106,
         'vep_missing': 185,
         'pgkb_missing': 61,
         'divergent_match': 8})

Also updated the spreadsheet with variant locations for the 106 mismatches, and added a tab with the only 12 records where PGKB has a gene but Biomart could not get Ensembl IDs.

Notebook is viewable here if you're interested...

tcezard · 2024-03-15T10:33:43Z

Thank you for the counts and the spreadsheet.

Assumptions

I'm going to start by assuming that VEP is providing the correct answer to the question we're asking which is "what is the gene impacted by the variant?"
The point of this exercise is to confirm that VEP and PGKB are providing an answer to the same question and that PGKB is answer is similarly accurate.

Added Contribution of PharmGKB genes

There are 4505 variants being queried here 4063 of which have gene associated that can be compared between VEP and PGKB.
Overall the two methods are giving similar results 3535 have perfect match, (78.4% total, 87.0% testable) and 3957 (87.8% total, 97.4% testable) are in relative agreement.

Assuming we would use VEP primarily and only add PGKB gene when we cannot get results from VEP.
VEP provide a gene in 91.3% of the cases and adding PGKB results would add 4.1%

Accuracy of PharmGKB genes

I looked in more detail at some cases where VEP and PGKB disagree in the spreadsheet and it looks like PGKB genes are off by a couple of KBs. In some case they provide a related gene to the one VEP provides and in others they provide a completely different but it is usually within the vicinity (but not overlapping) of the variant.

This makes me think that PGKB might be using an older (or at least different) set of annotation compare to VEP. My concern is that adding the 4.1% of genes found in PGKB and not in VEP is very likely enriched in gene that would be divergent between the two annotation sets.

Conclusion

I'm not convinced that pulling the gene from PGKB and adding it to the same field in the schema is a sensible thing to do. I think we would loose the ability to know the mechanism by which this gene was assign to the variant and therefore not be able to provide provenance. This said OT might disagree.

apriltuesday · 2024-03-15T11:33:47Z

I've also looked into why the variant specifically highlighted in this OT issue didn't get an annotation from VEP. The reason in this case is because we failed to resolve the reference allele and thus couldn't query VEP with the coordinates. The location provided by PGKB is NC_000001.11:97740411_97740418 (same as the one provided in dbSNP) which corresponds to the sequence GATGAATGA in the reference (with context base added). But the annotated alleles in the clinical annotation are ATGA,DEL. Since we can't match either of these to the reference, we fail to generate the complete coordinates.

So if falling back on PGKB genes isn't a reliable option, as Tim suggests above, two more options would be:

Implement a more clever reference allele determination algorithm. Some of the complications here were discussed back in this issue, and I've added a tab to the spreadsheet with the "no genotype ID" errors I get from the pipeline (repeats are a known issue which we'll get to at some point)
Query VEP with just position information to get just the overlapping gene. This resolves the most pressing issue and I'm guessing it would be accurate since the positions come from dbSNP, but we wouldn't get a functional consequence and wouldn't have a genotype ID.

@DSuveges @ireneisdoomed @tskir, please take a look and let us know your thoughts.

apriltuesday · 2024-04-03T09:51:54Z

Closing this as the PGKB vs. VEP comparison is complete, but will continue the work to improve gene coverage under the new Github issue.

apriltuesday mentioned this issue Apr 3, 2024

Improve gene coverage via improvements to genotype ID algorithm #36

Closed

apriltuesday closed this as completed Apr 3, 2024

apriltuesday mentioned this issue Apr 3, 2024

Issue 36: Improve genotype ID algorithm #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare genes from VEP and genes from PGKB #21

Compare genes from VEP and genes from PGKB #21

apriltuesday commented Aug 24, 2023

apriltuesday commented Feb 26, 2024

apriltuesday commented Mar 12, 2024

apriltuesday commented Mar 13, 2024

tcezard commented Mar 15, 2024

apriltuesday commented Mar 15, 2024 •

edited

Loading

apriltuesday commented Apr 3, 2024

Compare genes from VEP and genes from PGKB #21

Compare genes from VEP and genes from PGKB #21

Comments

apriltuesday commented Aug 24, 2023

apriltuesday commented Feb 26, 2024

apriltuesday commented Mar 12, 2024

apriltuesday commented Mar 13, 2024

tcezard commented Mar 15, 2024

Assumptions

Added Contribution of PharmGKB genes

Accuracy of PharmGKB genes

Conclusion

apriltuesday commented Mar 15, 2024 • edited Loading

apriltuesday commented Apr 3, 2024

apriltuesday commented Mar 15, 2024 •

edited

Loading