Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Merging of different variants VarDict and TNscope #1519

Closed
mathiasbio opened this issue Jan 15, 2025 · 7 comments · May be fixed by #1524
Closed

[Bug] Merging of different variants VarDict and TNscope #1519

mathiasbio opened this issue Jan 15, 2025 · 7 comments · May be fixed by #1524
Assignees
Labels
Bug Something isn't working
Milestone

Comments

@mathiasbio
Copy link
Collaborator

Description

The current method in balsamic v16.0.0 using bcftools concat to merge the VCFs doesn't require that the variants are matching perfectly in the ALT column, so for instance if a variant has been called as a MNV in VarDict and as separate SNVs in TNscope, it merges only the first variant.

Such as if such a MNV exists in VarDict

REF: ATTC -> ALT: TTTA

And in TNscope as two separate variants:

REF: A -> ALT: T
REF: C -> ALT: A

Then it would merge the 1st of the TNscope variants into the VarDict one, and keep the second variant.

How to reproduce

No response

Expected behaviour

Ideally the two different ways of representing the same variant should be able to be consolidated into 1 event.

But this may require quite a bit of work, and an intermediate solution would be to not merge these variants, but to maintain both representations of the event.

Anything else?

Example from IGV:

image

Top track is 2 variants from TNscope, bottom is 1 variant from VarDict:

Here is the variants after merging:

image

Top track is using the current v16.0.0 method bcftools concat, and the bottom one is using a custom python script

Pipeline version

16.0.0

@mathiasbio mathiasbio added the Bug Something isn't working label Jan 15, 2025
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Jan 15, 2025
@mathiasbio mathiasbio added this to the Release 17 milestone Jan 15, 2025
@mathiasbio mathiasbio linked a pull request Jan 15, 2025 that will close this issue
54 tasks
@mathiasbio
Copy link
Collaborator Author

Here is another example, of variants that are larger than 2 bases:

image

Track:

  1. Merged with bcftools concat
  2. Merged with custom python script
  3. TNscope calls
  4. VarDict calls

@mathiasbio
Copy link
Collaborator Author

I emailed Sentieon about the possibility of TNscope outputting MNVs in the case of these variants and got this response from Don Freed:

"
Sentieon's first variant callers were designed to match the GATK's variant representation, and the GATK produced an atomic representation of variants. TNscope has inherited this atomic representation from Sentieon's GATK-matching variant callers.

TNscope does not have any argument for merging multiple variants into a single MNP. However, we have an open-source tool that is able to merge variants into MNPs in a post-calling script, https://github.com/Sentieon/sentieon-scripts/tree/master/merge_mnp. Here is an example usage:

# Add a PASS FILTER to the variants. This is required by the `merge_mnp` script. 

# `util vcfconvert` compresses and indexes the VCF file read from stdin

bcftools filter -i 'QUAL>=10' <output.vcf.gz> | sentieon util vcfconvert - result/sample_pass.vcf.gz

 

# Merge MNPs with the `merge_mnp.py` script. The script requires the vcflib library

# which is packaged with the Sentieon software and `sentieon pyexec` will load this

# library

sentieon pyexec merge_mnp.py result/sample_pass.vcf.gz /home/regression/references/hg19/ucsc.hg19.fasta --max_distance 5 | \

  sentieon util vcfconvert -  result/merged.vcf.gz 

The first step to add the PASS filter may or may not be required depending if your input VCF already has PASS variants.

Here is a real-world example from some test data that I had on hand. Here are the variants before the merge script:

20      1052872 .       G       A       9.45    PASS    ECNT=2;FS=0.000;HCNT=1;MAX_ED=1;MIN_ED=1;NLOD=1.89;NLODF=1.11;PV=0.2055;PV2=0.2055;SOR=0.446;TLOD=9.74  GT:AD:AF:ALTHC:ALT_F1R2:ALT_F2R1:BaseQRankSumPS:ClippingRankSumPS:DP:DPHC:FOXOG:MQRankSumPS:NBQPS:PGT:PID:QSS:REF_F1R2:REF_F2R1:ReadPosEndDistPS:ReadPosRankSumPS       0/1:8,4:0.333333:4:3:1:1.820:-0.000:12:11:0.750:-0.000:36.186:0|1:1052872_G_A:232,146:3:5:21.167:-0.409 0/0:6,0:0:0:0:0:.:.:6:6:.:.:.:0|1:1052872_G_A:148,0:2:4:24.000:.
20      1052873 .       A       G       9.45    PASS    ECNT=2;FS=3.522;HCNT=1;MAX_ED=1;MIN_ED=1;NLOD=1.89;NLODF=1.11;PV=0.2055;PV2=0.2055;SOR=1.518;TLOD=9.74  GT:AD:AF:ALTHC:ALT_F1R2:ALT_F2R1:BaseQRankSumPS:ClippingRankSumPS:DP:DPHC:FOXOG:MQRankSumPS:NBQPS:PGT:PID:QSS:REF_F1R2:REF_F2R1:ReadPosEndDistPS:ReadPosRankSumPS       0/1:8,4:0.333333:4:3:1:-0.000:-0.000:12:11:0.250:-0.000:36.000:0|1:1052872_G_A:273,137:3:5:21.167:-0.163        0/0:7,0:0:0:0:0:.:.:7:6:.:.:.:0|1:1052872_G_A:223,0:3:4:20.571:.

The merge script command line, sentieon pyexec ./merge_mnp/merge_mnp.py test-somatic/test_merge-mnp.vcf.gz /home/regression/references/b37/hs37d5.fa --max_distance 5. Here is the output:

20      1052872 .       GA      AG      9.45    PASS    ECNT=2.0;FS=1.761;HCNT=1.0;MAX_ED=1.0;MIN_ED=1.0;ML_PROB=0.20500000000000002;NLOD=1.89;NLODF=1.11;PV=0.2055;PV2=0.2055;SOR=0.982;TLOD=9.74      GT:AD:AF:ALTHC:ALT_F1R2:ALT_F2R1:BaseQRankSumPS:ClippingRankSumPS:DP:DPHC:FOXOG:MQRankSumPS:NBQPS:PGT:PID:QSS:REF_F1R2:REF_F2R1:ReadPosEndDistPS:ReadPosRankSumPS       0/1:8,4:0.333333:4:3:1:1.82:-0.0:12:11:0.75:-0.0:36.186:0|1:1052872_G_A:232,146:3:5:21.167:-0.409       0/0:6,0:0.0:0:0:0:.:.:6:6:.:.:.:0|1:1052872_G_A:148,0:2:4:24.0:.
20      1052872 .       G       A       9.45    MERGED  ECNT=2;FS=0.0;HCNT=1;MAX_ED=1;MIN_ED=1;ML_PROB=0.19;NLOD=1.89;NLODF=1.11;PV=0.2055;PV2=0.2055;SOR=0.446;TLOD=9.74       GT:AD:AF:ALTHC:ALT_F1R2:ALT_F2R1:BaseQRankSumPS:ClippingRankSumPS:DP:DPHC:FOXOG:MQRankSumPS:NBQPS:PGT:PID:QSS:REF_F1R2:REF_F2R1:ReadPosEndDistPS:ReadPosRankSumPS       0/1:8,4:0.333333:4:3:1:1.82:-0.0:12:11:0.75:-0.0:36.186:0|1:1052872_G_A:232,146:3:5:21.167:-0.409       0/0:6,0:0.0:0:0:0:.:.:6:6:.:.:.:0|1:1052872_G_A:148,0:2:4:24.0:.
20      1052873 .       A       G       9.45    MERGED  ECNT=2;FS=3.522;HCNT=1;MAX_ED=1;MIN_ED=1;ML_PROB=0.22;NLOD=1.89;NLODF=1.11;PV=0.2055;PV2=0.2055;SOR=1.518;TLOD=9.74     GT:AD:AF:ALTHC:ALT_F1R2:ALT_F2R1:BaseQRankSumPS:ClippingRankSumPS:DP:DPHC:FOXOG:MQRankSumPS:NBQPS:PGT:PID:QSS:REF_F1R2:REF_F2R1:ReadPosEndDistPS:ReadPosRankSumPS       0/1:8,4:0.333333:4:3:1:-0.0:-0.0:12:11:0.25:-0.0:36.0:0|1:1052872_G_A:273,137:3:5:21.167:-0.163 0/0:7,0:0.0:0:0:0:.:.:7:6:.:.:.:0|1:1052872_G_A:223,0:3:4:20.571:.

The script will output a new MNP from the original calls. The original records are also output with a "MERGED" flag.

Best regards,
Don
"

This sounds like a decent solution, though as it doesn't have the bamfile as an input I don't think it can know which variants are actually part of the same molecule, and therefore should be interpreted together, and which ones are in different molecules and should be interpreted separately.

It seems in the case of the VarDict variants the adjacent SNVs are part of the same reads, and should be interpreted together as they should influence how VEP determines the effect on the protein.

I think for the moment maybe it is best to just keep both representations, the atomised versions from TNscope, and the MNVs from VarDict. Because I don't know how else to merge the variants in a reliable way, other than making a complex tool to read the bamfile and seeing which adjacent TNscope variants should be merged, or to use VarDict as the reference and merge variants that fit into a VarDict MNV. But I feel like it's all too much work for little benefit at the moment when we have so much else that's prioritised.

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 17, 2025

Update on this...seems like TNscope does output phased variants, so it should be possible to connect adjacent SNVs into a MNV. So if the script that Don linked earlier takes this into account then it should be possible to create true MVNs. There's also the possibility that VEP already reads this phased info and takes it into account.

But we would still have the issue of different representations of the same biological event. To start I'll look into this script and see if it uses the phasing info. (Five seconds later...ok, it does look at phasing info)

@mathiasbio
Copy link
Collaborator Author

mathiasbio commented Jan 17, 2025

Alternatives:

  1. Merge TNscope MNVs (up to what size?)
  2. Split VarDict MNVs. But then can VarDict correctly predict the effect on the protein if the variants are in the same codon? (from what I've seen it doesn't look like it, but I can be wrong)
  3. Keep both representations, don't do anything...

@mathiasbio
Copy link
Collaborator Author

I looked into bcftools norm again as suggested during the refinement meeting, and saw this option which I had seen before -m, --multiallelics -|+TYPE Split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both] and it seemed promising. But after testing it I realized why I had discarded that option earlier.

It's not aiming to solve the issue here, which is to merge variants at different positions, if they occur within the same reads. But to merge and create 1 variant if there are multiple variants at the same position.

Such as: REF: TAAAAAAAAA ALT: T,TAAAAAAAAAAA,TAAAAAAAAAAAAA

I haven't seen any other option from bcftools that seem promising. I'll continue with testing the Sentieon script for now

@mathiasbio
Copy link
Collaborator Author

I took tested the sentieon merge script on the TWIST reference sample selectbengal, and put the results of my test in this google sheet: https://docs.google.com/spreadsheets/d/1pocrlClqrNoDNBAIF_hZMyurMxYe5K649e1iLXPkYhI/edit?gid=0#gid=0

As written by Sentieon the tool keeps the variants that were merged as well, setting the filter to MERGED. I checked 4 of the MNVs that it created, and compared it to the MNV called by VarDict, and in all 4 variants that I checked the final variant was the same as the one made by VarDict.

This means that if the tnscope VCF had been pre-processed with this script before merging with VarDict in my python-script, all of these variants would be merged to 1, which is the behaviour that we would prefer.

The only remaining issue is how the script handles variants with filters set. If a variant has a filter it is never merged with any other variant, but simply output as it is. This is an issue for us for at least a couple of reason that I can think of now:

  1. We have lots of variants with triallelic site set, and some of these SNVs that would constitute an MNV would have this filter set, and not be merged. (potentially we can delete this triallelic site filter and just set it to PASS, but this would lose us some info)
  2. We are in this new feature feat: add option to disable normal hardfilter #1509 trying to soft filter normal variants which means that we will allow for yet another filter which is not PASS, and these variants would never be merged into MNVs.

The script that Sentieon has shared is under a copy-right which allows for modifications however, so maybe this can be fixed with some tweaking. In which case maybe we can create a different solution and simply stack filters if any occur in separate SNVs, removing PASS and keeping the unique set.

Copyright (c) Sentieon Inc. All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:
  
  * Redistributions of source code must retain the above copyright notice, this
    list of conditions and the following disclaimer.
  
  * Redistributions in binary form must reproduce the above copyright notice,
    this list of conditions and the following disclaimer in the documentation
    and/or other materials provided with the distribution.

@mathiasbio mathiasbio removed a link to a pull request Jan 20, 2025
54 tasks
@mathiasbio mathiasbio linked a pull request Jan 20, 2025 that will close this issue
59 tasks
@mathiasbio mathiasbio moved this from Todo to In Progress in BALSAMIC Jan 20, 2025
@mathiasbio
Copy link
Collaborator Author

Replaced by user story: #1525

@github-project-automation github-project-automation bot moved this from In Progress to Completed in BALSAMIC Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

1 participant