Duplicate entries in vcf file #10

Zoeyoungxy · 2024-12-20T04:55:27Z

I’m encountering an issue while using Sawfish for joint calling.
In the VCF file generated, I noticed some loci appear as completely identical entries, differing from the typical multiallelic case.
Here is an example:

chr1 122014316 sawfish:106:1718:0:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 999 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=2;HOMSEQ=TT GT:GQ:PL:AD:PS 0/1:32:32,0,232:5,1:. 1/1:12:200,12,0:0,4:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/1:2:702,0,2:1,15:.
chr1 122014316 sawfish:60:1727:1:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 72 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=6;HOMSEQ=TTGTAA GT:GQ:PL:AD:PS 0/0:15:0,15,250:5,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/0:6:0,6,100:2,0:.

In this case, both entries describe a DEL with the same SVLEN (-340) and identical POS, but the GT results for individual samples differ.
What could be the cause of such duplicate entries in the VCF file, and should I filter out one of them? If so, what criteria should I use to decide which row to keep?

Best wishes

ctsa · 2024-12-20T17:13:03Z

Thanks for checking on this example @Zoeyoungxy

These cases typically occur when the same deletion is found within 2 (or N) different contextual haplotypes. Sometimes these differences can be biologically interesting, but right now this often appears as the result of sequencing and assembly noise as well. It is another aspect of the larger cohort scaling for joint-genotyping (besides runtime), that will need more optimization in future since the rate of this phenomena tends to increase with sample count. The filtration decision will depend on the downstream application. We'll be adding more outputs soon to more easily match the full assembly contig to each VCF entry to help understand these cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate entries in vcf file #10

Duplicate entries in vcf file #10

Zoeyoungxy commented Dec 20, 2024

ctsa commented Dec 20, 2024

Duplicate entries in vcf file #10

Duplicate entries in vcf file #10

Comments

Zoeyoungxy commented Dec 20, 2024

ctsa commented Dec 20, 2024