Skip to content

collaborativebioinformatics/SVHack_Mendelian

Repository files navigation

Typing SVG

   _____       _            __      __   _            _   _              
  / ____|     | |           \ \    / /  | |          | | (_)             
 | (___   __ _| |___  __ _   \ \  / /_ _| | ___ _ __ | |_ _ _ __   __ _  
  \___ \ / _` | / __|/ _` |   \ \/ / _` | |/ _ \ '_ \| __| | '_ \ / _` | 
  ____) | (_| | \__ \ (_| |    \  / (_| | |  __/ | | | |_| | | | | (_| | 
 |_____/ \__,_|_|___/\__,_|     \/ \__,_|_|\___|_| |_|\__|_|_| |_|\__,_|

                                                                         

SV detection & filtering pipeline to identify putative de novo variants from mendelian inconsistencies.

Table of Contents

  1. How to use it
  2. Installation
  3. Dependendencies
  4. Usage
  5. Input
  6. Output
  7. Workflow Diagram
  8. Background
  9. Results
  10. Citation
  11. Contributors

Background

Candidate de novo SVs can be identified from trios as variants that do not follow Mendelian inheritance patterns. Mendelian inconsistencies are identified when a child has a genotype that is not possible given the genotypes of the parents, for example, when a child is homozygous for an allele that does not exist in either parent. Mendelian inconsistency in SV calls can indicate two possibilities: challenges in SV calling leading to false positive or negative calls across the trio, or a genuine de novo SV. De novo SVs are rare, with an estimated rate of 0.16 de novo SVs per genome in healthy individuals. Despite their rarity, de novo SVs have been associated with human diseases such as Autism Spectrum Disorder, Hereditary Pulmonary Alveolar Proteinosis and Alzheimer's disease. In addition, some benchmarking studies have used the rarity of de novo structural variants to support the validity of their SV calls under the assumption than any calls inconsistent with Mendelian inheritance are incorrect (Zook et al., 2021; Parikh et at., 2016; Eberth et al.,2021). Here, we aim to investigate putative de novo SVs to either validate them as genuine, which could assist in the diagnosis of rare diseases, or use their properties to inform strategies for more accurate SV calling.

Method: Verification of de-novo SVs from trios via visualization and local assembly of complex variants

True de novo SVs are expected to be rare; however, in practice, a high rate of inconsistent SVs will be identified, indicating false positives or negatives due to noise inherent in SV calling and merging. SalsaValentina creates a ‘naive’ de novo SV candidate list, develops a QC-framework tool to filter those candidates and enables users to visualize & create a local assembly of every de novo SV locus to aid in confirmation of the variant as either a de novo SV or an incorrect call.

SalsaValentina is an integrated pipeline for Mendelian inconsistency of SVs. We demonstrate the pipeline using the Genome in a Bottle (GIAB) Ashkenazim trio ( HG002 son, HG003 father & HG004 mother) sequenced on Sequel II System with 2.0 chemistry and aligned to the GRCh38 genome reference. SVs are called using the Sniffles2 variant caller.

To merge the SV calls into a single VCF, two methods are compared: multi-sample SV calling using Sniffles2 and variant merging with SURVIVOR using default parameters (Jeffares, D. et al., 2017). Each of the resulting merged VCFs is annotated for Mendelian inconsistencies using the mendelian plugin of BCFtools (Heng Li, 2011). The positions of each SV inconsistent with Mendelian inheritance is extracted from the merged VCFs and Samplot is used to visualize the region of each variant in each member of the trio (Belyeu, J.R.,2021).

To further investigate the validity of the candidate variants, local assembly was performed by extracting reads aligned 50 kb upstream and downstream of the region of interest and assembling with Hifiasm (--primary to generate primary and alternate assembly) (Cheng, H. et al., 2021). The YASS Genomic Similarity Search Tool webserver was used to create dotplots visualizing pairwise alignments of the resulting contigs to GRCh37 to verify the deletion in HG002 (Noe, L. and Kucherov, G., 2005).

How to use it

Installation

git clone https://github.com/collaborativebioinformatics/SVHack_Mendelian.git

Dependencies

  • Sniffles2 Version 2.0.7
  • SURVIVOR Version 1.0.7
  • bcftools Version 1.17
  • samplot Version 1.3.0
  • Hifiasm Version 0.19.6

Usage

Edit the input paths in workflow/Snake.config.yaml

  • SAMPLES: A list of sample names. These should correspond to the names of the input BAM files. For example: ["HG002", "HG003", "HG004"]
  • REF_GENOME: Path to the reference FASTA file. For example: reference/genome.fa
cd workflow
snakemake --use-conda

Candidate de novo SVs can be filtered using workflow/scripts/filtering_denovo_sniffles.py and workflow/scripts/filtering_denovo_survivor.py

Local assemblies for regions of interest can be generated using workflow/scripts/localassembly.py

Input Files

/working_dir/workflow
|-- input_data/reference  # Reference genome
|   |-- genome.fa
|-- aligned  # BAM file for each sample
|   |-- HG002.bam
|   |-- HG003.bam
|   |-- HG004.bam

Output Files

/working_dir/workflow/
|-- variants  # Variant calls from Sniffles
|   |-- HG002.vcf
|   |-- HG003.vcf
|   |-- HG004.vcf
|-- snfs  # SNF files from Sniffles
|   |-- HG002.snf
|   |-- HG003.snf
|   |-- HG004.snf
|-- merged  # SV calls merged by SURVIVOR
|   |-- merged_variants.vcf
|-- analysis  # Mendelian inconsistencies
|   |-- mendelian_results.txt
|-- visualization  # Visualizations from samplot
|   |-- visualization.pdf

Workflow

Workflow

Results

SalsaValentina compares two different methods of merging SV calls within the trio: multi-sample calling using Sniffles and merging using SURVIVOR. The two methods give different numbers of overall SV calls within the trio as well as percentages of SVs that are inconsistent with Medelian inheritance. We found a total of approximately 32,000 SV calls in our merged call set using either Sniffles multi-sample calling or SURVIVOR. For Sniffles multi-sample calling, 5.2% of these were Mendelian inconsistent while for SURVIVOR 2.4% were inconsistent (Fig. 1). The different number of inconsistent SV calls between the two methods is due to differences in genotype assignment between the tools, with SURVIVOR treating some variants as missing while Sniffles reports them as reference.

A mendelian inconsistent deletion was identified in HG002 at chr7:142,786,222-142,796,849 by the Sniffles multi-sample calling method (Fig. 2). This is in the T cell receptor beta locus and thus likely the result of somatic recombination rather than a de novo germline variant. However, it can still be used to demonstrate the method. This deletion was called heterozygous with 12 reads supporting the reference and 13 supporting the variant in HG002, while it was homozygous reference supported by 45 and 44 reads respectively in HG003 and HG004. In addition, GIAB previously reported a de novo deletion in HG002 at chr17:51417826–51417932 using GRCH37 reference as part of their v0.6 SV benchmark set, which was derived from high confidence calls supported by multiple methods (Zook et al., 2020). This deletion was also identified in this study, at chr17:53,340,465-53,340,571 when using GRCH38 as reference (Fig. 3). This heterozygous deletion was supported by 30 reads and the reference at this location by 27 reads, while the parents had only reads supporting the reference allele (65 in HG003 and 72 in HG004).

Figure 1. Comparison of Mendelian Inconsistencies in Sniffles multi-sample calling and single-sample calling followed by SURVIVOR.

Discrepancy: integrated calling (Sniffles) reports 0/0 for reference alleles, Survivor reports ./. For now: Continue with Sniffles file.

Figure 2. Literature-known de novo Deletion at chr17:53340465-53340571 visualized with samplot

Figure 3. Potential de novo Deletion at chr7:142786222-142796849 visualized with samplot

Figure 4

Pairwise alignment of hg38 reference (y-axis) vs assembled contig (x-axis) using YASS.

image

Figure 5

Pairwise alignment of hg38 reference (y-axis) vs assembled contig (x-axis) as output of our pipeline using pafr. c0b211c1-1c1b-483b-b6b1-3b7d89f8d58c

Future work

- Fully automate the evaluation of SV-containing contigs

- Test assembly in the most complex genomic regions

Contributors