Skip to content

Latest commit

 

History

History
178 lines (147 loc) · 9.66 KB

GERMLINE_VARIANT_README.md

File metadata and controls

178 lines (147 loc) · 9.66 KB

Kids First Data Resource Center Germline Variant Workflow

The Kids First Data Resource Center (KFDRC) Germline Variant Workflow is a common workflow language (CWL) implmentation to generate variant calls from an aligned reads BAM or CRAM file. The workflow makes use of copy number, single nucleotide, and structural variant calling software to call variants. Annotation is performed on the single nucleotide and structural variants.

Relevant Softwares and Versions

Callers

Annotators

Method CNV SNV SV Annotation
CNVnator x
GATK gCNV x
Freebayes x VEP/gnomAD
GATK gSNV x VEP/gnomAD
Strelka2 x VEP/gnomAD
Manta x AnnotSV
SVaba x AnnotSV

CNV

The CNV portion of this workflow uses CNVnator and GATK gCNV. For more information see the CNV module documentation.

SNV

The SNV portion of this workflow uses Freebayes, GATK gSNV, and Strelka2. Additional annotation is performed using VEP, CADD, gnomAD, ClinVar, InterVar, and dbNSFP. For more information see the SNV module documentation.

SV

The SV portion of this workflow uses Manta and SVaba. Additional annotation is performed using AnnotSV. For more information see the SV module documentation.

Enabling and Disabling Callers

This workflow is written in such a way that any of the individual subworkflows can be enabled or disabled at runtime. By default, all subworkflows are run. If you would like to disable a module, disable the control boolean for that module. The control booleans are:

  • run_gatk_gcnv
  • run_cnvnator
  • run_gatk_gsnv
  • run_freebayes
  • run_strelka
  • run_svaba
  • run_manta

Disabling GATK gVCF Creation

When running GATK gSNV, a gVCF file is required. By default, the workflow will create this file from the aligned_reads input. Creating this file is resource and time intensive. If the user already has a gVCF generated from the aligned_reads file, gVCF creation can be skipped. Do skip gVCF creation the user must provide the associated gVCF in the input_gvcf input.

Input Files

  • Universal

    • aligned_reads: The germline BAM/CRAM input that has been aligned to a reference genome.
    • indexed_reference_fasta: The reference genome fasta (and associated indicies) to which the germline BAM/CRAM was aligned.
  • Copy Number Variants

    • Calling
      • cnv_intervals/cnv_blacklist_intervals: Intervals to include or exclude from CNV analysis
      • contig_ploidy_model_tar: The contig-ploidy model directory generated by the DetermineGermlineContigPloidyCohortMode task in the Cohort workflow.
      • gcnv_model_tars: Array of tars of the contig-ploidy model directories generated by the GermlineCNVCallerCohortMode tasks in the Cohort workflow.
  • Single Nucleotide Variants

    • Calling

      • snv_calling_regions: File, in BED or INTERVALLIST format, containing a set of genomic regions over which SNVs will be called.
      • snv_unpadded_intervals_file: Handcurated intervals over which the gVCF will be genotyped to create a VCF.
      • snv_evaluation_interval_list: Evaluation regions for gVCF and known sites VCF metrics.
      • contamination_sites_bed: .bed file for markers used in contamination analysis
      • contamination_sites_mu: .mu matrix file of genotype matrix
      • contamination_sites_ud: .UD matrix file from SVD result of genotype matrix
      • dbsnp_vcf: Population resource used for both indel and SNP recalibration as well as gVCF/VCF metrics; available from GATK
      • axiomPoly_resource_vcf: Population resource used for indel recalibration; available from GATK
      • mills_resource_vcf: Population resource used for indel recalibration; available from GATK
      • hapmap_resource_vcf: Population resource used for SNP recalibration; available from GATK
      • omni_resource_vcf: Population resource used for SNP recalibration; available from GATK
      • one_thousand_genomes_resource_vcf: Population resource used for SNP recalibration; available from GATK
      • ped: Ped file to establish familial relationship. For single sample, this file is a single line. For example, if you are handing in only a single CRAM from NA12878, the ped file would look like this: NA128 NA12878 0 0 2 2
    • Annotation

      Recommended:

      • vep_cache: TAR.GZ cache from ensembl/local converted cache
      • echtvar_anno_zips: echtvar-formatted gnomAD v3.1.1 reference. See annotation docs for more info

      Optional:

      • clinvar_annotation_vcf: ClinVar VCF used for annotation
      • dbnsfp: VEP-formatted plugin file, index, and readme file containing dbNSFP annotations
      • cadd_indels: VEP-formatted plugin file and index containing CADD indel annotations
      • cadd_snvs: VEP-formatted plugin file and index containing CADD SNV annotations
      • intervar: Intervar vcf-formatted file. Exonic SNVs only - for more comprehensive run InterVar. See docs for custom build instructions
  • Structural Variants

    • Annotation
      • annotsv_annotations_dir: These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation (see: https://github.com/lgmgeo/AnnotSV/#quick-installation). Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.

Output Files

  • Copy Number Variant
    • GATK
      • gatk_gcnv_genotyped_intervals_vcfs: Per sample VCF files provides a detailed listing of the most likely copy-number call for each genomic interval included in the call-set, along with call quality, call genotype, and the phred-scaled posterior probability vector for all integer copy-number states.
      • gatk_gcnv_genotyped_segments_vcfs: Per sample VCF files containing coalesced contiguous intervals that share the same copy-number call
      • gatk_gcnv_denoised_copy_ratios: Per sample files concatenates posterior means for denoised copy ratios from all the call shards produced by the GermlineCNVCaller.
    • CNVnator
      • cnvnator_vcf: Called CNVs in VCF format by CNVnator
      • cnvnator_called_cnvs: Called CNVs from aligned_reads by CNVnator
      • cnvnator_average_rd: Average RD stats by CNVnator
    • Annotation (AnnotSV)
      • cnvnator_annotated_cnvs: This file contains all records from the cnvnator_vcf that AnnotSV could annotate.
      • gatk_gcnv_annotated_genotyped_segments: Per sample TSV files containing AnnotSV-annotated CNVs from gatk_gcnv_genotyped_segments_vcfs
  • Single Nucleotide Variant
    • Freebayes
      • freebayes_unfiltered_vcf: Raw variants output from freebayes
    • GATK
      • gatk_gvcf: GATK HaplotypeCaller generated gVCF
      • gatk_gvcf_metrics: GATK/Picard variant calling detail and summary metrics for the gVCF
      • gatk_vcf_metrics: GATK/Picard variant calling detail and summary metrics for the known sites VCF
      • peddy_html: HTML metrics files from Peddy
      • peddy_csv: CSV metrics for het_check, ped_check, and sex_check from Peddy
      • peddy_ped: PED file with additional metrics information from Peddy
      • verifybamid_output: VerifyBAMID output, including contamination score
    • Strelka2
      • strelka2_prepass_variants: Raw variants output from Strelka2
      • strelka2_gvcfs: gVCF output from Strelka2
    • Annotation (VEP/gnomAD)
      • vep_annotated_gatk_vcf: VQSR, Hard-filtered, and VEP annotated known sites VCF
      • vep_annotated_freebayes_vcf: Quality filtered and VEP annotated freebayes_unfiltered_vcf
      • vep_annotated_strelka_vcf: Pass filtered and VEP annotated strelka2_prepass_variants
  • Structural Variant
    • Manta
      • manta_svs: Structural Variants called by Manta
      • manta_indels: Small INDELs called by Manta
    • SvABA
      • svaba_svs: Structural Variants called by SvABA
      • svaba_indels: Small INDELs called by SvABA
    • Annotation (AnnotSV)
      • manta_annotated_svs: This file contains all records from the manta_svs that AnnotSV could annotate.
      • svaba_annotated_svs: This file contains all records from the svaba_svs that AnnotSV could annotate.

Basic Info

References