The Kids First Data Resource Center (KFDRC) Germline Variant Workflow is a common workflow language (CWL) implmentation to generate variant calls from an aligned reads BAM or CRAM file. The workflow makes use of copy number, single nucleotide, and structural variant calling software to call variants. Annotation is performed on the single nucleotide and structural variants.
- Ensembl VEP:
105
- gnomAD:
3.1.1
- AnnotSV:
3.1.1
Method | CNV | SNV | SV | Annotation |
---|---|---|---|---|
CNVnator | x | |||
GATK gCNV | x | |||
Freebayes | x | VEP/gnomAD | ||
GATK gSNV | x | VEP/gnomAD | ||
Strelka2 | x | VEP/gnomAD | ||
Manta | x | AnnotSV | ||
SVaba | x | AnnotSV |
The CNV portion of this workflow uses CNVnator and GATK gCNV. For more information see the CNV module documentation.
The SNV portion of this workflow uses Freebayes, GATK gSNV, and Strelka2. Additional annotation is performed using VEP, CADD, gnomAD, ClinVar, InterVar, and dbNSFP. For more information see the SNV module documentation.
The SV portion of this workflow uses Manta and SVaba. Additional annotation is performed using AnnotSV. For more information see the SV module documentation.
This workflow is written in such a way that any of the individual subworkflows can be enabled or disabled at runtime. By default, all subworkflows are run. If you would like to disable a module, disable the control boolean for that module. The control booleans are:
run_gatk_gcnv
run_cnvnator
run_gatk_gsnv
run_freebayes
run_strelka
run_svaba
run_manta
When running GATK gSNV, a gVCF file is required. By default, the workflow will
create this file from the aligned_reads
input. Creating this file is resource
and time intensive. If the user already has a gVCF generated from the
aligned_reads
file, gVCF creation can be skipped. Do skip gVCF creation the
user must provide the associated gVCF in the input_gvcf
input.
-
Universal
aligned_reads
: The germline BAM/CRAM input that has been aligned to a reference genome.indexed_reference_fasta
: The reference genome fasta (and associated indicies) to which the germline BAM/CRAM was aligned.
-
Copy Number Variants
- Calling
cnv_intervals/cnv_blacklist_intervals
: Intervals to include or exclude from CNV analysiscontig_ploidy_model_tar
: The contig-ploidy model directory generated by the DetermineGermlineContigPloidyCohortMode task in the Cohort workflow.gcnv_model_tars
: Array of tars of the contig-ploidy model directories generated by the GermlineCNVCallerCohortMode tasks in the Cohort workflow.
- Calling
-
Single Nucleotide Variants
-
Calling
snv_calling_regions
: File, in BED or INTERVALLIST format, containing a set of genomic regions over which SNVs will be called.snv_unpadded_intervals_file
: Handcurated intervals over which the gVCF will be genotyped to create a VCF.snv_evaluation_interval_list
: Evaluation regions for gVCF and known sites VCF metrics.contamination_sites_bed
: .bed file for markers used in contamination analysiscontamination_sites_mu
: .mu matrix file of genotype matrixcontamination_sites_ud
: .UD matrix file from SVD result of genotype matrixdbsnp_vcf
: Population resource used for both indel and SNP recalibration as well as gVCF/VCF metrics; available from GATKaxiomPoly_resource_vcf
: Population resource used for indel recalibration; available from GATKmills_resource_vcf
: Population resource used for indel recalibration; available from GATKhapmap_resource_vcf
: Population resource used for SNP recalibration; available from GATKomni_resource_vcf
: Population resource used for SNP recalibration; available from GATKone_thousand_genomes_resource_vcf
: Population resource used for SNP recalibration; available from GATKped
: Ped file to establish familial relationship. For single sample, this file is a single line. For example, if you are handing in only a single CRAM from NA12878, the ped file would look like this:NA128 NA12878 0 0 2 2
-
Annotation
Recommended:
vep_cache
: TAR.GZ cache from ensembl/local converted cacheechtvar_anno_zips
: echtvar-formatted gnomAD v3.1.1 reference. See annotation docs for more info
Optional:
clinvar_annotation_vcf
: ClinVar VCF used for annotationdbnsfp
: VEP-formatted plugin file, index, and readme file containing dbNSFP annotationscadd_indels
: VEP-formatted plugin file and index containing CADD indel annotationscadd_snvs
: VEP-formatted plugin file and index containing CADD SNV annotationsintervar
: Intervar vcf-formatted file. Exonic SNVs only - for more comprehensive run InterVar. See docs for custom build instructions
-
-
Structural Variants
- Annotation
annotsv_annotations_dir
: These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation (see: https://github.com/lgmgeo/AnnotSV/#quick-installation). Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.
- Annotation
- Copy Number Variant
- GATK
gatk_gcnv_genotyped_intervals_vcfs
: Per sample VCF files provides a detailed listing of the most likely copy-number call for each genomic interval included in the call-set, along with call quality, call genotype, and the phred-scaled posterior probability vector for all integer copy-number states.gatk_gcnv_genotyped_segments_vcfs
: Per sample VCF files containing coalesced contiguous intervals that share the same copy-number callgatk_gcnv_denoised_copy_ratios
: Per sample files concatenates posterior means for denoised copy ratios from all the call shards produced by the GermlineCNVCaller.
- CNVnator
cnvnator_vcf
: Called CNVs in VCF format by CNVnatorcnvnator_called_cnvs
: Called CNVs from aligned_reads by CNVnatorcnvnator_average_rd
: Average RD stats by CNVnator
- Annotation (AnnotSV)
cnvnator_annotated_cnvs
: This file contains all records from thecnvnator_vcf
that AnnotSV could annotate.gatk_gcnv_annotated_genotyped_segments
: Per sample TSV files containing AnnotSV-annotated CNVs fromgatk_gcnv_genotyped_segments_vcfs
- GATK
- Single Nucleotide Variant
- Freebayes
freebayes_unfiltered_vcf
: Raw variants output from freebayes
- GATK
gatk_gvcf
: GATK HaplotypeCaller generated gVCFgatk_gvcf_metrics
: GATK/Picard variant calling detail and summary metrics for the gVCFgatk_vcf_metrics
: GATK/Picard variant calling detail and summary metrics for the known sites VCFpeddy_html
: HTML metrics files from Peddypeddy_csv
: CSV metrics for het_check, ped_check, and sex_check from Peddypeddy_ped
: PED file with additional metrics information from Peddyverifybamid_output
: VerifyBAMID output, including contamination score
- Strelka2
strelka2_prepass_variants
: Raw variants output from Strelka2strelka2_gvcfs
: gVCF output from Strelka2
- Annotation (VEP/gnomAD)
vep_annotated_gatk_vcf
: VQSR, Hard-filtered, and VEP annotated known sites VCFvep_annotated_freebayes_vcf
: Quality filtered and VEP annotatedfreebayes_unfiltered_vcf
vep_annotated_strelka_vcf
: Pass filtered and VEP annotatedstrelka2_prepass_variants
- Freebayes
- Structural Variant
- Manta
manta_svs
: Structural Variants called by Mantamanta_indels
: Small INDELs called by Manta
- SvABA
svaba_svs
: Structural Variants called by SvABAsvaba_indels
: Small INDELs called by SvABA
- Annotation (AnnotSV)
manta_annotated_svs
: This file contains all records from themanta_svs
that AnnotSV could annotate.svaba_annotated_svs
: This file contains all records from thesvaba_svs
that AnnotSV could annotate.
- Manta
- D3b dockerfiles
- Testing Tools:
- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0