NGS_DNA pipeline

Manual

Find manual on installation and use at https://molgenis.gitbooks.io/ngs_dna

Preprocessing

During the first preprocessing steps of the pipeline, PhiX reads are inserted in each sample to create control SNPs in the dataset. Subsequently, Illumina encoding is checked and QC metrics are calculated using FastQC¹.

Alignment to a reference genome

The bwa-mem command from Burrows-Wheeler Aligner (BWA)² is used to align the sequence data to a reference genome resulting in a SAM (Sequence Alignment Map) file. The reads in the SAM file are sorted with Sambamba³ resulting in a sorted BAM file. When multiple lanes were used during sequencing, all lane BAMs were merged into a sample BAM using Sambamba. The (merged) BAM file is marked for duplicates of the same read pair using Sambamba.

Variant discovery

The GATK⁴ HaplotypeCaller estimates the most likely genotypes and allele frequencies in an alignment using a Bayesian likelihood model for every position of the genome regardless of whether a variant was detected at that site or not. This information can later be used in the project based genotyping step. A joint analysis has been performed of all the samples in the project. This leads to a posterior probability of a variant allele at a site. SNPs and small Indels are written to a VCF file, along with information such as genotype quality, allele frequency, strand bias and read depth for that SNP/Indel. Based on quality thresholds from the GATK "best practices"⁵, the SNPs and indels are filtered and marked as Lowqual or Pass resulting in a final VCF file.

References

1. Andrews S (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
2. Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
3. Tarasov A et al. (2015). Sambamba: Fast processing of NGS alignment formats.
4. McKenna A et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
5. Van der Auwera GA et al. (2013). From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 1,254 Commits
conf		conf
docs		docs
protocols		protocols
report		report
resources		resources
scripts		scripts
templates		templates
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
automated_create_in-house_ngs_projects_workflow.csv		automated_create_in-house_ngs_projects_workflow.csv
batchIDList_b38_chr.csv		batchIDList_b38_chr.csv
batchIDList_chr.csv		batchIDList_chr.csv
batchIDList_small.csv		batchIDList_small.csv
book.json		book.json
create_5GPM_WGS_workflow.csv		create_5GPM_WGS_workflow.csv
create_external_samples_ngs_projects_workflow.csv		create_external_samples_ngs_projects_workflow.csv
create_in-house_ngs_projects_workflow.csv		create_in-house_ngs_projects_workflow.csv
create_reanalysis_workflow.csv		create_reanalysis_workflow.csv
parameters.csv		parameters.csv
parameters_boxy.csv		parameters_boxy.csv
parameters_calculon.csv		parameters_calculon.csv
parameters_gonl.csv		parameters_gonl.csv
parameters_host.csv		parameters_host.csv
parameters_leela.csv		parameters_leela.csv
parameters_leucine-zipper.csv		parameters_leucine-zipper.csv
parameters_resources_exome.csv		parameters_resources_exome.csv
parameters_resources_wgs.csv		parameters_resources_wgs.csv
parameters_umcg-atd.csv		parameters_umcg-atd.csv
parameters_umcg-gaf.csv		parameters_umcg-gaf.csv
parameters_umcg-gd.csv		parameters_umcg-gd.csv
parameters_umcg-gdio.csv		parameters_umcg-gdio.csv
parameters_umcg-gonl.csv		parameters_umcg-gonl.csv
parameters_umcg-testgroup.csv		parameters_umcg-testgroup.csv
parameters_zinc-finger.csv		parameters_zinc-finger.csv
startFromVcf.sh		startFromVcf.sh
workflow-MarkDuplicates.csv		workflow-MarkDuplicates.csv
workflow-bare.csv		workflow-bare.csv
workflow.csv		workflow.csv
workflowNonUMCG.csv		workflowNonUMCG.csv
workflow_5GPM_WGS.csv		workflow_5GPM_WGS.csv
workflow_GavinStandAlone.csv		workflow_GavinStandAlone.csv
workflow_cv.csv		workflow_cv.csv
workflow_reanalysis.csv		workflow_reanalysis.csv
workflow_samplesize_bigger_than_200.csv		workflow_samplesize_bigger_than_200.csv
workflow_startFromVcf.csv		workflow_startFromVcf.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NGS_DNA pipeline

Manual

Preprocessing

Alignment to a reference genome

Variant discovery

References

About

Releases

Packages

Languages

License

TDMedina/NGS_DNA

Folders and files

Latest commit

History

Repository files navigation

NGS_DNA pipeline

Manual

Preprocessing

Alignment to a reference genome

Variant discovery

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages