Skip to content

Main genome analytics workflow powering the production analysis of WGS samples for the Singapore NPM Program Phase 1A (AKA SG10K Health)

License

Notifications You must be signed in to change notification settings

gis-rpd/rpd-sg10k-grch38-gatk4-gvcf-freebayes-vcf

Repository files navigation

SG10K Health: GRCh38 GATK4-gVCF Freebayes-VCF

Build Status

Introduction

This is the main genome analytics workflow powering the production analysis of whole genome samples for the Singapore National Precision Medicine (NPM) Program Phase 1A, sometimes also referred to as SG10K Health. It processes samples from FastQ to lossless CRAM, computes multiple QC metrics as well as Freebayes variant calls and GATK4 gvcfs.

To ensure reproducibility, scalability and mobility the workflow is implemented as Nextflow recipe and uses containers (Singularity on NSCC's Aspire 1 and Docker on AWS Batch). Container building is simplified by the use of Bioconda.

Output

All results can be found in the results folder of a pipeline execution. Results there are grouped per sample, with the exception of Goleft indexcov, which summarises over the sample set.

Main results

  • GATK4 gVCF (indexed): {sample}/{sample}.g.vcf.gz
  • Freebayes VCF (Q>=20; indexed): {sample}/{sample}.fb.vcf.gz
  • CRAM (lossless, with OQ, indexed): {sample}/{sample}.bqsr.cram

QC etc.

  • Goleft indexcov: indexcov/all/ (main file indexcov/all/all.html)
  • Samtools stats: {sample}/stats/ (main files: {sample}/stats/{sample}.stats and {sample}/stats/{sample}.html)
  • Verifybamid for the three ethnicities: {sample}/verifybamid/ (main files: {sample}/verifybamid/{sample}.SGVP_MAF0.01.{ethnicity}.selfSM)
  • Coverage as per SOP: {sample}/{sample}.cov-062017.txt

Notes

  • We share this code for transparency. This is not meant to be a generic whole genome workflow for wider use, but rather specific to the program's needs. For the same reason this documentation is rudimentary.
  • See this file for the execution DAG
  • GATK commandline parameters are based on the official WDL implementation
  • Developers: work on devel or feature branches. Only merge to master if tests/run.sh completes successfully

Authors

The workflow was implemented in the Genome Institute of Singapore (GIS) by:

About

Main genome analytics workflow powering the production analysis of WGS samples for the Singapore NPM Program Phase 1A (AKA SG10K Health)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •