To make it faster to run over this case study, we run only on chromosome 20.
Docker will be used to run DeepVariant and hap.py,
We will be using GRCh38 for this case study.
mkdir -p reference
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.fai
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG003.
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
The BAM file we used here is HG003 Illumina WGS reads publicly available from the PrecisionFDA Truth v2 Challenge, mapped with VG Giraffe. You can see this document on how to map.
We'll download the BAM file here to use:
mkdir -p input
HTTPDIR=https://storage.googleapis.com/deepvariant/vg-case-study
curl ${HTTPDIR}/HG003.novaseq.pcr-free.35x.vg-1.55.0.chr20.bam > input/HG003.novaseq.pcr-free.35x.vg-1.55.0.chr20.bam
curl ${HTTPDIR}/HG003.novaseq.pcr-free.35x.vg-1.55.0.chr20.bam.bai > input/HG003.novaseq.pcr-free.35x.vg-1.55.0.chr20.bam.bai
mkdir -p input
HTTPDIR=https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38
curl ${HTTPDIR}/hprc-v1.1-mc-grch38.gbz > input/hprc-v1.1-mc-grch38.gbz
DeepVariant pipeline consists of 3 steps: make_examples
, call_variants
, and
postprocess_variants
. You can now run DeepVariant with one command using the
run_pangenome_aware_deepvariant
script.
In this example, we used a n2-standard-96 machine.
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="pangenome_aware_deepvariant-1.8.0"
sudo docker pull google/deepvariant:"${BIN_VERSION}"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
--shm-size 12gb \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_pangenome_aware_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.pcr-free.35x.vg-1.55.0.chr20.bam \
--pangenome /input/hprc-v1.1-mc-grch38.gbz \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--regions chr20 \
--intermediate_results_dir /output/intermediate_results_dir
By specifying --model_type WGS
, you'll be using a model that is best
suited for short-read WGS data.
NOTE: If you want to run each of the steps separately, add --dry_run=true
to the command above to figure out what flags you need in each step.
--intermediate_results_dir
flag is optional. By specifying it, the
intermediate outputs can be found in the directory. After the command, you can
find these intermediate files in the directory:
call_variants_output-?????-of-?????.tfrecord.gz
gvcf.tfrecord-?????-of-?????.gz
make_examples_pangenome_aware_dv.tfrecord-?????-of-?????.gz
For running on GPU machines, or using Singularity instead of Docker, see Quick Start.
mkdir -p happy
sudo docker pull jmcdani20/hap.py:v0.3.12
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG003.output.vcf.gz \
-f /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /reference/GRCh38_no_alt_analysis_set.fasta \
-o /happy/happy.output \
--engine=vcfeval \
--pass-only \
-l chr20
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10594 34 21276 32 10189 21 8 0.996801 0.997114 0.478896 0.996957 NaN NaN 1.748961 2.231995
INDEL PASS 10628 10594 34 21276 32 10189 21 8 0.996801 0.997114 0.478896 0.996957 NaN NaN 1.748961 2.231995
SNP ALL 70166 70090 76 90303 94 20078 21 5 0.998917 0.998661 0.222340 0.998789 2.296566 1.942569 1.883951 1.599631
SNP PASS 70166 70090 76 90303 94 20078 21 5 0.998917 0.998661 0.222340 0.998789 2.296566 1.942569 1.883951 1.599631