PEPPER-Margin-DeepVariant is a haplotype-aware variant calling pipeline for long reads.
We evaluated this pipeline on ~30x
HG002 data. The data is publicly available, please feel free to download, run and evaluate the pipeline.
Sample: HG002
Coverage: ~25-90x
Basecaller: Guppy 5.0.7 "SUP"
Region: chr20
Reference: GRCh38_no_alt
Please install docker and wget if you don't have it installed already. You can install docker for other distros from here:
- CentOS docker installation guide
- Debian/Raspbian docker installation guide
- Fedora installation guide
- Ubuntu installation guide
We show the installation instructions for Ubuntu here:
# Install wget to download data files.
sudo apt-get -qq -y update
sudo apt-get -qq -y install wget
# Install docker using instructions on:
# https://docs.docker.com/install/linux/docker-ce/ubuntu/
sudo apt-get -qq -y install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get -qq -y update
sudo apt-get -qq -y install docker-ce
docker --version
# To add the user to avoid running docker with sudo:
# Details: https://docs.docker.com/engine/install/linux-postinstall/
sudo groupadd docker
sudo usermod -aG docker $USER
# Log out and log back in so that your group membership is re-evaluated.
# After logging back in.
docker run hello-world
# If you can run docker without sudo then change the following commands accordingly.
BASE="${HOME}/ont-case-study"
# Set up input data
INPUT_DIR="${BASE}/input/data"
REF="GRCh38_no_alt.chr20.fa"
BAM="HG002_guppy_507_2_GRCh38_pass.chr20.30x.bam"
# Set the number of CPUs to use
THREADS="64"
# Set up output directory
OUTPUT_DIR="${BASE}/output"
OUTPUT_PREFIX="HG002_ONT_30x_2_GRCh38_PEPPER_Margin_DeepVariant.chr20"
OUTPUT_VCF="PEPPER_MARGIN_DEEPVARIANT_OUTPUT.vcf.gz"
## Create local directory structure
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${INPUT_DIR}"
# Download the data to input directory
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/HG002_guppy_507_2_GRCh38_pass.chr20.30x.bam
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/HG002_guppy_507_2_GRCh38_pass.chr20.30x.bam.bai
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/GRCh38_no_alt.chr20.fa
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/GRCh38_no_alt.chr20.fa.fai
## Pull the docker image.
sudo docker pull kishwars/pepper_deepvariant:r0.6
# Run PEPPER-Margin-DeepVariant
sudo docker run \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
kishwars/pepper_deepvariant:r0.6 \
run_pepper_margin_deepvariant call_variant \
-b "${INPUT_DIR}/${BAM}" \
-f "${INPUT_DIR}/${REF}" \
-o "${OUTPUT_DIR}" \
-t "${THREADS}" \
--ont_r9_guppy5_sup
You can evaluate the variants using hap.py
.
Download benchmarking data:
# Set up input data
TRUTH_VCF="HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_BED="HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
# Download truth VCFs
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
wget -P ${INPUT_DIR} https://storage.googleapis.com/pepper-deepvariant-public/usecase_data/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
Run hap.py:
# Pull the docker image
sudo docker pull jmcdani20/hap.py:v0.3.12
# Run hap.py
sudo docker run -it \
-v "${INPUT_DIR}":"${INPUT_DIR}" \
-v "${OUTPUT_DIR}":"${OUTPUT_DIR}" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
${INPUT_DIR}/${TRUTH_VCF} \
${OUTPUT_DIR}/${OUTPUT_VCF} \
-f "${INPUT_DIR}/${TRUTH_BED}" \
-r "${INPUT_DIR}/${REF}" \
-o "${OUTPUT_DIR}/happy.output" \
--pass-only \
-l chr20 \
--engine=vcfeval \
--threads="${THREADS}"
Expected output:
Type | Truth total |
True positives |
False negatives |
False positives |
Recall | Precision | F1-Score |
---|---|---|---|---|---|---|---|
INDEL | 11256 | 6897 | 4359 | 1211 | 0.61274 | 0.853443 | 0.713333 |
SNP | 71333 | 71012 | 321 | 256 | 0.99550 | 0.996409 | 0.995954 |
This pipeline is developed in a collaboration between UCSC genomics institute and the genomics team at Google health.