Variant annotation

Variant annotation workflow

Variants are annotated with their predicted impact on the genes using SNPeff
Then with public variant databases:
- Small variants are annotated with with gnomAD's allele frequencies, conservation/CADD/MetaRNN scores, and ClinVar's clinical significance.
- Structural variants are annotated with the frequency in SV catalogs, and their overlap with dbVar clinical SVs, or DGV.
Structural variants can also be re-genotyped from the long reads to compute the read support and help rank/filter them by confidence.

The workflow was written in

WDL: see wdl/workflow.wdl and WDL section below.
Snakemake: see workflow/Snakefile (and config, profile folders) and Snakemake section below.

Input

The worklow annotates a VCF file containing small variants and structural variants (SVs). Mapped reads should also be provided to perform some QC on the SVs.

After preparing the annotation files, jump to the WDL or Snakemake sections to run the workflow.

Prepare annotation files

SnpEff

wget https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_GRCh38.105.zip
unzip snpEff_v5_1_GRCh38.105.zip

gnomAD

To reduce the size of the VCF with the variants in gnomAD, keeping only the field that we want to use (AF):

for CHR in `seq 1 22` X Y
do
    curl https://storage.googleapis.com/gcp-public-data--gnomad/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr${CHR}.vcf.bgz | zcat | bcftools annotate -x ^INFO/AF,ID,FILTER,QUAL -e 'FILTER~"AC0"' -Oz -o gnomad_chr.genomes.r3.0.sites.chr${CHR}.vcf.gz
done
bcftools concat gnomad_chr.genomes.r3.0.sites.chr*.vcf.gz | bcftools sort -Oz -o gnomad.genomes.r3.0.sites.small.vcf.bgz -m 3G
rm gnomad_chr.genomes.r3.0.sites.chr*.vcf.gz

The final bgzipped VCF is only 4.5G.

dbNSFP

To reduce the size of the dbNSFP database used by SNPeff:

## download dbNSFP zip from https://sites.google.com/site/jpopgen/dbNSFP
## list files in the zip
unzip -l dbNSFP4.4a.zip
## list columns in full txt file. helps finding/double-checking the column numbers we want to keep
unzip -p dbNSFP4.4a.zip dbNSFP4.4a_variant.chr1.gz | zcat | head -1 | awk 'BEGIN{RS="\t"}{N=N+1;print N" "$0}' | less
## keep only some columns. gzip to control the size of temporary files and use multi-threaded gzip (pigz) to speed up a bit
unzip -p dbNSFP4.4a.zip dbNSFP4.4a_variant.chr1.gz | zcat | head -1 | cut -f 1-4,77-79,128-130,165-167,676 | gzip > temp.txt.gz
for CHR in `seq 1 22` X Y M
do
    unzip -p dbNSFP4.4a.zip dbNSFP4.4a_variant.chr${CHR}.gz | zcat | sed 1d | cut -f 1-4,77-79,128-130,165-167,676 | pigz -c -p 4 >> temp.txt.gz
done
## bgzip instead of gzip
unpigz -c -p 4 temp.txt.gz | bgzip > dbNSFP4.4a.small.txt.gz
## index with tabix
tabix -b 2 -e 2 -s 1 dbNSFP4.4a.small.txt.gz

If we keep only information about GERP, CADD, MetaRNN, and ALFA, the file size decreases from 35G to about 2G.

AnnotSV

To also use the GeneHancer database for enhancers, download manually from the GeneHancer website. Then download the GeneHancer-v5.9-for-AnnotSV.zip file.

docker run -it -v `pwd`:/app -w /app -u `id -u $USER` quay.io/jmonlong/annotsv:3.4
# - # IN DOCKER CONTAINER
cp -r /build/AnnotSV .
cd AnnotSV
PREFIX=. make install-human-annotation
cd share/AnnotSV/Annotations_Human/FtIncludedInSV/RegulatoryElements
cp /app/GeneHancer-v5.9-for-AnnotSV.zip .
unzip GeneHancer-v5.9-for-AnnotSV.zip
cd ../../../../../
AnnotSV -SVinputFile tests/AnnotSV/test_06_VCFfromLumpy/input/test1.vcf -outputDir temp_out -annotationsDir share/AnnotSV
mv share/AnnotSV ../AnnotSV_annotations
cd ..
rm -fr AnnotSV
# - # OUT DOCKER CONTAINER

Misc annotations in R object

annotation_database.RData

Details and link soon...

Tandem repeat expansion in public controls

hprc-hgsvc-tr-catalog.tsv.gz

Details and link soon...

ClinVar VCF

Details soon...

Running the workflow

WDL

The input JSON should contain a VCF field with the path to the VCF file to annotate.

See test/test.inputs.bam.json for an example.

For instance, running the workflow locally with miniwdl:

miniwdl run --as-me -i test/test.inputs.bam.json wdl/workflow.wdl

Below are more details about the other input parameters.

Gene annotation

If SNPEFF_DB and SNPEFF_DB_NAME are provided, SNPeff will annotate variants with their impact based on the gene annotation.

Small variant databases

If GNOMAD_VCF, GNOMAD_VCF_INDEX, CLINVAR_VCF, CLINVAR_VCF_INDEX, DBNSFP_DB, and DBNSFP_DB_INDEX are provided, SNPeff/SNPsift will annotate small variants with:

their frequency in gnomAD
their presence and clinical significance in ClinVar
the predicted impact based on conservation (GERP++), CADD, or MetaRNN.

Structural variant databases

If SV_DB_RDATA is provided, SVs are annotated with:

the frequency of similar SVs in public SV catalogs
their similarity with variants in DGV
their similarity with the Clinical SV dataset from dbVar

Structural variant validation

If BAM, BAM_INDEX, and REFERENCE_FASTA are provided, SVs will be re-genotyped from the long reads using local pangenomes built with vg. Two new INFO fields will represent the read support for the SV allele:

RS_PROP with the proportion of supporting reads.
RS_AD with the read support for the reference and alternate alleles (e.g. 3,5 for 3 reference-supporting reads and 5 SV-supporting reads).

Snakemake

Assuming you're in the root of the repository, run the workflow with:

snakemake --use-singularity --configfile config.yaml -p reports --cores 4

The config.yaml file points to all the relevant files (FASTA, databases, sample information).

See test/snakemake.config.yaml and test/snakemake.sample_info.tsv for examples of those two files.

The config.yaml parameters are:

clinvar: VCF with ClinVar (prepared above)
dbnsfp: dbNSFP indexed file (prepared above)
gnomad: gnomAD VCF with allele frequencies (prepared above)
svdb: RData file with misc annotations, mostly for SV annotatio (prepared above)
annotsv: folder with the AnnotSV annotations (prepared above)
snpeff: folder with the SnpEff databse (prepared above)
ref: (indexed) FASTA of the reference genome
tr_catalog: Tandem repeat expansion in public controls (prepared above)
sample_info: TSV file with sample information (name, VCF, BAM)
sample: sample(s) to process. If empty, all samples in sample_info will be processed.
tmp_dir: a directory to use as temporary directory.

The sample_info parameter point to a TSV file with (at least) the following three columns:

sample: sample name
bam: path to the (indexed) BAM file
hvcf: harmonized VCF with small and structural variants (e.g. from the Napu pipeline).

Test locally

See test/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
config		config
docker		docker
profile/default		profile/default
scripts		scripts
test		test
wdl		wdl
workflow		workflow
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant annotation

Input

Prepare annotation files

SnpEff

gnomAD

dbNSFP

AnnotSV

Misc annotations in R object

Tandem repeat expansion in public controls

ClinVar VCF

Running the workflow

WDL

Gene annotation

Small variant databases

Structural variant databases

Structural variant validation

Snakemake

Test locally

About

Releases

Packages

Contributors 2

Languages

License

jmonlong/variant_annotation_wf

Folders and files

Latest commit

History

Repository files navigation

Variant annotation

Input

Prepare annotation files

SnpEff

gnomAD

dbNSFP

AnnotSV

Misc annotations in R object

Tandem repeat expansion in public controls

ClinVar VCF

Running the workflow

WDL

Gene annotation

Small variant databases

Structural variant databases

Structural variant validation

Snakemake

Test locally

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages