Skip to content

Germline SNVs and small Indels

Jason Walker edited this page Jun 13, 2019 · 18 revisions

Introduction

Tutorials

Command Line

Setup

Docker Environment

docker run -v `pwd`:/analysis-workflows -it mgibio/gatk-cwl:3.5.0 /bin/bash -l

Change working directory

cd analysis-workflows/example_data/exome_workflow

Haplotype Caller

/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T HaplotypeCaller -R chr17_test.fa -I H_NJ-HCC1395-HCC1395_BL.cram -ERC GVCF -L chr17_test_genes.interval_list -o H_NJ-HCC1395-HCC1395_BL.g.vcf.gz

/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T HaplotypeCaller -R chr17_test.fa -I H_NJ-HCC1395-HCC1395.cram -ERC GVCF -L chr17_test_genes.interval_list -o H_NJ-HCC1395-HCC1395.g.vcf.gz

View GVCF Output

zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL.g.vcf.gz | head

Potential variant sites

zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL.g.vcf.gz | grep ',<' | head

GenotypeGVCFs

/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T GenotypeGVCFs -R chr17_test.fa --variant H_NJ-HCC1395-HCC1395_BL.g.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz

/usr/bin/java -Xmx8g -jar /opt/GenomeAnalysisTK.jar -T GenotypeGVCFs -R chr17_test.fa --variant H_NJ-HCC1395-HCC1395.g.vcf.gz -o H_NJ-HCC1395-HCC1395_genotype.vcf.gz

View VCF output with genotypes

zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz | head

How many potential variations in each sample?

zgrep -v '^##' H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz | wc -l

zgrep -v '^##' H_NJ-HCC1395-HCC1395_genotype.vcf.gz | wc -l

THAT'S A LOT!!! Next, annotation and filtering.

Annotation

Setup VEP

exit
docker run -v `pwd`:/analysis-workflows -it mgibio/vep_helper-cwl:1.0.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow

Run VEP

/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --term SO --flag_pick --transcript_version --tsl --offline --cache --dir $PWD --check_existing --custom chr17_test_gnomADe.vcf.gz,gnomADe,vcf,exact,0,AF,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_SAS --hgvs --fasta chr17_test.fa --synonyms chromAlias.ensembl.txt -o H_NJ-HCC1395-HCC1395_BL_annotated.vcf -i H_NJ-HCC1395-HCC1395_BL_genotype.vcf.gz

/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --term SO --flag_pick --transcript_version --tsl --offline --cache --dir $PWD --check_existing --custom chr17_test_gnomADe.vcf.gz,gnomADe,vcf,exact,0,AF,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_SAS --hgvs --fasta chr17_test.fa --synonyms chromAlias.ensembl.txt -o H_NJ-HCC1395-HCC1395_annotated.vcf -i H_NJ-HCC1395-HCC1395_genotype.vcf.gz

View the annotated VCF

grep -v '^##' H_NJ-HCC1395-HCC1395_BL_annotated.vcf | head -n 3

Binary compression and index

/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_annotated.vcf
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_BL_annotated.vcf
/usr/local/bin/tabix -p vcf H_NJ-HCC1395-HCC1395_BL_annotated.vcf.gz
/usr/local/bin/tabix -p vcf H_NJ-HCC1395-HCC1395_annotated.vcf.gz

Filter Variants

Coding Only Variant Filter

/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --ontology --filter "Consequence is coding_sequence_variant" -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf -i H_NJ-HCC1395-HCC1395_BL_annotated.vcf.gz

/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --ontology --filter "Consequence is coding_sequence_variant" -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf -i H_NJ-HCC1395-HCC1395_annotated.vcf.gz

How many where filtered?

grep -v '^#' H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf | wc -l
grep -v '^#' H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf | wc -l

Remove common alleles based on gnomADe population allele frequencies

/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --filter "gnomADe_AF < 0.001 or not gnomADe_AF" -i H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf

/usr/bin/perl /opt/vep/src/ensembl-vep/filter_vep --format vcf --filter "gnomADe_AF < 0.001 or not gnomADe_AF" -i H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf

How many were filtered?

zgrep -v '^#' H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf | wc -l
zgrep -v '^#' H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf | wc -l

Compress and index

/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
/usr/local/bin/bgzip H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf
/usr/local/bin/tabix H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz
/usr/local/bin/tabix H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz

Setup GATK SelectVariants

exit
docker run -v `pwd`:/analysis-workflows -it mgibio/gatk-cwl:3.6.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow

Run SelectVariants with Target Regions, ie. Exons

/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T SelectVariants -R chr17_test.fa -L chr17_test_target.interval_list --excludeFiltered --variant H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz

/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T SelectVariants -R chr17_test.fa -L chr17_test_target.interval_list --excludeFiltered --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz

Reporting

VariantsToTable

/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_variants.tsv

/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_BL_variants.tsv

/usr/bin/java -jar /opt/GenomeAnalysisTK.jar -T VariantsToTable -R chr17_test.fa -F CHROM -F POS -F REF -F ALT -GF GT -GF AD -F AF -GF DP --variant H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz -o H_NJ-HCC1395-HCC1395_all_coding_variants.tsv

Add VEP fields to table

exit
docker run -v `pwd`:/analysis-workflows -it mgibio/annotation_table-cwl:1.0.0 /bin/bash -l
cd /analysis-workflows/example_data/exome_workflow

mkdir H_NJ-HCC1395-HCC1395
mkdir H_NJ-HCC1395-HCC1395_BL
mkdir H_NJ-HCC1395-HCC1395_all_coding

/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_variants.tsv H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.gnomADe_AF_filtered.target_interval_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395

/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_all_coding_variants.tsv H_NJ-HCC1395-HCC1395_annotated.coding_variant_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395_all_coding

/usr/bin/python /usr/bin/add_annotations_to_table_helper.py H_NJ-HCC1395-HCC1395_BL_variants.tsv H_NJ-HCC1395-HCC1395_BL_annotated.coding_variant_filtered.target_interval_filtered.vcf.gz Consequence,SYMBOL,Feature_type,Feature,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,HGNC_ID,Existing_variation,gnomADe_AF,CLIN_SIG,SOMATIC,PHENO $PWD/H_NJ-HCC1395-HCC1395_BL

View the final Germline variants for the tumor and normal sample

cat H_NJ-HCC1395-HCC1395_BL/variants.annotated.tsv

cat H_NJ-HCC1395-HCC1395/variants.annotated.tsv

Variant ENSP00000269305.4:p.Arg175His in TP53 is likely a pathogenic somatic variant since it's only seen in the tumor sample.

Variant ENSP00000418960.2:p.Arg1772Ter in BRCA1 is likely a pathogenic germline variant that shows up in both the tumor and the normal.

Notice anything about the allele frequencies (AF) between the tumor and normal? What would explain this?

CWL Workflow

Steps

INSERT PROCESS DIAGRAM

Inputs

Outputs

Known Issues

Clone this wiki locally