variant-call-filter

TLDR: This project consists of variant call filtering experiments. The primary objective is to attain similar performance (i.e. precision and recall) to GATK's VQSR tool without requiring as many modeling assumptions or manual user input.

Variant Calling and Filtering

When a gene sequencing pipeline finds evidence of a variant (AKA mutation or SNP), it outputs a "variant call". In practice, one finds that many of these variant calls tend to be false positives. In other words, the genome of interest does not in fact contain the "called" genetic variant. In the datasets I used, approximately 10-15% of the variant calls are false positives. In order to obtain a high quality variant call set, one must attempt to filter out the false positives. This is the goal of my project.

Dataset

I used variant calls obtained by Illumina reads from individual NA12878. Information about how to get the dataset is available on the Genome Analysis Toolkit (GATK) website. After obtaining the ".vcf" file you want to filter (and the corresponding ".idx" file), you should run my preprocessing steps. This notebook applies VQSR, converts the data to a more convenient table format, selects legitimate numerical features from the data (e.g. it drops features with NaN's), and it scales/normalizes the non-categorical data.

GATK Approach

GATK has a variant call filtering tool called Variant Quality Score Recalibration (VQSR). In short, VQSR fits a Gaussian Mixture Model to genetic variant data in order to classify proposed variants as true variants or not.

My Approach

Since we now have access to gold standard variant call datasets, I used such data to train supervised learning algorithms, such as logistic regression, support vector machine, and random forest. I've ran several experiments, which are contained in Jupyter notebooks for transparency.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data_analysis		data_analysis
experiments		experiments
preprocessing		preprocessing
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

variant-call-filter

Variant Calling and Filtering

Dataset

GATK Approach

My Approach

About

Releases

Packages

Languages

nmchaves/variant-call-filter

Folders and files

Latest commit

History

Repository files navigation

variant-call-filter

Variant Calling and Filtering

Dataset

GATK Approach

My Approach

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages