Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on getting Mutect algorithm up and running #167

Open
wants to merge 63 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
c34d30c
Added base quality score based read filters.
tdanford Apr 24, 2015
aa88c53
Miscellaneous read filter cleanup. Refactored preprocessing stages th…
fnothaft Apr 24, 2015
16da202
Updates to depth filter. Genotypes are not discarded, rather the filt…
fnothaft Apr 24, 2015
65c68dd
adding in @tdanford's mutect work, merging into current avocado
jstjohn May 13, 2015
cc840b9
Merge branch 'master' into jsj_mutect_hack
jstjohn May 13, 2015
10181ec
Merge branch 'common-filters' of github.com:fnothaft/avocado into jsj…
jstjohn May 14, 2015
3fc75b2
Merge branch 'master' of github.com:bigdatagenomics/avocado into jsj_…
jstjohn May 14, 2015
265b142
restoring successful compile with in-progress mutect against lastest …
jstjohn May 17, 2015
fe6772c
Merge branch 'common-filters' of github.com:fnothaft/avocado into mut…
jstjohn May 17, 2015
310c924
pulling over MuTect LikelihoodModel, fixing some bugs, and making tests
jstjohn May 18, 2015
571bb2a
initial pull over of Mutect.scala base function, and a base suite of …
jstjohn May 18, 2015
c0503f6
more fixes and clarifications on the mutect likelihood model
jstjohn May 18, 2015
5d96d8b
adding in initial tests of likelihood given an actual mutation
jstjohn May 18, 2015
c8cb648
added in tests for mutant site
jstjohn May 18, 2015
5506148
reorganizing tests slightly
jstjohn May 18, 2015
f9fe9c2
added in a final test covering the normal somatic classification
jstjohn May 18, 2015
aac84f9
Adding mutect tests and TODO comments to main code
jstjohn May 18, 2015
cde7a00
adding somatic classification code/tests into MuTect
jstjohn May 19, 2015
c32ef7f
making test reads slightly more realistic, assigned to both strands now
jstjohn May 19, 2015
fe85550
adding a small mutect test for sites with no tumor coverage, but some…
jstjohn May 19, 2015
1a92281
Merge branch 'master' into mutect_work
jstjohn May 20, 2015
4d5487e
cleaning up code, refactoring location of Mutect.scala, adding in lic…
jstjohn May 26, 2015
b5ac34b
Adding harness for running Mutect algorithm on real reads.
fnothaft May 26, 2015
2820113
switching to phredToErrorProbability from internal function
jstjohn May 26, 2015
af3040f
Merge pull request #1 from fnothaft/mutect-runner
jstjohn May 26, 2015
3c1511d
adding license header to MutectGenotyperSuite and removing old versions
jstjohn May 26, 2015
095c2d4
merging in new AlleleObservation code from Master, updating tests
jstjohn May 27, 2015
b3a3f23
Adding more pre-processing filters for Mutect.
tdanford May 28, 2015
504c0a8
Merge pull request #2 from fnothaft/mo-mutect
jstjohn May 30, 2015
bb2755c
adding 5 fields to AlleleObservation for allele-specific filtering: d…
jstjohn May 30, 2015
c7021ee
Adding two new fields to AlleleAlleleObservation to store the distanc…
jstjohn May 30, 2015
c2aa3c5
soft clipped read should have the clipped bases in the sequence
jstjohn May 30, 2015
e55d1a9
initial calculation of distance from each allele to nearest indel and…
jstjohn May 31, 2015
7218475
Tests pass ReadExplorer, adding initial version using those filters t…
jstjohn Jun 1, 2015
2083c8c
implementing more mutect filters, and adding TODOs for the last few f…
jstjohn Jun 2, 2015
fc5bf4d
putting magic numbers into config variables
jstjohn Jun 2, 2015
79e3547
implement the allele cluster filter
jstjohn Jun 2, 2015
27bb516
removing unneeded fields from AlleleObservation and replacing with a …
jstjohn Jun 2, 2015
0dc1ea1
adding more tests to Mutect internal filters
jstjohn Jun 2, 2015
31d1f00
passing mate rescue status through to AlleleObservation
jstjohn Jun 2, 2015
b034e7b
adding in custom mutect strand bias method
jstjohn Jun 2, 2015
b0ee22d
fixing bug in MutectGenotyper tests where filters were not being prop…
jstjohn Jun 2, 2015
03aaaf8
refactoring variables to camelCase
jstjohn Jun 3, 2015
607ebab
faster O(log(n)) recursive search for minimal passing k value
jstjohn Jun 3, 2015
0f9931c
cleaning up code a bit
jstjohn Jun 4, 2015
ab9b0ca
cleaning up variable names
jstjohn Jun 4, 2015
cc088c9
Implementation of the sum of mismatching quality scores in the Allele…
jstjohn Jun 10, 2015
fb18208
Merge branch 'master' of github.com:bigdatagenomics/avocado into mute…
jstjohn Jun 26, 2015
8d0c286
Merge branch 'master' of github.com:bigdatagenomics/avocado into mute…
jstjohn Jul 1, 2015
88a5cba
Adding known-sites variant filter.
fnothaft Aug 1, 2015
d5de796
Merge pull request #3 from fnothaft/add-dbsnp
jstjohn Aug 1, 2015
4ed4e93
Added site detection to mutect genotyper. Also, made site filter and …
fnothaft Aug 3, 2015
00a89c1
Fixed MD tag bug caused by upstream change to MD tag code in ADAM.
fnothaft Aug 3, 2015
7b5da97
Merge pull request #4 from fnothaft/add-dbsnp
jstjohn Aug 3, 2015
b501595
adding in the required settings to enable Panel of Normal (PON) filte…
jstjohn Aug 30, 2015
ba1b236
Merge branch 'master' of github.com:bigdatagenomics/avocado into mute…
jstjohn Aug 30, 2015
5b313c8
Merge pull request #2 from bigdatagenomics/master
stewSquared Jan 7, 2016
d3b16d1
Merge branch 'master' into mutect_work
stewSquared Jan 7, 2016
2444c9a
Merge pull request #3 from bigdatagenomics/master
stewSquared Jan 8, 2016
4819062
Merge branch 'master' into mutect_work
stewSquared Jan 8, 2016
9bb2980
Added support for having 2 input ADAM reads file, with the option to …
Jan 28, 2016
0b2955e
implementing isabels changes
jstjohn Mar 10, 2016
f7ad48b
adding isabels changes
jstjohn Mar 11, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
*.vcf*
target
*.log
*.jar

# Intellij
.idea/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ class AvocadoArgs extends Args4jBase with ParquetArgs {
@Argument(metaVar = "CONFIG", required = true, usage = "avocado configuration file", index = 3)
var configFile: String = _

@Argument(metaVar = "NORMAL", required = false, usage = "ADAM normal data", index = 4)
var normalInput: String = _

@option(name = "-debug", usage = "If set, prints a higher level of debug output.")
var debug = false

Expand Down Expand Up @@ -204,10 +207,18 @@ class Avocado(protected val args: AvocadoArgs) extends BDGSparkCommand[AvocadoAr

log.info("Loading reads in from " + args.readInput)
// load in reads from ADAM file
val reads: RDD[AlignmentRecord] = LoadReads.time {
var reads: RDD[AlignmentRecord] = LoadReads.time {
Input(sc, args.readInput, reference, config)
}

// load in reads from normal ADAM file if there is one
var normal: RDD[AlignmentRecord] = null

if (args.normalInput != null) {
normal = Input(sc, args.normalInput, reference, config)
reads = sc.union(reads, normal)
}

// create stats/config item
val stats = new AvocadoConfigAndStats(sc, args.debug, reads, reference)

Expand All @@ -233,4 +244,4 @@ class Avocado(protected val args: AvocadoArgs) extends BDGSparkCommand[AvocadoAr
args.disableDictionaryEncoding)
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
/*
* Licensed to Big Data Genomics (BDG) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The BDG licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.bdgenomics.avocado.algorithms.mutect

import org.bdgenomics.avocado.models.AlleleObservation
import org.bdgenomics.adam.util.PhredUtils.phredToErrorProbability
import scala.math._

trait LikelihoodModel extends Serializable {
def logLikelihood(ref: String,
alt: String,
obs: Iterable[AlleleObservation],
f: Option[Double]): Double
}

case class LogOdds(m1: LikelihoodModel, m2: LikelihoodModel) {

def logOdds(ref: String, alt: String,
obs: Iterable[AlleleObservation],
f: Option[Double]): Double =
m1.logLikelihood(ref, alt, obs, f) - m2.logLikelihood(ref, alt, obs, f)
}

object MutectLogOdds extends LogOdds(MfmModel, M0Model) {
}

/**
* Use for the log odds that a normal is not a heterozygous site
*/
object MutectSomaticLogOdds extends LogOdds(M0Model, MHModel) {
}

object M0Model extends LikelihoodModel {

def logLikelihood(ref: String,
alt: String,
obs: Iterable[AlleleObservation],
f: Option[Double]): Double =
MfmModel.logLikelihood(ref, alt, obs, Some(0.0))
}

/**
* M_{m, 0.5} -- probability of a heterozygous site
*/
object MHModel extends LikelihoodModel {

def logLikelihood(ref: String,
alt: String,
obs: Iterable[AlleleObservation],
f: Option[Double]): Double =
MfmModel.logLikelihood(ref, alt, obs, Some(0.5))
}

/**
* M_{m, f}
*/
object MfmModel extends LikelihoodModel {

def P_bi(obs: AlleleObservation, r: String, m: String, f: Double): Double = {
val ei = phredToErrorProbability(obs.phred)

if (obs.allele == r) {
f * (ei / 3.0) + (1.0 - f) * (1.0 - ei)
} else if (obs.allele == m) {
f * (1.0 - ei) + (1.0 - f) * (ei / 3.0)
} else {
ei / 3.0
}
}

def logLikelihood(ref: String, alt: String,
obs: Iterable[AlleleObservation],
f: Option[Double]): Double = {
val fEstimate: Double = f.getOrElse(obs.count(_.allele == alt).toDouble / obs.size)
obs.map { ob => log10(P_bi(ob, ref, alt, fEstimate)) }.sum
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ import org.apache.spark.Logging
import org.apache.spark.rdd.RDD
import org.bdgenomics.adam.models.ReferencePosition
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rich.RichAlignmentRecord
import org.bdgenomics.adam.rich.{ DecadentRead, RichAlignmentRecord }
import org.bdgenomics.adam.util.MdTag
import org.bdgenomics.avocado.Timers._
import org.bdgenomics.avocado.models.{ AlleleObservation, Observation }
import org.bdgenomics.avocado.stats.AvocadoConfigAndStats
Expand All @@ -43,10 +44,38 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit

val companion: ExplorerCompanion = ReadExplorer

def mdTagToMismatchPositions(mdTag: MdTag, cigar: List[CigarElement]): Seq[Int] = {
var idx = 0
val insertions = cigar.map(c => {
(c, c.getLength)
}).map(kv => {
val r = (kv._1, idx)
idx += kv._2
r
}).flatMap(kv => {
val (ce, i) = kv
if (ce.getOperator == CigarOperator.I) {
(0 until ce.getLength).map(_ + i)
} else {
Seq.empty
}
})

val deletions = mdTag.deletions
val oriPositions = mdTag.mismatches.keys
var mismatchPositions = oriPositions.zip(oriPositions)
for (iPos <- insertions) {
mismatchPositions = mismatchPositions.map({ case (p, i) => if (i <= iPos) (p + 1, i) else (p, i) })
}
for ((dPos, _) <- deletions) {
mismatchPositions = mismatchPositions.map({ case (p, i) => if (i > dPos) (p - 1, i) else (p, i) })
}
mismatchPositions.map({ case (p, i) => p.toInt }).toSeq
}

def readToObservations(r: (AlignmentRecord, Long)): Seq[Observation] = ExploringRead.time {
val (read, readId) = r
val richRead: RichAlignmentRecord = RichAlignmentRecord(read)

// get read start, contig, strand, sample, mapq, and sequence
var pos: Long = read.getStart
val contig: String = read.getContig.getContigName
Expand All @@ -64,14 +93,115 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
// get cigar, md tag, and phred scores for bases
val cigar: List[CigarElement] = richRead.samtoolsCigar.getCigarElements
val quals = richRead.qualityScores
val mdTag = richRead.mdTag
val mdString = read.getMismatchingPositions
val mismatchPositions: Option[Seq[Int]] = if (mdString != null && mdString != "")
Some(mdTagToMismatchPositions(MdTag(read.getMismatchingPositions,
if (cigar.head.getOperator == CigarOperator.S) cigar.head.getLength else 0,
richRead.samtoolsCigar), cigar))
else None

// observations
var observations = Seq[Observation]()

// position in the current read
var readPos = 0

// get the sum of mismatching bases
val qscores: Option[Seq[Int]] = mismatchPositions.map(l => {
l.map(p => {
quals(p)
})
})

val mismatchQScoreSum = qscores.map(_.sum)

// Helper function to get the unclipped read length (hard or soft) from CIGAR
def unclippedLenFromCigar(cigar: Cigar): Int = {
cigar.getCigarElements.map(ce => ce.getOperator match {
case CigarOperator.D | CigarOperator.N | CigarOperator.P => 0
case _ => ce.getLength
}).sum
}

def alignedLenFromCigar(cigar: Cigar): Int = {
cigar.getCigarElements.map(alignedElementLength).sum
}

def alignedElementLength(ce: CigarElement): Int = {
ce.getOperator match {
case CigarOperator.D | CigarOperator.N | CigarOperator.P | CigarOperator.H | CigarOperator.S => 0
case _ => ce.getLength
}
}

// Helper function to calculate the length of an element, if it is a clipping element
def basesTrimmed(cigarElement: CigarElement): Int = {
cigarElement.getOperator match {
case CigarOperator.S | CigarOperator.H => cigarElement.getLength
case _ => 0
}
}

// Set up variables to help with tracking the distance from indels, and
// the distance from the current allele to the first and last trimmed base
// within this read.
val readLen = unclippedLenFromCigar(richRead.samtoolsCigar)
val trimmedFromStart = basesTrimmed(cigar.head)
val trimmedFromEnd = basesTrimmed(cigar.last)
var softclippedBases = 0
val alignedLen = alignedLenFromCigar(richRead.samtoolsCigar)

val cigarLenOps = cigar.zipWithIndex.map({
case (ce: CigarElement, idx: Int) =>
(idx, (ce.getLength, ce.getOperator))
}).toMap

val alignedLenFromCigars = cigar.map(alignedElementLength)

// List of changes that are only insertions along with their lengths and pos (pos, (len, CigarOperator.I))
val insertions = cigarLenOps.filter({ case (idx, (len, op)) => op == CigarOperator.I })
// List of changes that are only deletions along with their lengths and pos (pos, (len, CigarOperator.D))
val deletions = cigarLenOps.filter({ case (idx, (len, op)) => op == CigarOperator.D })

/**
*
* @param idx position of the insertion in the Cigar list
* @param len lenght of the insertion from the Cigar list
* @param del whether it is deletion or insertion
* @return
*/
def makeNaiveDistanceVec(idx: Int, len: Int, del: Boolean): Vector[Int] = {
// Finds the distance (num of bases in the read) before the event
val lpre = (0 until idx).map(alignedLenFromCigars(_)).sum

// Finds the num of bases in the read after the event (at idx)
val lpost = ((idx + 1) until cigar.length).map(alignedLenFromCigars(_)).sum

// Makes a list for every base pair in the read based on how far it is from this particular event
// eg. if it is an insertion, lpre = 5, lpost = 6, insertion len = 3, read length = 14
// (5, 4, 3, 2, 1, 0, 0, 0, 1, 2, 3, 4, 5, 6)
val distanceVec = ((1 to lpre).reverse ++ (if (del) Vector.empty[Int] else Vector.fill(len)(0)) ++ (1 to lpost).map(-_)).toVector
assert(distanceVec.length == alignedLen)
distanceVec
}

val insertionDistVecs = insertions.map({ case (idx, (len, _)) => makeNaiveDistanceVec(idx, len, false) })
val deletionDistVecs = deletions.map({ case (idx, (len, _)) => makeNaiveDistanceVec(idx, len, true) })

val posToInsDist: Option[Vector[Int]] = if (insertionDistVecs.size > 0) Some(insertionDistVecs.transpose.map(l => l.minBy(Math.abs(_))).toVector) else None
val posToDelDist: Option[Vector[Int]] = if (deletionDistVecs.size > 0) Some(deletionDistVecs.transpose.map(l => l.minBy(Math.abs(_))).toVector) else None

def getTags(read: RichAlignmentRecord): Option[Seq[org.bdgenomics.adam.models.Attribute]] = {
try {
Option(read.tags)
} catch {
case e: NullPointerException => None
}
}

val tags: Option[Seq[org.bdgenomics.adam.models.Attribute]] = getTags(richRead)
val mateRescue: Boolean = tags.getOrElse(Seq()).exists(a => a.tag == "XT" && a.value == "M")

def processAlignmentMatch() {
observations = AlleleObservation(ReferencePosition(contig, pos),
1,
Expand All @@ -80,7 +210,15 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
mapq,
negativeStrand,
firstOfPair,
readPos,
readPos - softclippedBases,
alignedLen,
posToInsDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
posToDelDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
trimmedFromStart,
trimmedFromEnd,
readLen,
mismatchQScoreSum,
mateRescue,
sample,
readId) +: observations
readPos += 1
Expand All @@ -100,13 +238,22 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
mapq,
negativeStrand,
firstOfPair,
readPos,
readPos - softclippedBases,
alignedLen,
posToInsDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
posToDelDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
trimmedFromStart,
trimmedFromEnd,
readLen,
mismatchQScoreSum,
mateRescue,
sample,
readId).asInstanceOf[Observation] +: observations

// increment read pointers
readPos += alleleLength
pos += 1

} else if (idx + 1 < cigar.length && cigar(idx + 1).getOperator == CigarOperator.D) {
// the allele includes the matching base
val alleleLength = 1 + cigar(idx + 1).getLength
Expand All @@ -119,7 +266,15 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
mapq,
negativeStrand,
firstOfPair,
readPos,
readPos - softclippedBases,
alignedLen,
posToInsDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
posToDelDist.flatMap((lst: Vector[Int]) => Some(lst(readPos - softclippedBases))),
trimmedFromStart,
trimmedFromEnd,
readLen,
mismatchQScoreSum,
mateRescue,
sample,
readId).asInstanceOf[Observation] +: observations

Expand Down Expand Up @@ -152,7 +307,9 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
// no op; handle inserts by looking ahead from match/mismatch operator
}
case CigarOperator.S => {
readPos += cigar(i).getLength
val clippedLen = cigar(i).getLength
readPos += clippedLen
softclippedBases += clippedLen
}
case CigarOperator.H =>
case _ => {
Expand All @@ -177,4 +334,4 @@ class ReadExplorer(referenceObservations: RDD[Observation]) extends Explorer wit
.flatMap(readToObservations)
} ++ referenceObservations
}
}
}
Loading