Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to consensus call overlapping bases in paired end reads. Adds direct support in the consensus calling tools too. #805

Merged
merged 25 commits into from
Apr 4, 2022

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Mar 4, 2022

This PR adds a tool CallOverlappingConsensusBases that will examine read pairs who overlap and consensus call those overlapping bases in-place. This functionality is extracted into the OverlappingBasesConsensusCaller class so that both the CallMolecularConsensusReads and CallDuplexConsensusReads tools can also support this (opt-in).

For a given read pair, it will iterate through the the mapped bases that overlap between the read and mate in a read pair. If the read and mate agree at a given reference position, then read and mate base will not change and the base quality returned is controlled by maxQualOnAgreement. If they disagree at a given reference position, then the base and quality returned is controlled by onlyMaskDisagreements. See OverlappingBasesConsensusCaller for more details on the two arguments.

To facilitate all this, a new iterator implementation ReadAndRefPosIterator has been built. For a given mapped read, this iterates over each mapped read base (i.e. has a corresponding reference base), with each value being the 1-based position in the read and reference respectively. Insertions and deletions will not be returned. For example, an insertion consumes a read base, but has no corresponding reference base. A deletion consumes a reference base, but has no corresponding read base. This class is extended in the new MateOverlappingReadAndRefPosIterator, which can iterate by reference position through just the overlapping bases in a read and its mate. This iteration is how the consensus caller for overlapping bases lines up bases to examine between a read and its mate.

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2022

Codecov Report

Merging #805 (f447fcf) into main (560f089) will increase coverage by 0.08%.
The diff coverage is 99.39%.

@@            Coverage Diff             @@
##             main     #805      +/-   ##
==========================================
+ Coverage   95.58%   95.67%   +0.08%     
==========================================
  Files         122      125       +3     
  Lines        7068     7255     +187     
  Branches      511      521      +10     
==========================================
+ Hits         6756     6941     +185     
- Misses        312      314       +2     
Flag Coverage Δ
unittests 95.67% <99.39%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...genomics/bam/OverlappingBasesConsensusCaller.scala 98.61% <98.61%> (ø)
...umgenomics/bam/CallOverlappingConsensusBases.scala 100.00% <100.00%> (ø)
...om/fulcrumgenomics/bam/ReadAndRefPosIterator.scala 100.00% <100.00%> (ø)
...fulcrumgenomics/umi/CallDuplexConsensusReads.scala 100.00% <100.00%> (ø)
...crumgenomics/umi/CallMolecularConsensusReads.scala 100.00% <100.00%> (ø)
.../scala/com/fulcrumgenomics/util/NumericTypes.scala 95.60% <0.00%> (-0.35%) ⬇️
.../scala/com/fulcrumgenomics/bam/api/SamWriter.scala 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 560f089...f447fcf. Read the comment docs.

@nh13 nh13 force-pushed the feature/correct-overlapping-reads branch from 8588fcd to f6e9f09 Compare March 10, 2022 00:39
@nh13 nh13 changed the base branch from main to sam-record-native-pos-at-methods March 10, 2022 00:40
@nh13 nh13 marked this pull request as ready for review March 10, 2022 19:19
@nh13 nh13 requested a review from tfenne March 10, 2022 19:20
@nh13 nh13 changed the base branch from sam-record-native-pos-at-methods to feature/clip-bam-no-sort March 10, 2022 22:04
@nh13 nh13 force-pushed the feature/correct-overlapping-reads branch from 5ebd61f to 31d7fe5 Compare March 10, 2022 22:07
Base automatically changed from feature/clip-bam-no-sort to main March 11, 2022 15:45
@nh13 nh13 force-pushed the feature/correct-overlapping-reads branch from dd02a62 to 5415287 Compare March 21, 2022 18:42
@nh13 nh13 changed the title Personal tool to correct overlapping bases Tool to consensus call overlapping bases in paired end reads. Adds direct support in the consensus calling tools too. Mar 21, 2022
@nh13 nh13 marked this pull request as draft March 21, 2022 19:24
@nh13 nh13 marked this pull request as ready for review March 21, 2022 23:50
Copy link

@kstromhaug kstromhaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tfenne tfenne assigned nh13 and unassigned tfenne Mar 24, 2022
@nh13
Copy link
Member Author

nh13 commented Mar 26, 2022

@tfenne back to you.

  1. There's the question about capping the base quality coming out, which I don't think is needed now that we have agreement and disagreement strategies:

Wait - so the default is to output Q60 if the two inputs are Q30? Should there be some cap like in the main consensus callers?

We don't have an explicit cap in the main consensus callers, except to make sure it doesn't go beyond 93 or whatever the htsjdk maximum value is. The other callers have an effective cap due to the pre and post labeling error rates, which we aren't using in this simplified model. I could switch over to ConsensusCaller and have the pre-error rate be effectively zero, and the post-error rate by 40 or whatever the default is, but is it worth it?

  1. fixed everything else you asked for except your comment here.

See this comment in SamRecords that you wrote on 5/19/2017:

// TODO long-term: replace these two methods with methods on [[Cigar]] to save creating alignment blocks in memory

I took that to heart and started an PR for replacing those methods in #809, but didn't end up needing them here, since it was rather than parsing and iterating through the cigar for each overlapped read/reference position, it was more efficient to create iterators (one for a read, one for a mate) that kept track of where it was in the cigar for each read and returned the desired overlapping read/references positions. Yes it's complex, but this booking keeping is hard!

@tfenne tfenne self-requested a review April 4, 2022 16:30
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Please merge when done.

@nh13 nh13 merged commit a96cdd7 into main Apr 4, 2022
@nh13 nh13 deleted the feature/correct-overlapping-reads branch April 4, 2022 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants