diff --git a/README.md b/README.md
index 185b4d5..324060b 100644
--- a/README.md
+++ b/README.md
@@ -76,13 +76,7 @@ The sort order is defined by the barcode indices, lowest first.
* Enhanced filtering options to remove ambiguous calls
## Latest Version
-
-Version **1.8.0**:
- * Add clip lengths as `bx` tag
- * Enable single-barcode samples
- * Implicitly call `--dump-clips` in `--isoseq` mode
-
-[Full changelog here](#full-changelog)
+Version **1.9.0**: [Full changelog here](#full-changelog)
## Execution
@@ -196,6 +190,7 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
Below min score lead : 11656 (32%)
Below min ref span : 3124 (8%)
Without adapter : 25094 (68%)
+ With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio
Undesired hybrids : xxx (xx%) <- Only with --peek-guess
Undesired same barcode pairs : xxx (xx%) <- Only with --different
Undesired diff barcode pairs : xxx (xx%) <- Only with --same
@@ -212,6 +207,8 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
ZMWs for (A):
Allow diff barcode pair : 157264 (74%)
Allow same barcode pair : 188026 (88%)
+ Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio
+ Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio
Reads for (B):
Above length : 1278461 (100%)
@@ -509,6 +506,8 @@ A `prefix.lima.guess` file shows the decision process. Example:
### `--guess-min-count`
The minimum ZMW abundance to whitelist a barcode. This filter is `AND`ed with
the minimum barcode score provided by `--guess`. The default is 0.
+If there are in total less barcoded ZMWs than the provided threshold,
+the guess feature is automatically deactivated.
### `--peek` && `--guess`
The optimal way is to use both advanced options in combination, e.g.,
@@ -522,6 +521,9 @@ Equivalent to the `Infer Barcodes Used parameter` option in SMRT Link.
Sets the following options:
`--peek 50000 --guess 45 --guess-min-count 10`.
+If used in combination with `--isoseq`:
+`--peek 50000 --guess 75 --guess-min-count 100`.
+
### `--single-side`
Identify barcodes in molecules that only have barcodes adjacent to one adapter.
This approach makes no assumption about an alternating pattern of barcoded and
@@ -678,6 +680,10 @@ HQ length distribution per barcode:
+Bad adapter ratio histogram:
+
+
+
## Implementation details
### Smith-Waterman 101
Barcode score and clipping position are computed by a Smith-Waterman algorithm.
@@ -828,6 +834,49 @@ For asymmetric designs with *different* barcodes in a pair, at least a single
full-pass read is required; this can be two adapters, two half-adapters, or a
combination.
+### What are bad adapters?
+In the subreads.bam file, each subread has a context flag `cx`.
+It annotates, among other things, if a subread has flanking adapters,
+before and/or after. Adapter finding has been improved and can also find
+molecularly missing adapters or those obscured by a local decrease in accuracy.
+This may lead to missing or obscured bases in the flanking barcode.
+Such adapters are called "bad", since they don't align with the adapter reference
+sequence(s).
+Regions flanking those bad adapters are problematic, because they can fully or
+partially miss the barcode bases, leading to wrong classification of the
+molecule.
+*Lima* can handle those adapters, by ignoring regions flanking
+bad adapters. For this, *lima* computes the ratio of
+number of bad adapters divided by number of all adapters.
+
+By default, `--bad-adapter-ratio` is set to `0` and does not perform any filtering.
+In this mode, bad adapters are handled just like good adapters.
+But the `*.lima.summary` file contains one row with the number of
+ZMWs that have at least 25% bad adapters, but otherwise pass all other filters.
+This metric can be used as a diagnostic to assess the library prep.
+
+If `--bad-adapter-ratio` is set non-zero positive `(0,1]`,
+bad adapter flanking barcode regions are treated as missing.
+If a ZMW has a higher ratio of bad adapters than provided, the ZMW
+is being filtered and consequently removed from the output.
+The `*.lima.summary` file contains two additional rows.
+
+ With bad adapter : 10349 (28%)
+ Bad adapter yield loss : 10112 (5%)
+
+The first row counts the number of ZMWs that have too high bad adapter ratios
+and the percentage is with respected to the number of all ZMW not passing.
+The second row counts the number of ZMWs that only get removed because of
+too high bad adapter ratios and the percentage is with respect the number of all
+input ZMWs and consequently is the effective yield loss caused by bad adapters.
+
+If a ZMW has ~50% bad adapters, one side of the molecule is molecularly missing
+an adapter. For 100% bad adapter, both sides are missing adapters.
+A lower than ~40% percentage indicates decreased local accuracy during
+sequencing leading to adapter sequences not being found. If a high percentage
+of ZMWs is molecularly missing adapters, you should improve library prep.
+
+
### Why are *different* barcode pair hits reported in --same mode?
*Lima* tries all barcode combinations and `--same` only filters BAM output.
Sequences flanked by *different* barcodes are still reported, but are not
@@ -878,7 +927,7 @@ Even if you only want to remove IsoSeq primers, *lima* is the tool of choice.
AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
```
-3) Use the `--isoseq` mode; cannot be combined with `--guess`.
+3) Use the `--isoseq` mode. Run in combination with `--peek-guess` to remove spurious false positive.
4) Output will be only different pairs with a `5p` and `3p` combination:
```
@@ -939,6 +988,11 @@ false positives.
## Full Changelog
+ * **1.9.0**:
+ * Add `--bad-adapter-ratio` to remove ZMWs with molecularly missing adapters
+ * Fix rare case, where a read only matches one barcode and not a single alternative
+ * Fix `--no-bam` to automatically omit pbi
+ * Allow combination of `--isoseq` with `--peek-guess`
* 1.8.0:
* Add clip lengths as `bx` tag
* Enable single-barcode samples
diff --git a/img/plots/summary_bad_adapter_ratio.png b/img/plots/summary_bad_adapter_ratio.png
new file mode 100644
index 0000000..22904ba
Binary files /dev/null and b/img/plots/summary_bad_adapter_ratio.png differ
diff --git a/scripts/r/report_detail.R b/scripts/r/report_detail.R
old mode 100644
new mode 100755
index 202b7ce..4d2a866
--- a/scripts/r/report_detail.R
+++ b/scripts/r/report_detail.R
@@ -79,7 +79,7 @@ titration_pass = report %>% filter(Barcoded, PassedFilters) %>% filter(BarcodePa
titration_pass$Filter = "PASS"
readLengthsUnnested = report %>% select(ReadLengths, Barcoded) %>% unnest(ReadLengths)
-scoresCombinedUnnested = report %>% select(ScoresCombined, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS")) %>% unnest(ScoresCombined)
+scoresCombinedUnnested = report %>% select(ScoresCombined, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS")) %>% unnest(ScoresCombined) %>% filter(ScoresCombined >= 0)
refSpans = report %>% select(NumBarcodedRegionsPassed, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS"))
dpi = 150
diff --git a/scripts/r/report_summary.R b/scripts/r/report_summary.R
old mode 100644
new mode 100755
index 0e81bea..3699b7f
--- a/scripts/r/report_summary.R
+++ b/scripts/r/report_summary.R
@@ -32,6 +32,14 @@ zmwYieldVsMeanScore1 = report %>% filter(Barcoded) %>% filter(BarcodePair %in% u
zmwYieldVsMeanScore2 = report %>% filter(Barcoded) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% select(BarcodePair, ZMW, ScoreCombined) %>% group_by(BarcodePair) %>% count(BarcodePair)
zmwYieldVsMeanScore = full_join(zmwYieldVsMeanScore1,zmwYieldVsMeanScore2,by="BarcodePair") %>% rename(NumZMWs=n)
+g = ggplot(report) +
+ geom_histogram(aes(NumBadAdapterRatio), binwidth = 0.05) +
+ scale_x_continuous(breaks=seq(0,1,0.1),labels=seq(0,1,0.1),limits=c(-.05,1.05)) +
+ theme_minimal() +
+ xlab("Ratio #BadAdapters/#Adapters")+
+ ylab("ZMW yield")
+ggsave("summary_bad_adapter_ratio.png",g,width=20,height=15,dpi=150,units="cm")
+
g = ggplot(zmwYieldVsMeanScore) +
geom_jitter(aes(MeanScore, NumZMWs)) +
coord_cartesian(xlim = c(0, 100), ylim = c(0, max(zmwYieldVsMeanScore$NumZMWs)*1.1)) +
@@ -98,4 +106,4 @@ g = ggplot(report %>% filter(Barcoded) %>% filter(IdxFirst == IdxCombined)) +
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),plot.title = element_text(hjust = 0.5))+
xlab("Barcoded Samples") +
ylab("HQ Length")
-ggsave("summary_hq_length_hist_2d.png",width=50,height=15,dpi=150,units="cm")
+ggsave("summary_hq_length_hist_2d.png",width=50,height=15,dpi=150,units="cm")
\ No newline at end of file