diff --git a/README.md b/README.md index 185b4d5..324060b 100644 --- a/README.md +++ b/README.md @@ -76,13 +76,7 @@ The sort order is defined by the barcode indices, lowest first. * Enhanced filtering options to remove ambiguous calls ## Latest Version - -Version **1.8.0**: - * Add clip lengths as `bx` tag - * Enable single-barcode samples - * Implicitly call `--dump-clips` in `--isoseq` mode - -[Full changelog here](#full-changelog) +Version **1.9.0**: [Full changelog here](#full-changelog) ## Execution @@ -196,6 +190,7 @@ how ZMWs many are *same/different*, and how many reads have been filtered. Below min score lead : 11656 (32%) Below min ref span : 3124 (8%) Without adapter : 25094 (68%) + With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio Undesired hybrids : xxx (xx%) <- Only with --peek-guess Undesired same barcode pairs : xxx (xx%) <- Only with --different Undesired diff barcode pairs : xxx (xx%) <- Only with --same @@ -212,6 +207,8 @@ how ZMWs many are *same/different*, and how many reads have been filtered. ZMWs for (A): Allow diff barcode pair : 157264 (74%) Allow same barcode pair : 188026 (88%) + Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio + Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio Reads for (B): Above length : 1278461 (100%) @@ -509,6 +506,8 @@ A `prefix.lima.guess` file shows the decision process. Example: ### `--guess-min-count` The minimum ZMW abundance to whitelist a barcode. This filter is `AND`ed with the minimum barcode score provided by `--guess`. The default is 0. +If there are in total less barcoded ZMWs than the provided threshold, +the guess feature is automatically deactivated. ### `--peek` && `--guess` The optimal way is to use both advanced options in combination, e.g., @@ -522,6 +521,9 @@ Equivalent to the `Infer Barcodes Used parameter` option in SMRT Link. Sets the following options: `--peek 50000 --guess 45 --guess-min-count 10`. +If used in combination with `--isoseq`: +`--peek 50000 --guess 75 --guess-min-count 100`. + ### `--single-side` Identify barcodes in molecules that only have barcodes adjacent to one adapter. This approach makes no assumption about an alternating pattern of barcoded and @@ -678,6 +680,10 @@ HQ length distribution per barcode: +Bad adapter ratio histogram: + + + ## Implementation details ### Smith-Waterman 101 Barcode score and clipping position are computed by a Smith-Waterman algorithm. @@ -828,6 +834,49 @@ For asymmetric designs with *different* barcodes in a pair, at least a single full-pass read is required; this can be two adapters, two half-adapters, or a combination. +### What are bad adapters? +In the subreads.bam file, each subread has a context flag `cx`. +It annotates, among other things, if a subread has flanking adapters, +before and/or after. Adapter finding has been improved and can also find +molecularly missing adapters or those obscured by a local decrease in accuracy. +This may lead to missing or obscured bases in the flanking barcode. +Such adapters are called "bad", since they don't align with the adapter reference +sequence(s). +Regions flanking those bad adapters are problematic, because they can fully or +partially miss the barcode bases, leading to wrong classification of the +molecule. +*Lima* can handle those adapters, by ignoring regions flanking +bad adapters. For this, *lima* computes the ratio of +number of bad adapters divided by number of all adapters. + +By default, `--bad-adapter-ratio` is set to `0` and does not perform any filtering. +In this mode, bad adapters are handled just like good adapters. +But the `*.lima.summary` file contains one row with the number of +ZMWs that have at least 25% bad adapters, but otherwise pass all other filters. +This metric can be used as a diagnostic to assess the library prep. + +If `--bad-adapter-ratio` is set non-zero positive `(0,1]`, +bad adapter flanking barcode regions are treated as missing. +If a ZMW has a higher ratio of bad adapters than provided, the ZMW +is being filtered and consequently removed from the output. +The `*.lima.summary` file contains two additional rows. + + With bad adapter : 10349 (28%) + Bad adapter yield loss : 10112 (5%) + +The first row counts the number of ZMWs that have too high bad adapter ratios +and the percentage is with respected to the number of all ZMW not passing. +The second row counts the number of ZMWs that only get removed because of +too high bad adapter ratios and the percentage is with respect the number of all +input ZMWs and consequently is the effective yield loss caused by bad adapters. + +If a ZMW has ~50% bad adapters, one side of the molecule is molecularly missing +an adapter. For 100% bad adapter, both sides are missing adapters. +A lower than ~40% percentage indicates decreased local accuracy during +sequencing leading to adapter sequences not being found. If a high percentage +of ZMWs is molecularly missing adapters, you should improve library prep. + + ### Why are *different* barcode pair hits reported in --same mode? *Lima* tries all barcode combinations and `--same` only filters BAM output. Sequences flanked by *different* barcodes are still reported, but are not @@ -878,7 +927,7 @@ Even if you only want to remove IsoSeq primers, *lima* is the tool of choice. AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG ``` -3) Use the `--isoseq` mode; cannot be combined with `--guess`. +3) Use the `--isoseq` mode. Run in combination with `--peek-guess` to remove spurious false positive. 4) Output will be only different pairs with a `5p` and `3p` combination: ``` @@ -939,6 +988,11 @@ false positives. ## Full Changelog + * **1.9.0**: + * Add `--bad-adapter-ratio` to remove ZMWs with molecularly missing adapters + * Fix rare case, where a read only matches one barcode and not a single alternative + * Fix `--no-bam` to automatically omit pbi + * Allow combination of `--isoseq` with `--peek-guess` * 1.8.0: * Add clip lengths as `bx` tag * Enable single-barcode samples diff --git a/img/plots/summary_bad_adapter_ratio.png b/img/plots/summary_bad_adapter_ratio.png new file mode 100644 index 0000000..22904ba Binary files /dev/null and b/img/plots/summary_bad_adapter_ratio.png differ diff --git a/scripts/r/report_detail.R b/scripts/r/report_detail.R old mode 100644 new mode 100755 index 202b7ce..4d2a866 --- a/scripts/r/report_detail.R +++ b/scripts/r/report_detail.R @@ -79,7 +79,7 @@ titration_pass = report %>% filter(Barcoded, PassedFilters) %>% filter(BarcodePa titration_pass$Filter = "PASS" readLengthsUnnested = report %>% select(ReadLengths, Barcoded) %>% unnest(ReadLengths) -scoresCombinedUnnested = report %>% select(ScoresCombined, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS")) %>% unnest(ScoresCombined) +scoresCombinedUnnested = report %>% select(ScoresCombined, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS")) %>% unnest(ScoresCombined) %>% filter(ScoresCombined >= 0) refSpans = report %>% select(NumBarcodedRegionsPassed, PassedFilters, BarcodePair) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% mutate(Filter = PassedFilters) %>% mutate(Filter=ifelse(Filter==0,"NONE","PASS")) dpi = 150 diff --git a/scripts/r/report_summary.R b/scripts/r/report_summary.R old mode 100644 new mode 100755 index 0e81bea..3699b7f --- a/scripts/r/report_summary.R +++ b/scripts/r/report_summary.R @@ -32,6 +32,14 @@ zmwYieldVsMeanScore1 = report %>% filter(Barcoded) %>% filter(BarcodePair %in% u zmwYieldVsMeanScore2 = report %>% filter(Barcoded) %>% filter(BarcodePair %in% unique_bps$BarcodePair) %>% select(BarcodePair, ZMW, ScoreCombined) %>% group_by(BarcodePair) %>% count(BarcodePair) zmwYieldVsMeanScore = full_join(zmwYieldVsMeanScore1,zmwYieldVsMeanScore2,by="BarcodePair") %>% rename(NumZMWs=n) +g = ggplot(report) + + geom_histogram(aes(NumBadAdapterRatio), binwidth = 0.05) + + scale_x_continuous(breaks=seq(0,1,0.1),labels=seq(0,1,0.1),limits=c(-.05,1.05)) + + theme_minimal() + + xlab("Ratio #BadAdapters/#Adapters")+ + ylab("ZMW yield") +ggsave("summary_bad_adapter_ratio.png",g,width=20,height=15,dpi=150,units="cm") + g = ggplot(zmwYieldVsMeanScore) + geom_jitter(aes(MeanScore, NumZMWs)) + coord_cartesian(xlim = c(0, 100), ylim = c(0, max(zmwYieldVsMeanScore$NumZMWs)*1.1)) + @@ -98,4 +106,4 @@ g = ggplot(report %>% filter(Barcoded) %>% filter(IdxFirst == IdxCombined)) + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),plot.title = element_text(hjust = 0.5))+ xlab("Barcoded Samples") + ylab("HQ Length") -ggsave("summary_hq_length_hist_2d.png",width=50,height=15,dpi=150,units="cm") +ggsave("summary_hq_length_hist_2d.png",width=50,height=15,dpi=150,units="cm") \ No newline at end of file