Version 1.2.0

Add isoseq and split bam streaming docs
PacificBiosciences · Feb 27, 2018 · 8b8ccdb · 8b8ccdb
1 parent ce0ad16
commit 8b8ccdb
Showing 1 changed file with 119 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -36,15 +36,15 @@ by PCR or ligation:
 2. Barcoded universal primers
 3. Barcoded adapters
 
-<img src="img/barcoding-schemes.png" width="1000px">
+<img src="img/barcoding-schemes.png" width="886px">
 
 In addition, there are three different barcode library designs.
 In order to describe a barcode library design, one can view it
 from a SMRTbell or read perspective.
 As *lima* supports raw subread and CCS read demultiplexing,
 the following terminology is based on the per (sub-)read view.
 
-<img src="img/barcode_overview.png" width="1000px">
+<img src="img/barcode_overview.png" width="886px">
 
 In the overview above, the input sequence is flanked by adapters on both sides.
 The bases adjacent to an adapter are referred to as barcode regions.
@@ -81,6 +81,14 @@ The sort order is defined by the barcode indices, lowest first.
  * Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
  * Enhanced filtering options to remove ambiguous calls
 
+## Changelog
+
+ * 1.2.0:
+   * Streaming of split BAM files
+   * New fat binary build approach
+ * 1.1.0: IsoSeq support
+ * 1.0.0: Initial release, included in SMRT Link 5.1.0
+
 ## Execution
 **Attention: Existing output files will be overwritten!**
 
@@ -111,7 +119,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
 Input data is either raw unaligned subreads, straight from a Sequel, or
 unaligned CCS reads, generated by [CCS 2](https://github.com/PacificBiosciences/unanimity);
 both in the PacBio enhanced BAM format. If you want to demux RSII data, first
-use SMRT Link or bax2bam to convert h5 to BAM.
+use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
+with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.
 
 Barcodes are provided as a FASTA file, one entry per barcode sequence,
 **no duplicate** sequences, only upper-case bases,
@@ -542,54 +551,65 @@ Or any combination of those two:
 
 #### Yield
 Per-barcode base yield:
-<img src="img/plots/detail_yield_base.png" width="1000px">
+<img src="img/plots/detail_yield_base.png" width="886px">
 
 Per-barcode read yield:
-<img src="img/plots/detail_yield_read.png" width="1000px">
+
+<img src="img/plots/detail_yield_read.png" width="886px">
 
 Per-barcode ZMW yield:
-<img src="img/plots/detail_yield_zmw.png" width="1000px">
+<img src="img/plots/detail_yield_zmw.png" width="886px">
 
 #### Scores
 Score per number of adapters (lines) and all adapters (histogram).
 [What are half adapters?](#what-are-half-adapters)
-<img src="img/plots/detail_scores_per_adapter.png" width="1000px">
+
+<img src="img/plots/detail_scores_per_adapter.png" width="886px">
 
 Read length vs. mean score (99.9% percentile)
-<img src="img/plots/detail_read_length_vs_score.png" width="1000px">
+<img src="img/plots/detail_read_length_vs_score.png" width="886px">
 
 HQ length vs. mean score (99.9% percentile)
-<img src="img/plots/detail_hq_length_vs_score.png" width="1000px">
+<img src="img/plots/detail_hq_length_vs_score.png" width="886px">
 
 #### Read length (99.9% percentile, 1000 binwidth)
 Grouped by barcode, same y-axis :
-<img src="img/plots/detail_read_length_hist_group_same_y.png" width="1000px">
+
+<img src="img/plots/detail_read_length_hist_group_same_y.png" width="886px">
 
 Grouped by barcode, free y-axis:
-<img src="img/plots/detail_read_length_hist_group_free_y.png" width="1000px">
+
+<img src="img/plots/detail_read_length_hist_group_free_y.png" width="886px">
 
 Not grouped into facets, line histogram:
-<img src="img/plots/detail_read_length_linehist_nogroup.png" width="1000px">
+
+<img src="img/plots/detail_read_length_linehist_nogroup.png" width="886px">
 
 Barcoded vs. non-barcoded:
-<img src="img/plots/detail_read_length_hist_barcoded_or_not.png" width="1000px">
+
+<img src="img/plots/detail_read_length_hist_barcoded_or_not.png" width="886px">
 
 #### HQ length (99.9% percentile, 2000 binwidth)
 Grouped by barcode, same y-axis:
-<img src="img/plots/detail_hq_length_hist_group_same_y.png" width="1000px">
+
+<img src="img/plots/detail_hq_length_hist_group_same_y.png" width="886px">
 
 Grouped by barcode, free y-axis:
-<img src="img/plots/detail_hq_length_hist_group_free_y.png" width="1000px">
+
+<img src="img/plots/detail_hq_length_hist_group_free_y.png" width="886px">
 
 Not grouped into facets, line histogram:
-<img src="img/plots/detail_hq_length_linehist_nogroup.png" width="1000px">
+
+<img src="img/plots/detail_hq_length_linehist_nogroup.png" width="886px">
 
 Barcoded vs. non-barcoded:
-<img src="img/plots/detail_hq_length_hist_barcoded_or_not.png" width="1000px">
+
+<img src="img/plots/detail_hq_length_hist_barcoded_or_not.png" width="886px">
 
 #### Adapters  (99.9% percentile, 1 binwidth)
 Number of adapters:
-<img src="img/plots/detail_num_adapters.png" width="1000px">
+
+<img src="img/plots/detail_num_adapters.png" width="886px">
 
 ### `counts_detail.R`
 
@@ -598,10 +618,12 @@ This script cannot be called from the CLI yet, to be implemented.
 
 #### ZMW Counts
 Single plot:
-<img src="img/plots/counts_nogroup.png" width="1000px">
+
+<img src="img/plots/counts_nogroup.png" width="886px">
 
 Grouped by barcode into facets:
-<img src="img/plots/counts_group.png" width="1000px">
+
+<img src="img/plots/counts_group.png" width="886px">
 
 ### `report_summary.R`
 The third script is for high-plex data in one `lima.report` file:
@@ -612,19 +634,24 @@ Example:
     $ Rscript --vanilla scripts/r/report_summary.R prefix.lima.report
 
 Yield per barcode:
+
 <img src="img/plots/summary_yield_zmw.png" width="750px">
 
 Score distribution across all barcodes:
+
 <img src="img/plots/summary_score_hist.png" width="750px">
 
 Score distribution per barcode:
-<img src="img/plots/summary_score_hist_2d.png" width="1000px">
+
+<img src="img/plots/summary_score_hist_2d.png" width="886px">
 
 Read length distribution per barcode:
-<img src="img/plots/summary_read_length_hist_2d.png" width="1000px">
+
+<img src="img/plots/summary_read_length_hist_2d.png" width="886px">
 
 HQ length distribution per barcode:
-<img src="img/plots/summary_hq_length_hist_2d.png" width="1000px">
+
+<img src="img/plots/summary_hq_length_hist_2d.png" width="886px">
 
 ## Implementation details
 ### Smith-Waterman 101
@@ -707,16 +734,61 @@ The optional parameter `--split-bam-named`, names the files by their barcode nam
 of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
 in barcode names are replaced with an underscore in the file name.
 
-This mode consumes significantly more memory, as output cannot be streamed.
+This mode might consume more memory. Read the next FAQ entry for more information.
 
 In addition, a `prefix.datastore.json` is generated to wrap the individual dataset
 files.
 
 ### Why is the memory consumption really high?
 Most likely this is due to `--split-bam` or `--split-bam-named` mode.
 The latter is activated per default in SMRT Link.
-Memory usage (RES column in top) is approximately the size of the input,
-uncompressed.
+*Lima* is able to stream up to 500 barcode pairs to individual split BAM files.
+If more than 500 barcode pairs are detected, additional output is buffered first.
+In this case, memory usage (RES column in top) is approximately the size of the
+input, uncompressed.
+
+The maximum concurrent output BAM file handles can be adjusted with
+`--bam-handles N`. The default is 500.
+
+Examples, how memory usage is affected by `--bam-handles`. Option
+`--bam-handles-verbose` is only used to visualize the BAM output file handles.
+Memory usage reported using [memusg](https://gist.github.com/netj/526585):
+
+    $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 9 --bam-handles-verbose
+    Open stream 7--7
+    Open stream 3--3
+    Open stream 5--5
+    Open stream 1--1
+    Open stream 4--4
+    Open stream 6--6
+    Open stream 0--0
+    Open stream 2--2
+    Open stream 210--210
+    memusg: peak=86,728
+
+    $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 4 --bam-handles-verbose
+    Open stream 7--7
+    Open stream 3--3
+    Open stream 5--5
+    Open stream 1--1
+    Buffered stream 0--0
+    Buffered stream 2--2
+    Buffered stream 4--4
+    Buffered stream 6--6
+    Buffered stream 210--210
+    memusg: peak=113,476
+
+    $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 0 --bam-handles-verbose
+    Buffered stream 0--0
+    Buffered stream 1--1
+    Buffered stream 2--2
+    Buffered stream 3--3
+    Buffered stream 4--4
+    Buffered stream 5--5
+    Buffered stream 6--6
+    Buffered stream 7--7
+    Buffered stream 210--210
+    memusg: peak=132,276
 
 ### What are half adapters?
 If there is an adapter call with only one barcode region,
@@ -767,7 +839,27 @@ individually, but not as a pair. Those unwanted barcode pairs are called
 hybrids in *lima*.
 
 ### How can I demultiplex IsoSeq data?
-This feature will be added in the upcoming version!
+Even if you only want to remove IsoSeq primers, *lima* is the tool of choice.
+
+1) Remove all duplicate sequences.
+2) Annotate sequence names with a P5 or P3 suffix. Example:
+
+```
+    >Forward_P5
+    AAGCAGTGGTATCAACGCAGAGTACATGGGG
+    >SampleBrain_P3
+    AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG
+    >SampleLiver_P3
+    AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
+```
+
+3) Use the `--isoseq` mode; cannot be combined with `--guess`.
+4) Output will be only different pairs with a `P5` and `P3` combination:
+
+```
+    demux.Forward_P5--SampleBrain_P3.bam
+    demux.Forward_P5--SampleLiver_P3.bam
+```
 
 ### Why do most of my ZMWs get filtered by the score lead threshold?
 The score lead measures how close the best barcode call is to the second best.