diff --git a/README.md b/README.md
index 6435d06..7d4323e 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@ by PCR or ligation:
2. Barcoded universal primers
3. Barcoded adapters
-
+
In addition, there are three different barcode library designs.
In order to describe a barcode library design, one can view it
@@ -44,7 +44,7 @@ from a SMRTbell or read perspective.
As *lima* supports raw subread and CCS read demultiplexing,
the following terminology is based on the per (sub-)read view.
-
+
In the overview above, the input sequence is flanked by adapters on both sides.
The bases adjacent to an adapter are referred to as barcode regions.
@@ -81,6 +81,14 @@ The sort order is defined by the barcode indices, lowest first.
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
* Enhanced filtering options to remove ambiguous calls
+## Changelog
+
+ * 1.2.0:
+ * Streaming of split BAM files
+ * New fat binary build approach
+ * 1.1.0: IsoSeq support
+ * 1.0.0: Initial release, included in SMRT Link 5.1.0
+
## Execution
**Attention: Existing output files will be overwritten!**
@@ -111,7 +119,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
Input data is either raw unaligned subreads, straight from a Sequel, or
unaligned CCS reads, generated by [CCS 2](https://github.com/PacificBiosciences/unanimity);
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
-use SMRT Link or bax2bam to convert h5 to BAM.
+use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
+with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.
Barcodes are provided as a FASTA file, one entry per barcode sequence,
**no duplicate** sequences, only upper-case bases,
@@ -542,54 +551,65 @@ Or any combination of those two:
#### Yield
Per-barcode base yield:
-
+
Per-barcode read yield:
-
+
+
Per-barcode ZMW yield:
-
+
#### Scores
Score per number of adapters (lines) and all adapters (histogram).
[What are half adapters?](#what-are-half-adapters)
-
+
+
Read length vs. mean score (99.9% percentile)
-
+
HQ length vs. mean score (99.9% percentile)
-
+
#### Read length (99.9% percentile, 1000 binwidth)
Grouped by barcode, same y-axis :
-
+
+
Grouped by barcode, free y-axis:
-
+
+
Not grouped into facets, line histogram:
-
+
+
Barcoded vs. non-barcoded:
-
+
+
#### HQ length (99.9% percentile, 2000 binwidth)
Grouped by barcode, same y-axis:
-
+
+
Grouped by barcode, free y-axis:
-
+
+
Not grouped into facets, line histogram:
-
+
+
Barcoded vs. non-barcoded:
-
+
+
#### Adapters (99.9% percentile, 1 binwidth)
Number of adapters:
-
+
+
### `counts_detail.R`
@@ -598,10 +618,12 @@ This script cannot be called from the CLI yet, to be implemented.
#### ZMW Counts
Single plot:
-
+
+
Grouped by barcode into facets:
-
+
+
### `report_summary.R`
The third script is for high-plex data in one `lima.report` file:
@@ -612,19 +634,24 @@ Example:
$ Rscript --vanilla scripts/r/report_summary.R prefix.lima.report
Yield per barcode:
+
Score distribution across all barcodes:
+
Score distribution per barcode:
-
+
+
Read length distribution per barcode:
-
+
+
HQ length distribution per barcode:
-
+
+
## Implementation details
### Smith-Waterman 101
@@ -707,7 +734,7 @@ The optional parameter `--split-bam-named`, names the files by their barcode nam
of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
in barcode names are replaced with an underscore in the file name.
-This mode consumes significantly more memory, as output cannot be streamed.
+This mode might consume more memory. Read the next FAQ entry for more information.
In addition, a `prefix.datastore.json` is generated to wrap the individual dataset
files.
@@ -715,8 +742,53 @@ files.
### Why is the memory consumption really high?
Most likely this is due to `--split-bam` or `--split-bam-named` mode.
The latter is activated per default in SMRT Link.
-Memory usage (RES column in top) is approximately the size of the input,
-uncompressed.
+*Lima* is able to stream up to 500 barcode pairs to individual split BAM files.
+If more than 500 barcode pairs are detected, additional output is buffered first.
+In this case, memory usage (RES column in top) is approximately the size of the
+input, uncompressed.
+
+The maximum concurrent output BAM file handles can be adjusted with
+`--bam-handles N`. The default is 500.
+
+Examples, how memory usage is affected by `--bam-handles`. Option
+`--bam-handles-verbose` is only used to visualize the BAM output file handles.
+Memory usage reported using [memusg](https://gist.github.com/netj/526585):
+
+ $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 9 --bam-handles-verbose
+ Open stream 7--7
+ Open stream 3--3
+ Open stream 5--5
+ Open stream 1--1
+ Open stream 4--4
+ Open stream 6--6
+ Open stream 0--0
+ Open stream 2--2
+ Open stream 210--210
+ memusg: peak=86,728
+
+ $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 4 --bam-handles-verbose
+ Open stream 7--7
+ Open stream 3--3
+ Open stream 5--5
+ Open stream 1--1
+ Buffered stream 0--0
+ Buffered stream 2--2
+ Buffered stream 4--4
+ Buffered stream 6--6
+ Buffered stream 210--210
+ memusg: peak=113,476
+
+ $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 0 --bam-handles-verbose
+ Buffered stream 0--0
+ Buffered stream 1--1
+ Buffered stream 2--2
+ Buffered stream 3--3
+ Buffered stream 4--4
+ Buffered stream 5--5
+ Buffered stream 6--6
+ Buffered stream 7--7
+ Buffered stream 210--210
+ memusg: peak=132,276
### What are half adapters?
If there is an adapter call with only one barcode region,
@@ -767,7 +839,27 @@ individually, but not as a pair. Those unwanted barcode pairs are called
hybrids in *lima*.
### How can I demultiplex IsoSeq data?
-This feature will be added in the upcoming version!
+Even if you only want to remove IsoSeq primers, *lima* is the tool of choice.
+
+1) Remove all duplicate sequences.
+2) Annotate sequence names with a P5 or P3 suffix. Example:
+
+```
+ >Forward_P5
+ AAGCAGTGGTATCAACGCAGAGTACATGGGG
+ >SampleBrain_P3
+ AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG
+ >SampleLiver_P3
+ AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
+```
+
+3) Use the `--isoseq` mode; cannot be combined with `--guess`.
+4) Output will be only different pairs with a `P5` and `P3` combination:
+
+```
+ demux.Forward_P5--SampleBrain_P3.bam
+ demux.Forward_P5--SampleLiver_P3.bam
+```
### Why do most of my ZMWs get filtered by the score lead threshold?
The score lead measures how close the best barcode call is to the second best.