diff --git a/README.md b/README.md index 6435d06..7d4323e 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,7 @@ by PCR or ligation: 2. Barcoded universal primers 3. Barcoded adapters - + In addition, there are three different barcode library designs. In order to describe a barcode library design, one can view it @@ -44,7 +44,7 @@ from a SMRTbell or read perspective. As *lima* supports raw subread and CCS read demultiplexing, the following terminology is based on the per (sub-)read view. - + In the overview above, the input sequence is flanked by adapters on both sides. The bases adjacent to an adapter are referred to as barcode regions. @@ -81,6 +81,14 @@ The sort order is defined by the barcode indices, lowest first. * Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold * Enhanced filtering options to remove ambiguous calls +## Changelog + + * 1.2.0: + * Streaming of split BAM files + * New fat binary build approach + * 1.1.0: IsoSeq support + * 1.0.0: Initial release, included in SMRT Link 5.1.0 + ## Execution **Attention: Existing output files will be overwritten!** @@ -111,7 +119,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result. Input data is either raw unaligned subreads, straight from a Sequel, or unaligned CCS reads, generated by [CCS 2](https://github.com/PacificBiosciences/unanimity); both in the PacBio enhanced BAM format. If you want to demux RSII data, first -use SMRT Link or bax2bam to convert h5 to BAM. +use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json` +with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed. Barcodes are provided as a FASTA file, one entry per barcode sequence, **no duplicate** sequences, only upper-case bases, @@ -542,54 +551,65 @@ Or any combination of those two: #### Yield Per-barcode base yield: - + Per-barcode read yield: - + + Per-barcode ZMW yield: - + #### Scores Score per number of adapters (lines) and all adapters (histogram). [What are half adapters?](#what-are-half-adapters) - + + Read length vs. mean score (99.9% percentile) - + HQ length vs. mean score (99.9% percentile) - + #### Read length (99.9% percentile, 1000 binwidth) Grouped by barcode, same y-axis : - + + Grouped by barcode, free y-axis: - + + Not grouped into facets, line histogram: - + + Barcoded vs. non-barcoded: - + + #### HQ length (99.9% percentile, 2000 binwidth) Grouped by barcode, same y-axis: - + + Grouped by barcode, free y-axis: - + + Not grouped into facets, line histogram: - + + Barcoded vs. non-barcoded: - + + #### Adapters (99.9% percentile, 1 binwidth) Number of adapters: - + + ### `counts_detail.R` @@ -598,10 +618,12 @@ This script cannot be called from the CLI yet, to be implemented. #### ZMW Counts Single plot: - + + Grouped by barcode into facets: - + + ### `report_summary.R` The third script is for high-plex data in one `lima.report` file: @@ -612,19 +634,24 @@ Example: $ Rscript --vanilla scripts/r/report_summary.R prefix.lima.report Yield per barcode: + Score distribution across all barcodes: + Score distribution per barcode: - + + Read length distribution per barcode: - + + HQ length distribution per barcode: - + + ## Implementation details ### Smith-Waterman 101 @@ -707,7 +734,7 @@ The optional parameter `--split-bam-named`, names the files by their barcode nam of their barcode indices. Non-word characters, anything except [A-Za-z0-9_], in barcode names are replaced with an underscore in the file name. -This mode consumes significantly more memory, as output cannot be streamed. +This mode might consume more memory. Read the next FAQ entry for more information. In addition, a `prefix.datastore.json` is generated to wrap the individual dataset files. @@ -715,8 +742,53 @@ files. ### Why is the memory consumption really high? Most likely this is due to `--split-bam` or `--split-bam-named` mode. The latter is activated per default in SMRT Link. -Memory usage (RES column in top) is approximately the size of the input, -uncompressed. +*Lima* is able to stream up to 500 barcode pairs to individual split BAM files. +If more than 500 barcode pairs are detected, additional output is buffered first. +In this case, memory usage (RES column in top) is approximately the size of the +input, uncompressed. + +The maximum concurrent output BAM file handles can be adjusted with +`--bam-handles N`. The default is 500. + +Examples, how memory usage is affected by `--bam-handles`. Option +`--bam-handles-verbose` is only used to visualize the BAM output file handles. +Memory usage reported using [memusg](https://gist.github.com/netj/526585): + + $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 9 --bam-handles-verbose + Open stream 7--7 + Open stream 3--3 + Open stream 5--5 + Open stream 1--1 + Open stream 4--4 + Open stream 6--6 + Open stream 0--0 + Open stream 2--2 + Open stream 210--210 + memusg: peak=86,728 + + $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 4 --bam-handles-verbose + Open stream 7--7 + Open stream 3--3 + Open stream 5--5 + Open stream 1--1 + Buffered stream 0--0 + Buffered stream 2--2 + Buffered stream 4--4 + Buffered stream 6--6 + Buffered stream 210--210 + memusg: peak=113,476 + + $ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 0 --bam-handles-verbose + Buffered stream 0--0 + Buffered stream 1--1 + Buffered stream 2--2 + Buffered stream 3--3 + Buffered stream 4--4 + Buffered stream 5--5 + Buffered stream 6--6 + Buffered stream 7--7 + Buffered stream 210--210 + memusg: peak=132,276 ### What are half adapters? If there is an adapter call with only one barcode region, @@ -767,7 +839,27 @@ individually, but not as a pair. Those unwanted barcode pairs are called hybrids in *lima*. ### How can I demultiplex IsoSeq data? -This feature will be added in the upcoming version! +Even if you only want to remove IsoSeq primers, *lima* is the tool of choice. + +1) Remove all duplicate sequences. +2) Annotate sequence names with a P5 or P3 suffix. Example: + +``` + >Forward_P5 + AAGCAGTGGTATCAACGCAGAGTACATGGGG + >SampleBrain_P3 + AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG + >SampleLiver_P3 + AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG +``` + +3) Use the `--isoseq` mode; cannot be combined with `--guess`. +4) Output will be only different pairs with a `P5` and `P3` combination: + +``` + demux.Forward_P5--SampleBrain_P3.bam + demux.Forward_P5--SampleLiver_P3.bam +``` ### Why do most of my ZMWs get filtered by the score lead threshold? The score lead measures how close the best barcode call is to the second best.