Skip to content

Commit

Permalink
Version 1.2.0
Browse files Browse the repository at this point in the history
Add isoseq and split bam streaming docs
  • Loading branch information
armintoepfer committed Feb 27, 2018
1 parent ce0ad16 commit 8b8ccdb
Showing 1 changed file with 119 additions and 27 deletions.
146 changes: 119 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ by PCR or ligation:
2. Barcoded universal primers
3. Barcoded adapters

<img src="img/barcoding-schemes.png" width="1000px">
<img src="img/barcoding-schemes.png" width="886px">

In addition, there are three different barcode library designs.
In order to describe a barcode library design, one can view it
from a SMRTbell or read perspective.
As *lima* supports raw subread and CCS read demultiplexing,
the following terminology is based on the per (sub-)read view.

<img src="img/barcode_overview.png" width="1000px">
<img src="img/barcode_overview.png" width="886px">

In the overview above, the input sequence is flanked by adapters on both sides.
The bases adjacent to an adapter are referred to as barcode regions.
Expand Down Expand Up @@ -81,6 +81,14 @@ The sort order is defined by the barcode indices, lowest first.
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
* Enhanced filtering options to remove ambiguous calls

## Changelog

* 1.2.0:
* Streaming of split BAM files
* New fat binary build approach
* 1.1.0: IsoSeq support
* 1.0.0: Initial release, included in SMRT Link 5.1.0

## Execution
**Attention: Existing output files will be overwritten!**

Expand Down Expand Up @@ -111,7 +119,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.
Input data is either raw unaligned subreads, straight from a Sequel, or
unaligned CCS reads, generated by [CCS 2](https://github.com/PacificBiosciences/unanimity);
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
use SMRT Link or bax2bam to convert h5 to BAM.
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.

Barcodes are provided as a FASTA file, one entry per barcode sequence,
**no duplicate** sequences, only upper-case bases,
Expand Down Expand Up @@ -542,54 +551,65 @@ Or any combination of those two:

#### Yield
Per-barcode base yield:
<img src="img/plots/detail_yield_base.png" width="1000px">
<img src="img/plots/detail_yield_base.png" width="886px">

Per-barcode read yield:
<img src="img/plots/detail_yield_read.png" width="1000px">

<img src="img/plots/detail_yield_read.png" width="886px">

Per-barcode ZMW yield:
<img src="img/plots/detail_yield_zmw.png" width="1000px">
<img src="img/plots/detail_yield_zmw.png" width="886px">

#### Scores
Score per number of adapters (lines) and all adapters (histogram).
[What are half adapters?](#what-are-half-adapters)
<img src="img/plots/detail_scores_per_adapter.png" width="1000px">

<img src="img/plots/detail_scores_per_adapter.png" width="886px">

Read length vs. mean score (99.9% percentile)
<img src="img/plots/detail_read_length_vs_score.png" width="1000px">
<img src="img/plots/detail_read_length_vs_score.png" width="886px">

HQ length vs. mean score (99.9% percentile)
<img src="img/plots/detail_hq_length_vs_score.png" width="1000px">
<img src="img/plots/detail_hq_length_vs_score.png" width="886px">

#### Read length (99.9% percentile, 1000 binwidth)
Grouped by barcode, same y-axis :
<img src="img/plots/detail_read_length_hist_group_same_y.png" width="1000px">

<img src="img/plots/detail_read_length_hist_group_same_y.png" width="886px">

Grouped by barcode, free y-axis:
<img src="img/plots/detail_read_length_hist_group_free_y.png" width="1000px">

<img src="img/plots/detail_read_length_hist_group_free_y.png" width="886px">

Not grouped into facets, line histogram:
<img src="img/plots/detail_read_length_linehist_nogroup.png" width="1000px">

<img src="img/plots/detail_read_length_linehist_nogroup.png" width="886px">

Barcoded vs. non-barcoded:
<img src="img/plots/detail_read_length_hist_barcoded_or_not.png" width="1000px">

<img src="img/plots/detail_read_length_hist_barcoded_or_not.png" width="886px">

#### HQ length (99.9% percentile, 2000 binwidth)
Grouped by barcode, same y-axis:
<img src="img/plots/detail_hq_length_hist_group_same_y.png" width="1000px">

<img src="img/plots/detail_hq_length_hist_group_same_y.png" width="886px">

Grouped by barcode, free y-axis:
<img src="img/plots/detail_hq_length_hist_group_free_y.png" width="1000px">

<img src="img/plots/detail_hq_length_hist_group_free_y.png" width="886px">

Not grouped into facets, line histogram:
<img src="img/plots/detail_hq_length_linehist_nogroup.png" width="1000px">

<img src="img/plots/detail_hq_length_linehist_nogroup.png" width="886px">

Barcoded vs. non-barcoded:
<img src="img/plots/detail_hq_length_hist_barcoded_or_not.png" width="1000px">

<img src="img/plots/detail_hq_length_hist_barcoded_or_not.png" width="886px">

#### Adapters (99.9% percentile, 1 binwidth)
Number of adapters:
<img src="img/plots/detail_num_adapters.png" width="1000px">

<img src="img/plots/detail_num_adapters.png" width="886px">

### `counts_detail.R`

Expand All @@ -598,10 +618,12 @@ This script cannot be called from the CLI yet, to be implemented.

#### ZMW Counts
Single plot:
<img src="img/plots/counts_nogroup.png" width="1000px">

<img src="img/plots/counts_nogroup.png" width="886px">

Grouped by barcode into facets:
<img src="img/plots/counts_group.png" width="1000px">

<img src="img/plots/counts_group.png" width="886px">

### `report_summary.R`
The third script is for high-plex data in one `lima.report` file:
Expand All @@ -612,19 +634,24 @@ Example:
$ Rscript --vanilla scripts/r/report_summary.R prefix.lima.report

Yield per barcode:

<img src="img/plots/summary_yield_zmw.png" width="750px">

Score distribution across all barcodes:

<img src="img/plots/summary_score_hist.png" width="750px">

Score distribution per barcode:
<img src="img/plots/summary_score_hist_2d.png" width="1000px">

<img src="img/plots/summary_score_hist_2d.png" width="886px">

Read length distribution per barcode:
<img src="img/plots/summary_read_length_hist_2d.png" width="1000px">

<img src="img/plots/summary_read_length_hist_2d.png" width="886px">

HQ length distribution per barcode:
<img src="img/plots/summary_hq_length_hist_2d.png" width="1000px">

<img src="img/plots/summary_hq_length_hist_2d.png" width="886px">

## Implementation details
### Smith-Waterman 101
Expand Down Expand Up @@ -707,16 +734,61 @@ The optional parameter `--split-bam-named`, names the files by their barcode nam
of their barcode indices. Non-word characters, anything except [A-Za-z0-9_],
in barcode names are replaced with an underscore in the file name.

This mode consumes significantly more memory, as output cannot be streamed.
This mode might consume more memory. Read the next FAQ entry for more information.

In addition, a `prefix.datastore.json` is generated to wrap the individual dataset
files.

### Why is the memory consumption really high?
Most likely this is due to `--split-bam` or `--split-bam-named` mode.
The latter is activated per default in SMRT Link.
Memory usage (RES column in top) is approximately the size of the input,
uncompressed.
*Lima* is able to stream up to 500 barcode pairs to individual split BAM files.
If more than 500 barcode pairs are detected, additional output is buffered first.
In this case, memory usage (RES column in top) is approximately the size of the
input, uncompressed.

The maximum concurrent output BAM file handles can be adjusted with
`--bam-handles N`. The default is 500.

Examples, how memory usage is affected by `--bam-handles`. Option
`--bam-handles-verbose` is only used to visualize the BAM output file handles.
Memory usage reported using [memusg](https://gist.github.com/netj/526585):

$ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 9 --bam-handles-verbose
Open stream 7--7
Open stream 3--3
Open stream 5--5
Open stream 1--1
Open stream 4--4
Open stream 6--6
Open stream 0--0
Open stream 2--2
Open stream 210--210
memusg: peak=86,728

$ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 4 --bam-handles-verbose
Open stream 7--7
Open stream 3--3
Open stream 5--5
Open stream 1--1
Buffered stream 0--0
Buffered stream 2--2
Buffered stream 4--4
Buffered stream 6--6
Buffered stream 210--210
memusg: peak=113,476

$ lima input.bam barcodes.fasta out.bam --same --split-bam --bam-handles 0 --bam-handles-verbose
Buffered stream 0--0
Buffered stream 1--1
Buffered stream 2--2
Buffered stream 3--3
Buffered stream 4--4
Buffered stream 5--5
Buffered stream 6--6
Buffered stream 7--7
Buffered stream 210--210
memusg: peak=132,276

### What are half adapters?
If there is an adapter call with only one barcode region,
Expand Down Expand Up @@ -767,7 +839,27 @@ individually, but not as a pair. Those unwanted barcode pairs are called
hybrids in *lima*.

### How can I demultiplex IsoSeq data?
This feature will be added in the upcoming version!
Even if you only want to remove IsoSeq primers, *lima* is the tool of choice.

1) Remove all duplicate sequences.
2) Annotate sequence names with a P5 or P3 suffix. Example:

```
>Forward_P5
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>SampleBrain_P3
AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG
>SampleLiver_P3
AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
```

3) Use the `--isoseq` mode; cannot be combined with `--guess`.
4) Output will be only different pairs with a `P5` and `P3` combination:

```
demux.Forward_P5--SampleBrain_P3.bam
demux.Forward_P5--SampleLiver_P3.bam
```

### Why do most of my ZMWs get filtered by the score lead threshold?
The score lead measures how close the best barcode call is to the second best.
Expand Down

0 comments on commit 8b8ccdb

Please sign in to comment.