Skip to content

Commit

Permalink
Update cli examples and changelog for v4 (#53)
Browse files Browse the repository at this point in the history
Update pigeon v1.1.0 docs

Add custom annotation docs

Co-authored-by: Jessica Mattick <[email protected]>
  • Loading branch information
jmattick and Jessica Mattick authored Nov 1, 2023
1 parent 3a42cd4 commit 90890b1
Show file tree
Hide file tree
Showing 12 changed files with 346 additions and 109 deletions.
4 changes: 3 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ nav_order: 99
# Version changelog
* **4.0.0**
* Rename `isoseq3` to `isoseq`
* Add new tool `cluster2`
* Add new tool `isoseq cluster2`
* Update `--max-5p-diff` default value for `isoseq collapse`
* Add `X` design option to `isoseq tag` to remove TSO sequences

* 3.8.2
* Update `groupdedup` to output consistent molecular IDs across runs
Expand Down
58 changes: 37 additions & 21 deletions docs/classification/isoseq-collapse.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: default
parent: Classification
title: IsoSeq Collapse
title: Iso-Seq Collapse
nav_order: 2
---

# IsoSeq Collapse
# Iso-Seq Collapse

After transcript sequences are mapped to a reference genome, `isoseq collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.

Expand All @@ -15,37 +15,53 @@ After transcript sequences are mapped to a reference genome, `isoseq collapse` c

### Execution

Map reads using _pbmm2_ before collapsing
**Map reads using _pbmm2_ before collapsing**

```
pbmm2 align --preset ISOSEQ --sort <input.bam> <ref.fa> <mapped.bam>
```

Collapse mapped reads into unique isoforms using _isoseq collapse_.
**Collapse mapped reads into unique isoforms using _isoseq collapse_.**

For single-cell Iso-Seq:

```
isoseq collapse <mapped.bam> <collapse.gff>
```

Note: `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.
For bulk Iso-Seq:
```
isoseq collapse --do-not-collapse-extra-5exons <mapped.bam> <flnc.bam> <collapsed.gff>
```
Notes:
- The optional `<flnc.bam>` input is required to get the correct FLNC counts for bulk Iso-Seq in the `flnc_count.txt` supplemental file.
- `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk Iso-Seq.

### Ouptut
### Output

- `collapse.gff` contains the collapsed isoforms in gff format.
- `*.abundance.txt` contains information about the number of FLNC reads supporting each isoform and cell barcodes if applicable. Each unique isoform has the ID format PB.X.Y, while `count_fl` denotes the number of unique molecules (after UMI deduplication) supporting the isoform, and `fl_assoc` denotes the number of reads (before UMI deduplication) supporting it. `cell_barcodes` shows the list of single cell barcodes from which the reads came from, if applicable.
- `*.abundance.txt` contains information about the number of FLNC reads supporting each isoform and cell barcodes if applicable. Each unique isoform has the ID format PB.X.Y, while `count_fl` denotes the number of unique molecules (after UMI deduplication) supporting the isoform, and `fl_assoc` denotes the number of reads (before UMI deduplication) supporting it. `cell_barcodes` shows the list of single cell barcodes from which the reads came from, if applicable. This file should be used for downstream `pigeon` steps for Single-cell Iso-Seq.
```
pbid count_fl fl_assoc cell_barcodes
PB.1.1 2 2 ATCCATTCACCTCTGT,ATCGGCGCAGAGATGC
PB.2.1 1 1 CGGACACCATTGCCGG
PB.3.1 1 1 ACTTCGCGTCTAACTG
```
- `*.group.txt` shows the grouping of redundant isoforms (based on mapped exonic structures), where the read names `molecule/<number>` denote a unique molecule after UMI deduplication.
- `*.flnc_count.txt` contains information about the number of FLNC reads supporting each isoform before any clustering or deduplication. Each unique isoform has the ID format PB.X.Y and the FLNC counts will be separated by sample if multiple samples are present.
This file should be used for downstream `pigeon` steps for Bulk Iso-Seq.
```
id BioSample1 BioSample2
PB.1.1 2 2
PB.2.1 1 2
PB.3.1 1 1
```
- `*.group.txt` shows the grouping of redundant isoforms (based on mapped exonic structures), where the read names `molecule/<number>` denote a unique molecule after UMI deduplication and the read names `transcript/<number>` denote a clustered transcript.
```
PB.1.1 molecule/7343975,molecule/7738347
PB.2.1 molecule/14601188
PB.3.1 molecule/3998518
```
- `*.read_stat.txt` shows the assignment of each read (before UMI deduplication) to the final, unique isoforms PB.X.Y. Read names with the format `<movie>/<zmw>/ccs` indicate a CCS read, whereas `<movie>/<zmw>/ccs/<start>_<end>` further denotes a segment of a CCS read (S-read), likely as a result of segmentation (using, for example, [Skera](http://skera.how/)) of concatenated single cell libraries.
- `*.read_stat.txt` shows the assignment of each read (before UMI deduplication or clustering) to the final, unique isoforms PB.X.Y. Read names with the format `<movie>/<zmw>/ccs` indicate a CCS read, whereas `<movie>/<zmw>/ccs/<start>_<end>` further denotes a segment of a CCS read (S-read), likely as a result of segmentation (using, for example, [Skera](http://skera.how/)) of concatenated single cell libraries.
```
id pbid
m64012_220421_000242/120719489/ccs/10460_11196 PB.1.1
Expand All @@ -56,33 +72,33 @@ Note: `collapse` by default will collapse isoforms containing 5p degradation as
# Collapse FAQ
As of *isoseq3 v3.8.0* `collapse` has algorithmic updates.
These updates include performance improvements and updates to isoform collapse logic.
As of *isoseq3 v3.8.0* `collapse` has algorithmic updates.
These updates include performance improvements and updates to isoform collapse logic.
## What is new in *v3.8.0* and later?
### Collapsing extra 5p exons
For applications like single-cell IsoSeq where there is a higher percentage of 5p truncated isoforms,
it is useful to collapse isoforms that have a matching exon structure with the exception of extra 5p exons.
Previous versions of `collapse` did not merge isoforms with extra 5p exons.
For applications like single-cell Iso-Seq where there is a higher percentage of 5p truncated isoforms,
it is useful to collapse isoforms that have a matching exon structure with the exception of extra 5p exons.
Previous versions of `collapse` did not merge isoforms with extra 5p exons.
As of *v3.8.0*, `collapse` will merge these isoforms by default.
To not allow merging isoforms with extra 5p exons, use `--do-not-collapse-extra-5exons`.
This option is used in the bulk IsoSeq workflow.
To not allow merging isoforms with extra 5p exons, use `--do-not-collapse-extra-5exons`.
This option is used in the bulk Iso-Seq workflow.
<img src="../img/collapse-5p-exons.png" alt="collapse 5p exons" width="1000px"/>
### Flexible first/last exon differences
Previous versions of `collapse` used stringent maximum differences (5bp) for both internal junctions and external junctions.
As of *v3.8.0*, the maximum 5p and 3p differences have been increased and paramaters added to allow adjustments.
Note: the maximum 5p difference only applies when `--do-not-collapse-extra-5exons` is set.
As of *v3.8.0*, the maximum 5p and 3p differences have been increased and paramaters added to allow adjustments.
Note: the maximum 5p difference only applies when `--do-not-collapse-extra-5exons` is set.
New *v3.8.0* `collapse` maximum junction difference parameters:
Latest *v4.0.0* `collapse` maximum junction difference parameters:
```
--max-fuzzy-junction INT Ignore mismatches or indels shorter than or equal to N. [5]
--max-5p-diff INT Maximum allowed 5' difference if on same exon. [1000]
--max-5p-diff INT Maximum allowed 5' difference if on same exon. [50]
--max-3p-diff INT Maximum allowed 3' difference if on same exon. [100]
```
Expand All @@ -94,4 +110,4 @@ The legacy `collapse` logic can be recreated using the following parameters:
```
isoseq collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
```
```
120 changes: 120 additions & 0 deletions docs/classification/pigeon-annotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
layout: default
parent: Classification
title: Pigeon Annotations
nav_order: 6
---

## How to create a pigeon‐compatible annotation GTF

Pigeon is designed to work for [Gencode annotation](https://www.gencodegenes.org/) GTF file formats. Other GTF formats will need to be modified to work with `pigeon classify`.

The pigeon GTF format requirements are:

A tab-delimited 9-column file [GFF/GTF File Format](https://useast.ensembl.org/info/website/upload/gff.html)

* Column 1 must be the chromosome
* Column 2 is ignored
* Column 3 will only be processed if it is gene, transcript, or exon. All other types (e.g. CDS) are ignored.
* Column 4 & 5 are 1-based start/end
* Column 6 & 8 are ignored
* Column 7 is the strand which must be + or -
* Column 9 is attribute, a semicolon-separated list of tag-value pairs. To be processed properly, the following tags must have values: gene_id , transcript_id and gene_name. Ex: gene_id "ENSG0001"; transcript_id "ENST000A"; gene_name "TP53";
* No extra blank lines at the beginning or end of the file
* Annotations must be organized with a "gene" record, followed by one or more associated "transcript" records, and each "transcript" record is followed by one or more associated "exon" records. Example:
```
gene
transcript_1
exon_1_1
exon_1_2
transcript_2
exon_2_1
exon_2_2
```

## Example 1: Gencode annotation

Below is a snippet of a Gencode annotation as a reference:

```
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR68
59-1"; level 3;
chr1 ENSEMBL transcript 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";
chr1 ENSEMBL exon 17369 17436 . - . gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; ge
ne_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id
"ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.3"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "RP1
1-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "li
ncRNA"; gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; leve
l 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT0000
0002840.1";
chr1 HAVANA exon 29554 30039 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
1; exon_id "ENSE00001947070.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30564 30667 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
2; exon_id "ENSE00001922571.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1 HAVANA exon 30976 31097 . + . gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
3; exon_id "ENSE00001827679.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
```

## Example 2: modified non-model organism annotation for Pigeon

Here is an example of a pigeon-compatible annotation after it's been manually modified.

```
Pf3D7_13_v3 VEuPathDB gene 21364 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB transcript 21364 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gen
e_name "PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 21364 26538 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 27474 28787 . + . gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB CDS 21364 26538 . + 0 Parent=PF3D7_1300100.1
Pf3D7_13_v3 VEuPathDB CDS 27474 28787 . + 0 Parent=PF3D7_1300100.1
Pf3D7_13_v3 VEuPathDB gene 30605 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB transcript 30605 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gen
e_name "PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 30605 31597 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB exon 31828 31881 . - . gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3 VEuPathDB CDS 30605 31597 . - 0 Parent=PF3D7_1300200.1
Pf3D7_13_v3 VEuPathDB CDS 31828 31881 . - 0 Parent=PF3D7_1300200.1
```

## Example 3: SIRV control annotation

Here is an example of an SIRV control annotation compatible with pigeon.

```
SIRV1 LexogenSIRVData gene 1001 11643 . - 0 gene_name "SIRV1"; gene_id "SIRV1";
SIRV1 LexogenSIRVData transcript 1001 10786 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101";
SIRV1 LexogenSIRVData exon 1001 1484 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_0";
SIRV1 LexogenSIRVData exon 6338 6473 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_1";
SIRV1 LexogenSIRVData exon 6561 6813 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_2";
SIRV1 LexogenSIRVData exon 7553 7814 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_3";
SIRV1 LexogenSIRVData exon 10283 10366 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_4";
SIRV1 LexogenSIRVData exon 10445 10786 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_5";
SIRV1 LexogenSIRVData transcript 1007 10366 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102";
SIRV1 LexogenSIRVData exon 1007 1484 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_0";
SIRV1 LexogenSIRVData exon 6338 6813 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_1";
SIRV1 LexogenSIRVData exon 7553 7814 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_2";
SIRV1 LexogenSIRVData exon 10283 10366 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_3";
SIRV1 LexogenSIRVData transcript 1001 10791 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103";
SIRV1 LexogenSIRVData exon 1001 1484 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_0";
SIRV1 LexogenSIRVData exon 6338 6473 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_1";
SIRV1 LexogenSIRVData exon 6561 6813 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_2";
SIRV1 LexogenSIRVData exon 7553 7814 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_3";
SIRV1 LexogenSIRVData exon 10283 10366 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_4";
SIRV1 LexogenSIRVData exon 10648 10791 . - 0 gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_5";
```
9 changes: 7 additions & 2 deletions docs/classification/pigeon-changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,14 @@ nav_order: 99

# Pigeon version changelog

**1.0.0**
**1.1.0**
* Multi-sample support
* Add `prepare` tool
* Bug fixes

1.0.0
* Fix indexing in `make-seurat` gene matrix

0.1.2
* Improved filtering to `make-seurat`

Expand Down
Loading

0 comments on commit 90890b1

Please sign in to comment.