Update cli examples and changelog for v4 (#53)

Update pigeon v1.1.0 docs Add custom annotation docs Co-authored-by: Jessica Mattick <[email protected]>
PacificBiosciences · Nov 1, 2023 · 90890b1 · 90890b1
1 parent 3a42cd4
commit 90890b1
Show file tree

Hide file tree

Showing 12 changed files with 346 additions and 109 deletions.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -7,7 +7,9 @@ nav_order: 99
 # Version changelog
  * **4.0.0**
    * Rename `isoseq3` to `isoseq`
-   * Add new tool `cluster2`
+   * Add new tool `isoseq cluster2`
+   * Update `--max-5p-diff` default value for `isoseq collapse`
+   * Add `X` design option to `isoseq tag` to remove TSO sequences
 
  * 3.8.2
    * Update `groupdedup` to output consistent molecular IDs across runs

diff --git a/docs/classification/isoseq-collapse.md b/docs/classification/isoseq-collapse.md
@@ -1,11 +1,11 @@
 ---
 layout: default
 parent: Classification
-title: IsoSeq Collapse
+title: Iso-Seq Collapse
 nav_order: 2
 ---
 
-# IsoSeq Collapse
+# Iso-Seq Collapse
 
 After transcript sequences are mapped to a reference genome, `isoseq collapse` can be used to collapse redundant transcripts (based on exonic structures) into unique isoforms. Output consists of unique isoforms in GFF format and secondary files containing information about the number of reads supporting each unique isoform.
 
@@ -15,37 +15,53 @@ After transcript sequences are mapped to a reference genome, `isoseq collapse` c
 
 ### Execution
 
-Map reads using _pbmm2_ before collapsing
+**Map reads using _pbmm2_ before collapsing**
 
 ```
 pbmm2 align --preset ISOSEQ --sort <input.bam> <ref.fa> <mapped.bam>
 ```
 
-Collapse mapped reads into unique isoforms using _isoseq collapse_.
+**Collapse mapped reads into unique isoforms using _isoseq collapse_.**
+
+For single-cell Iso-Seq:
 
 ```
 isoseq collapse <mapped.bam> <collapse.gff>
 ```
 
-Note: `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk IsoSeq.
+For bulk Iso-Seq:
+```
+isoseq collapse --do-not-collapse-extra-5exons <mapped.bam> <flnc.bam> <collapsed.gff>
+```
+Notes:
+  - The optional `<flnc.bam>` input is required to get the correct FLNC counts for bulk Iso-Seq in the `flnc_count.txt` supplemental file.
+  - `collapse` by default will collapse isoforms containing 5p degradation as of version `3.8.0`. To turn this off `--do-not-collapse-extra-5exons` should be used. This option is recommended for bulk Iso-Seq.
 
-### Ouptut
+### Output
 
 - `collapse.gff` contains the collapsed isoforms in gff format.
-- `*.abundance.txt` contains information about the number of FLNC reads supporting each isoform and cell barcodes if applicable. Each unique isoform has the ID format PB.X.Y, while `count_fl` denotes the number of unique molecules (after UMI deduplication) supporting the isoform, and `fl_assoc` denotes the number of reads (before UMI deduplication) supporting it. `cell_barcodes` shows the list of single cell barcodes from which the reads came from, if applicable.
+- `*.abundance.txt` contains information about the number of FLNC reads supporting each isoform and cell barcodes if applicable. Each unique isoform has the ID format PB.X.Y, while `count_fl` denotes the number of unique molecules (after UMI deduplication) supporting the isoform, and `fl_assoc` denotes the number of reads (before UMI deduplication) supporting it. `cell_barcodes` shows the list of single cell barcodes from which the reads came from, if applicable. This file should be used for downstream `pigeon` steps for Single-cell Iso-Seq.
     ```
     pbid	count_fl	fl_assoc	cell_barcodes
     PB.1.1	2	2	ATCCATTCACCTCTGT,ATCGGCGCAGAGATGC
     PB.2.1	1	1	CGGACACCATTGCCGG
     PB.3.1	1	1	ACTTCGCGTCTAACTG
     ```
-- `*.group.txt` shows the grouping of redundant isoforms (based on mapped exonic structures), where the read names `molecule/<number>` denote a unique molecule after UMI deduplication.
+- `*.flnc_count.txt` contains information about the number of FLNC reads supporting each isoform before any clustering or deduplication. Each unique isoform has the ID format PB.X.Y and the FLNC counts will be separated by sample if multiple samples are present.
+This file should be used for downstream `pigeon` steps for Bulk Iso-Seq.
+    ```
+    id	BioSample1	BioSample2
+    PB.1.1	2	2
+    PB.2.1	1	2
+    PB.3.1	1	1
+    ```
+- `*.group.txt` shows the grouping of redundant isoforms (based on mapped exonic structures), where the read names `molecule/<number>` denote a unique molecule after UMI deduplication and the read names `transcript/<number>` denote a clustered transcript.
     ```
     PB.1.1	molecule/7343975,molecule/7738347
     PB.2.1	molecule/14601188
     PB.3.1	molecule/3998518
     ```
-- `*.read_stat.txt` shows the assignment of each read (before UMI deduplication) to the final, unique isoforms PB.X.Y. Read names with the format `<movie>/<zmw>/ccs` indicate a CCS read, whereas `<movie>/<zmw>/ccs/<start>_<end>` further denotes a segment of a CCS read (S-read), likely as a result of segmentation (using, for example, [Skera](http://skera.how/)) of concatenated single cell libraries.
+- `*.read_stat.txt` shows the assignment of each read (before UMI deduplication or clustering) to the final, unique isoforms PB.X.Y. Read names with the format `<movie>/<zmw>/ccs` indicate a CCS read, whereas `<movie>/<zmw>/ccs/<start>_<end>` further denotes a segment of a CCS read (S-read), likely as a result of segmentation (using, for example, [Skera](http://skera.how/)) of concatenated single cell libraries.
     ```
     id	pbid
     m64012_220421_000242/120719489/ccs/10460_11196	PB.1.1
@@ -56,33 +72,33 @@ Note: `collapse` by default will collapse isoforms containing 5p degradation as
 
 # Collapse FAQ
 
-As of *isoseq3 v3.8.0* `collapse` has algorithmic updates. 
-These updates include performance improvements and updates to isoform collapse logic. 
+As of *isoseq3 v3.8.0* `collapse` has algorithmic updates.
+These updates include performance improvements and updates to isoform collapse logic.
 
 ## What is new in *v3.8.0* and later?
 
 ### Collapsing extra 5p exons
 
-For applications like single-cell IsoSeq where there is a higher percentage of 5p truncated isoforms, 
-it is useful to collapse isoforms that have a matching exon structure with the exception of extra 5p exons. 
-Previous versions of `collapse` did not merge isoforms with extra 5p exons. 
+For applications like single-cell Iso-Seq where there is a higher percentage of 5p truncated isoforms,
+it is useful to collapse isoforms that have a matching exon structure with the exception of extra 5p exons.
+Previous versions of `collapse` did not merge isoforms with extra 5p exons.
 As of *v3.8.0*, `collapse` will merge these isoforms by default.
-To not allow merging isoforms with extra 5p exons, use `--do-not-collapse-extra-5exons`. 
-This option is used in the bulk IsoSeq workflow. 
+To not allow merging isoforms with extra 5p exons, use `--do-not-collapse-extra-5exons`.
+This option is used in the bulk Iso-Seq workflow.
 
 <img src="../img/collapse-5p-exons.png" alt="collapse 5p exons" width="1000px"/>
 
 ### Flexible first/last exon differences
 
 Previous versions of `collapse` used stringent maximum differences (5bp) for both internal junctions and external junctions.
-As of *v3.8.0*, the maximum 5p and 3p differences have been increased and paramaters added to allow adjustments. 
-Note: the maximum 5p difference only applies when `--do-not-collapse-extra-5exons` is set. 
+As of *v3.8.0*, the maximum 5p and 3p differences have been increased and paramaters added to allow adjustments.
+Note: the maximum 5p difference only applies when `--do-not-collapse-extra-5exons` is set.
 
-New *v3.8.0* `collapse` maximum junction difference parameters:
+Latest *v4.0.0* `collapse` maximum junction difference parameters:
 
 ```
   --max-fuzzy-junction            INT    Ignore mismatches or indels shorter than or equal to N. [5]
-  --max-5p-diff                   INT    Maximum allowed 5' difference if on same exon. [1000]
+  --max-5p-diff                   INT    Maximum allowed 5' difference if on same exon. [50]
   --max-3p-diff                   INT    Maximum allowed 3' difference if on same exon. [100]
 ```
 
@@ -94,4 +110,4 @@ The legacy `collapse` logic can be recreated using the following parameters:
 
 ```
 isoseq collapse --do-not-collapse-extra-5exons --max-5p-diff 5 --max-3p-diff 5 <mapped.bam> <collapsed.gff>
-```
+```
diff --git a/docs/classification/pigeon-annotation.md b/docs/classification/pigeon-annotation.md
@@ -0,0 +1,120 @@
+---
+layout: default
+parent: Classification
+title: Pigeon Annotations
+nav_order: 6
+---
+
+## How to create a pigeon‐compatible annotation GTF
+
+Pigeon is designed to work for [Gencode annotation](https://www.gencodegenes.org/) GTF file formats. Other GTF formats will need to be modified to work with `pigeon classify`.
+
+The pigeon GTF format requirements are:
+
+A tab-delimited 9-column file [GFF/GTF File Format](https://useast.ensembl.org/info/website/upload/gff.html)
+
+* Column 1 must be the chromosome
+* Column 2 is ignored
+* Column 3 will only be processed if it is gene, transcript, or exon. All other types (e.g. CDS) are ignored.
+* Column 4 & 5 are 1-based start/end
+* Column 6 & 8 are ignored
+* Column 7 is the strand which must be + or -
+* Column 9 is attribute, a semicolon-separated list of tag-value pairs. To be processed properly, the following tags must have values: gene_id , transcript_id and gene_name. Ex: gene_id "ENSG0001"; transcript_id "ENST000A"; gene_name "TP53";
+* No extra blank lines at the beginning or end of the file
+* Annotations must be organized with a "gene" record, followed by one or more associated "transcript" records, and each "transcript" record is followed by one or more associated "exon" records. Example:
+```
+  gene
+  transcript_1
+    exon_1_1
+    exon_1_2
+  transcript_2
+    exon_2_1
+    exon_2_2
+```
+
+## Example 1: Gencode annotation
+
+Below is a snippet of a Gencode annotation as a reference:
+
+```
+chr1    ENSEMBL gene    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR68
+59-1"; level 3;
+chr1    ENSEMBL transcript      17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
+RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
+basic"; transcript_support_level "NA";
+chr1    ENSEMBL exon    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; ge
+ne_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id
+ "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
+chr1    HAVANA  gene    29554   31109   .       +       .       gene_id "ENSG00000243485.3"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "RP1
+1-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
+chr1    HAVANA  transcript      29554   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "li
+ncRNA"; gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; leve
+l 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT0000
+0002840.1";
+chr1    HAVANA  exon    29554   30039   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
+gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
+1; exon_id "ENSE00001947070.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
+9.2"; havana_transcript "OTTHUMT00000002840.1";
+chr1    HAVANA  exon    30564   30667   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
+gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
+2; exon_id "ENSE00001922571.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
+9.2"; havana_transcript "OTTHUMT00000002840.1";
+chr1    HAVANA  exon    30976   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA";
+gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number
+3; exon_id "ENSE00001827679.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
+9.2"; havana_transcript "OTTHUMT00000002840.1";
+```
+
+## Example 2: modified non-model organism annotation for Pigeon
+
+Here is an example of a pigeon-compatible annotation after it's been manually modified.
+
+```
+Pf3D7_13_v3     VEuPathDB       gene    21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
+PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       transcript      21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gen
+e_name "PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       exon    21364   26538   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
+PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       exon    27474   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
+PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       CDS     21364   26538   .       +       0       Parent=PF3D7_1300100.1
+Pf3D7_13_v3     VEuPathDB       CDS     27474   28787   .       +       0       Parent=PF3D7_1300100.1
+Pf3D7_13_v3     VEuPathDB       gene    30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
+PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       transcript      30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gen
+e_name "PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       exon    30605   31597   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
+PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       exon    31828   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
+PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
+Pf3D7_13_v3     VEuPathDB       CDS     30605   31597   .       -       0       Parent=PF3D7_1300200.1
+Pf3D7_13_v3     VEuPathDB       CDS     31828   31881   .       -       0       Parent=PF3D7_1300200.1
+```
+
+## Example 3: SIRV control annotation
+
+Here is an example of an SIRV control annotation compatible with pigeon.
+
+```
+SIRV1	LexogenSIRVData	gene	1001	11643	.	-	0	gene_name "SIRV1"; gene_id "SIRV1";
+SIRV1	LexogenSIRVData	transcript	1001	10786	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101";
+SIRV1	LexogenSIRVData	exon	1001	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_0";
+SIRV1	LexogenSIRVData	exon	6338	6473	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_1";
+SIRV1	LexogenSIRVData	exon	6561	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_2";
+SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_3";
+SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_4";
+SIRV1	LexogenSIRVData	exon	10445	10786	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV101"; exon_assignment "SIRV101_5";
+SIRV1	LexogenSIRVData	transcript	1007	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102";
+SIRV1	LexogenSIRVData	exon	1007	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_0";
+SIRV1	LexogenSIRVData	exon	6338	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_1";
+SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_2";
+SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV102"; exon_assignment "SIRV102_3";
+SIRV1	LexogenSIRVData	transcript	1001	10791	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103";
+SIRV1	LexogenSIRVData	exon	1001	1484	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_0";
+SIRV1	LexogenSIRVData	exon	6338	6473	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_1";
+SIRV1	LexogenSIRVData	exon	6561	6813	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_2";
+SIRV1	LexogenSIRVData	exon	7553	7814	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_3";
+SIRV1	LexogenSIRVData	exon	10283	10366	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_4";
+SIRV1	LexogenSIRVData	exon	10648	10791	.	-	0	gene_name "SIRV1"; gene_id "SIRV1"; transcript_id "SIRV103"; exon_assignment "SIRV103_5";
+```
diff --git a/docs/classification/pigeon-changelog.md b/docs/classification/pigeon-changelog.md
@@ -7,9 +7,14 @@ nav_order: 99
 
 # Pigeon version changelog
 
-**1.0.0**
+**1.1.0**
+   * Multi-sample support
+   * Add `prepare` tool
+   * Bug fixes
+
+1.0.0
    * Fix indexing in `make-seurat` gene matrix
-   
+
 0.1.2
    * Improved filtering to `make-seurat`