Skip to content

Commit

Permalink
Merge pull request #38 from umranyaman/develop
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
cmatKhan authored Sep 1, 2023
2 parents df0de8f + 4c947b1 commit d52c8f2
Showing 1 changed file with 39 additions and 33 deletions.
72 changes: 39 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Isocomp: comparing high-quality IsoSeq3 isoforms between samples

![](images/logo.png)
![ISOCOMP](https://i.ibb.co/vHLhrZq/Isocomp-logo1.png)

## Contributors
1. Yutong Qiu (Carnegie Mellon)
Expand All @@ -11,15 +11,20 @@
4. Rupesh Kesharwani (Baylor College of Medicine)
5. Bida Gu (University of Southern California)
6. Muhammad Sohail Raza (Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation)
6. Evan Biederstedt (HMS)

7. Evan Biederstedt (HMS)
8. Umran Yaman (UK Dementia Research Institute, University College London)
9. Abdullah Al Nahid (Shahjalal University of Science and Technology)
10. Trinh Tat (Houston Methodist Research Institute)
11. Sejal Modha (Theolytics Limited)
12. Jędrzej Kubica (University of Warsaw)

## Github Codespace for Development

To use codespaces for development purposes, do the following:

1. fork the repo
2. switch to the 'develop' branch
- **NOTE**: if you are planning to code/add a feature, create a branch from the 'develop' branch. Switch to it, and then continue on with the steps below.
- **NOTE**: if you plan to code/add a feature, create a branch from the 'develop' branch. Switch to it, and then continue on with the steps below.
4. click the green 'code' button. **But**, rather than copying the https or ssh link, click the tab that says "Codespace"
5. click the button that says "create codespace on develop". Go make some tea -- it takes ~5 minutes or so to set up the environment. But, once it is set up, you
will have a fully functioning vscode environment with all the dependencies installed. Start running the tests, set some breakpoints, take a look around!
Expand All @@ -28,67 +33,68 @@ To use codespaces for development purposes, do the following:
https://github.com/collaborativebioinformatics/isocomp/blob/main/FinalPresentation_BCM_Hackathon_12Oct2022.pdf

## Introduction
NGS targeted sequencing and WES have become routine for clinical diagnosis of Mendelian disease [CITATION]. Family sequencing (or "trio sequencing") involves sequencing a patient and parents (trio) or other relatives. This improves the diagnostic potential via the interpretation of germline mutations and enables detection of de novo mutations which underlie most Mendelian disorders.

Transcriptomic profiling has been gaining used over the past several decades. However, this endeavor has been hampered by short-read sequencing, especially for inferring alternative splicing, allelic imbalance, and isoform variation.
NGS-targeted sequencing and WES have become routine for diagnosing Mendelian disease (Xue et al., 2015). Family sequencing (or "trio sequencing") involves sequencing a patient and parents (trio) or other relatives. This improves the diagnostic potential via the interpretation of germline mutations and enables the detection of de novo mutations which underlie most Mendelian disorders.

The promise of long-read sequencing has been to overcome the inherent uncertainties of short-reads.
While transcriptomic profiling has gained traction over the past few decades, its progress has been hindered by short-read sequencing, particularly in tasks such as inferring alternative splicing, allelic imbalance, and isoform variation.

Something something Isoseq3: https://www.pacb.com/products-and-services/applications/rna-sequencing/
The potential of long-read sequencing lies in its ability to overcome the inherent limitations of short-reads. Tools like Isoseq3 [link: https://www.pacb.com/products-and-services/applications/rna-sequencing/] offer high-quality, polished and assembled full-length isoforms. This advancement allows us to identify alternatively spliced isoforms and detect gene fusions.

Provides high-quality, polished, assembled full isoforms. With this, we will be able to identify alternatively-spliced isoforms and detect gene fusions.
Provides high-quality, polished, assembled full isoforms. With this, we will be able to identify alternatively spliced isoforms and detect gene fusions.

Since the advent of HiFi reads, the error rates have plummeted.
With the introduction of HiFi error rates have significantly decreased.

The goal of this project will be to extend the utility of long-read RNAseq for investigating Mendelian diseases between multiple samples.
This project aims to expand the applicability of long-read RNAseq for investigating Mendelian diseases across multiple samples.

And what about gene fusions? We detect these in the stupidest possible way with short-read sequencing, and we think they're cancer-specific. What about the germline?
-And what about gene fusions? We detect these in the stupidest possible way with short-read sequencing, and we think they're cancer-specific. What about the germline?


## Goals

Given high-quality assembled isoforms from 2-3 samples, we want to algorithmically (definitively) characterize the "unique" (i.e. differing) isoforms between samples.
The goal of this project is to algorithmically characterize the "unique" (differing) isoforms between 2-3 samples using high-quality assembled isoforms.

## Methods

### Methods overview
[Isoform set comparison problem] Given two sets of isoform sam, find two subsets so that each subset of isoform is unique to each sample.
### Overview of Methods

The core challenge, referred to as the "Isoform set comparison problem," involves identifying distinct isoforms between two sets of samples.

To solve this problem, the naive solution is to perform sequence match between two sets of isoforms. However, this method is time consuming due to the size of the isoform sets in human genome (give an example number (way larger than 10,000, e.g.)).
A direct approach to solving this problem is through sequence matching between the sets of isoforms. However, this becomes time-consuming given the size of isoform sets in the human genome (consider a number significantly larger than 10,000, for instance).

We note that it is unnecessary to perform all-against-all alignment between complete sets of isoforms. In fact, we only need to compare the isoforms that are aligned to the same genomic region. We extract region windows from the genome that contain at least one isoform from any sample, then, we divide the set of isoforms into smaller subsets by their origin in the extracted genomic regions.
We recognize that an all-against-all alignment across complete isoform sets isn't necessary. Instead, the focus is on comparing isoforms aligned to the same genomic regions. Genomic windows containing at least one isoform from any sample are extracted. The isoform sets are then subdivided into smaller subsets based on their origin in these extracted regions.

For each pair of samples that we are comparing, we perform intersection between two subsets of isoforms within each genomic window and identify isoforms that are shared by both samples and unique to each sample.
For each pair of samples under comparison, intersections are made between subsets of isoforms within each genomic window. This process identifies isoforms shared by both samples and isoforms unique to each sample.

For each unique isoform S from sample A, we further investigate the differences between S and other isoforms from sample B within the same genomic window.
For each unique isoform S from sample A, a deeper analysis is conducted on the differences between S and other isoforms from sample B within the same genomic window.

### Aligning isoforms to the reference genome

For every sample, we first prepend sample name to FASTA sequences in final corrected FASTA generated by SQANTI to make sure that all sequence names are unique, then align the renamed FASTA to the human Telomere-to-Telomere genome assembly of the CHM13 cell line (T2T-CHM13v2.0; RefSeq - GCF_009914755.1) using minimap2 (v2.24-r1122). The resulting SAM file is converted to BAM and sorted by samtools (v1.15.1; Danecek et al, 2021).
Dividing isoforms into subsets
For each individual sample, we initially prefix the sample name to the FASTA sequences in the finalized corrected FASTA output from SQANTI. This step ensures the uniqueness of all sequence names. Subsequently, we employ the minimap2 aligner (v2.24-r1122) to align the renamed FASTA sequences against the human Telomere-to-Telomere genome assembly of the CHM13 cell line (T2T-CHM13v2.0; RefSeq - GCF_009914755.1). The resultant alignment is presented in a SAM file, which is then converted into BAM format and sorted using samtools (v1.15.1; Danecek et al, 2021).

We extract regions from the CHM13v2.0 genome that overlap with at least one isoform from any sample. We first obtain the average coverage of isoform per base by using samtools mpileup (citation, version). We next extract the 20,042 annotated protein coding gene regions from the reference genome and take the union of overlapping regions to create windows. Finally, windows were filtered to those which contained a per-base coverage greater than 0.05, which reduced our final set of windows to 11936.
### Segmentation of Isoforms into Subsets

Regions from the CHM13v2.0 genome that intersect with at least one isoform from any given sample are extracted. We initially determine the average coverage of isoforms per base using samtools mpileup (version?, Danecek et al, 2021). Subsequently, we identify and extract the 20,042 annotated protein-coding gene regions from the reference genome. To create windows, we merge these regions where overlaps occur. Further refinement is applied by filtering windows to those displaying per-base coverage greater than 0.05, resulting in a final set of 11,936 windows.

In addition to the annotated gene regions, each sample contains more than 100,000 isoforms (Table 1) of isoforms that aligned to intron region. These isoforms are usually regarded as novel and may be important to the phenotypes of interest. Therefore, in addition to known gene regions, we divide the genome into 100 bp windows and retain the ones that has per-base coverage higher than 0.05.
Apart from the annotated gene regions, each sample encompasses over 100,000 isoforms (Table 1) that align with intron regions. These isoforms, often considered novel, hold potential relevance to the observed phenotypes. To account for these, we divide the genome into 100-base-pair windows and retain those exhibiting per-base coverage exceeding 0.05.

After that, we merge the gene windows and 100bp windows to obtain a complete set of windows that overlap with any isoform.
Following this, the gene-related windows and the 100-base-pair windows are merged to form a comprehensive set of windows aligning with any isoform.

### Intersecting subsets of isoforms

For each isoform S in the subset of sample A, we perform exact string matching with all isoforms in the subset of sample B. If no isoform in sample B in the same genomic window matches S exactly, we say that S is unique.
For every isoform S within the subset of sample A, we conduct precise string matching against all isoforms within the subset of sample B. If no isoform in sample B, within the same genomic window, precisely matches S, we classify S as unique.

### Comparing unique isoforms with other isoforms
For each unique isoform U, we perform Needleman-Wunch alignment between U and other isoforms within the same genomic window. We measure similarities between isoforms by the percentage of matched bases in U.
Annotating the differences between unique isoforms and the other sequences

We categorize differences between isoforms into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusion, different exon usage and completely novel. Similar categories was used by SQANTI in annotating differences between sample isoforms and reference transcriptome. Note that we extend the categories by SQANTI by adding SNPs and large-scale variants.
For each unique isoform U, we employ the Needleman-Wunch alignment method to compare U with other isoforms within the identical genomic window. The comparison is quantified through the percentage of matched bases in U.

### Annotating the differences between unique isoforms and the other sequences

Differences among isoforms are categorized into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusions, distinct exon utilization, and entirely novel sequences. These categories build upon those used by SQANTI to annotate disparities between sample isoforms and the reference transcriptome. Notably, we extend SQANTI's categories by incorporating SNPs and large-scale variants.

### Iso-Seq analysis

Isoseq3 (v3.2.2) generated HQ (Full-length high quality) transcripts [Table 1] were mapped to GRCh38 (v33 p13) using Minimap2 long read alignment tools [1] (v2.24-r1122; commands: minimap2 -t 8 -ax splice:hq -uf --secondary=no -C5 -O6,24 -B4 GRCh38.v33p13.primary_assembly.fa sample.polished.hq.fastq.gz). The table 2 shows the basic statistics of the alignment of each sample [NA24385 /HG002, NA24143/HG004 and NA24631/HG005]. Next, we performed cDNA_cupcake [https://github.com/Magdoll/cDNA_Cupcake] workflow to collapse the redundant isoforms from bam, followed by filtering the low counts isoforms by 10 and filter away 5' degraded isoforms that might not be biologically significant. Next, sqanti3 [2] tool was used to generate final corrected fasta [Table 3a] transcripts and gtf [Table 3b] along with the isoform classification reports. The external databases including reference data set of transcription start sites (refTSS), list of polyA motif, tappAS-annotation and genecode hg38 annotation were utilized. Finally, IsoAnnotLite (v2.7.3) analysis was performed to annotate the gtf file from sqanti3.

We categorize differences between isoforms into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusion, different exon usage and completely novel. Similar categories was used by SQANTI in annotating differences between sample isoforms and reference transcriptome. Note that we extend the categories by SQANTI by adding SNPs and large-scale variants.
Isoseq3 (v3.2.2) generated HQ (Full-length high quality) transcripts [Table 1] were mapped to GRCh38 (v33 p13) using Minimap2 long-read alignment tools [1] (v2.24-r1122; commands: minimap2 -t 8 -ax splice:hq -uf --secondary=no -C5 -O6,24 -B4 GRCh38.v33p13.primary_assembly.fa sample.polished.hq.fastq.gz). Basic statistics of the alignment for each sample [NA24385 /HG002, NA24143/HG004, and NA24631/HG005] are provided in Table 2. The cDNA_cupcake workflow [https://github.com/Magdoll/cDNA_Cupcake] was then executed to collapse redundant isoforms from the BAM file. Low-count isoforms (<10) were filtered out, as well as 5' degraded isoforms that might lack biological significance. Subsequently, SQANTI3 [2] was employed to generate the final corrected fasta [Table 3a] transcripts and GTF [Table 3b] files, along with isoform classification reports. External databases, including the reference data set of transcription start sites (refTSS), a list of polyA motifs, tappAS-annotation, and Genecode hg38 annotation, were utilized. Finally, IsoAnnotLite (v2.7.3) analysis was conducted to annotate the GTF file from SQANTI3.

Differences between isoforms are categorized into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusion, different exon usage, and completely novel sequences. These categories build upon those used by SQANTI to annotate disparities between sample isoforms and the reference transcriptome. Note that we extend the categories provided by SQANTI by adding SNPs and large-scale variants.

## Description

Expand All @@ -101,7 +107,7 @@ We categorize differences between isoforms into [TODO] SNPs (<5bp), large-scale

## Example Output

For each isoforom that is unique to at least one sample, we output the information of the read and the similarity between that isoform and the most similiar isoform in the same window.
For each isoform that is unique to at least one sample, we provide information about the read and the similarity between that isoform and the most similar isoform within the same window.

The last column describes the normalized edit distance and the CIGAR string.

Expand Down

0 comments on commit d52c8f2

Please sign in to comment.