Skip to content

Latest commit

 

History

History
executable file
·
615 lines (452 loc) · 42.2 KB

README.md

File metadata and controls

executable file
·
615 lines (452 loc) · 42.2 KB

Bambu

bambu: Context-Aware Transcript Quantification from Long Read RNA-Seq data

Lifecycle: maturing Maintained? Install R build status
GitHub issues GitHub pulls BioC status BioC dev status License: GPL v3 CodeFactor codecov

bambu is a R package for multi-sample transcript discovery and quantification using long read RNA-Seq data. You can use bambu after read alignment to obtain expression estimates for known and novel transcripts and genes. The output from bambu can directly be used for visualization and downstream analysis such as differential gene expression or transcript usage.

Content

Installation

bambu is available through GitHub and Bioconductor

Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("bambu")

GitHub:

library(devtools)
install_github("GoekeLab/bambu")
library(bambu)

We can test if bambu is installed correctly and runs correctly by using a small test set that comes with the package.

test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu")
fa.file <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu")

gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file)

General Usage

The default mode to run bambu is using a set of aligned reads (bam files), reference genome annotations (gtf file, TxDb object, or bambuAnnotation object that can be obtained from prepareAnnotations() function), and reference genome sequence (fasta file or BSgenome). bambu will return a summarizedExperiment object with the genomic coordinates for annotated and new transcripts and transcript expression estimates. If you do not have any data yet or would like to test bambu with a full data-set that has been proven to work (the test data-set that comes with the package is too small to do proper analysis on), we recommend the SG-NEx data project. You can find this data and instructions on how to install it at here. We highly recommend using the same annotations that were used for genome alignment. If you have a gtf file and fasta file you can run bambu with the following options:

se <- bambu(reads = sample, annotations = annotations, genome = fa.file)

reads - is a path to one or more bam files aligned to the same genome used in the genome argument, or a path to intermediate read class files (see Storing and using preprocessed files (rcFiles))

genome - is a path to the genome fasta file. This should be the same file used for read alignment. Bambu does not support alignment to the transcriptome as it requires the splice junctions from a genome alignment for transcript discovery.

annotations - takes a path to a gtf file, a txdb object, or annotations prepared by prepareAnnotations() (see Use precalculated annotation objects). When not provided, de novo transcript discovery is performed (see De-novo transcript discovery)

NDR - Novel Discovery Rate threshold. A value between 0 and 1 representing the maximum NDR threshold for candidate transcripts to be included in the analysis. By default, bambu will recommend a threshold for your analysis. For more information see Modulating the sensitivity of discovery (pre and post analysis).

For the full parameter list see Arguments

For information on the output and how to export it to a file see Output.

Transcript discovery only (no quantification)

If you are only interested in identifying novel transcripts, the quantification module of bambu can be skipped by setting quant to FALSE. Note that the output will be a GRangeslist object containing the reference and novel annotations (See rowRanges() in Output). We recommend running transcript discovery only mode with NDR = 1, and doing filtering in the downstream analysis to allow flexibility in the analysis. See Modulating the sensitivity of discovery (pre and post analysis)

se.discoveryOnly <- bambu(reads = test.bam, annotations = gtf.file, genome = fa.file, quant = FALSE)

Quantification of annotated transcripts and genes only (no transcript/gene discovery)

If you are only interested in quantifying transcripts, the discovery module of bambu can be skipped by setting discovery to FALSE.

se.quantOnly <- bambu(reads = test.bam, annotations = gtf.file, genome = fa.file, discovery = FALSE)

Using precalculated annotation objects

Depending on the size of your reference annotations the prepareAnnotations() step may take a few minutes. You can also use precalculated annotations and if you plan to run bambu more frequently with the same annotations, we recommend to save the bambuAnnotations object. The bambuAnnotation object can be calculated from:

a) a .gtf file:

annotations <- prepareAnnotations(gtf.file)

b) a TxDb object

annotations <- prepareAnnotations(txdb)

Save the object

saveRDS(annotations, ”/path/to/annotations.rds” )

This object can then be used instead of a path to your reference annotations to the annotations argument.

annotations <- readRDS("/path/to/annotations.rds")
bambu(reads <- test.bam, annotations = annotations, genome = fa.file)

Running multiple samples

If you have multiple replicates for a sample, or plan to do comparative analysis between conditions, it may be beneficial to run all samples together instead of individually. This can be done by providing a vector of paths to all the bam files you want to analyze together.

se.multiSample <- bambu(reads = c(test1.bam, test2.bam, test3.bam), annotations = gtf.file, genome = fa.file)

The advantage of running samples together include: Novel transcripts that are identified in multiple samples are assigned unified IDs, enabling comparative analysis between different samples. This is especially important for downstream differential expression analysis when looking at novel transcripts. Running multiple samples can be multithreaded (see ncore). While running multiple samples, By default, bambu will train a model separately on each sample and score novel transcripts in each sample separately.

If you need to combine samples in multiple configurations (for example different pairwise comparisons) we would recommend using the intermediate rcFiles to save processing time (see Storing and using preprocessed files (rcFiles))

Modulating the sensitivity of discovery (pre and post analysis)

When doing transcript discovery there is a balance between sensitivity (the number of real novel transcripts that are detected) and the precision (how many of the novel transcripts are real). To control this balance, bambu uses the novel discovery rate (NDR) as the main parameter. The NDR threshold approximates the proportion of novel candidates output by bambu, relative to the number of known transcripts it found, i.e., an NDR of 0.1 would mean that 10% of all transcripts passing the threshold are classified as novel.

If you are using a genome where you expect a high amount of novel transcripts, a higher NDR is recommended so that these novel transcripts are not missed. Conversely if you are using a well annotated genome, we recommend a lower NDR threshold to reduce the presence of false positives. By default the NDR threshold is automatically chosen for the user based on predicted level of annotation completeness when compared to the default model trained on human reference annotations (Hg38). For more information see Training a model on another species/dataset and applying it

To manually select an NDR value, use the NDR argument in bambu:

se.NDR_0.3 <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, NDR = 0.3)

Alternatively transcript discovery can be run without thresholds, producing a GRangesList annotation object with all transcripts scored with its NDR score. Note that this means turning quant = FALSE in running bambu (refer to “Transcript discovery only” section). The annotations can be filtered by their NDR score (see example below), read count and gene read proportion between the discovery and quantification steps or used for other types of analysis.

newAnnotations <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, NDR = 1, quant = FALSE)
annotations.filtered <- newAnnotations[(!is.na(mcols(newAnnotations)$NDR) & mcols(newAnnotations)$NDR<0.1) | is.na(mcols(newAnnotations)$NDR)]
se.NDR_1 <- bambu(reads = test.bam, annotations = annotations.filtered, genome = fa.file, NDR = 1, discovery = FALSE)

Additionally there are other thresholds that advanced users can access through opt.discovery when running bambu (see arguments).

Output

bambu returns a SummarizedExperiment object which can be accessed as follows:

  • assays(se) returns a list of transcript abundance estimates as counts or CPM
  • rowRanges(se) returns a GRangesList with all annotated and newly discovered transcripts
  • rowData(se) returns additional information about each transcript

Access transcript expression estimates by extracting a variable (such as counts or CPM) using assays():

  • assays(se)$counts - expression estimates
  • assays(se)$CPM - sequencing depth normalized estimates
  • assays(se)$fullLengthCounts - estimates of read counts mapped as full length reads for each transcript
  • assays(se)$uniqueCounts - counts of reads that are uniquely mapped to each transcript

For a full description of the other outputs see Output Description

The full output can be written to a file using writeBambuOutput(). Using this function will generate three files, including a .gtf file for the extended annotations, and two .txt files for the expression counts at transcript and gene levels.

writeBambuOutput(se, path = "./bambu/")

If you are only interested in the novel transcripts, one can filter this 'se' object first to remove reference annotations.

se.novel = se[mcols(se)$novelTranscript,]
writeBambuOutput(se.novel, path = "./bambu/")

If you are only interested in full-length transcripts that were detected by Bambu.

se.novel = se[assays(se)$fullLengthCounts >= 1,]
writeBambuOutput(se.novel, path = "./bambu/")

If quant is set to FALSE i.e. only transcript discovery is performed, only the rowRanges output of the extended annotations is returned (a GRangesList object). The equivalent rowData can be accessed with mcols() These annotations can be written to a .gtf file using writeToGTF(GRangesList_object, output_path).

se.discoveryOnly <- bambu(reads = sample, annotations = annotations, genome = fa.file, quant = FALSE)
writeToGTF(se.discoveryOnly, "./output.gtf")

As above, to output only the novel annotations, you need to filter out the reference annotations.

se.discoveryOnly.novel = se.discoveryOnly[mcols(se.discoveryOnly)$novelTranscript,]
writeToGTF(se.discoveryOnly.novel, "./output.gtf")

If you are only interested in full-length transcripts that were detected by Bambu. If multiple transcripts share exon-junctions, only one will be displayed. To avoid this, do the filter after quantification as in the example above.

se.novel = se[!is.na(mcols(se)$readCount) & mcols(se)$readCount >= 1,]
writeBambuOutput(se.novel, path = "./bambu/")

If both quant and discovery are set to FALSE, bambu will return an intermediate object see Storing and using preprocessed files (rcFiles)

Visualization

You can visualize the novel genes/transcripts using plotBambu function. (Note that the visualization was done by running bambu on the three replicates of HepG2 cell line in the SG-NEx project)

plotBambu(se, type = "annotation", gene_id = "ENSG00000107104")

plotGene

plotBambu(se, type = "annotation", transcript_id = "tx.9")

plotTranscript

plotBambu can also be used to visualize the clustering of input samples on gene/transcript expressions. Only for multiple samples’ visualisation. See Running multiple samples

plotBambu(se, type = "heatmap") # heatmap 

plotHeapmap

plotBambu(se, type = "pca") # PCA visualization

plotPCA

plotBambu can also be used to visualize the clustering of input samples on gene/transcript expressions with grouping variable

plotBambu(se, type = "heatmap", group.var) # heatmap 

plotBambu(se, type = "pca", group.var) # PCA visualization

Bambu Advanced Options

Below we include several advanced options and use-cases for bambu. We recommend reading and understanding the paper before attempting to use these features.

Using a pretrained model

Bambu requires at least 1000 transcripts from the annotations to be detected in a sample in order to train a sample specific model. In use cases where this is not possible bambu will instead use a default pretrained model to calculate the transcript probability score (TPS) for each read class. Users can force this behavior if they believe their annotations are not sufficient for sample specific training (for example if they suspect a high proportion of real novel transcripts are present in their sample). This is advantageous when you want NDR calibration without the impacts of a model trained using low quality annotations.

se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, opt.discovery = list(fitReadClassModel = FALSE))

The default pretrained model was trained on SGNex_HepG2_directRNA_replicate5_run1 and has the following characteristics:

Genome: Homo_sapiens.GRCh38.dna_sm.primary_assembly
Annotations: Homo_sapiens.GRCh38.91
Read count: 7,861,846
Technology: Nanopore (ONT)
Library preparation: directRNA
Base Calling Accuracy: 79%
Average Read Length: 1093

We have found the pretrained model works successfully across species borders (on Arabidopsis thaliana) and on different technologies (PacBio), with only small decreases in performance compared to using a sample specific model. The pretrained model is not always effective in samples with large differences in sequencing quality or if the library preparation results in biases in the overall structure of the transcriptome. In this case, we would recommend training a new model using similar data from a different sample that has quality reference annotations (See Training a model on another species/dataset and applying it).

De-novo transcript discovery

In cases where the organism does not yet have reference annotations, or unreliable annotations, bambu can be run in de-novo mode. In de-novo mode, bambu does not train a model, and instead uses the pretrained model to classify novel transcripts (see Using a pretrained model. To learn how to train a new model for a more closely related organism/sample see Training a model on another species/dataset and applying it. Without annotations bambu is unable to calibrate the NDR output, nor be able to recommend a threshold and will instead use the TPS as the thresholded value. Therefore you should supply a manual NDR threshold (Modulating the sensitivity of discovery (pre and post analysis)) and note that the precision of the output is unlikely to linearly match an applied threshold. The TPS threshold used is (> 1-NDR). If an NDR is not provided, a default NDR threshold of <0.1 is used (an effective TPS threshold of > 0.9). As in Modulating the sensitivity of discovery (pre and post analysis) an NDR of 1 can be provided to output all possible read classes with their TPS scores

novelAnnotations <- bambu(reads = test.bam, annotations = NULL, genome = fa.file, NDR = 0.5, quant = FALSE)

Storing and using preprocessed files (rcFiles)

The first step of bambu involves the construction of read classes which is a large fraction of the running time. This could be time-consuming if we want to perform transcript discovery & quantification multiple times on the same dataset using different configurations (eg. NDR, or combinations of samples), especially when the sample is large. To mitigate this, we can store the read class information as read class files (rcFiles) during a bambu run. Then they can be used as an input argument in the bambu main function for the subsequent bambu run.

se <- bambu(reads = rcFiles, annotations = annotations, genome = fa.file)

rcFiles can be generated in two ways, either as a direct output of the bambu() function when quant and discovery are FALSE, or as written outputs when a path is provided to the rcOutdir argument. When rcFiles are output using rcOutdir this is done using BiocFileCache. For more details on how to access, use, and identify these files see here. A short example is shown below.

Example using rcOutDir to produce preprocessed files

se <- bambu(reads = test.bam, rcOutDir = "path/to/rcOutput/", annotations = annotations, genome = fa.file)

This will store a preprocessed rcFile in the provided directory for each sample file provided to reads. To access these files for future use, we recommend using the BioCFileCache package which provides the metadata needed to identify the sample.

library(BiocFileCache)
bfc <- BiocFileCache("path/to/rcOutput/", ask = FALSE)
info <- bfcinfo(bfc)

The info object is a tibble which associates the filename (fpath) with the sample (rname) to help you identify which .rds file you need.

info
# running bambu using the first file
se <- bambu(reads = info$rpath[1], annotations = annotations, genome = fa.file)

This output is also generated when both quant and discovery are set to false in a list form indexed by sample.

se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE)

As this is an intermediate object it will not be suitable to use for general use cases. We will document the object below for any potential advanced use cases that may arise.

rowData(se[[1]])
column name description
chr.rc The chromosome name the read class is found on
strand.rc The strand of the read class
startSD The standard deviation of the aligned genomic start positions of all reads assigned to the read class
readCount.posStrand The number of reads assigned to this read class that aligned to the positive strand
intronStarts A comma separated character vector of intron start coordinates
intronEnds A comma separated character vector of intron end coordinates
confidenceType Category of confidence:
highConfidenceJunctionReads - the read class contain no low confidence junctions
lowConfidenceJunctionReads - the read class contains low confidence junctions
unsplicedWithin - single exon read class that is within the exon boundaries of an annotation
unsplicedNew - single exon read class that does not fully overlap with annotated exons
readCount The number of reads assigned to this read class
readId *only present when trackReads = TRUE An integer list of bambu internal read ids that belong to the read class. (See the metadata of the object for full read names)
GENEID The gene ID the transcript is associated with
novelGene A logical that is true if the read class belongs to a novel gene (does not overlap with an annotated gene loci)
numExons The number of exons the read class has
geneReadProp The proportion of reads assigned to this read class relative to all the reads assigned to all read classes from its gene
geneReadCount The number of reads assigned to the gene of this read class
equal A logical that is true if the read classes exon-junctions perfectly and completely match the exon-junctions of a reference annotation
compatible An integer counting the number of reference annotations, where the read classes exon-junctions are contiguously present (a subset)
numAstart An integer counting the number of A nucleotides found within a 20bp window centered on the read class genomic start position
numAend An integer counting the number of A nucleotides found within a 20bp window centered on the read class genomic end position
numTstart An integer counting the number of T nucleotides found within a 20bp window centered on the read class genomic start position
numTend An integer counting the number of T nucleotides found within a 20bp window centered on the read class genomic end position
txScore This is the TPS generated by the sample trained model
txScore.noFit This is the TPS generated by the pretrained model

Tracking read-to-transcript assignment

Some use cases require knowing which individual reads support specific transcripts (novel and annotated). By default this feature is off due to the memory overhead it introduces but can be turned on using the trackReads argument. The output has three columns: read_id, a list of indices of equal matches, a list of indices of compatible matches. These indices match the annotations found in rowRanges(se)

se <- bambu(reads = test.bam, annotations = annotations, genome = fa.file, trackReads = TRUE)
metadata(se)$readToTranscriptMaps[[1]]
column name description
readId The read name as found in the bam file. If running from a rcFile where trackReads!=TRUE, bambu will not have stored the read names, this will instead be a unique bambu-assigned numerical ID (will not correlate with the bam file).
equalMatches A list of integers with the tx ids where the exon-junctions of the read match completely and contiguously. This matches the index of the transcript found in rowRanges()
compatibleMatches A list of integers with the tx ids where the exon-junctions of the read are found contiguously within the transcript (a subset). This matches the index of the transcript found in rowRanges()

Training a model on another species/dataset and applying it

In situations where training is not or cannot be performed, and the default model is also not suitable for the sample (the sample is generated from a different technology, species, condition, etc), bambu provides the option to train a new model, if well annotated similar data is available. For example one might train a model on arabidopsis to apply to an unannotated plant sample.

# first train the model using a related annotated dataset from .bam
se = bambu(reads = sample1.bam, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE, opt.discovery = list(returnModel = TRUE)) # note that discovery and quant need to be set to FALSE, alternatively you can have them set to TRUE and retrieve the model from the rcFile as long as returnModel = TRUE ([see here](#Storing-and-using-preprocessed-files-rcFiles)).
newDefaultModel = metadata(se[[1]])$model # [[1]] will select the model trained on the first sample

# alternatively train the model using an rcFile
rcFile <- readRDS(pathToRcFile)
newDefaultModel = trainBambu(rcFile)

# use the trained model on another sample
# sample2.bam and fa.file2 represent the aligned reads and genome for the poorly annotated sample
se <- bambu(reads = sample2.bam, annotations = NULL, genome = fa.file2, opt.discovery = list(defaultModels = newDefaultModel, fitReadClassModel = FALSE))

#trainBambu Arguments

rcFile <- NULL, min.readCount = 2, nrounds = 50, NDR.threshold = 0.1, verbose = TRUE
arguments description
rcFile A loaded rcFile sample (see Storing and using preprocessed files (rcFiles))
min.readCount The minimum read count threshold used for read classes during training
nrounds The number of stumps used in the xgboost tree
NDR.threshold The NDR threshold that will be used for the recommended NDR calibration when using this model.
verbose A logical if more information should be printed whilst the function is running

Quantification of gene expression

To obtain gene expression, simply summing up over all annotated transcripts will likely underestimate it, as Bambu assigns only reads to transcripts if they are compatible. Reads which are incompatible with transcripts, but which can be assigned to the gene are tracked by Bambu to obtain more accurate gene expression estimate.

To obtain the accurate gene expression estimates which uses all reads that can be assigned to each gene (including reads that are incompatible with all existing annotations) you can run the following command:

seGene <- transcriptToGeneExpression(se)

The output of this function is a SummarizedExperiment object, where

  • assays(seGene)$counts returns the estimated expression counts for each gene
  • assays(seGene)$CPM returns the estimated CPM for each gene
  • rowData(seGene) returns the gene information
  • rowRanges(seGene) returns the gene genomic ranges

Including single exons

By default bambu does not report single exon transcripts because they are known to have a high frequency of false positives and do not have splice junctions that are used by bambu to distinguish read classes. Nevertheless bambu trains a separate model on single-exon transcripts, and these predictions can be accessed and included in the annotations.

se <- bambu(reads = sample1.bam, annotations = annotations, genome = fa.file, opt.discovery = list(min.txScore.singleExon = 0))

Fusion gene/isoform detection

To facilitate fusion gene/isoform detection, bambu has implemented a fusion mode. When it is set to TRUE, it will assign multiple GENEIDs to fusion transcripts, separated by ":".

To use this feature, it is recommended to detect the fusion gene breakpoints using fusion detection tools like JAFFAL first. Then fusion chromosome fasta file can be created by concatenating the two fusion gene sequences. Similarly, the fusion annotation gtf file can also be created with coordinates of the transcripts from the relevant genes changed to fusion chromosome coordinates. It is then required to do the re-alignment of reads originating from fusion region to the generated fusion chromosome fasta file. Then users can apply bambu on the re-aligned bam files with fusion chromosome fasta and gtf files.

se <- bambu(reads = fusionAligned.bam, annotations = fusionAnnotations, genome = fusionFasta, fusionMode = TRUE)

Bambu Arguments

argument description
reads A string or a vector of strings specifying the paths of bam files for genomic alignments, or a BamFile object or a BamFileList object (from Rsamtools).
rcOutDir A string variable specifying the path to where read class files will be saved.
annotations A TxDb object, a path to a .gtf file, or a GRangesList object obtained by prepareAnnotations.
genome A fasta file or a BSGenome object. If a fa.gz is provided, the .fai and .gzi must also be present
stranded A boolean for strandedness, defaults to FALSE.
ncore specifying number of cores used when parallel processing is used, defaults to 1.
NDR specifying the maximum NDR rate to novel transcript output among detected transcripts, defaults to 0.1
yieldSize see Rsamtools.
opt.discovery A list of controlling parameters for isoform reconstruction process:
prefix specifying prefix for new gene Ids (genePrefix.number), defaults to empty
remove.subsetTx indicating whether filter to remove read classes which are a subset of known transcripts, defaults to TRUE
min.readCount specifying minimun read count to consider a read class valid in a sample, defaults to 2
min.readFractionByGene specifying minimum relative read count per gene, highly expressed genes will have many high read count low relative abundance transcripts that can be filtered, defaults to 0.05
min.sampleNumber specifying minimum sample number with minimum read count, gene read proportion, and TPS, defaults to 1
min.exonDistance specifying minimum distance to known transcript to be considered valid as new, defaults to 35bp
min.exonOverlap specifying minimum number of bases shared with annotation to be assigned to the same gene id, defaults to 10bp
min.primarySecondaryDist specifying the minimum number of distance threshold between a read class and the annotations internal exons. Read classes with distances less than the threshold are not annotated as novel and counted with the annotations for quantification, defaults to 5bp
min.primarySecondaryDistStartEnd1 specifying the minimum number of distance threshold between a read class and the annotations start/end exons. Read classes with distances less than the threshold are not annotated as novel, defaults to 5bp
min.primarySecondaryDistStartEnd2 specifying the minimum number of distance threshold between a read class and the annotations start/end exons. Read classes with distances less than the threshold are counted with the annotations, defaults to 5bp
min.txScore.multiExon specifying the minimum transcript probility score threshold for multi-exon transcripts for min.sampleNumber, defaults to 0
min.txScore.singleExon specifying the minimum transcript probability score threshold for single-exon transcripts for min.sampleNumber
fitReadClassModel a boolean specifying if bambu should train a model on each sample. If set to false bambu will use the default model for ranking novel transcripts. defaults to TRUE
defaultModels a bambu trained model object that bambu will use when fitReadClassModel==FALSE or the data is not suitable for training, defaults to the pretrained model in the bambu package
returnModel a boolean specifying if bambu will output the model it trained on the data, defaults to FALSE
baselineFDR a value between 0-1. Bambu uses this FDR on the trained model to recommend an equivilent NDR threshold to be used for the sample. By default, a baseline FDR of 0.1 is used. This does not impact the analysis if an NDR is set.
min.readFractionByEqClass indicating the minimum relative read count of a subset transcript compared to all superset transcripts (ie the relative read count within the minimum equivalent class). This filter is applied on the set of annotations across all samples using the total read count, this is not a per-sample filter. Please use with caution. defaults to 0
opt.em A list of controlling parameters for quantification algorithm estimation process:
maxiter specifying maximum number of run iterations, defaults to 10000
degradationBias correcting for degradation bias, defaults to TRUE
conv specifying the covergence threshold control, defaults to 0.0001
minvalue specifying the minvalue for convergence consideration, defaults to 0.00000001
trackReads When TRUE read names will be tracked and output as metadata in the final output as readToTranscriptMaps detailing the assignment of reads to transcripts.The output is a list with an entry for each sample.
returnDistTable When TRUE the calculated distance table between read classes and annotations will be output as metadata as distTables. The output is a list with an entry for each sample.
discovery A logical variable indicating whether annotations are to be extended for quantification, defaults to TRUE.
quant A logical variable indicating whether quantification will be performed, defaults to TRUE.
verbose A logical variable indicating whether processing messages will be printed.
lowMemory Reads will be processed by chromosomes instead of all together when lowMemory is specified. This option provides an efficient way to process big samples.

Output Description

Access annotations that are matched to the transcript expression estimates by rowRanges()

rowRanges(se)
column description
seqnames The scaffold name the transcript is found on
ranges An iRanges object containing the start and end coordinates of the transcript (not stranded)
strand The strand of the transcript (+, -, *)
exon_rank The exon index of the exons in the transcript starting from the 5’ end of the transcript
exon_endRank The exon index of the exons in the transcript starting from the 3’ end of the transcript

Access transcript level data which is matched to transcript expression estimates using rowData()

mcols(rowRanges(se)) 
#or
mcols(se) 
#or 
rowData(se)
column description
TXNAME The transcript name for the transcript. Will use either the transcript name from the provided annotations or tx.X if it is a novel transcript where X is a unique integer.
GENEID The gene name for the transcript. Will use either the gene name from the provided annotations or gene.X if it is a novel transcript where X is a unique integer.
eqClass A character vector with the transcript names of all the equivalent transcripts (those which have this transcripts contiguous exon junctions)
txId A bambu specific transcript id used for indexing purposes
eqClassById A integer list with the transcript ids of all equivalent transcripts
txClassDescription A concatenated string containing the classes the transcript falls under:
annotation - Transcript matches an annotation transcript
allNew - All the intron-junctions are novel
newFirstJunction - the first junction is novel and at least one other junction matches an annotated transcript
newLastJunction - the last junction is novel and at least one other junction matches an annotated transcript
newJunction - an internal junction is novel and at least one other internal junction matches an annotated transcript
newWithin - A novel transcript with matching junctions but is not a subset of an annotation
unsplicedNew - A single exon transcript that doesn’t completely overlap with annotations
compatible - Is a subset of an annotated transcript
newFirstExon - The first exon is novel
newLastExon - The last exon is novel
readCount The number of full length reads associated with this transcript (filtered by min.readCount)
NDR The NDR score calculated for the transcript
relReadCount The proportion of reads this transcript has relative to all reads assigned to its gene
relSubsetCount The proportion of reads this transcript has relative to all reads that either fully or partially match this transcript

Release History

bambu v3.2.5

Release date: 2023-July-07

Minor changes:

  • Fix crash when extremely large datasets provided
  • Speed up read class construction
  • Add LongRead BiocView
  • Update release history

bambu v3.2.4

Release date: 2023-Apr-26

Minor changes:

  • Fixes crash during Low Memory Mode when there are scaffolds with no reads
  • Fixes crash on windows machines caused by DNAStringSet
  • Adds NDR metadata when running discovery mode with recommended NDR, so users do not need to look at console for the recommended NDR.
  • Re-enabled GitHub actions for new devel branch name and the windows check
  • Fixed a crash that occurs with large datasets resulting in large overflow tables during novel gene id assignment
  • Remove nested bplapply in EM
  • Remove unused eqClassById list column in the readClassDist object to reduce memory usage
  • Fixed a bug that caused identical unspliced reads to not be tracked when trackReads = TRUE

bambu version 3.0.0

Release date: 2022-10-25

Major changes:

  • Updated the input parameters of Bambu to simplify the user experience
  • Introduced NDR threshold recommendation
  • Implemented trainBambu(), allowing users to train and use models on their own data
  • Reads that cannot be assigned to any transcript are grouped as incompatible counts
  • Partial estimates are removed from output as it can be directly obtained based on total count estimates and full-length count estimates
  • The fusion mode is now available, which assigns read classes that align to multiple genes to a new combined fusion gene

Minor changes:

  • Novel transcripts and genes are now by default output with a Bambu prefix
  • Updated the documentation, messages and errors output by Bambu
  • Annotated transcripts (with unique exon-junctions) with at least 1 full-length read are assigned a NDR rank

bambu version 1.99.0

Release date: 2021-10-18

Major Changes:

  • Implemented a machine learning model to estimate transcript-level novel discovery rate
  • Implemented full length estimates, partial length estimates and unique read counts in final output
  • Improved the performance when extending annotations with simplified code
  • Improved the performance when large amounts of annotations are missing.
  • Implemented a lowMemory option to reduce the memory requirements for very large samples (>100 million reads)

Minor fixes:

  • remove the use of get() which looks into environment variables (prone to crashes if a variable of the same name exists) and directly references the functions that should be used instead.
  • bug fix when a fa file is provdied as a string variable to non-windows system
  • bug fix when no single exon read class in provided samples
  • bug fix when no splice overlaps found between read class and annotations

bambu version 1.0.2

Release date: 2020-11-10

  • bug fix for author name display
  • bug fix for calling fasta file and bam file from ExperimentHub
  • update NEWS file

bambu version 1.0.0

Release date: 2020-11-06

  • bug fix for parallel computation to avoid bplapply

bambu version 0.99.4

Release date: 2020-08-18

  • remove codes using seqlevelStyle to allow customized annotation
  • update the requirement of R version and ExperimentHub version

bambu version 0.3.0

Release date: 2020-07-27

  • bambu now runs on windows with a fasta file
  • update to the documentation (vignette)
  • prepareAnnotations now works with TxDb or gtf file
  • minor bug fixes

bambu version 0.2.0

Release date: 2020-06-18

bambu version 0.1.0

Release date: 2020-05-29

Citation

Chen, Y., Sim, A., Wan, Y.K. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods (2023). https://doi.org/10.1038/s41592-023-01908-w

Contributors

This package is developed and maintained by Ying Chen, Andre Sim, Yuk Kei Wan, Keith Yeo, Min Hao Ling and Jonathan Goeke at the Genome Institute of Singapore. If you want to contribute, please leave an issue. Thank you.

Bambu