Simulated data were generated as in the previous studies of APAtrap (Ye et al. Bioinformatics 2018) or DaPars (Xia et al. Nature Communications 2014).
- Select candidate gene models
- Based on the human genome annotation file (version hg19), we first filtered out genes with less than 3 exons. Then 2000 genes were randomly selected from the remaining genes, which were used as candidate gene models for the simulation study. To simulate genes with two isoforms, 1000 genes were randomly selected from the candidate gene set. Then the 3' end of the first isoform is the same as the hg19 gene model, while the 3' end of the second isoform is 1000 bp beyond the first 3' end. We also simulated genes with variable number of isoforms (one to four isoforms). The gene set is the same as that used for simulating two-isoform genes. Then 50, 450, 300, and 200 genes were randomly selected from the 1000 genes and were regarded as genes with 1, 2, 3, and 4 isoforms, respectively. 50 genes are with only one isoform, whose 3' end is the same as the hg19 gene model. For genes with two to four isoforms, the respective 3' end is 500 bp, 1000 bp, and 1500 bp beyond the first 3' end, respectively. Finally, two gtf files (two isoforms.gtf and multiple isoforms.gtf) were obtained. Relevant files and code are in the simulated data folder.
Rscript selectValidGene.R
- Simulate RNA-seq reads
- We used the Flux Simulator tool, hg19 genomes fasta file, and the gtf files from the first step (Select candidate gene models) to generate simulated RNA-seq reads. Simulated RNA-seq datasets of 50 bp paired-end reads with a 150 bp fragment size were generated by Flux Simulator, providing the human genome annotation (parameters: FRAG_UR_ETA=150; READ_NUMBER=10000000; READ_LENGTH=50). In the benchmarking study, 100 millions reads were generated for the 1000 two-isoform genes and 1000 one to four isoform genes, respectively. For the evaluation of performance of differential APA detection on the simulated data, we calculated the true index of percentage difference (PD) (Ye, et al., 2018) from the simulated data and considered genes with PD larger than 0.2 as genes with dynamic APA usage.
flux-simulator -t simulator -p syn1000.par
- Files in simulated data
- bed file: The BED format is employed as default for describing reads produced in a Flux Simulator run by the genomic regions from which they are originating. Reads that fall partially in the poly-A tail are truncated to their respective content of genomic sequence. In contrast, reads that fall completely into the poly-A tail are described to be located on the special reference sequence 'poly-A'.
- read sequences file (.fq): For the (optional) input of a genomic sequence to produce read sequences.
- Transcriptome Profile (.PRO): The Profile (.PRO) format is designed to describe the simulated characteristics of each transcript from the reference annotation.
- RNA-seq datasets used for the benchmarking analysis
Species | Data samples | NCBI accession number | Genome annotation and assembly version |
---|---|---|---|
Human | MAQC Brain | SRX016359 SRX016366 | hg19 |
Human | MAQC Universal Human Reference (UHR) | SRX016367 SRX016368 | hg19 |
Mouse | Mouse Brain | GSE41637 | mm10 |
Arabidopsis | Control and mild drought conditions | ERX697776 ERX697793 | TAIR10 |
Tools | Program | APA detection | Switching detection |
---|---|---|---|
MISO (Kate, et al., 2010) | Python | No | Yes |
roar (Grassi, et al., 2016) | R | No | Yes |
QAPA (Ha, et al., 2018) | R, Python | Yes | Yes |
PAQR (Gruber, et al., 2018) | R, Python | Yes | Yes |
3USS (Le Pera, et al., 2015) | Web | Yes | No |
PASA (Campbell, et al., 2006) | Perl | Yes | No |
Cufflinks (Trapnell, et al., 2012) | R | Yes | No |
ExUTR (Huang and Teeling, 2017) | Perl | No | No |
Scripture (Guttman, et al., 2010) | Java | No | No |
KLEAT (Birol, et al., 2015) | Python | Yes | No |
ContextMap2 (Bonfert and Friedel, 2017) | Java | Yes | No |
GETUTR (kim, et al., 2015) | Python | Yes | No |
PHMM (Lu and Bushel, 2013) | R | No | Yes |
ChangePoint (Wang, et al., 2014) | Java | No | Yes |
IsoSCM (Shenker, et al., 2015) | Java | Yes | Yes |
DaPars (Xia, et al., 2014) | Python | Yes | Yes |
APAtrap (Ye, et al., 2018) | R, Perl | Yes | Yes |
TAPAS (Arefeen, et al., 2018) | R | Yes | Yes |
EBChangePoint (Zhang and Wei, 2016) | Java | Yes | Yes |