Data and Tools

Simulated Data

Simulated data were generated as in the previous studies of APAtrap (Ye et al. Bioinformatics 2018) or DaPars (Xia et al. Nature Communications 2014).

Select candidate gene models
- Based on the human genome annotation file (version hg19), we first filtered out genes with less than 3 exons. Then 2000 genes were randomly selected from the remaining genes, which were used as candidate gene models for the simulation study. To simulate genes with two isoforms, 1000 genes were randomly selected from the candidate gene set. Then the 3' end of the first isoform is the same as the hg19 gene model, while the 3' end of the second isoform is 1000 bp beyond the first 3' end. We also simulated genes with variable number of isoforms (one to four isoforms). The gene set is the same as that used for simulating two-isoform genes. Then 50, 450, 300, and 200 genes were randomly selected from the 1000 genes and were regarded as genes with 1, 2, 3, and 4 isoforms, respectively. 50 genes are with only one isoform, whose 3' end is the same as the hg19 gene model. For genes with two to four isoforms, the respective 3' end is 500 bp, 1000 bp, and 1500 bp beyond the first 3' end, respectively. Finally, two gtf files (two isoforms.gtf and multiple isoforms.gtf) were obtained. Relevant files and code are in the simulated data folder.
```
  Rscript selectValidGene.R 
```
Simulate RNA-seq reads
- We used the Flux Simulator tool, hg19 genomes fasta file, and the gtf files from the first step (Select candidate gene models) to generate simulated RNA-seq reads. Simulated RNA-seq datasets of 50 bp paired-end reads with a 150 bp fragment size were generated by Flux Simulator, providing the human genome annotation (parameters: FRAG_UR_ETA=150; READ_NUMBER=10000000; READ_LENGTH=50). In the benchmarking study, 100 millions reads were generated for the 1000 two-isoform genes and 1000 one to four isoform genes, respectively. For the evaluation of performance of differential APA detection on the simulated data, we calculated the true index of percentage difference (PD) (Ye, et al., 2018) from the simulated data and considered genes with PD larger than 0.2 as genes with dynamic APA usage.
```
  flux-simulator -t simulator -p syn1000.par 
```
Files in simulated data
- bed file: The BED format is employed as default for describing reads produced in a Flux Simulator run by the genomic regions from which they are originating. Reads that fall partially in the poly-A tail are truncated to their respective content of genomic sequence. In contrast, reads that fall completely into the poly-A tail are described to be located on the special reference sequence 'poly-A'.
- read sequences file (.fq): For the (optional) input of a genomic sequence to produce read sequences.
- Transcriptome Profile (.PRO): The Profile (.PRO) format is designed to describe the simulated characteristics of each transcript from the reference annotation.

Real RNA-seq data

RNA-seq datasets used for the benchmarking analysis

Species	Data samples	NCBI accession number	Genome annotation and assembly version
Human	MAQC Brain	SRX016359 SRX016366	hg19
Human	MAQC Universal Human Reference (UHR)	SRX016367 SRX016368	hg19
Mouse	Mouse Brain	GSE41637	mm10
Arabidopsis	Control and mild drought conditions	ERX697776 ERX697793	TAIR10

Tools evaluated in the benchmarking study

Tools	Program	APA detection	Switching detection
MISO (Kate, et al., 2010)	Python	No	Yes
roar (Grassi, et al., 2016)	R	No	Yes
QAPA (Ha, et al., 2018)	R, Python	Yes	Yes
PAQR (Gruber, et al., 2018)	R, Python	Yes	Yes
3USS (Le Pera, et al., 2015)	Web	Yes	No
PASA (Campbell, et al., 2006)	Perl	Yes	No
Cufflinks (Trapnell, et al., 2012)	R	Yes	No
ExUTR (Huang and Teeling, 2017)	Perl	No	No
Scripture (Guttman, et al., 2010)	Java	No	No
KLEAT (Birol, et al., 2015)	Python	Yes	No
ContextMap2 (Bonfert and Friedel, 2017)	Java	Yes	No
GETUTR (kim, et al., 2015)	Python	Yes	No
PHMM (Lu and Bushel, 2013)	R	No	Yes
ChangePoint (Wang, et al., 2014)	Java	No	Yes
IsoSCM (Shenker, et al., 2015)	Java	Yes	Yes
DaPars (Xia, et al., 2014)	Python	Yes	Yes
APAtrap (Ye, et al., 2018)	R, Perl	Yes	Yes
TAPAS (Arefeen, et al., 2018)	R	Yes	Yes
EBChangePoint (Zhang and Wei, 2016)	Java	Yes	Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data and Tools.md

Data and Tools.md

Data and Tools

Simulated Data

Real RNA-seq data

Tools evaluated in the benchmarking study

Files

Data and Tools.md

Latest commit

History

Data and Tools.md

File metadata and controls

Data and Tools

Simulated Data

Real RNA-seq data

Tools evaluated in the benchmarking study