Skip to content

Latest commit

 

History

History
57 lines (50 loc) · 6.15 KB

Data and Tools.md

File metadata and controls

57 lines (50 loc) · 6.15 KB

Data and Tools

Simulated Data

Simulated data were generated as in the previous studies of APAtrap (Ye et al. Bioinformatics 2018) or DaPars (Xia et al. Nature Communications 2014).

  • Select candidate gene models
    • Based on the human genome annotation file (version hg19), we first filtered out genes with less than 3 exons. Then 2000 genes were randomly selected from the remaining genes, which were used as candidate gene models for the simulation study. To simulate genes with two isoforms, 1000 genes were randomly selected from the candidate gene set. Then the 3' end of the first isoform is the same as the hg19 gene model, while the 3' end of the second isoform is 1000 bp beyond the first 3' end. We also simulated genes with variable number of isoforms (one to four isoforms). The gene set is the same as that used for simulating two-isoform genes. Then 50, 450, 300, and 200 genes were randomly selected from the 1000 genes and were regarded as genes with 1, 2, 3, and 4 isoforms, respectively. 50 genes are with only one isoform, whose 3' end is the same as the hg19 gene model. For genes with two to four isoforms, the respective 3' end is 500 bp, 1000 bp, and 1500 bp beyond the first 3' end, respectively. Finally, two gtf files (two isoforms.gtf and multiple isoforms.gtf) were obtained. Relevant files and code are in the simulated data folder.
      Rscript selectValidGene.R 
    
  • Simulate RNA-seq reads
    • We used the Flux Simulator tool, hg19 genomes fasta file, and the gtf files from the first step (Select candidate gene models) to generate simulated RNA-seq reads. Simulated RNA-seq datasets of 50 bp paired-end reads with a 150 bp fragment size were generated by Flux Simulator, providing the human genome annotation (parameters: FRAG_UR_ETA=150; READ_NUMBER=10000000; READ_LENGTH=50). In the benchmarking study, 100 millions reads were generated for the 1000 two-isoform genes and 1000 one to four isoform genes, respectively. For the evaluation of performance of differential APA detection on the simulated data, we calculated the true index of percentage difference (PD) (Ye, et al., 2018) from the simulated data and considered genes with PD larger than 0.2 as genes with dynamic APA usage.
      flux-simulator -t simulator -p syn1000.par 
    
  • Files in simulated data
    • bed file: The BED format is employed as default for describing reads produced in a Flux Simulator run by the genomic regions from which they are originating. Reads that fall partially in the poly-A tail are truncated to their respective content of genomic sequence. In contrast, reads that fall completely into the poly-A tail are described to be located on the special reference sequence 'poly-A'.
    • read sequences file (.fq): For the (optional) input of a genomic sequence to produce read sequences.
    • Transcriptome Profile (.PRO): The Profile (.PRO) format is designed to describe the simulated characteristics of each transcript from the reference annotation.

Real RNA-seq data

  • RNA-seq datasets used for the benchmarking analysis
Species Data samples NCBI accession number Genome annotation and assembly version
Human MAQC Brain SRX016359 SRX016366 hg19
Human MAQC Universal Human Reference (UHR) SRX016367 SRX016368 hg19
Mouse Mouse Brain GSE41637 mm10
Arabidopsis Control and mild drought conditions ERX697776 ERX697793 TAIR10

Tools evaluated in the benchmarking study

Tools Program APA detection Switching detection
MISO (Kate, et al., 2010) Python No Yes
roar (Grassi, et al., 2016) R No Yes
QAPA (Ha, et al., 2018) R, Python Yes Yes
PAQR (Gruber, et al., 2018) R, Python Yes Yes
3USS (Le Pera, et al., 2015) Web Yes No
PASA (Campbell, et al., 2006) Perl Yes No
Cufflinks (Trapnell, et al., 2012) R Yes No
ExUTR (Huang and Teeling, 2017) Perl No No
Scripture (Guttman, et al., 2010) Java No No
KLEAT (Birol, et al., 2015) Python Yes No
ContextMap2 (Bonfert and Friedel, 2017) Java Yes No
GETUTR (kim, et al., 2015) Python Yes No
PHMM (Lu and Bushel, 2013) R No Yes
ChangePoint (Wang, et al., 2014) Java No Yes
IsoSCM (Shenker, et al., 2015) Java Yes Yes
DaPars (Xia, et al., 2014) Python Yes Yes
APAtrap (Ye, et al., 2018) R, Perl Yes Yes
TAPAS (Arefeen, et al., 2018) R Yes Yes
EBChangePoint (Zhang and Wei, 2016) Java Yes Yes