This folder contains a test data set for the RNA-Seq workflow.
- reads_1.fq.gz and reads_2.fq.gz: Fastq files containing the simulated paired-end RNA-Seq data
- transcriptome.fa: Transcriptome reference fasta file (representing 582 transcripts)
- truth.tsv: True read counts for the simulated data
The test data set was simulted to represent a small set of genes in a human cell line RNA-Seq sample with the following procedure (thanks to Rob Patro):
-
The sample ERR188297 was downloaded from ENA (this is an experimental sample from GEUVADIS).
-
The sample was quantified against the Gencode v38 human transcriptome.
-
The results were loaded in R with tximport and aggregating to the gene level.
-
Expressed genes were randomly pulled out until the sum of their estimated read counts exceeded 100,000 (resulting in 66 genes).
-
All transcripts from these genes were selected to generate the test data transcriptome reference file (582 transcripts).
-
The estimated transcript level counts were then used to simulate the test data with polyester using
simulate_experiment_countmat
. -
The reads were shuffled (while maintaining the pairing) using bbmap.
-
Fake quality scores were added to the reads, using bbmap.