bioinformatics-workflows/test_data at master · GoekeLab/bioinformatics-workflows

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
reads_1.fq.gz		reads_1.fq.gz
reads_2.fq.gz		reads_2.fq.gz
transcriptome.fa		transcriptome.fa
truth.tsv		truth.tsv

README.md

Test data for the RNA-Seq workflow

This folder contains a test data set for the RNA-Seq workflow.

reads_1.fq.gz and reads_2.fq.gz: Fastq files containing the simulated paired-end RNA-Seq data
transcriptome.fa: Transcriptome reference fasta file (representing 582 transcripts)
truth.tsv: True read counts for the simulated data

Test data generation

The test data set was simulted to represent a small set of genes in a human cell line RNA-Seq sample with the following procedure (thanks to Rob Patro):

The sample ERR188297 was downloaded from ENA (this is an experimental sample from GEUVADIS).
The sample was quantified against the Gencode v38 human transcriptome.
The results were loaded in R with tximport and aggregating to the gene level.
Expressed genes were randomly pulled out until the sum of their estimated read counts exceeded 100,000 (resulting in 66 genes).
All transcripts from these genes were selected to generate the test data transcriptome reference file (582 transcripts).
The estimated transcript level counts were then used to simulate the test data with polyester using simulate_experiment_countmat.
The reads were shuffled (while maintaining the pairing) using bbmap.
Fake quality scores were added to the reads, using bbmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_data

test_data

README.md

Test data for the RNA-Seq workflow

Test data generation

Files

test_data

Directory actions

More options

Directory actions

More options

Latest commit

History

test_data

Folders and files

parent directory

README.md

Test data for the RNA-Seq workflow

Test data generation