A snakemake workflow for bulk RNA-seq data preprocessing including sequence alignment, quality check etc.
Zhengqiao Zhao, Yuri Pritykin
October, 2022
- The core environment can be installed using
conda
. The pipeline relies on hisat2, samtools and fastqc.
conda create -n snakemake_bulkRNA -c bioconda -c conda-forge snakemake hisat2 samtools fastqc gnuplot
- We also provide an R environment for gene count matrix build and differential gene expression analysis using Rsubread and DESeq2.
conda create -n r_deseq -c bioconda -c conda-forge -c r r-tidyverse r-data.table bioconductor-deseq2 bioconductor-rtracklayer bioconductor-summarizedexperiment bioconductor-rsubread r-pheatmap
The raw fastq files should be placed in one directory, e.g., raw_data
. The name of the fastq files should follow this pattern: <sample_name>_R<1,2>.fastq
. For example, raw_data
should have the following directory structure:
raw_data
| SAMPLE1_R1.fastq
| SAMPLE1_R2.fastq
| SAMPLE2_R1.fastq
| SAMPLE2_R2.fastq
| ...
The config file should have the exact sample names (excluding _R<1,2>.fastq
suffix) and Snakemake file should have the absolute path to the raw data directory and the results directory that used to save the pipeline outputs including the alignment results.
- modify the
cluster.json
file so that it is compatible with your computation cluster settings. - run the following command in terminal:
bash run_snakemake.sh
The following image shows the directed acyclic graph (DAG) of jobs where the edges represent dependencies.
It can be obtained by running the following command in the snakemake conda environment:
snakemake --forceall --rulegraph | dot -Tpng > RULE_DAG.png