This repository has been archived by the owner on Mar 26, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
lfdelzam/Automatic_transcriptome_analysis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Project name: Automating the transcriptomics analysis Author: Fernando Delgado I. Objective: To Automate the transcriptomics analysis using Snakemake (a workflow-engine). II. Software and environment: A. List of software: 1. FastQC v0.11.8 2. Trimmomatic 0.38 3. hisat2 version: 2.1.0 4. htseq version: HTSeq-0.11.2 5. conda (version: 4.5.12), where the followings will be added: * R version: 3.5.0 * R packages: DESeq2 gplots RColorBrewer * snakemake (version: 5.4.2) B. Download and install software in home/bin directory: > cd > cd bin * Trimmomatic (binary packages) from http://www.usadellab.org/cms/?page=trimmomatic * FASTQC from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Then add to .bashrc file > echo "~/bin/FastQC" >> ~/.bashrc > source ~/.bashrc * conda > wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh > bash ~/miniconda.sh -b Then add to .bashrc file > echo ". ~/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc > source ~/.bashrc > conda config --add channels bioconda > conda config --add channels conda-forge > conda create --name snakenv rstudio bioconductor-DESeq2 snakemake * hisat2 > wget ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-Linux_x86_64.zip > unzip hisat2-2.1.0-Linux_x86_64.zip > echo export PATH=$PATH:~/bin/hisat2-2.1.0 >> ~/.bashrc > source ~/.bashrc * htseq > git clone https://github.com/simon-anders/htseq.git > cd htseq ~/bin/htseq> pip install HTSeq --user > echo export PATH=$PATH:~/bin/htseq >> ~/.bashrc > source ~/.bashrc #checking > htseq-count -h III. Preparation steps: A. first create the project directory : Analysis_Transcriptomic #commands in your terminal: > cd #move to root > mkdir Analysis_Transcriptomic > cd Analysis_Transcriptomic #move to the project directory SOFTWARE: B. Create bin directory in your project directory and link the executable software to you bin in your project directory > cd ~/Analysis_Transcroptomic > mkdir bin > cd bin > ln -s ~/bin/FastQC > ln -s ~/bin/trimmomatic-0.38.jar > ln -s ~/bin/hisat2-2.1.0 > ln -s ~/bin/htseq/ RAWDATA: C. creating a data directory > cd > cd Analysis_Transcriptomic > mkdir data > cd data download to this directory your rawdata: * genome data: genes.gff # annotation file genome.fna # reference genome * Single reads: * control (reference), 3 replicates: ref1.fastq ref2.fastq ref3.fastq * Condition to be evaluated, 3 replicates: fh1.fastq fh2.fastq fh3.fastq ---description of the data-- "We will use single reads representing two conditions. Each condition has been replicated three times (numbered 1, 2 and 3). The condition fh stands for forest hot and is a soil which is rich in nutrients (dead organic material). The term ref stands for reference. Here the fungus lives on poor media on an agar plate. The purpose is to see if there are (significantly) deferentially expressed genes when comparing the two conditions. The importance of replication is evaluated"(from Transcriptome analysis compendium, Lund University) D. creating a scr directory > cd.. ~/Analysis_Transcriptomic> mkdir scr > cd scr #import the following scripts: * DifferentialExpression.R * seqfromgenome.py # make it executable ~/Analysis_Transcriptomic/scr> chmod +x DifferentialExpression.R ~/Analysis_Transcriptomic/scr> chmod +x seqfromgenome.py IV. Runing transcriptomic analysis steps: write the following commands: > cd > cd Analysis_Transcriptomic > conda activate snakenv > snakemake Then: (snakenv) User ~/Analysis_Transcriptomic> pip install bin/htseq > conda install -c r r-gplots > conda install -c r r-RColorBrewer > snakemake after runing the snakefile, deactivate the conda environment suing the following command: > conda deactivate V. Analysis A. Data quality Start by checking the quality and correct for any quality problems encountered. This is done by using the FastQC program that takes fastq file as input. --- rule Quality in snakemake file (snakefile) --- Result are stored in directory 1_Quality. To see the html file open them using the following command: 1_Quality> opera *.html B. Trimming In this part, reads that have be corrected are processed (part are removed) by using the software trimmomatic. --- rule trimming in snakemake file (snakefile) --- Result are stored in directory 2_trimming. C. Index Here we index the genome file, creating a genome index called genome by using the software HISAT2. --- rule index in snakemake file (snakefile) --- Result are stored in directory 3_mapping. D. Mapping The cleaned reads will be used for mapping against the genome by using the software HISAT2. This is a splice junction aware aligner that can map different parts of the reads to different exons if the reads spans two or more exons --- rule mapping in snakemake file (snakefile) --- Result are stored in directory 3_mapping. E. Reads counts Here we use the software htseq to see the number of reads that mapped to each gene. --- rule readcounts in snakemake file (snakefile) --- Result are stored in directory 4_Read_counts. F. Differential Expression The differential expression analysis will be perform by using the DESeq2 package. We run the script DifferentialExpression.R. Here, we use a simple experiment design with the aim of investigate the difference between poor and rich conditions. Overexpression will mean that genes are overexpressed in the forest hot(fh) replicates as compared to the control(ref). The control replicates are grown in poor media and the forest hot in a rich media. For each gene the difference in (normalized) counts for the two conditions can be assigned a p-value, which is ajusted using the Benjamini-Hochberg method. Each gene is given a corrected p-value called q-value or p-adj. Results are stored in directory 5_differential_expression: --- rule DifferentialExpression in snakemake file (snakefile) --- *normalized.count:The gene expression is normalized using statistical methods. *diffExp.tab (all calculations), where: baseMean - mean estimated from both conditions (with normalization by size factors) log2FoldChange - log2 of the ratio lfcSE - standard error of log2 fold change stat - the Wald test statistics pval - uncorrected p-value from the negative binomial test padj - p-value adjusted for multiple testing using Benjamini-Hochberg to estimate the false discovery rate *diffExp.0.01.tab: A list of the genes that have a p-adj value below 0.01 *diffExp.0.05.tab: A list of the genes that have a p-adj value below 0.05 G. MORE EXPRESSED GENES The name and sequences (CDS) of the top most significantly (Padj < 0.01) deferentially expressed genes are stored in two files in the directory 5_differential_expression. --- rule top_more_expressed --- * topmorexpressed: the 20 most significantly (Padj < 0.01) deferentially expressed genes. First column (name of the gene), second column (log2FoldChange) --- rule rule sequences --- * morexpresedseq.fasta: fasta file containing the CDS sequences of the top 9 genes more expressed. H. VISUALIZE A few graphs are made to visualize the results, which are stored in directory 6_visualize: * fig1.pdf: Expression versus significance * fig2.pdf: Expression heatmap for the most expressed genes * fig3.pdf: Heatmap of similarity between replicates * fig4.png: Principal component analysis plot you can see the figures using the following command: 6_visualize > opera * I. file running.log in Analysis_Transcriptomic directory stores stderr of some programs run by snakemake.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published