Mouse_RNA_Seq_p53_genotoxic.Rmd

---
title: "RNA Seq for STAT736"
author: "Alex Soupir"
date: "9/10/2019"
output: 
  html_document:
    keep_md: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# **Background**

For STAT736-Fall-2019, we are analyzing the RNA-Seq from the publication [Genome-wide analysis of p53 transcriptional programs in B cells upon exposure to genotoxic stress in vivo.](https://www.ncbi.nlm.nih.gov/pubmed/26372730?dopt=Abstract) We are only using the [sequences](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP061386) *B cells from spleen* and not the *non-B cells from spleen* from the SRA Run Selector on NCBI.

The mice were exposed to whole-body ionizing radiation and sequences were extracted from both Bcells and non-B cells from the spleens of the mice. Two genotypes of mice were used: mice with p53 knocked out and the wild-type C57/Bl6. There were 4 different group combinations including the 2 different genotypes; each genotype was subjected to the ionizing radiation as well as control/mock.

```{r experimental_design, echo=FALSE}
library(knitr)
experimental_design = data.frame('Genotype' = c('p53', 'C57/Bl6', 'p53', 'C57/Bl6'), 'Treatment' = c('Mock', 'Mock', 'IR', 'IR'))
rownames(experimental_design) = c('Group 1', 'Group 2', 'Group 3', 'Group 4')

library(kableExtra)
library(reticulate)
kable(experimental_design[1:4,1:2]
      #, "latex"
      , caption = 'Treatment groups of the mice that were either controls or treated with ionizing radiation to determine reaction of p53.'
      #,booktabs = T
      ) %>%
  kable_styling(latex_options = c("striped", "repeat_header"))
```

This document will contain 2 different pipelines: The first one is going to be using the genome to map the reads too, and the secod is going to be *de novo*.

[Genome](#genome-mapping)
[*De novo*](#denovo-assembly)

#Genome Mapping {#genome-mapping}

The pipeline used in this analysis used **conda** on South Dakota State University's High Performance Computing cluster to run the programs **FastQC**, **Trimmomatic**, and **Tophat**. 

This is different than previous RNA-Seq analyses where I used my workstation pc with **Ubuntu 18.04** to run **FastQC**, **Trimmomatic**, **HiSat2**, **HTSeq**, and **DESeq2** locally. Also, the previous RNA-Seq alayses were of Soybean with treatment combinations of mycorrhizae and rhizobia inoculation.

# **Cleaning Data**

## Programs used?

+ FastQC
+ Trimmomatic 0.39
+ Bowtie 2.2.5.0
+ Tophat 2.1.1
+ STAR
+ Cufflinks
+ Kraken2
+ MultiQC
+ featureCount

### Picking the right node

To find a node that we can use on our own, we need to see which nodes are already allocated to jobs and which ones are idle. To do this, we can run **sinfo**. We want to pick one of the nodes that are marked 'idle' so we get the whole thing and we aren't interrupting someone elses job. For the sake of this exercise, lets work on **big-mem**.

Once a node that is idle has been found, you can *ssh* into it by typing **ssh -X big-mem00#** where # is the node number.
```{bash, eval=FALSE}
ssh -X big-mem005
```

Once on the node, the modules will have to be pulled from the shared folder again, otherwise we will be left with very basic ones. **NOTE:** if running programs that are in a personal folder such as miniconda (these examples), it is not necessary to add the other modules.

```{bash, eval=FALSE}
module use /cm/shared/modulefiles_local/
```

After loading the modules you can use it just as you would any other command line.

### Creating slurm scripts

When running on a cluster, it can sometime be difficult to find open nodes with the resources needed to run the jobs that we have. Making a slurm script is really easy. Fist we make a new file with the *touch* command.

```{bash, eval=FALSE}
touch commands.slurm
```

Now in our directory we have the file **commands.slurm** which we can edit to hold our code in. We can edit it with the *vi* command.

```{bash, eval=FALSE}
vi commands.slurm
```

We have a few things that we need to put in the file header so slurm knows what to do with our commands.

```{bash, eval=FALSE}
#!/bin/bash

#SBATCH --job-name=example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --output=job-%j-%N.log
#SBATCH --partition=bigmem
#SBATCH --time=10:00:00
```

When we break this down, we see *--job-name* which is what we will see when we look at whats running later, *--nodes* is the number of nodes we have, *--ntasks-per-node* is the number of cores that we are requesting to have allocated, *--output* is the output log file of the job (here it names the output file with the job number and the node that we used), *--partition* here is requesting a big-mem node but **compute** can also be used, and finally *--time* is how long we are requesting the allocation for. 

If the time runs out before the job is done I believe that it just kills the job even if not finished so we need to think a little about how much time to set. If the time is set too low, the job is killed and if the time is set too long, we may face issues with getting the node allocated to us.

To submit a job we can use *sbatch commands.slurm* and then we have the job ID. To check the status of our submission we use *sbatch* and then it shows all of the submitted jobs and how long they have been running along with the name that we set in the script.

## Acquiring sequences

To download the sequences from the sequence read archive (SRA), the SRA Toolkit was used. The downloading of the files took a very long time, so this was left to run over night. The **--gzip** was used to keep the files a relatively small, although this can be left out to download uncompressed files, and **--split-files** was used to split the forward read from the reverse read for paired end read trimming through **Trimmomatic**.

```{bash, eval=FALSE}
~/tools/sratoolkit.2.9.6-1-centos_linux64/bin/fastq-dump --gzip --split-files SRR2121770
```

This is an example of the single file, but the above code needed to be ran for all of the following SRA numbers:

+ SRR2121770
+ SRR2121771
+ SRR2121774
+ SRR2121775
+ SRR2121778
+ SRR2121779
+ SRR2121780
+ SRR2121781
+ SRR2121786
+ SRR2121787
+ SRR2121788
+ SRR2121789

The results from downloading with **--split-files** gives 2 files per SRR, as mentioned before, one forward and one reverse. The suffix of the split files is one with **\_1.fastq.gz** and another with **\_2.fastq.gz**.

## FastQC

FastQC can be run on all of the read files by using the wild card (\*) as in **\*.fastq.gz**. This prevents the need to hard code each individual read file into a FastQC command, which saves a lot of time since there are 24 read files in total for these 12 samples.

```{bash, eval=FALSE}
~/miniconda2/bin/fastqc *.fastq.gz
```

The output from running FastQC is a zipped folder and an HTML file for each of the **\fastq.gz** files in the folder. The HTML document looks something like this:

![FastQC of Raw SRR2121770_1.fastq.gz Read](./TopFastQCRaw.PNG)

This is just the top of the file, and every category under the **Summary** heading has a graph that shows how the read quality looks for that particular metric. These reports can give insight into whether the reads are of decent quality or if the quality is poor. 

The raw reads we have here all passed for adapter content and sequence length distribution and everything failed per base sequence content. SRR2121770, SRR2121771, SRR2121774, SRR2121775, SRR2121788, SRR2121781-2, and SRR2121789-1 were fairly decent quality reads. SRR2121778, SRR2121779, SRR2121780, SRR2121786, SRR2121787, SRR2121781-1, and SRR2121789-2 were of fairly lower quality (failing 3 or more in both reads. All of them failed both per base sequence quality and per tile sequence quality. 

### MultiQC

First lets install **multiqc** with conda. The command for this is ***conda install \-c bioconda multiqc***.

When that is finished, we can run MultiQC in the folder with the QC files (they should be moved into a folder alone so things don't get cluttered later on in the analyses).

```{bash, eval=FALSE}
~/miniconda/bin/multiqc .
```

When MultiQC is finished running, there will be a new folder called **multiqc_data** where the summaries are stored. Now lets go back up a level where our raw data folder and fastqc folder is and make a new folder for all of our MultiQC data. We will copy the FastQC output from MultiQC to this new folder.

```{bash, eval=FALSE}
mkdir MultiQC_All

cp RawQC/multiqc_data/multiqc_fastqc.txt MultiQC_All/
```

## Trimming with Trimmomatic

Conda was used again to run Trimmomatic. This isn't as easy as using the wildcard like with FastQC because each output has to be personalized for the read files that are input into Trimmomatic. Also, we have to make sure that the adapter sequences are in the same folder that we are running so we can refer to them easily when calling the Trimmomatic program. In this case, we are using the TruSeq3-PE-2.fa adapter sequences For example:

```{bash, eval=FALSE}
~/miniconda2/bin/trimmomatic PE SRR2121770_1.fastq.gz SRR2121770_2.fastq.gz 770_fp.fq.gz 770_fu.fq.gz 770_rp.fq.gz 770_ru.fq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2keepBothReads LEADING:3 TRALING:3 MINLEN:36 &
```

This would be repeated for each of the pairs (12 in total). We are trimming paired-end reads with the TruSeq3-PE-2 adapters. We are chopping off the first and last 3 bases and if we end up with a sequence less than 36 bases, we get rid of it. We want to make sure that there are enough bases in a read to work with. These parameters can be tweaked for possibly better end results with less being discarded. When Trimmomatic is finished running, it will out put the total number of reads, the total number from both the forward and reverse reads that are kept, the number of only forward reads kept, the number of only reverse reads kept, and the number of discarded reads.

The highest number of reads dropped was from trimming SRR2121786, where 20.55% dropped. Most reads were between 5% and 10% dropped. SRR2121786, SRR2121787, and SRR2121779 had sequence drops greater than 15%.

The trimmed reads can be analyzed again with FastQC to see how well the trimming worked to make the file better quality. After running FastQC on the trimmed files we see that the quality of those that were really bad quality were improved. There were a few different metrics throughout all of the files that bounced from a warning before the failing, or from passing before to a warning, and so forth, overall creating better quality read files.

## Alignment

### Using Tophat

Tophat can be installed using the same **conda install** (***conda install \-c bioconda tophat***). When this is finished installing, then we will need to get the mouse genome from the [Johns Hopkins Univeristy Center for computational BIology](http://ccb.jhu.edu/software/tophat/igenomes.shtml). The version of the mouse genome that I am using here is the [NCBI build37.2](ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz). Instead of downloading this from the website and having to move it to the cluser, I will just download it using wget into the folder that has the raw reads, trimmed reads, and the FastQC files.

```{bash, eval=FALSE}
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz
```

This will take a long time to download because the file is a little less than 16GB zipped.

We notice here that we have a zipped tar file. To make this file easier to use, lets unzip it.

```{bash, eval=FALSE}
tar zxvf Mus_muculus_NCBI_build37.2.tar.gz
```

Since Tophat is requiring **\*.bt21** files (large index) and the files downloaded for the genome above are only small index files, we have to create a large index using **bowtie2-build**. For this, lets navigate to the WholeGenomeFasta folder within the extracted folder and then run **bowtie2-build**.

```{bash, eval=FALSE}
~/miniconda2/bin/bowtie2-build --large-index genome.fa genome
```

This process took about 26 minutes to run. Now lets copy the index files to a folder close to our reads so we can access them easier, rather than having to refer to the longer path where we build them. After they are copied to a new folder closer to our working directory, I went ahead and unzipped the trimmed read files to try and make the Tophat faster but it turned out not to work. The multicore call with **\-p** didn't use more cores than 1 until **bowtie2-align-s**, then 20 cores were used.

```{bash, eval=FALSE}
~/miniconda2/bin/tophat --no-converage-search -p 20 -G Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -0 770_thout ./Index/genome 770_fp.fq.gz 770_rp.fq.gz 770_fu.fq 770_ru.fq
```

This run took almost 3 hours to complete.. Running with 80 cores rather than 20 cores took just 4 minutes less, so the whole process must be limited by a single core and the core clock speed. The process does use close to 8,000% at its peak so there is a benefit to multicore, just isn't very scalable.

### Using STAR

STAR can be installed the same way as the previous programs with **conda install** (***conda install \-c bioconda star***). In order to run STAR, we need to creaate indices just like with tophat, but STAR has this built in. I'm going to be using the same genome and GTF file as previously downloaded, but Dr. Ge uses a different zipped genome from the *gencode* database.

```{bash, eval=FALSE}
~/miniconda2/bin/STAR \
--runThreadN 80 \
--runMode genomeGenerate \
--genomeDir starIndex \
--genomeFastaFiles Index/genome.fa \ #same when we made the bowtie indices 
--sjdbGTFfile Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf
```

With the index files made, we can start aligning with STAR. It's important here than we only pick the paired end reads and not use all of the reads. Tophat is able to use all 4 reads but STAR doesn't allow that, so we need to make sure that we feed in the large files from trimming.

```{bash, eval=FALSE}
~/miniconda2/bin/STAR --runThreadN 80 --genomeDir starIndex --readFilesIn 770_fp.fq 770_rp.fq --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix 2121770 --outSAMtype BAM SortedByCoordinate
```

## Assembling transcripts with Cufflinks

Once STAR is done running, we can assemble the transcripts with Cufflinks. This can also be installed with **conda install** (***conda install \-c bioconda cufflinks***).

```{bash, eval=FALSE}
~/miniconda2/bin/cufflinks -p 20 -o SRR2121771_clout --library-type fr-firststrand 2121770Aligned.sortedByCoord.out.bam
```

## Checking for Contamination

### PhiX contamination

Now we will look at what kind of contamination we are looking at. When samples are sequenced with Illumina, a PhiX control is run along side them. This control is for cluster generation, sequencing, alignment, and calibration for cross-talk matrix generation. We will use Bowtie to create a file to determine the PhiX contamination level.

```{bash, eval=FALSE}
~/miniconda2/bin/bowtie2 -p 20 -x PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome \
-1 TrimmedReads/770_fp.fq  -2 TrimmedReads/770_rp.fq -S phix.sam &> PhiXout/SRR2121770_phix.out
```

When the job is done running, the output file will show how much PhiX contamination we have. For example, lookin at the **SRR2121770_phix.out** created above, we see that 0.11% of the reads aligned with PhiX. The lower this value the better.

### rRNA Sequences

To retreive the rRNA sequences for mouse, we need to search the taxonomy database on NCBI for *Mus musculus*. Click on *Mus musculus* on the next page, and then the top *Mus musculus* at the head of the list. Now, select the top subtree link in the **Nucleotide** database. Select rRNA sequences on the left side of the page and download full list just downloading with Send > Complete Record > File > FASTA > Create File. Drag the file using WinSCP to the raw folder on the cluster and rename it to rRNA.fa.

We are going to need to install **bwa** with conda in order to get the alignments to work. This can be done with ***conda install -c bioconda bwa***. Following this, we will need to make indixes for the rRNA that we downloaded. To make this more clean, lets make a directory for the rRNA sequences that we downloaded and the indices that we make.

```{bash, eval=FALSE}
mkdir rRNA
```

Then we move the **rRNA.fa** to the new **rRNA** folder with WinSCP and then we can run the bwa.

```{bash, eval=FALSE}
time ~/miniconda2/bin/bwa mem -t 20 rRNA/rRNA.fa TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq > rnaAlign/770_rna.sam
```

When we are done creating the new *\*.sam* files for all of the forward/reverse read combinations, we can use samtools to convert the *\*.sam* file to *\*.bam* files which are essentially the same file just that *sam* is easier for us to look at while *bam* is binary. Samtools can be installed with ***conda install \-c bioconda samtools***.

```{bash, eval=FALSE}
~/miniconda2/bin/samtools view -@ 10 -bS -o rnaAlign/770_rna.bam rnaAlign/770.sam
```

Now in the rnaAlign folder we have our sam and bam file for each of the libraries. Lets create an output file with *flagstat*.

```{bash, eval=FALSE}
~/miniconda2/bin/samtools flagstat -@ 10 rnaAlign/770_rna.out
```

Wihtin this file we will be able to see the summary of our alignments to the rRNA file that we downloaded from NCBI.

```{bash, eval=FALSE}
#From 770_rna.out
205559289 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
85 + 0 supplementary
0 + 0 duplicates
4265179 + 0 mapped (2.07% : N/A)
205559204 + 0 paired in sequencing
102779602 + 0 read1
102779602 + 0 read2
4151684 + 0 properly paired (2.02% : N/A)
4187608 + 0 with itself and mate mapped
77486 + 0 singletons (0.04% : N/A)
4222 + 0 with mate mapped to a different chr
1026 + 0 with mate mapped to a different chr (mapQ>=5)
```

### Bacterial contamination

In order to find out the contamination, we need to install Kraken2 with ***conda install \-c bioconda kraken2*** and download a pre-built database containing bacteria, archaea, and viral sequences. The database we are going to download only contains about 5% of k-mers from the original database (but directions are sort of lacking to build an entirely new database). More information can be found at https://ccb.jhu.edu/software/kraken/ for the pre-built databases.

Using the code in the next chunk will download the 8GB database and then extract the files so we can use them with the **Kraken2** program. Lets do this in the main project folder.

```{bash, eval=FALSE}
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz

tar xzf minikraken2_v2_8GB_201904_UPDATE.tgz
```

Now lets make a directory for the output.

```{bash, eval=FALSE}
mkdir krakenOut
```

We can call **Kraken** with the extracted database folder and point it to the location of out paired end reads from trimming and to the output folder that we just created for the outputs.

```{bash, eval=FALSE}
~/miniconda2/bin/kraken2 --db minikraken2_v2_8GB_201904_UPDATE/ --output krakenOut/770.out --threads 10 --paired TrimmedReads/770_fp.fq TrimmedReads/770_rp.fq
```

When Kraken is done running, it will print out the number (and percentage) of reads that were classified. In this case, we have used 102779602 sequences, of which 19142843 sequences were classified (18.63%) and 83636759 sequences were unclassified (81.37%). My interpretation of this is that 18.63% of the reads are possibly from microbial cell contamination.

## Counting Transcripts

Since we have the **bam** files from the alignments of the different samples, we can count the features for each and get the transcipt counts using featureCounts form ***conda install \-c bioconda/label/cf201901 subread***. The genome and annotations that we previously downloaded were from genome **mm9** so we have to specify to *featureCounts* what we want to actually count. FeatureCounts defaults to using **gene_id** which our output bam files don't have described correctly for *featureCounts* to read them. This is a single line of code because we can use a wildcard to run through all of the **bam** files. 

```{bash, eval=FALSE}
#Move to the Star Alignment output folder for a working directory
cd StarOut

~/miniconda2/bin/featureCounts -a /gpfs/scratch/alex.soupir/Mus/raw/Mus_musculus/NCBI/build37.2/Annotation/Archives/archive-2015-07-17-14-32-40/Genes/genes.gtf -g 'transcript_id' -o readCounds.txt *bam
```

With the files that we are working with, this will take between 3.5 minutes to 5 minutes per **bam** file. The output will be a file that can be imported into excel and saved as csv which we then can work with in R.

### Final QC of cleaning the data

Lets look at the data that we have collected from all of the MultiQC runs that we had with initial FastQC, Trimmomatic, STAR alignment, PhiX contamination, rRNA contamination, and the final feature counts.

```{r}
qc = read.csv('Whole Data QC.csv', header=TRUE, na.strings="")
kable(qc) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")
```

# **Differentially Expressed Sequence Identification**

Programs Used

+ R
+ RStudio

Packages

+ readr (***install.packages('readr')***)
+ limma (***BiocManager::install('limma')***)
+ DESeq2 (***BiocManager::install('DESeq2')***)
+ dplyr (***install.packages("dplyr")***)
+ ggplot2 (***install.packages("ggplot2")***)
+ gplots (***install.packages("gplots")***)
+ Annotations (***BiocManager::install('AnnotationDbi')***)
+ org.Hs.eg.db (***BiocManager::install('org.Hs.eg.db')***)
  + This is for Human
+ org.Mm.eg.db (***BiocManager::install('org.Mm.eg.db')***)
  + This is for Mouse
+ ggrepel (***install.packages("ggrepel")***)
+ ReportingTools (***BiocManager::install('ReportingTools')***)
+ GO.db (***BiocManager::install('GO.db')***)
+ GOstats (***BiocManager::install('GOstats')***)
+ pathview (***BiocManager::install('pathview')***)
+ gage (***BiocManager::install('gage')***)
+ gageData (***BiocManager::install('gageData')***)
+ select (***BiocManager::install('Select')***)

With these, you most certainly will have to step through each and install extra things when you start calling the packages. Take it step by step to ensure that each dependency is installed. 

## Analyzing Reads Counts

When the count file is completed, we can import it into R and start working with it to determine differentially expressed genes. First we will import it into R

```{r, message=FALSE}
library(limma)
library(DESeq2)
library(dplyr)
library(readr)

countData = read_csv("readCounts.csv", skip = 1)
```

This gives us our dataframe from out featureCounts program, but if we look at the data we see that featureCounts added some extra information that characterizes each gene_id.

```{r}
kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "320px")
```

We also need to set out row names to the gene_id. We will do some data frame manipulation and then look at the data again.

```{r}
countData = as.data.frame(countData)
rownames(countData) = countData$Geneid
countData = countData[,-c(1:6)]
kable(head(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

### Quick Data Exploration

```{r}
dim(countData)
```

```{r}
kable(summary(countData)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

Let's also go ahead and change the names to describe out data a little better.

```{r}
columns = c('Trp53m_mock_1', 'Trp53m_mock_2', 'Trp53m_4h7Gy_1', 'Trp53m_4h7Gy_2', 'Trp53p_mock_1', 'Trp53p_mock_2', 'Trp53p_mock_3', 'Trp53p_mock_4', 'Trp53p_4h7Gy_1', 'Trp53p_4h7Gy_2', 'Trp53p_4h7Gy_3', 'Trp53p_4h7Gy_4')
colnames(countData) = columns
```

Here we have to make sure that we convert the *+/+* and *-/-* to characters. These characters

```{r}
par(mar=c(8,4,4,1)+0.1)
barplot( colSums(countData)/1e6, col="green",las=3,main="Total read counts (millions)", ylab="Total read counts in millions")
```

```{r}
hist(countData[,1], br=200, xlab="Number of Reads Counts per Feature", main="Histogram of Read Counts for Trp53-/- Mock")
```

We can see that our count data is highly skewed to the right. This is a great case for using **log** transformation!

```{r}
logCountData = log2(1+countData)
par(mfrow = c(1, 2), mar=c(8,4,4,1))  # two columns
hist(logCountData[,1], main="Histogram of Log Read Counts", xlab="Log transformed counts")
boxplot(logCountData,las=3, main="Boxplot of Log Read Counts")
```

```{r }
x <- logCountData
myColors = rainbow(dim(x)[2])
plot(density(x[,1]),col = myColors[1], lwd=2,
     xlab="Expresson values", ylab="Density", main= "Distribution of transformed data",
     ylim=c(0, max(density(x[,1])$y)+.02 ) )
  
for( i in 2:dim(x)[2] )
lines(density(x[,i]),col=myColors[i], lwd=2)
legend("topright", cex=1.1,colnames(x), lty=rep(1,dim(x)[2]), col=myColors )	
```

```{r}
plot(logCountData[,1],logCountData[,2], xlab="Trp53-/- mock replication 1", ylab="Trp53-/- mock replication 2")
```

### Filtering, Normalization, and Trasformation using DESeq2

We have to make the experiment design into a small dataframe so we can tell DESeq how we want to analyze the data. Here will will make a small table that has the rep names that we changed the column names to previously, and then a column for which columns are Trp53+/+ or Trp53-/-, and which columns were control mice and which columns were treated with ionizing radiation.

```{r }
detectGroups <- function (x){  # x are col names
  tem <- gsub("[0-9]*$","",x) # Remove all numbers from end
  #tem = gsub("_Rep|_rep|_REP","",tem)
  tem <- gsub("_$","",tem); # remove "_" from end
  tem <- gsub("_Rep$","",tem); # remove "_Rep" from end
  tem <- gsub("_rep$","",tem); # remove "_rep" from end
  tem <- gsub("_REP$","",tem)  # remove "_REP" from end
  return( tem )
}

groups = as.character ( detectGroups( colnames( countData ) ) )
groups
p53 = c("m", "m", "m", "m",
        "p", "p", "p", "p", "p", "p", "p", "p")
treatment = c("control", "control", "IR", "IR", "control", "control", "control", "control",
              "IR", "IR", "IR", "IR")

```

```{r}
colData = cbind(colnames(countData), p53 )
colData

```

```{r}
colData = as.data.frame(cbind(colnames(countData), p53, treatment))
colData
str(colData)
```

Creating a DESeq Dataset

```{r message = FALSE }
dds = DESeqDataSetFromMatrix(countData=countData,
                             colData=colData,
                             design= ~ p53+treatment+p53*treatment)   # note that the study design is changed.
dds = DESeq(dds)  # main function
nrow(dds)
```

Filtering: we will only keep rows that have a sum count between all samples greater than 5. This will remove most of the genes that mostly have "0" counts.

```{r}
dds <- dds[ rowSums(counts(dds)) > 5, ]
nrow(dds)
```

Regularized log transformation - used for clustering

```{r}
rld <- rlog(dds, blind = FALSE)
kable(head(assay(rld), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

Variance Stabilizing Transformation

# Interactions cause a difference between the lfc betwen pooled data, e.g. p53+/+ (control and IR) and p53-/- (control and IR)

```{r}
vsd <- vst(dds, blind = FALSE)
kable(head(assay(vsd), 6)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

For the log2 approach, we need to first estimate size factors to account for sequencing depth, and then specify normalized=TRUE. Sequencing depth correction is done automatically for the rlog and the vst.

Size Factor

```{r}
dds <- estimateSizeFactors(dds)
kable(sizeFactors(dds)) %>%
  kable_styling() %>%
  scroll_box(width = "300px", height = "520px")
```

We will first look at the log transformed data

```{r}
slog <- log2(counts(dds, normalized=TRUE)+1)
kable(head(slog)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

```{r}
par(mfrow = c(1, 3))  # 3 columns
plot(slog[,1],slog[,2])
plot(assay(rld)[,1],assay(rld)[,2])
plot(assay(vsd)[,1],assay(vsd)[,2])
```

As the log transformation constant increases, the information of the data is lost.

```{r}
par(mfrow = c(1, 3))  # 3 columns
slog <- log2(counts(dds, normalized=TRUE)+1)
plot(slog[,1],slog[,2])
slog <- log2(counts(dds, normalized=TRUE)+4)
plot(slog[,1],slog[,2], xlim=c(0,20))
slog <- log2(counts(dds, normalized=TRUE)+20)
plot(slog[,1],slog[,2], xlim=c(0,20))
```

```{r}
library("dplyr")
library("ggplot2")

df <- bind_rows(
  as_data_frame(slog[,1:2]) %>%
         mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))
  
colnames(df)[1:2] <- c("x", "y")  

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)

```

### Exploratory Data Analysis

PCA plot

```{r}
plotPCA(rld, intgroup = c("p53", "treatment")) + theme(aspect.ratio=1)
```

A prettier PCA plot created with GGPlot2

```{r}
pca.object <- prcomp(t(assay(rld))) # PCA 
pcaData = as.data.frame(pca.object$x[,1:2]); 
pcaData = cbind(pcaData,detectGroups(colnames(assay(rld)) ))
colnames(pcaData) = c("PC1", "PC2", "Type")
percentVar=round(100*summary(pca.object)$importance[2,1:2],0)
#plot
p=ggplot(pcaData, aes(PC1, PC2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab(paste0("PC1: ",percentVar[1],"% variance")) 
p=p+ylab(paste0("PC2: ",percentVar[2],"% variance")) 
p=p+ggtitle("Principal component analysis (PCA)")+coord_fixed(ratio=1.0)+ 
    theme(plot.title = element_text(size = 16,hjust = 0.5)) + theme(aspect.ratio=1) +
    theme(axis.text.x = element_text( size = 16),
    axis.text.y = element_text( size = 16),
    axis.title.x = element_text( size = 16),
    axis.title.y = element_text( size = 16) ) +
  theme(legend.text=element_text(size=16))
print(p)
```

Multidimensional Scaling Plot

```{r}
dist2 <- function(x, ...)   # distance function = 1-PCC (Pearson's correlation coefficient)
  as.dist(1-cor(t(x), method="pearson"))

fit = cmdscale( dist2(t(assay(rld))) , eig=T, k=2)
mdsData <- as.data.frame(fit$points[,1:2]); 
mdsData <- cbind(mdsData,detectGroups(colnames(assay(rld))) )
colnames(mdsData) = c("x1", "x2", "Type")
	
p<-ggplot(mdsData, aes(x1, x2, color=Type, shape = Type)) + geom_point(size=5) 
p=p+xlab("Dimension 1") 
p=p+ylab("Dimension 2") 
p=p+ggtitle("Multidimensional scaling (MDS)")+ coord_fixed(ratio=1.)+ 
     theme(plot.title = element_text(hjust = 0.5)) + theme(aspect.ratio=1) +
	 	 theme(axis.text.x = element_text( size = 16),
        axis.text.y = element_text( size = 16),
        axis.title.x = element_text( size = 16),
        axis.title.y = element_text( size = 16) ) +
	   theme(legend.text=element_text(size=16))
print(p)
```

Creating a heatmap

```{r heatmap, message = FALSE,fig.width=7,fig.height=7}
library(gplots)

hclust2 <- function(x, method="average", ...)  # average linkage in hierarchical clustering
  hclust(x, method=method, ...)

n=100 # number of top genes by standard deviation

x = assay(rld)
if(n>dim(x)[1]) n = dim(x)[1] # max	as data

x = x[order(apply(x,1,sd),decreasing=TRUE),]  # sort genes by standard deviation

x = x[1:n,]   # only keep the n genes

# this will cutoff very large values, which could skew the color 
x=as.matrix(x[1:n,])-apply(x[1:n,],1,mean)
cutoff = median(unlist(x)) + 4*sd (unlist(x)) 
x[x>cutoff] <- cutoff
cutoff = median(unlist(x)) - 4*sd (unlist(x)) 
x[x< cutoff] <- cutoff
	
groups = detectGroups(colnames(x) )
groups.colors = rainbow(length(unique(groups) ) )


	lmat = rbind(c(5,4),c(0,1),c(3,2))
	lwid = c(1.5,4)
	lhei = c(1,.2,4)


heatmap.2(x, distfun = dist2,hclustfun=hclust2,
	 col=greenred(75), density.info="none", trace="none", scale="none", keysize=.5
	,key=T, symkey=F
	,ColSideColors=groups.colors[ as.factor(groups)]
	,margins=c(8,12)
	,cexRow=1
	,srtCol=45
	,cexCol=1.  # size of font for sample names
	,lmat = lmat, lwid = lwid, lhei = lhei
	)

```

### Identification of Differentially Expressed Genes

```{r, message=FALSE}
dds <- DESeq(dds)
res <- results(dds)

kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

```

DESeq2 uses the Benjamini-Hochberg (BH) adjustment (Benjamini and Hochberg 1995) as implemented in the base R p.adjust function

```{r}
res <- results(dds, alpha = 0.5, lfcThreshold=0.01)
summary(res)
```

Now lets sort genes by fold change

```{r}
res <- res[order(abs( res$log2FoldChange), decreasing=TRUE),]
kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

MA Plot

Show the significant genes. The lower the average read counts for all samples and the higher the variation between the samples, the less significant those genes are. 

```{r}
DESeq2::plotMA(res,  ylim = c(-5, 5))
```

Volcano plot

```{r}
library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.05, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<0.5),'sig'] <- "Not Sig"


p = ggplot(res1, aes(log2FoldChange, -log10(padj))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p
```

### Gene Annotations

Plot counts of top gene

```{r}
topGene <- rownames(res)[1]
plotCounts(dds, gene = topGene, intgroup=c("p53", "treatment"))
```

Here we see an interesting point or our normalized counts under the Trp53p_4h7Gy group that seems to be extremely high, while the other 3 replicates are around 0.5. I cannot get this portion to work, however. I get an error stating that None of the keys entered are valid keys for 'SYMBOL'. 

Let's look at the keys that we have to work with for our ***res*** file from DESeq2

```{r}
head(row.names(res))
```

Now we need to find the same key in the Mm database.

```{r, message=FALSE}
library(AnnotationDbi)
library(org.Mm.eg.db)

columns(org.Mm.eg.db)
#key = gsub("\\..*","", row.names(res))
res$symbol <- gsub("\\..*","", row.names(res))
#res$symbol <- gsub(" ","",row.names(res)) 
```

```{r}
message("Ensembl IDs")
key.en = keys(org.Mm.eg.db, keytype="ENSEMBL")
head(key.en)
cat("\n\n")
message("SYMBOL names")
key.sy = keys(org.Mm.eg.db, keytype="SYMBOL")
head(key.sy)
```

These are ENSEMBL symbols, so we need to designate that when looking for the genes that we have. 

```{r}
res$ensembl <- gsub("\\..*","", row.names(res))
res$entrez <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
res$symbol <- mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="SYMBOL",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
write.csv(res, file = "results.csv")
```

```{r}
kable(head(res)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

```

Let's make a file with just the genes with an adjusted p-value < 0.5

```{r}
resSig = as.data.frame(subset(res,padj<0.5) )
resSig = resSig[order(resSig$log2FoldChange,decreasing=TRUE),]
head(resSig)

write.csv(resSig,"SigGenes.csv")
```

Here is a volcano plot that shows the symbol that we created at each point.

```{r}
library(dplyr)
res1 = as.data.frame(res)
# add a new column using the mutate function in dplyr
res1 = mutate(res1, sig=ifelse(res1$padj<0.5, "FDR<0.05", "Not Sig"))
res1[which(abs(res1$log2FoldChange)<1),'sig'] <- "Not Sig"
p = ggplot(res1, aes(log2FoldChange, -log10(pvalue))) +
  geom_point(aes(col=sig)) +
  scale_color_manual(values=c("red", "black"))
p+geom_text(data=filter(res1, padj<1e-50), aes(label=symbol))
```

```{r, eval=FALSE, echo=FALSE}
library(dplyr)
# Install ggrepel package if needed
# install.packages("devtools")
# devtools::install_github("slowkow/ggrepel")
library(ggrepel) # "repels" overlapping text

p+geom_text_repel(data=filter(res1,  abs(log2FoldChange)>10 | padj < 1e-50 ), aes(label=symbol))
```

```{r, eval=FALSE, echo=FALSE}
library("ReportingTools")
htmlRep <- HTMLReport(shortName="report", title="My report",
                      reportDirectory="./report")
publish(resSig, htmlRep)
url <- finish(htmlRep)
#browseURL(url)
```

# 7. GO Enrichment analysis using GOstats

Here we will do a GO Enrichment analysis for genes that have a decreased fold-change of 5 or more

```{r,message=FALSE}
library(GO.db)
library(GOstats)
selectedGenes = unique(resSig[resSig$log2FoldChange>5,'entrez'])  # upregulated genes
universeGenes =  unique( mapIds(org.Mm.eg.db,
                     keys= res$ensembl,
                     column="ENTREZID",
                     keytype="ENSEMBL", #Out ID is ENSMBL
                     multiVals="first")
                    )


 hgCutoff <- 0.001
 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,]
```

```{r}
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
```


```{r}
params1 <- params
ontology(params1) <- "CC"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOCCID","Pvalue","Term")]
```

```{r}
params1 <- params
ontology(params1) <- "MF"
hgOver <- hyperGTest(params1)
summary(hgOver)[1:10,c("GOMFID","Pvalue","Term")]
```

## GO Enrichment analysis  of downregulated genes

Next we will have a look at the genes that are upregulated by a fold-change of 5 or greater. 

```{r,message=FALSE}

selectedGenes = unique(resSig[resSig$log2FoldChange<5,'entrez'])  # upregulated genes

 params <- new("GOHyperGParams",
     geneIds=selectedGenes,
     universeGeneIds=universeGenes,
     annotation="org.Mm.eg.db",
     ontology="BP",
     pvalueCutoff=hgCutoff,
     conditional=FALSE,
     testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,c("GOBPID","Pvalue","Term")]
```

# 8. Pathway analysis using expression data

```{r, message=FALSE}
# bioconductor packages
# source("https://bioconductor.org/biocLite.R");
# biocLite(c("pathview","gage","gageData"))
library(pathview) 
library(gage) 

```

## Prepare data

```{r}
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
```

```{r}
library(gageData)
data(go.sets.mm)
data(go.subs.mm)
gobpsets = go.sets.mm[go.subs.mm$BP]
gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)
#lapply(gobpres, head)
message("Greater")
kable(head(gobpres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Less")
kable(head(gobpres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Stats")
kable(head(gobpres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")

```

## KEGG pathways

```{r}
library(gageData)
data(kegg.sets.mm)
data(sigmet.idx.mm)
kegg.sets.mm = kegg.sets.mm[sigmet.idx.mm]
#head(kegg.sets.mm, 3)

message("Greater")
kable(head(kegg.sets.mm$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Less")
kable(head(kegg.sets.mm$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Stats")
kable(head(kegg.sets.mm$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.mm, same.dir=TRUE)

# Look at both up (greater), down (less), and statatistics.
#lapply(keggres, head, n=10)
message("Greater")
kable(head(keggres$greater)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Less")
kable(head(keggres$less)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
message("Stats")
kable(head(keggres$stats)) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "300px")
```

```{r}
# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>% 
  tbl_df() %>% 
  filter(row_number()<=5) %>% 
  .$id %>% 
  as.character()
keggrespathways
```

```{r}
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
```

```{r, message=FALSE, warning=FALSE}

# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu", new.signature=FALSE)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mmu"))

```

## Pathway and regulation of genes for Oxidative phosphorylation.

![Oxidative Phosphorylation](./Mouse_RNA_Seq_p53_genotoxic_files/figure-html/mmu00190.pathview.png)

## Pathway and regulation of genes for Glycosylphosphatidylinositol(GPI)-anchor biosynthesis.

![Glycosylphosphatidylinositol(GPI)-anchor biosynthesis](./Mouse_RNA_Seq_p53_genotoxic_files/figure-html/mmu00563.pathview.png)

## Pathway and regulation of genes for RNA polymerase.

![RNA polymerase](./Mouse_RNA_Seq_p53_genotoxic_files/figure-html/mmu03020.pathview.png)

## Pathway and regulation of genes for Nucleotide excision repair.

![Nucleotide excision repair](./Mouse_RNA_Seq_p53_genotoxic_files/figure-html/mmu03420.pathview.png)

## Pathway and regulation of genes for Non-homologous end-joining.

![Non-homologous end-joining](./Mouse_RNA_Seq_p53_genotoxic_files/figure-html/mmu03450.pathview.png)


#De novo Assembly {#denovo-assembly}

*AT THIS TIME, MINICONDA3 WAS INSTALLED SO FURTHER TOOLS ARE ALL INSTALLED UNDER THIS*

Having the genome for RNA-sequencing analysis is very useful, but sometimes it is not available so *de novo* assembly is used. *de novo* assembly is using the reads from sequencing to create longer seqeunces called contigs (contiguous sequences). These sequences are then compared to a protein database to get an idea of what proteins are possibly present. The raw sequences are also mapped back to the contigs, treating the contigs as the "genome", much like we do when we have the genome from the host available.

Lets jump right into it. We are going to be using [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki) for the *de novo* assembly and analysis because there are many tools built into the Trinity tool that allow for the building of contigs, counting of sequences, and even DeSeq/edgeR analyses.

To install Trinity, all I did was ran **conda install -c bioconda trinity**. This installs all of the previously mentioned tools in one go. In addition to Trinity, BLAST (**conda install -c bioconda blast**) and RSEM (**conda install -c bioconda rsem**) were installed.

```{bash, eval=FALSE}
Trinity --seqType fq \
--left trimmedReads/SRR2121770_Trimmed_1P.fq.gz,trimmedReads/SRR2121771_Trimmed_1P.fq.gz,trimmedReads/SRR2121774_Trimmed_1P.fq.gz,trimmedReads/SRR2121775_Trimmed_1P.fq.gz,trimmedReads/SRR2121778_Trimmed_1P.fq.gz,trimmedReads/SRR2121779_Trimmed_1P.fq.gz,trimmedReads/SRR2121780_Trimmed_1P.fq.gz,trimmedReads/SRR2121781_Trimmed_1P.fq.gz,trimmedReads/SRR2121786_Trimmed_1P.fq.gz,trimmedReads/SRR2121787_Trimmed_1P.fq.gz,trimmedReads/SRR2121788_Trimmed_1P.fq.gz,trimmedReads/SRR2121789_Trimmed_1P.fq.gz \
--right trimmedReads/SRR2121770_Trimmed_2P.fq.gz,trimmedReads/SRR2121771_Trimmed_2P.fq.gz,trimmedReads/SRR2121774_Trimmed_2P.fq.gz,trimmedReads/SRR2121775_Trimmed_2P.fq.gz,trimmedReads/SRR2121778_Trimmed_2P.fq.gz,trimmedReads/SRR2121779_Trimmed_2P.fq.gz,trimmedReads/SRR2121780_Trimmed_2P.fq.gz,trimmedReads/SRR2121781_Trimmed_2P.fq.gz,trimmedReads/SRR2121786_Trimmed_2P.fq.gz,trimmedReads/SRR2121787_Trimmed_2P.fq.gz,trimmedReads/SRR2121788_Trimmed_2P.fq.gz,trimmedReads/SRR2121789_Trimmed_2P.fq.gz \
--CPU 80 --max_memory 2000G --min_contig_length 150
```

```{bash, eval=FALSE}
TrinityStats.pl trinity_out_dir/Trinity.fasta

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  602870
Total trinity transcripts:      730297
Percent GC: 45.89

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 8589
        Contig N20: 5955
        Contig N30: 4400
        Contig N40: 3276
        Contig N50: 2413

        Median contig length: 413
        Average contig: 1028.24
        Total assembled bases: 750921120


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 5470
        Contig N20: 3434
        Contig N30: 2391
        Contig N40: 1741
        Contig N50: 1285

        Median contig length: 343
        Average contig: 703.04
        Total assembled bases: 423841627
```

```{bash, eval=FALSE}
bowtie2-build trinity_out_dir/Trinity.fasta trinity_out_dir/Trinity.fasta
bowtie2 --local --no-unal -x trinity_out
```

#download uniprot swiss-prot db
#ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

```{bash, eval=FALSE}
gunzip uniprot_sprot.fasta.gz
mkdir -p blast_protdb
mv uniprot_sprot.fasta blast_protdb/.
makeblastdb -in uniprot_sprot.fasta -dbtype prot
cd ..
```

```{bash, eval=FALSE}
blastx -query trinity_out_dir/Trinity.fasta -db blast_protdb/uniprot_sprot.fasta -out blastx.outfmt6 -evalue 1e-20 -num_threads 80 -max_target_seqs 1 -outfmt 6
analyze_blastPlus_topHit_coverage.pl blastx.outfmt6 trinity_out_dir/Trinity.fasta blast_protdb/uniprot_sprot.fasta | column -t

#hit_pct_cov_bin  count_in_bin  >bin_below
100               7361          7361
90                1117          8478
80                981           9459
70                971           10430
60                1003          11433
50                865           12298
40                924           13222
30                1102          14324
20                1478          15802
10                932           16734
```

```{bash, eval=FALSE}
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121770_1.fastq.gz -2 rawReads/SRR2121770_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121770.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121771_1.fastq.gz -2 rawReads/SRR2121771_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121771.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121774_1.fastq.gz -2 rawReads/SRR2121774_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121774.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121775_1.fastq.gz -2 rawReads/SRR2121775_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121775.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121778_1.fastq.gz -2 rawReads/SRR2121778_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121778.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121779_1.fastq.gz -2 rawReads/SRR2121779_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121779.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121780_1.fastq.gz -2 rawReads/SRR2121780_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121780.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121781_1.fastq.gz -2 rawReads/SRR2121781_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121781.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121786_1.fastq.gz -2 rawReads/SRR2121786_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121786.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121787_1.fastq.gz -2 rawReads/SRR2121787_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121787.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121788_1.fastq.gz -2 rawReads/SRR2121788_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121788.coordSorted.bam
$ bowtie2 --local --no-unal -x trinity_out_dir/Trinity.fasta -q -1 rawReads/SRR2121789_1.fastq.gz -2 rawReads/SRR2121789_2.fastq.gz --threads 10 | samtools view --threads 10 -Sb - | samtools sort --threads 10 -o SRR2121789.coordSorted.bam
```

```{bash, eval=FALSE}
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121770_1.fastq.gz --right rawReads/SRR2121770_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121770.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121771_1.fastq.gz --right rawReads/SRR2121771_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121771.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121774_1.fastq.gz --right rawReads/SRR2121774_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121774.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121775_1.fastq.gz --right rawReads/SRR2121775_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121775.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121778_1.fastq.gz --right rawReads/SRR2121778_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121778.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121779_1.fastq.gz --right rawReads/SRR2121779_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121779.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121780_1.fastq.gz --right rawReads/SRR2121780_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121780.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121781_1.fastq.gz --right rawReads/SRR2121781_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121781.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121786_1.fastq.gz --right rawReads/SRR2121786_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121786.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121787_1.fastq.gz --right rawReads/SRR2121787_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121787.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121788_1.fastq.gz --right rawReads/SRR2121788_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121788.RSEM
$ time align_and_estimate_abundance.pl --seqType fq --thread_count $SLURM_NTASKS --left rawReads/SRR2121789_1.fastq.gz --right rawReads/SRR2121789_2.fastq.gz --transcripts trinity_out_dir/Trinity.fasta --est_method RSEM --aln_method bowtie2 --trinity_mode --prep_reference --output_dir SRR2121789.RSEM
```

```{bash, eval=FALSE}
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_trans SRR2121770.RSEM/SRR2121770.RSEM.trans.results \
SRR2121771.RSEM/SRR2121771.RSEM.trans.results \
SRR2121774.RSEM/SRR2121774.RSEM.trans.results \
SRR2121775.RSEM/SRR2121775.RSEM.trans.results \
SRR2121778.RSEM/SRR2121778.RSEM.trans.results \
SRR2121779.RSEM/SRR2121779.RSEM.trans.results \
SRR2121780.RSEM/SRR2121780.RSEM.trans.results \
SRR2121781.RSEM/SRR2121781.RSEM.trans.results \
SRR2121786.RSEM/SRR2121786.RSEM.trans.results \
SRR2121787.RSEM/SRR2121787.RSEM.trans.results \
SRR2121788.RSEM/SRR2121788.RSEM.trans.results \
SRR2121789.RSEM/SRR2121789.RSEM.trans.results --gene_trans_map none
```

```{bash, eval=FALSE}
time abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix Trinity_genes SRR2121770.RSEM/SRR2121770.RSEM.genes.results \
SRR2121771.RSEM/SRR2121771.RSEM.genes.results \
SRR2121774.RSEM/SRR2121774.RSEM.genes.results \
SRR2121775.RSEM/SRR2121775.RSEM.genes.results \
SRR2121778.RSEM/SRR2121778.RSEM.genes.results \
SRR2121779.RSEM/SRR2121779.RSEM.genes.results \
SRR2121780.RSEM/SRR2121780.RSEM.genes.results \
SRR2121781.RSEM/SRR2121781.RSEM.genes.results \
SRR2121786.RSEM/SRR2121786.RSEM.genes.results \
SRR2121787.RSEM/SRR2121787.RSEM.genes.results \
SRR2121788.RSEM/SRR2121788.RSEM.genes.results \
SRR2121789.RSEM/SRR2121789.RSEM.genes.results --gene_trans_map none
```

```{bash, eval=FALSE}
contig_Exn50_statistic.pl Trinity_trans.TMM.EXPR.matrix trinity_out_dir/Trinity.fasta > ExN50_trans.stats
```

Make samples.txt

```{bash, eval=FALSE}
mock-	SRR2121770.RSEM
mock-	SRR2121771.RSEM
IR-	SRR2121774.RSEM
IR-	SRR2121775.RSEM
mock+	SRR2121778.RSEM
mock+	SRR2121779.RSEM
mock+	SRR2121780.RSEM
mock+	SRR2121781.RSEM
IR+	SRR2121786.RSEM
IR+	SRR2121787.RSEM
IR+	SRR2121788.RSEM
IR+	SRR2121789.RSEM
```

The way that the installed version of *run_DE_analysis.pl* tries to install or check for edgeR doesn't work for the new versions of R (3.6.1). In the perl script I just had to edit a few lines under the edgeR section for it to run smoothly, as well as install *BiocManager before hand.

I changed the perl script, under *sub run_edgeR_sample_pair*, when writing the R script to look like:

```{perl, eval=FALSE}
    print $ofh "if (! require(edgeR)) {\n";
    print $ofh "   install.packages(\"BiocManager\")\n";
    print $ofh "   BiocManager::install(\"edgeR\")\n";
    print $ofh "   library(edgeR)\n";
    print $ofh "}\n\n";
```

To make sure BiocManager and edgeR work, they were installed in the R terminal.

```{r, eval=FALSE}
R #runs the installed version of R in terminal mode
install.packages("BiocManager")
BiocManager::install("edgeR")
```

Since we don't have at least 3 replicates for each of the treatments, edgeR wants a dispersion parameter. 0.035 seems to be pretty common. If the dispersion parameter has the squareroot taken of it we get ~ 0.19, meaning that we are saying the true abundance for each gene can vary up or down by 19% between replicates.

```{bash, eval=FALSE}
run_DE_analysis.pl --matrix Trinity_trans.counts.matrix --method edgeR --output edgeR_trans --dispersion 0.035 --samples_file samples.txt
```

```{bash, eval=FALSE}
head edgeR_trans/Trinity_trans.counts.matrix.IR-_vs_mock-.edgeR.DE_results | column -t

                          sampleA  sampleB  logFC           logCPM             PValue                FDR
TRINITY_DN2468_c0_g4_i1   IR-      mock-  -14.7753615825754  4.45315041918     5.51021995613131e-115  5.85185359341145e-110
TRINITY_DN2505_c0_g1_i2   IR-      mock-  -10.1817621187735  4.11815268306374  8.54695600746168e-88   4.53843363996215e-83
TRINITY_DN3231_c0_g2_i15  IR-      mock-  -13.5922145319911  3.27307743751541  6.29068575237937e-81   2.2269027563423e-76
TRINITY_DN4975_c0_g1_i7   IR-      mock-  13.4941229058571   3.17132827495398  5.25978730838885e-79   1.39647353037724e-74
TRINITY_DN4360_c0_g1_i5   IR-      mock-  -13.5215746751431  3.20272600532575  1.0222845072488e-70    2.17133229339645e-66
TRINITY_DN128_c0_g1_i20   IR-      mock-  13.1582688371709   2.83599676163376  1.8443766792831e-69    3.26454672233108e-65
TRINITY_DN644_c1_g1_i6    IR-      mock-  -5.91850038396836  4.31456225100214  8.01440426808572e-67   1.21589961895815e-62
TRINITY_DN2043_c0_g1_i13  IR-      mock-  13.7803059024088   3.45719012794474  3.30519380237759e-66   4.38764477265625e-62
TRINITY_DN3321_c0_g1_i2   IR-      mock-  12.8215942182179   2.49995284585527  9.5059376972195e-61    1.1217006482719e-56
```

```{bash, eval=FALSE}

```

```{bash, eval=FALSE}

```

```{bash, eval=FALSE}

```

```{bash, eval=FALSE}

```

```{bash, eval=FALSE}

```


```{r,echo=FALSE, eval=FALSE}
#BiocManager::install("EnsDb.Mmusculus.v79")
library(GO.db)
library(GOstats)
library(EnsDb.Mmusculus.v79)

res.3 = subset(res.2, !is.na(res.2$entrezgene_id))
selectedGenes = unique(resSig[resSig$log2FoldChange>0,'entrezgene_id'])  # upregulated genes
universeGenes =  unique( mapIds(org.Mm.eg.db,
                     keys=gsub("\\.[0-9]*$","",res.3$entrezgene_id),  # this is causing problems for mapping
                     column="ENSEMBL",
                     keytype="SYMBOL",
                     multiVals="first")
                    )


hgCutoff <- 0.001
params <- new("GOHyperGParams",
              geneIds=selectedGenes,
              universeGeneIds=universeGenes,
              annotation="org.Hs.eg.db",
              ontology="BP",
              pvalueCutoff=hgCutoff,
              conditional=FALSE,
              testDirection="over")

hgOver <- hyperGTest(params)
summary(hgOver)[1:10,]


```

```{r Medicago_trucatula_Database, echo=FALSE, eval=FALSE}
#BiocManager::install("AnnotationHub")
library(AnnotationHub)
hub = AnnotationHub()
query(hub, c("Medicago", "OrgDb"))

Mt.orgdb = hub[["AH72235"]]
Mt.orgdb

```

```{r Glycine_max_Database, echo=FALSE, eval=FALSE}
#BiocManager::install("AnnotationHub")
library(AnnotationHub)
hub = AnnotationHub()

query(hub, c("Glycine", "OrgDb"))

Gm.orgdb = hub[["AH72142"]]
Gm.orgdb

```

```{r, echo=FALSE, eval=FALSE}
library(AnnotationDbi)
keytypes(Mt.orgdb)

#key.Gm = keys(Mt.orgdb, keytype="GENENAME")
#head(key.Gm)
```