rna-seq-preprocessing.Rmd

---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    toc: yes
    toc_float: yes
minutes: 300
layout: page
subtitle: Pre-processsing RNA-seq data
bibliography: ref.bib
---

**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016

```{r knitrOpts, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## Resources and data files

This material has been created using the following resources:  
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]  
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html  

Data files downloaded from:  
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60450/suppl/GSE60450_Lactation-GenewiseCounts.txt.gz
http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5.rdata
http://bioinf.wehi.edu.au/software/MSigDB/mouse_H_v5.rdata

Data files:  
sampleinfo.txt  
GSE60450_Lactation-GenewiseCounts.txt  
mouse_c2_v5.rdata  
mouse_H_v5.rdata

Data files available from: [https://figshare.com/s/1d788fd384d33e913a2a](https://figshare.com/s/1d788fd384d33e913a2a)
You should download these files and place them in your `/data` directory.

Packages used:
limma,
edgeR,
gplots,
org.Mm.eg.db,
RColorBrewer,
Glimma

## Overview

* Reading in table of counts
* Filtering lowly expressed genes
* Quality control
* Normalisation for composition bias


## Introduction

Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.

There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today we will be starting from the count data and getting stuck into analysis.

## Data import
*Set up an RStudio project specifying the directory where you have saved the `/data` directory*.

First, let's load all the packages we will need to analyse the data.

```{r setup, message = FALSE}
library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
library(RColorBrewer)
```

### Mouse mammary gland dataset

The data for this tutorial comes from a Nature Cell Biology paper, [*EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival*](http://www.ncbi.nlm.nih.gov/pubmed/25730472) [@Fu2015]. Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number [GSE60450](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450).

This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.

The sampleinfo file contains basic information about the samples that we will need for the analysis today.

```{r loadSampleInfo}
# Read the sample information into R
sampleinfo <- read.delim("data/SampleInfo.txt")
View(sampleinfo)
sampleinfo
```

### Reading in the  count data

We will first use the counts file as a starting point for our analysis. This data has already been aligned to the mouse genome. The command line tool featureCounts [@Liao2014] was used to count reads mapped to mouse genes from Refseq annotation (see the [paper](http://www.ncbi.nlm.nih.gov/pubmed/25730472) for details).

Let's take a look at the data. You can use the `head` command to see the first 6 lines. In RStudio the `View` command will open the dataframe in a new tab. The `dim` command will tell you how many rows and columns the data frame has.

```{r loadData}
# Read the data into R
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
head(seqdata)
View(seqdata)
dim(seqdata)
```

The `seqdata` object contains information about genes (one gene per row), the first column has the Entrez gene id, the second has the gene length and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample. There are two replicates for each cell type and timepoint (detailed sample info can be found in file "GSE60450_series_matrix.txt" from the [GEO website](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450)). 

### Format the data

We will be manipulating and reformating the counts matrix into a suitable format for downstream analysis. The first two columns in the `seqdata` dataframe contain annotation information. We need to make a new matrix containing only the counts, but we can store the gene identifiers (the `EntrezGeneID` column) as rownames. We will add more annotation information about each gene later on in the workshop.

```{r rowNames}
head(seqdata)
rownames(seqdata)[1:10]
```


Let's create a new data object, `countdata`, that contains only the counts for the 12 samples.  

```{r createCountMatrix}
# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]

# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]

View(countdata)
```

Now take a look at the column names

```{r colNames}
colnames(countdata)
```

These are the sample names which are pretty long so we'll shorten these to contain only the relevant information about each sample. We will use the `substr` command to extract the first 7 characters and use these as the colnames.

```{r modifyColNames}
substr("ThisIsAString", start=1, stop=5)
# using substr, you can extract the samplename:
colnames(countdata) <- substr(colnames(countdata), 1, 7)
View(countdata)
colnames(countdata)
```

Note that the column names are now the same as `SampleName` in the `sampleinfo` file. This is good because it means our sample information in `sampleinfo` is in the same order as the columns in `countdata`.

```{r compareNames}
colnames(countdata)==sampleinfo$SampleName

all(colnames(countdata)==sampleinfo$SampleName)
```


## Filtering to remove lowly expressed genes

Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in this case we have a sample size of 2 in each group, we favour filtering on a minimum counts per million threshold present in at least 2 samples. Two represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least two samples.

```{r librarySizes}
colSums(countdata)
```

We'll use the `cpm` function from the *edgeR* library [@robinson2010edgeR] to generate the CPM values and then filter. Note that by converting to CPMs we are normalising for the different sequencing depths for each sample.

```{r getCPM}
# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)

col1sum <- sum(countdata[,1])/1000000
countdata[1,1]/col1sum
```


```{r subsetThreshold}
# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)
```


```{r countPass}
# Summary of how many TRUEs there are in each row
# There are 11433 genes that have TRUEs in all 12 samples.
rowSums(head(thresh))
table(rowSums(thresh))
```

```{r subsetMatrix}
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
summary(keep)
```

```{r subsetData}
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
dim(countdata)
dim(counts.keep)
```

A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample. A requirement for expression in two or more libraries is used as each group contains two replicates. This ensures that a gene will be retained if it is only expressed in one group. Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5. You should filter with CPMs rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

```{r plotCPMThreshold}
# Let's have a look and see whether our threshold of 0.5 does indeed correspond to a count of about 10-15
# We will look at the first sample
plot(myCPM[,1], countdata[,1], xlab="CPM", ylab="Raw Count", main=colnames(myCPM)[1])
# Add a vertical line at 0.5 CPM
abline(v=0.5)


# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1], xlab="CPM", ylab="Raw Count", main=colnames(myCPM)[1], 
     ylim=c(0,50), xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)
```

> ## Challenge 1 {.challenge}
>
> 1. Plot the counts-per-million versus counts for the second sample.
> 1. Add a vertical line at 0.5 and a horizontal line at 10.
> 1. Add the lines again, colouring them blue
>
> HINT: use the `col` parameter.
>

**Solution**
```{r solutionChallenge1, echo=FALSE}

```

## Convert counts to DGEList object

Next we'll create a `DGEList` object. This is an object used by *edgeR* to store count data. It has a number of slots for storing various parameters about the data.

```{r makeDGEObj}
dgeObj <- DGEList(counts.keep)
# have a look at dgeObj
dgeObj
# See what slots are stored in dgeObj
names(dgeObj)
# Look at the structure of the list
str(dgeObj)
# Library size information is stored in the samples slot
dgeObj$samples
```

## Quality control

Now that we have got rid of the lowly expressed genes and have our counts stored in a `DGEList` object, we can look at a few different plots to check that the data is good quality, and that the samples are as we would expect.

### Library sizes and distribution plots

First, we can check how many reads we have for each sample in the `dgeObj`.

```{r dgeLibrarySizes}
dgeObj$samples[,2]
dgeObj$samples[,"lib.size"]

```

We can also plot the library sizes as a barplot to see whether there are any major discrepanies between the samples more easily.

```{r plotLibrarySizes}

# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(dgeObj$samples$lib.size, names=colnames(dgeObj), las=2, main="Barplot of library sizes")
abline(h=20e6, lty=2)

```

Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. Next we'll use box plots to check the distribution of the read counts on the log2 scale. We can use the `cpm` function to get log2 counts per million, which are corrected for the different library sizes. The `cpm` function also adds a small offset to avoid taking log of zero.  

```{r plotLogCounts}
# Get log2 counts per million
logcounts <- cpm(dgeObj,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts), col="blue", main="Boxplots of logCPMs (unnormalised)")
```

From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further.

> ## Discussion {.challenge}
>
> Do any samples appear to be different compared to the others?
>


### Multidimensional scaling plots

By far, one of the most important plots we make when we analyse RNA-Seq data are MDSplots. An MDSplot is a visualisation of a principle components analysis, which determines the greatest sources of variation in the data. A principle components analysis is an example of an unsupervised analysis, where we don't need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the `plotMDS` function to create the MDS plot.

```{r plotMDSbasic}
plotMDS(dgeObj)
```

It is a bit difficult to see exactly what is going on with the default plot, although we do see samples grouping together in pairs. To make this plot more informative, we can colour the samples according to the grouping information. We can also change the labels, or instead of labels we can have points.

```{r setPlotParams}
# Let's set up colour schemes for CellType
# How many cell types and in what order are they stored?
levels(sampleinfo$CellType)
## Let's choose purple for basal and orange for luminal
col.cell <- c("purple","orange")[sampleinfo$CellType]
data.frame(sampleinfo$CellType,col.cell)
col.cell
sampleinfo$CellType

# Similarly for status
levels(sampleinfo$Status)
col.status <- c("blue","red","dark green")[sampleinfo$Status]
col.status
```

```{r plotMDScolored, fig.height=5, fig.width=10}
# We specify the option to let us plot two plots side-by-sde
par(mfrow=c(1,2))
# Redo the MDS with cell type colouring
plotMDS(dgeObj,col=col.cell, main="Cell type")
# Let's add a legend to the plot so we know which colours correspond to which cell type
legend("topleft",fill=c("purple","orange"),legend=levels(sampleinfo$CellType))

plotMDS(dgeObj,col=col.status, main="Status")
legend("topleft",fill=c("blue","red","dark green"),legend=levels(sampleinfo$Status))

```

> ## Discussion {.challenge}
>
> Look at the MDS plot coloured by cell type.
> Is there something strange going on with the samples?
> Identify the two samples that don't appear to be in the right place.
>

```{r correctSampleSheet}
# There is a sample info corrected file in your data directory
# Old sampleinfo
sampleinfo
# I'm going to write over the sampleinfo object with the corrected sample info
sampleinfo <- read.delim("data/SampleInfo_Corrected.txt")
sampleinfo
```

```{r plotMDSfixed, fig.height=5, fig.width=10}
# Redo the MDSplot with corrected information
par(mfrow=c(1,2))
col.cell <- c("purple","orange")[sampleinfo$CellType]
col.status <- c("blue","red","dark green")[sampleinfo$Status]

plotMDS(dgeObj,col=col.cell, main="Cell type")
legend("topleft",fill=c("purple","orange"),legend=levels(sampleinfo$CellType))

plotMDS(dgeObj,col=col.status, main="Status")
legend("topleft",fill=c("blue","red","dark green"),legend=levels(sampleinfo$Status))
```

> ## Discussion {.challenge}
>
> What is the greatest source of variation in the data (i.e. what does dimension 1 represent)?
> What is the second greatest source of variation in the data?
>

> ## Challenge 2 {.challenge}
>
> 1. Redo the plots choosing your own colours.
> 2. Change the plotting character to a symbol instead of the column names  
> HINT: use `pch` (**p**lotting **ch**aracters) argument. Try `pch=16` and see what happens.
> 3. Change the plotting characters such that basal samples have the value `1` and luminal samples have the value `4` and colour the points by status (lactate, pregnant, virgin)
>

**Solution**
```{r solutionChallenge2, echo=FALSE}

```


The distance between each pair of samples in the MDS plot is calculated as the leading fold change, defined as the root-mean-square of the largest 500 log2-fold changes between that pair of samples. Replicate samples from the same group cluster together in the plot, while samples from different groups form separate clusters. This indicates that the differences between groups are larger than those within groups, i.e., differential expression is greater than the variance and can be detected. In the MDS plot, the distance between basal samples on the left and luminal cells on the right is about 6 units, corresponding to a leading fold change of about 64-fold (2^6 = 64) between basal and luminal. The expression differences between virgin, pregnant and lactating are greater for luminal cells than for basal.

Notes

* The MDS plot can be simply generated with `plotMDS(dgeObj)`. The additional code is purely for aesthetics, to improve the visualization of the groups.
* Clustering in the MDS plot can be used to motivate changes to the design matrix in light of potential batch effects. For example, imagine that the first replicate of each group was prepared at a separate time from the second replicate. If the MDS plot showed separation of samples by time, it might be worthwhile including time in the down stream analysis to account for the time-based effect.

`plotMDS` plots the first two dimensions as a default, however you can plot higher dimensions using the `dim` argument.

```{r plotMDS3and4}
# Dimension 3 appears to separate pregnant samples from the rest. Dim4?
plotMDS(dgeObj,dim=c(3,4), col=col.status)
legend("topleft",fill=c("blue","red","dark green"),legend=levels(sampleinfo$Status))
```


Another alternative is to generate an interactive MDS plot using the *Glimma* package. This allows the user to interactively explore the different dimensions.

```{r glimmaMDS}
labels <- paste(sampleinfo$SampleName, sampleinfo$CellType, sampleinfo$Status)
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group <- factor(group)
glMDSPlot(dgeObj, labels=labels, groups=group, folder="mds")
```

*Glimma* was created to make interactive versions of some of the popular plots from the *limma* package. At present it can be used to obtain MDS plots and mean-difference (MD) plots, which will be covered later. The output of `glMDSPlot` is an html page (/mds/MDS-Plot.html) that shows the MDS plot on the left, and the amount of variation explained by each dimension in a barplot on the right. The user can hover over points to find out sample information, and switch between successive dimensions in the MDS plot by clicking on the bars in the barplot. The default MDS plots shows dimensions 1 and 2.

### Hierarchical clustering with heatmaps

An alternative to `plotMDS` for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the `heatmap.2` function from the *gplots* package. In this example `heatmap.2` calculates a matrix of euclidean distances from the logCPM (`logcounts` object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using `plotMDS`.)

The *RColorBrewer* package has nicer colour schemes, accessed using the `brewer.pal` function. "RdYlBu" is a common choice, and "Spectral" is also nice.

Note:The `png` function will create a png file to save the plots created straight after, and will close this file when `dev.off()` is called. To see your plots interactively, simply omit those two lines.

Let's select data for the 500 most variable genes and plot the heatmap

```{r getHMData}
# We estimate the variance for each row in the logcounts matrix
var_genes <- apply(logcounts, 1, var)
head(var_genes)
# Get the gene names for the top 500 most variable genes
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
head(select_var)
# Subset logcounts matrix
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
head(highly_variable_lcpm)
```

```{r plotHM}
## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
## http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[sampleinfo$CellType]

# Plot the heatmap
heatmap.2(highly_variable_lcpm, 
          col=rev(morecols(50)),
          trace="column", 
          main="Top 500 most variable genes across samples",
          ColSideColors=col.cell,scale="row")

# Save the heatmap
png(file="High_var_genes.heatmap.png")
heatmap.2(highly_variable_lcpm, 
          col=rev(morecols(50)), 
          trace="none", 
          main="Top 500 most variable genes\nacross samples", 
          ColSideColors=col.cell, 
          scale="row")
dev.off()
```

> ## Challenge 3 {.challenge}
>
> 1. Change the colour scheme to "PiYG" and redo the heatmap. Try `?RColorBrewer` and see what other colour schemes are available.
> 1. Change the sample names to `group` using the `labCol` argument
> 1. Redo the heatmap using the top 500 LEAST variable genes.
>

**Solution**
```{r solutionChallenge3,echo=FALSE, fig.height=8, fig.width=6}

```


-----

## Normalisation for composition bias

The trimmed mean of M-values normalization method (TMM) is performed to eliminate composition biases between libraries [@robinson2010tmm]. This generates a set of normalization factors, where the product of these factors and the library sizes defines the effective library size. The `calcNormFactors` function in `edgeR` calculates the normalization factors between libraries. TMM normalisation (and most scaling normalisation methods) scale relative to one sample.

```{r calcNormFactors}
# Apply normalisation to DGEList object
dgeObj <- calcNormFactors(dgeObj)
```

This will update the normalisation factors in the `DGEList` object (their default values are 1). Take a look at the normalisation factors for these samples.

```{r vizNormFactors}
dgeObj$samples
```

The normalization factors multiply to unity across all libraries. A normalization factor below one indicates that the library size will be scaled down, as there is more suppression (i.e., composition bias) in that library relative to the other libraries. This is also equivalent to scaling the counts upwards in that sample. Conversely, a factor above one scales up the library size and is equivalent to downscaling the counts.


The last two samples have much smaller normalisation factors, and MCL1.LA and MCL1.LB have the largest. If we plot mean difference plots using the `plotMD` function for these samples, we should be able to see the composition bias problem. We will use the `logcounts`, which have been normalised for library size, but not for composition bias.

```{r plotRawMD, fig.height=5, fig.width=10}
par(mfrow=c(1,2))
plotMD(logcounts,column = 7)
abline(h=0,col="grey")
plotMD(logcounts,column = 11)
abline(h=0,col="grey")
```

The mean-difference plots show average expression (mean: x-axis) against log-fold-changes (difference: y-axis).
Because our `DGEList` object contains the normalisation factors, if we redo these plots using `dgeObj`, we should see the composition bias problem has been solved.

```{r plotNormedMD, fig.height=5, fig.width=10}
par(mfrow=c(1,2))
plotMD(dgeObj,column = 7)
abline(h=0,col="grey")
plotMD(dgeObj,column = 11)
abline(h=0,col="grey")
```

> ## Challenge 4 {.challenge}
>
> Plot the biased and unbiased MD plots side by side for the same sample to see the before and after TMM normalisation effect.
>

**Solution**
```{r solutionChallenge4, echo=FALSE, fig.height=5, fig.width=10}

```

**We need to save a few data objects to use for Day 2 so we don't have to rerun everything**

```{r saveData}
save(group,dgeObj,sampleinfo,file="Robjects/preprocessing.Rdata")
```