forked from COMBINE-Australia/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 8
/
rna-seq-de-solutions.Rmd
198 lines (152 loc) · 5.89 KB
/
rna-seq-de-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "RNA-seq analysis in R"
subtitle: "Differential Expression of RNA-seq data"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
minutes: 300
layout: page
bibliography: ref.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016
## Resources and data files
This material has been created using the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html
Data files downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60450/suppl/GSE60450_Lactation-GenewiseCounts.txt.gz
http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5.rdata
http://bioinf.wehi.edu.au/software/MSigDB/mouse_H_v5.rdata
Data files:
sampleinfo.txt
GSE60450_Lactation-GenewiseCounts.txt
mouse_c2_v5.rdata
mouse_H_v5.rdata
Data files available from: [https://figshare.com/s/1d788fd384d33e913a2a](https://figshare.com/s/1d788fd384d33e913a2a)
You should download these files and place them in your `/data` directory.
## Differential expression with edgeR
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESEQ2 or edgeR. We will use edgeR for the rest of this practical.
**First make sure we have all the objects and libraries loaded*
```{r}
library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
load("Robjects/preprocessing.Rdata")
```
### Recap of pre-processing
The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
```{r eval=FALSE}
## Read the counts from the downloaded data
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
#
# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
countdata
colnames(countdata) <- substr(colnames(countdata), 1, 7)
countdata
## Calculate the Counts Per Million measure
myCPM <- cpm(countdata)
## Identify genes with at least 0.5 cpm in at least 2 samples
thresh <- myCPM > 0.5
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
## Convert to an edgeR object
dgeObj <- DGEList(counts.keep)
## Perform TMM normalisation
dgeObj <- calcNormFactors(dgeObj)
## Obtain corrected sample information
sampleinfo <- read.delim("data/SampleInfo_Corrected.txt")
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group
```
### Create the design matrix
First we need to create a design matrix for the groups, as we have seen in the linear models lecture.
We have two variables, status and cell type. We will fit two models under two assumptions; no interaction and interaction of these two factors.
Let's start with the model with only main effects, that is no interaction. The main assumption here is that the effect of the status is the same in both type of cells.
```{r}
# Create the two variables
group <- as.character(group)
type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
# Specify a design matrix without an intercept term
design <- model.matrix(~ type + status)
design
```
### Data exploration
An MDS plot shows distances, in terms of biological coefficient of variation (BCV), between samples. What do you think of the quality of the data? Can you anticipate if the interaction term will be important?
```{r}
plotMDS(dgeObj, labels=group, cex=0.75, xlim=c(-4, 5))
```
### Estimating the dispersion
The common dispersion estimates the overall BCV of the dataset, averaged over all genes:
```{r}
dgeObj <- estimateCommonDisp(dgeObj)
```
Then we estimate gene-wise dispersion estimates, allowing a possible trend with averge count size:
```{r}
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)
```
Plot the estimated dispersions:
```{r}
plotBCV(dgeObj)
```
### Testing for differential expression
First, we fit genewise glms:
```{r}
# Fit the linear model
fit <- glmFit(dgeObj, design)
names(fit)
head(coef(fit))
```
Conduct likelihood ratio tests for luminal vs basal and show the top genes:
```{r}
lrt.BvsL <- glmLRT(fit, coef=2)
topTags(lrt.BvsL)
```
> ## Challenge {.challenge}
> Conduct likelihood ratio tests for virgin vs lactate and show the top genes.
```{r}
```
### Contrasts
Suppose we want to find differentially expressed genes between pregnant and virgin. We don't have a parameter that explicitly will allow us to test that hypothesis. We need to build a contrast:
```{r}
PvsV <- makeContrasts(statuspregnant-statusvirgin, levels=design)
lrt.pVsV <- glmLRT(fit, contrast=PvsV)
topTags(lrt.pVsV)
```
> ## Challenge {.challenge}
>
> 1.Fit a model with interaction: What is the rationale to include the interaction (What assumption are you relaxing?)
> 2. Is the number of replicates good enough to include the interaction?
> 3. Is the interaction needed in the model?
**Solution**
```{r,echo=FALSE}
# Solution
design2 <- model.matrix(~ type * status)
fit2 <- glmFit(dgeObj, design2)
names(fit2)
head(coef(fit2))
lrt2 <- glmLRT(fit2, coef=5)
topTags(lrt2)
lrt3 <- glmLRT(fit2, coef=6)
topTags(lrt3)
```
```{r}
save(lrt.BvsL,dgeObj,group,file="Robjects/DE.Rdata")
```