forked from YuLab-SMU/clusterProfiler-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-methods.Rmd
65 lines (37 loc) · 5.09 KB
/
02-methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Functional Enrichment Analysis Methods {#chapter2}
## Over Representation Analysis
Over Representation Analysis (ORA) [@boyle2004] is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, *e.g.* a list of differentially expressed genes (DEGs).
The _p_-value can be calculated by hypergeometric distribution.
$p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}$
In this equation, `N` is the total number of genes in the background distribution,
`M` is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest,
`n` is the size of the list of genes of interest and `k` is the number of genes within that list which
are annotated to the gene set. The background distribution by default is
all the genes that have annotation. _P_-values should be adjusted for [multiple comparison](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).
**Example:** Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set^[example adopted from <https://guangchuangyu.github.io/cn/2012/04/enrichment-analysis/>].
```{r}
d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
```
whether the overlap of 25 genes are significantly over represented in the gene set can be assessed using hypergeometric distribution. This corresponding to a one-sided version of Fisher's exact test.
```{r}
fisher.test(d, alternative = "greater")
```
## Gene Set Enrichment Analysis
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated in [Disease enrichment analysis](enrichmentAnalysis.html) vignette were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)[@subramanian_gene_2005] directly addresses this limitation.
All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
Genes are ranked based on their phenotypes. Given a priori defined set of gene _S_ (e.g., genes shareing the same _DO_ category), the goal of GSEA is to determine whether the members of _S_ are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.
There are three key elements of the GSEA method:
* Calculation of an Enrichment Score.
+ The enrichment score (_ES_) represent the degree to which a set _S_ is over-represented at the top or bottom of the ranked list _L_. The score is calculated by walking down the list _L_, increasing a running-sum statistic when we encounter a gene in _S_ and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The _ES_ is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic[@subramanian_gene_2005].
* Esimation of Significance Level of _ES_.
+ The _p_-value of the _ES_ is calculated using permutation test. Specifically, we permute the gene labels of the gene list _L_ and recompute the _ES_ of the gene set for the permutated data, which generate a null distribution for the _ES_. The _p_-value of the observed _ES_ is then calculated relative to this null distribution.
* Adjustment for Multiple Hypothesis Testing.
+ When the entire gene sets were evaluated, `r Biocpkg("DOSE")` adjust the estimated significance level to account for multiple hypothesis testing and also _q_-values were calculated for FDR control.
We implemented GSEA algorithm proposed by Subramanian[@subramanian_gene_2005]. Alexey Sergushichev implemented an algorithm for fast GSEA analysis in the `r Biocpkg("fgsea")`[@alex_fgsea] package.
In `r Biocpkg("DOSE")`[@yu_dose_2015], user can use GSEA algorithm implemented in `DOSE` or `fgsea` by specifying the parameter `by="DOSE"` or `by="fgsea"`. By default, `r Biocpkg("DOSE")` use `fgsea` since it is much more fast.
## Leading edge analysis and core enriched genes
Leading edge analysis reports `Tags` to indicate the percentage of genes contributing to the enrichment score, `List` to indicate where in the list the enrichment score is attained and `Signal` for enrichment signal strength.
It would also be very interesting to get the core enriched genes that contribute to the enrichment.
`r Biocpkg("DOSE")` supports leading edge analysis and report core enriched genes in GSEA analysis.