02-methods.Rmd

# Functional Enrichment Analysis Methods {#chapter2}

## Over Representation Analysis

Over Representation Analysis (ORA) [@boyle2004] is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, *e.g.* a list of differentially expressed genes (DEGs). 


The _p_-value can be calculated by hypergeometric distribution.


$p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}$

In this equation, `N` is the total number of genes in the background distribution,
`M` is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest,
`n` is the size of the list of genes of interest and `k` is the number of genes within that list which
are annotated to the gene set. The background distribution by default is
all the genes that have annotation. _P_-values should be adjusted for [multiple comparison](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).


**Example:** Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differential expressed genes, 28 are annotated to a gene set^[example adopted from <https://guangchuangyu.github.io/cn/2012/04/enrichment-analysis/>]. 


```{r}
d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
```                        


whether the overlap of 25 genes are significantly over represented in the gene set can be assessed using hypergeometric distribution. This corresponding to a one-sided version of Fisher's exact test.

```{r}
fisher.test(d, alternative = "greater")
```


## Gene Set Enrichment Analysis


A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated in [Disease enrichment analysis](enrichmentAnalysis.html) vignette were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)[@subramanian_gene_2005] directly addresses this limitation.
All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given a priori defined set of gene _S_ (e.g., genes shareing the same _DO_ category), the goal of GSEA is to determine whether the members of _S_ are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

* Calculation of an Enrichment Score.
  + The enrichment score (_ES_) represent the degree to which a set _S_ is over-represented at the top or bottom of the ranked list _L_. The score is calculated by walking down the list _L_, increasing a running-sum statistic when we encounter a gene in _S_ and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The _ES_ is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic[@subramanian_gene_2005].
* Esimation of Significance Level of _ES_.
  + The _p_-value of the _ES_ is calculated using permutation test. Specifically, we permute the gene labels of the gene list _L_ and recompute the _ES_ of the gene set for the permutated data, which generate a null distribution for the _ES_. The _p_-value of the observed _ES_ is then calculated relative to this null distribution.
* Adjustment for Multiple Hypothesis Testing.
  + When the entire gene sets were evaluated, `r Biocpkg("DOSE")` adjust the estimated significance level to account for multiple hypothesis testing and also _q_-values were calculated for FDR control.

We implemented GSEA algorithm proposed by Subramanian[@subramanian_gene_2005]. Alexey Sergushichev implemented an algorithm for fast GSEA analysis in the `r Biocpkg("fgsea")`[@alex_fgsea] package.

In `r Biocpkg("DOSE")`[@yu_dose_2015], user can use GSEA algorithm implemented in `DOSE` or `fgsea` by specifying the parameter `by="DOSE"` or `by="fgsea"`. By default, `r Biocpkg("DOSE")` use `fgsea` since it is much more fast.


## Leading edge analysis and core enriched genes

Leading edge analysis reports `Tags` to indicate the percentage of genes contributing to the enrichment score, `List` to indicate where in the list the enrichment score is attained and `Signal` for enrichment signal strength.

It would also be very interesting to get the core enriched genes that contribute to the enrichment.

`r Biocpkg("DOSE")` supports leading edge analysis and report core enriched genes in GSEA analysis.