-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
methods: add more details about the CCC and update algorithm
- Loading branch information
Showing
4 changed files
with
823 additions
and
697 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,97 +1,97 @@ | ||
## Methods | ||
|
||
The code needed to reproduce all of our analyses and generate the figures is available in [https://github.com/greenelab/ccc](https://github.com/greenelab/ccc). | ||
We provide scripts to download the required data and run all the steps. | ||
A Docker image is provided to use the same runtime environment. | ||
|
||
|
||
### The CCC algorithm {#sec:ccc_algo .page_break_before} | ||
|
||
#### Overview | ||
|
||
The Clustermatch Correlation Coefficient (CCC) computes a similarity value $c \in \left[0,1\right]$ between any pair of numerical or categorical features/variables $\mathbf{x}$ and $\mathbf{y}$ measured on $n$ objects. | ||
CCC assumes that if two features $\mathbf{x}$ and $\mathbf{y}$ are similar, then the partitioning by clustering of the $n$ objects using each feature separately should match. | ||
For example, given $\mathbf{x}=(11, 27, 32, 40)$ and $\mathbf{y}=10x=(110, 270, 320, 400)$, where $n=4$, partitioning each variable into two clusters ($k=2$) using their medians (29.5 for $\mathbf{x}$ and 295 for $\mathbf{y}$) would result in partition $\Omega^{\mathbf{x}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{x}$, and partition $\Omega^{\mathbf{y}}_{k=2}=(1, 1, 2, 2)$ for $\mathbf{y}$. | ||
Then, the agreement between $\Omega^{\mathbf{x}}_{k=2}$ and $\Omega^{\mathbf{y}}_{k=2}$ can be computed using the adjusted Rand index (ARI) [@doi:10.1007/BF01908075] or, in theory, any other similarity measure between partitions. | ||
In our example, it will return the maximum value (1.0 in the case of ARI). | ||
Note that the same value of $k$ might not be the right one to find a relationship between any two features. | ||
For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other). | ||
If we used only two clusters instead, CCC would return a similarity value of 0.02. | ||
Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$. | ||
### The Clustermatch Correlation Coefficient (CCC) {#sec:ccc_algo .page_break_before} | ||
|
||
#### Definitions | ||
|
||
**Definition 1.1.** Given a data vector $\mathbf{x}=(x_{1},x_{2},\dots,x_{n})$, we define | ||
**Definition 1.1.** Given a data vector $\mathbf{x}=(x_{1},x_{2},\dots,x_{n})$, if $\mathbf{x} \in \mathbb{R}^n$ then define | ||
|
||
$$\pi_{\ell} = \{i \mid \rho_\ell < x_{i} \leq \rho_{\ell+1}\}, \forall \ell \in \left[1,k\right]$$ | ||
|
||
as a *partition* of the $n$ objects of $\mathbf{x}$ into $\left\vert\pi\right\vert=k$ clusters, where $\boldsymbol{\rho}$ is a set of $k+1$ cutpoints that define the clusters. | ||
as a *partition* of the $n$ objects of $\mathbf{x}$ into $\left\vert\pi\right\vert=k$ clusters, where $\boldsymbol{\rho}$ is a set of $k+1$ cutpoints (e.g., quantiles) that define the clusters and $\rho_{1} = \min(\mathbf{x})$ and $\rho_{k+1} = \max(\mathbf{x})$. | ||
If $\mathbf{x}$ is a categorical vector (i.e., with no intrinsic ordering), then a partition is defined as | ||
|
||
$$\pi_{c} = \{i \mid x_{i} = \mathcal{C}_{c}\}, \forall c \in \left[1,r\right]$$ | ||
|
||
where $\mathcal{C} = \{x_{i} | x_{i} \in \mathbf{x}\}$ is a set of unique values in $\mathbf{x}$ corresponding to the $r$ categorical values that define the clusters. | ||
|
||
**Definition 1.2.** Given two partitions $\pi$ and $\pi^{\prime}$ of $n$ objects, the *adjusted Rand Index (ARI)* [@doi:10.1007/BF01908075] is given by | ||
|
||
$$\textrm{ARI}(\pi, \pi^{\prime}) = \frac{2(n_{0}n_{1} - n_{2} n_{3})}{(n_0 + n_2)(n_2 + n_1) + (n_0 + n_3)(n_3 + n_1)},$$ | ||
|
||
where $n_{0}$ is the number of pairs of objects that are in the same cluster in both partitions $\pi$ and $\pi^{\prime}$, $n_{1}$ is the number of pairs of objects that are in different clusters, $n_{2}$ is the number of pairs of objects that are in the same cluster in $\pi$ but in different clusters in $\pi^{\prime}$, and $n_{3}$ is the number of pairs of objects that are in different clusters in $\pi$ but in the same cluster in $\pi^{\prime}$. | ||
Intuitively, $n_0 + n_1$ reflects the number of pairs of objects where both partitions agree, and $n_2 + n_3$ are those in which they disagree. | ||
In the special case where all $n$ objects are in their own cluster, the ARI is undefined. | ||
where $n_{0}$ is the number of object pairs that are in the same cluster in both partitions $\pi$ and $\pi^{\prime}$, | ||
$n_{1}$ is the number of object pairs that are in different clusters, | ||
$n_{2}$ is the number of object pairs that are in the same cluster in $\pi$ but in different clusters in $\pi^{\prime}$, | ||
and $n_{3}$ is the number of object pairs that are in different clusters in $\pi$ but in the same cluster in $\pi^{\prime}$. | ||
Intuitively, $n_0 + n_1$ reflects the number of object pairs where both partitions agree, and $n_2 + n_3$ are those in which they disagree. | ||
|
||
**Definition 1.3.** The *Clustermatch Correlation Coefficient (CCC)* for two vectors $\mathbf{x}$ and $\mathbf{y}$, both of size $n$, is defined as the maximum ARI between all possible partitions of $\mathbf{x}$ and $\mathbf{y}$ | ||
**Definition 1.3.** The *Clustermatch Correlation Coefficient (CCC)* for two equally-sized vectors $\mathbf{x}$ and $\mathbf{y}$ is defined as the maximum ARI between all possible partitions of $\mathbf{x}$ and $\mathbf{y}$ | ||
|
||
$$\textrm{CCC}(\mathbf{x}, \mathbf{y}) = \max_{\substack{\pi_j \in \Pi^{\mathbf{x}} \\ \pi_l \in \Pi^{\mathbf{y}}}} \lbrace \textrm{ARI}(\pi_j, \pi_l) \rbrace, \forall \left\vert\pi\right\vert \in [2, k_{\mathrm{max}}]$$ | ||
$$\textrm{CCC}(\mathbf{x}, \mathbf{y}) = \max \lbrace 0, \max_{\substack{\pi_j \in \Pi^{\mathbf{x}} \\ \pi_l \in \Pi^{\mathbf{y}}}} \lbrace \textrm{ARI}(\pi_j, \pi_l) \rbrace \rbrace, \forall \left\vert\pi\right\vert \in [2, k_{\mathrm{max}}]$$ | ||
|
||
where $\Pi^{\mathbf{x}}$ is a set of partitions derived from vector $\mathbf{x}$, $\Pi^{\mathbf{y}}$ is a set of partitions derived from vector $\mathbf{y}$, and $k_{\mathrm{max}}$ specifies the maximum number of clusters allowed. | ||
The ARI has an upper bound of 1 (achieved when both partitions are identical), and although it does not have a well-defined lower bound, values equal or less than zero are achieved when partitions are independent. | ||
Therefore, $\textrm{CCC}(\mathbf{x}, \mathbf{y}) \in \left[0,1\right]$. | ||
In the special case where all $n$ objects in either $\mathbf{x}$ or $\mathbf{y}$ have the same value, the CCC is undefined. | ||
|
||
|
||
The CCC has the following basic properties derived from the ARI: | ||
1) symmetry, since $\mathrm{ARI}(\pi, \pi^{\prime}) = \mathrm{ARI}(\pi^{\prime}, \pi)$; | ||
2) normalization, since it takes a minimum value of zero and a maximum of one since $\mathrm{ARI}(\pi, \pi) = 1$; | ||
3) constant baseline, since the ARI is adjusted-for-chance [@doi:10.1007/BF01908075], it returns a value close to zero for independently drawn partitions, and this also holds when partitions have different number of clusters [@Vinh2010]; | ||
this is an important property, since CCC compares partitions with different numbers of clusters, and relationships between two variables (such as linear or quadratic) might be better represented with different numbers of clusters as shown in Figure @fig:datasets_rel. | ||
|
||
The ARI has an upper bound of 1, which is achieved when partitions are identical. | ||
Although the ARI does not have a well-defined lower bound, values equal or less than zero are achieved when partitions are independent. | ||
#### The maximum number of clusters $k_{\mathrm{max}}$ | ||
|
||
* ARI is symmetric, i.e., $\mathcal{A}(\pi, \pi^{\prime}) = \mathcal{A}(\pi^{\prime}, \pi)$. | ||
* ARI is normalized, i.e., $\mathcal{A}(\pi, \pi) = 1$. | ||
* ARI is "adjusted-for-chance" (i.e., independently drawn partitions have an ARI of zero), which also holds when partitions have a different number of clusters [@Vinh2010]. | ||
|
||
#### Algorithm | ||
The parameter $k_{\mathrm{max}}$ is the maximum number of clusters $k$ allowed for any partition derived from $\mathbf{x}$ or $\mathbf{y}$. | ||
However, note that the same value of $k$ might not be the right one to find a relationship between any two variables. | ||
For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one variable and two using the other). | ||
If we used only two clusters instead, CCC would return a similarity value of 0.02. | ||
However, computational time increases exponentially with $k_{\mathrm{max}}$. | ||
In addition, it is important to note that given the constant baseline property of the ARIs, the CCC returns a value close to zero for independent variables regardless of the value of $k_{\mathrm{max}}$. | ||
As shown in Figure @fig:constant_baseline:k_max, this holds even for very large values of $k_{\mathrm{max}}$ approaching the number of objects $n$. | ||
Therefore, $k_{\mathrm{max}}$ represents a tradeoff between the ability to capture complex patterns and the computational cost. | ||
|
||
COSAS QUE HABIA PUESTO ANTES EN DEFINICIONES: | ||
We found that $k_{\mathrm{max}}=10$ works well in practice, and it was used as the default maximum number of clusters across all our analyses. | ||
|
||
using a set of quantiles $\boldsymbol{\rho}=(\rho_\ell \mid \Pr\left(x_{i}\!<\!\rho_\ell\right)\!\leq\!(\ell-1)/k)$. | ||
For example, if $k=2$, then $\boldsymbol{\rho}=(\rho_1, \rho_2, \rho_3)$, where $\rho_1$ is the minimum value of $\mathbf{x}$, $\rho_2$ is the median, and $\rho_3$ is the maximum value of $\mathbf{x}$. | ||
Then, $\Omega_{2,1}$ contains all values of $\mathbf{x}$ that are less than or equal to $\rho_2$, and $\Omega_2$ contains all values of $\mathbf{x}$ that are greater than $\rho_2$. | ||
If vector $\mathbf{x}$ contains categorical values across $n$ objects, we can compute a single partition $\Omega_k$ into $k=\left\vert\mathcal{C}\right\vert$ clusters, where $\mathcal{C}=\cup_{j}\{x_{i}\}$ is a set of categorical values in $\mathbf{x}$. | ||
#### Statistical significance | ||
|
||
Our null hypothesis is that the variables represented by vectors $\mathbf{x}$ and $\mathbf{y}$ are independent. | ||
To compute a $P$-value, we perform a set of permutations of values in $\mathbf{y}$ and compute the CCC between $\mathbf{x}$ and each permuted vector. | ||
The $P$-value is the proportion of CCC values using the permuted data that are greater than or equal to the CCC value between $\mathbf{x}$ and $\mathbf{y}$. | ||
We used 1 million permutations in all our analyses, and we adjusted the $P$-values using the Benjamini and Hochberg procedure [@doi:10.1111/j.2517-6161.1995.tb02031.x] to control the false discovery rate (FDR). | ||
|
||
/COSAS | ||
In our analyses, we computed a $P$-value only for a subset of gene pairs, since computing the statistical significance via permutations is computationally expensive with CCC. | ||
For this, we selected gene pairs from the "Disagreements" group in Figure @fig:upsetplot_coefs, which contains gene pairs ranked differently by the correlation coefficients. | ||
|
||
#### Algorithm | ||
|
||
<!-- The Latex code for the algorithm is here: https://www.overleaf.com/project/61b8c643eb0ed41c2d8aaadc --> | ||
![ | ||
](images/intro/ccc_algorithm/ccc_algorithm.svg "CCC algorithm"){width="75%"} | ||
|
||
The main function of the algorithm, `ccc`, generates a list of partitionings $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (lines 14 and 15), for each feature $\mathbf{x}$ and $\mathbf{y}$. | ||
Then, it computes the ARI between each partition in $\Omega^{\mathbf{x}}$ and $\Omega^{\mathbf{y}}$ (line 16), and then it keeps the pair that generates the maximum ARI. | ||
Finally, since ARI does not have a lower bound (it could return negative values, which in our case are not meaningful), CCC returns only values between 0 and 1 (line 17). | ||
The main function of the algorithm, `ccc`, generates a set of partitions $\Pi^{\mathbf{x}}$ for variable $\mathbf{x}$ (line 16), and another set of partitions $\Pi^{\mathbf{y}}$ for variable $\mathbf{y}$ (line 17). | ||
Then, it computes the ARI between each partition $\pi_j \in \Pi^{\mathbf{x}}$ and $\pi_l \in \Pi^{\mathbf{y}}$ and gets the maximum (line 18), returning either this value or zero if this is negative (line 19). | ||
|
||
|
||
Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. | ||
If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. | ||
If the feature is categorical (lines 7 to 9), the categories are used to group objects together. | ||
Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order. | ||
Interestingly, since CCC only needs a set of partitions to compute a correlation value, any type of variable that can be used to perform clustering is supported. | ||
If variable $\mathbf{v}$ is numerical (lines 2 to 6 in the `get_partitions` function), then a set of partitions $\Pi$ is generated with different number of clusters. | ||
Each partition $\pi$ is generated using a set of quantiles $\boldsymbol{\rho}$. | ||
For example, if $k=2$, then $\boldsymbol{\rho}=(\rho_1, \rho_2, \rho_3)$, where $\rho_1$ is the minimum value of $\mathbf{v}$, $\rho_2$ is the median, and $\rho_3$ is the maximum value of $\mathbf{v}$. | ||
Then, the first cluster $\pi_{1}$ contains all values of $\mathbf{v}$ that are less than or equal to $\rho_2$, and $\pi_2$ contains all values of $\mathbf{v}$ that are greater than $\rho_2$. | ||
If variable $\mathbf{v}$ is categorical (lines 8 to 11), we compute a single partition $\pi$ with $k=\left\vert\mathcal{C}\right\vert$ clusters, where $\mathcal{C} = \{v_{i} | v_{i} \in \mathbf{v}\}$ is a set of unique categorical values. | ||
Therefore, all variable types are internally represented as partitions and it is not necessary to access the original data values to compute the ARI. | ||
Consequently, numerical and categorical variables can be naturally integrated. | ||
|
||
COMMENT LINES when singleton clusters are removed | ||
|
||
|
||
For all our analyses we used $k_{\mathrm{max}}=10$. | ||
This means that for each gene pair, 18 partitions are generated (9 for each gene, from $k=2$ to $k=10$), and 81 ARI comparisons are performed. | ||
Our algorithm implementation uses $k_{\mathrm{max}}=10$ as the default. | ||
This means that for a variable pair, 18 partitions are generated (9 for each variable, from $k=2$ to $k=10$), and 81 ARI comparisons are performed. | ||
Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships. | ||
Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired. | ||
In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed. | ||
In this regard, our Python implementation of CCC provides flexibility in specifying $k_{\mathrm{max}}$. | ||
In this regard, our Python implementation provides flexibility in specifying $k_{\mathrm{max}}$. | ||
For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters. | ||
|
||
|
||
For a single pair of features (genes in our study), generating partitions or computing their similarity can be parallelized. | ||
Generating partitions (lines 16 and 17) or computing their similarity (line 18) can be parallelized. | ||
We used three CPU cores in our analyses to speed up the computation of CCC. | ||
A future improved implementation of CCC could potentially use graphical processing units (GPU) to parallelize its computation further. | ||
|
||
|
||
A Python implementation of CCC (optimized with `numba` [@doi:10.1145/2833157.2833162]) can be found in our Github repository [@url:https://github.com/greenelab/clustermatch-gene-expr], as well as a package published in the Python Package Index (PyPI) that can be easily installed. |
Binary file not shown.
Oops, something went wrong.