Skip to content

Commit

Permalink
methods: add Diego's suggestions
Browse files Browse the repository at this point in the history
Co-authored-by: Diego Milone <[email protected]>
  • Loading branch information
miltondp and dmilone committed May 17, 2024
1 parent cc06ba9 commit e085563
Show file tree
Hide file tree
Showing 4 changed files with 697 additions and 641 deletions.
49 changes: 21 additions & 28 deletions content/08.01.methods.ccc.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@

#### Definitions

**Definition 1.1.** Given a data vector $\mathbf{x}=(x_{1},x_{2},\dots,x_{n})$, if $\mathbf{x} \in \mathbb{R}^n$ then define
**Definition 1.1.** Given a data vector $\mathbf{x}=(x_{1},x_{2},\dots,x_{n}) \in \mathbb{R}^n$ then define

$$\pi_{\ell} = \{i \mid \rho_\ell < x_{i} \leq \rho_{\ell+1}\}, \forall \ell \in \left[1,k\right]$$

as a *partition* of the $n$ objects of $\mathbf{x}$ into $\left\vert\pi\right\vert=k$ clusters, where $\boldsymbol{\rho}$ is a set of $k+1$ cutpoints (e.g., quantiles) that define the clusters and $\rho_{1} = \min(\mathbf{x})$ and $\rho_{k+1} = \max(\mathbf{x})$.
If $\mathbf{x}$ is a categorical vector (i.e., with no intrinsic ordering), then a partition is defined as
as a *partition* of the $n$ objects of $\mathbf{x}$ into $\left\vert\pi\right\vert=k$ clusters, where $\boldsymbol{\rho}$ is a set of $k+1$ cutpoints (e.g., quantiles) that define the clusters, with $\rho_{1} = \min(\mathbf{x})$ and $\rho_{k+1} = \max(\mathbf{x})$.
If $\mathbf{x}$ is a categorical vector with no intrinsic ordering, then a partition is defined as

$$\pi_{c} = \{i \mid x_{i} = \mathcal{C}_{c}\}, \forall c \in \left[1,k\right]$$
$$\pi_{\ell} = \{i \mid x_{i} = \mathcal{C}_{\ell}\}, \forall \ell \in \left[1,\lvert\mathcal{C}\rvert\right]$$

where $\mathcal{C} = \{x_{i} | x_{i} \in \mathbf{x}\}$ is a set of unique values in $\mathbf{x}$ corresponding to the $k = \mid\mathcal{C}\mid$ categorical values that define the clusters.
where $\mathcal{C} = \{c_1, c_2,\dots,c_m\}$ is a set of unique values in $\mathbf{x}$ corresponding to the $m = \lvert\mathcal{C}\rvert$ categorical values that define the clusters.

**Definition 1.2.** Given two partitions $\pi$ and $\pi^{\prime}$ of $n$ objects, the *adjusted Rand Index (ARI)* [@doi:10.1007/BF01908075] is given by

Expand All @@ -25,11 +25,11 @@ $n_{2}$ is the number of object pairs that are in the same cluster in $\pi$ but
and $n_{3}$ is the number of object pairs that are in different clusters in $\pi$ but in the same cluster in $\pi^{\prime}$.
Intuitively, $n_0 + n_1$ reflects the number of object pairs where both partitions agree, and $n_2 + n_3$ are those in which they disagree.

**Definition 1.3.** The *Clustermatch Correlation Coefficient (CCC)* for two equally-sized vectors $\mathbf{x}$ and $\mathbf{y}$ is defined as the maximum ARI between all possible partitions of $\mathbf{x}$ and $\mathbf{y}$
**Definition 1.3.** The *Clustermatch Correlation Coefficient (CCC)* between $\mathbf{x}$ and $\mathbf{y}$ is defined as the maximum ARI between all possible partitions of $\mathbf{x}$ and $\mathbf{y}$

$$\textrm{CCC}(\mathbf{x}, \mathbf{y}) = \max \lbrace 0, \max_{\substack{\pi_j \in \Pi^{\mathbf{x}} \\ \pi_l \in \Pi^{\mathbf{y}}}} \lbrace \textrm{ARI}(\pi_j, \pi_l) \rbrace \rbrace, \forall \left\vert\pi\right\vert \in [2, k_{\mathrm{max}}]$$

where $\Pi^{\mathbf{x}}$ is a set of partitions derived from vector $\mathbf{x}$, $\Pi^{\mathbf{y}}$ is a set of partitions derived from vector $\mathbf{y}$, and $k_{\mathrm{max}}$ specifies the maximum number of clusters allowed.
where $\Pi^{\mathbf{x}}$ is a set of partitions derived from $\mathbf{x}$, $\Pi^{\mathbf{y}}$ is a set of partitions derived from $\mathbf{y}$, and $k_{\mathrm{max}}$ specifies the maximum number of clusters allowed.
The ARI has an upper bound of 1 (achieved when both partitions are identical), and although it does not have a well-defined lower bound, values equal or less than zero are achieved when partitions are independent.
Therefore, $\textrm{CCC}(\mathbf{x}, \mathbf{y}) \in \left[0,1\right]$.
In the special case where all $n$ objects in either $\mathbf{x}$ or $\mathbf{y}$ have the same value, the CCC is undefined.
Expand All @@ -38,32 +38,28 @@ In the special case where all $n$ objects in either $\mathbf{x}$ or $\mathbf{y}$
The CCC has the following basic properties derived from the ARI:
1) symmetry, since $\mathrm{ARI}(\pi, \pi^{\prime}) = \mathrm{ARI}(\pi^{\prime}, \pi)$;
2) normalization, since it takes a minimum value of zero and a maximum of one since $\mathrm{ARI}(\pi, \pi) = 1$;
3) constant baseline, since the ARI is adjusted-for-chance [@doi:10.1007/BF01908075], it returns a value close to zero for independently drawn partitions, and this also holds when partitions have different number of clusters [@Vinh2010];
this is an important property, since CCC compares partitions with different numbers of clusters, and relationships between two variables (such as linear or quadratic) might be better represented with different numbers of clusters as shown in Figure @fig:datasets_rel.
3) constant baseline, since the ARI is adjusted-for-chance [@doi:10.1007/BF01908075], it returns a value close to zero for independently drawn partitions, and this also holds when partitions have different number of clusters [@Vinh2010].
This is an important property, since CCC compares partitions with different numbers of clusters, and relationships between two variables (such as linear or quadratic) might be better represented with different numbers of clusters as shown in Figure @fig:datasets_rel.

#### The maximum number of clusters $k_{\mathrm{max}}$

The parameter $k_{\mathrm{max}}$ is the maximum number of clusters $k$ allowed for any partition derived from $\mathbf{x}$ or $\mathbf{y}$.
However, note that the same value of $k$ might not be the right one to find a relationship between any two variables.
On one hand, note that the same value of $k$ might not be the right one to find a relationship between any two variables.
For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one variable and two using the other).
If we used only two clusters instead, CCC would return a similarity value of 0.02.
However, computational time increases exponentially with $k_{\mathrm{max}}$.
On the other hand, computational time increases quadratically with $k_{\mathrm{max}}$.
In addition, it is important to note that given the constant baseline property of the ARI, the CCC returns a value close to zero for independent variables regardless of the value of $k_{\mathrm{max}}$.
As shown in Figure @fig:constant_baseline:k_max, this holds even for very large values of $k_{\mathrm{max}}$ approaching the number of objects $n$.
Note that as $k_{\mathrm{max}}$ approaches $n$, the number of singleton clusters (i.e., clusters with only one object) increases, which would not be useful to find relationships between variables.
Therefore, given the constant baseline property, $k_{\mathrm{max}}$ only represents a tradeoff between the ability to capture complex patterns and the computational cost, with random/independent variables having a CCC value close to zero regardless of the value of $k_{\mathrm{max}}$.

We found that $k_{\mathrm{max}}=10$ works well in practice, and it was used as the default maximum number of clusters across all our analyses.
As shown in Figure @fig:constant_baseline:k_max, this holds even for very large values of $k_{\mathrm{max}}$, approaching the number of objects $n$.
Note that as $k_{\mathrm{max}}$ approaches $n$, the number of singleton clusters (i.e., clusters with only one object) increases, which would not be useful for finding relationships between variables.
Therefore, given the constant baseline property, $k_{\mathrm{max}}$ only represents a tradeoff between the ability to capture complex patterns and the computational cost, with random/independent variables having a CCC value close to zero regardless of the value of $k_{\mathrm{max}}$; we found that $k_{\mathrm{max}}=10$ works well in practice, and it was used as the default maximum number of clusters across all our analyses.

#### Statistical significance

Our null hypothesis is that the variables represented by vectors $\mathbf{x}$ and $\mathbf{y}$ are independent.
Our null hypothesis is that the variables represented by $\mathbf{x}$ and $\mathbf{y}$ are independent.
To compute a $P$-value, we perform a set of permutations of values in $\mathbf{y}$ and compute the CCC between $\mathbf{x}$ and each permuted vector.
The $P$-value is the proportion of CCC values using the permuted data that are greater than or equal to the CCC value between $\mathbf{x}$ and $\mathbf{y}$.
We used 1 million permutations in all our analyses, and we adjusted the $P$-values using the Benjamini and Hochberg procedure [@doi:10.1111/j.2517-6161.1995.tb02031.x] to control the false discovery rate (FDR).

In our analyses, we computed a $P$-value only for a subset of gene pairs, since computing the statistical significance via permutations is computationally expensive with CCC.
For this, we selected gene pairs from the "Disagreements" group in Figure @fig:upsetplot_coefs, which contains gene pairs ranked differently by the correlation coefficients.
We used 1 million permutations in all our analyses, and we adjusted the $P$-values using the Benjamini and Hochberg procedure [@doi:10.1111/j.2517-6161.1995.tb02031.x] to control the false discovery rate (FDR);
given the computational cost, we computed a $P$-value only for gene pairs from the "Disagreements" group in Figure @fig:upsetplot_coefs, which contains gene pairs ranked differently by the correlation coefficients.

#### Algorithm

Expand All @@ -76,11 +72,10 @@ Then, it computes the ARI between each partition $\pi_j \in \Pi^{\mathbf{x}}$ an


Interestingly, since CCC only needs a set of partitions to compute a correlation value, any type of variable that can be used to perform clustering is supported.
If variable $\mathbf{v}$ is numerical (lines 2 to 6 in the `get_partitions` function), then a set of partitions $\Pi$ is generated with different number of clusters.
Each partition $\pi$ is generated using a set of quantiles $\boldsymbol{\rho}$.
If variable $\mathbf{v}$ is numerical (lines 2 to 6 in the `get_partitions` function), each partition $\pi$ is generated using a set of quantiles $\boldsymbol{\rho}$.
For example, if $k=2$, then $\boldsymbol{\rho}=(\rho_1, \rho_2, \rho_3)$, where $\rho_1$ is the minimum value of $\mathbf{v}$, $\rho_2$ is the median, and $\rho_3$ is the maximum value of $\mathbf{v}$.
Then, the first cluster $\pi_{1}$ contains all values of $\mathbf{v}$ that are less than or equal to $\rho_2$, and $\pi_2$ contains all values of $\mathbf{v}$ that are greater than $\rho_2$.
If variable $\mathbf{v}$ is categorical (lines 8 to 11), we compute a single partition $\pi$ with $k=\left\vert\mathcal{C}\right\vert$ clusters, where $\mathcal{C} = \{v_{i} | v_{i} \in \mathbf{v}\}$ is a set of unique categorical values.
If variable $\mathbf{v}$ is categorical (lines 8 to 11), we compute a single partition $\pi$ with $m=\left\vert\mathcal{C}\right\vert$ clusters, where $\mathcal{C} = \{c_1, c_2,\dots,c_m\}$ is a set of unique categorical values in $\mathbf{v}$.
Therefore, all variable types are internally represented as partitions and it is not necessary to access the original data values to compute the ARI.
Consequently, numerical and categorical variables can be naturally integrated.

Expand All @@ -90,9 +85,7 @@ This means that for a variable pair, 18 partitions are generated (9 for each var
Smaller values of $k_{\mathrm{max}}$ can reduce computation time, although at the expense of missing more complex/general relationships.
Our examples in Figure @fig:datasets_rel suggest that using $k_{\mathrm{max}}=2$ would force CCC to find linear-only patterns, which could be a valid use case scenario where only this kind of relationships are desired.
In addition, $k_{\mathrm{max}}=2$ implies that only two partitions are generated, and only one ARI comparison is performed.
In this regard, our Python implementation provides flexibility in specifying $k_{\mathrm{max}}$.
For instance, instead of the maximum $k$ (an integer), the parameter could be a custom list of integers: for example, `[2, 5, 10]` will partition the data into two, five and ten clusters.


Generating partitions (lines 16 and 17) or computing their similarity (line 18) can be parallelized.
We used three CPU cores in our analyses to speed up the computation of CCC.
As a final remark, it is interesting to note that generating partitions (lines 15 and 16) and computing their similarity (line 17) can be easily parallelized.
We used three CPU cores in our analyses to speed up the computation of CCC and this could be potentially extended to a large number of processors using a GPU.
Binary file modified content/images/intro/ccc_algorithm/ccc_algorithm.pdf
Binary file not shown.
Loading

0 comments on commit e085563

Please sign in to comment.