Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT (text-davinci-edit-001) used to revise manuscript #30

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion content/01.abstract.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

Correlation coefficients are widely used to identify patterns in data that may be of particular interest.
In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes.
Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models.

In this paper, we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models.
CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.
CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.
When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients.
Expand Down
17 changes: 9 additions & 8 deletions content/02.introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ This large amount of data provides new opportunities to address unanswered scien
Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971].
Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109].
Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976].

The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas.
Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research.

Expand All @@ -19,19 +20,19 @@ Therefore, advanced correlation coefficients could immediately find wide applica


The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly.
However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships.
Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505].
MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001].
Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855].
We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899].
miltondp marked this conversation as resolved.
Show resolved Hide resolved
However, they are designed to capture linear or monotonic patterns and may miss complex yet critical relationships.
Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) and the Distance Correlation (DC).
MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains.
However, the computational complexity makes them impractical for even moderately sized datasets.
Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions.
We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels.
Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables.
CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time.
CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships.
We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions.
To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776].
To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues.
CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients.
For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples.
We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute.
Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259].
Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT).
Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories.
54 changes: 53 additions & 1 deletion content/04.05.results_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,59 @@ The CCC provides a similarity measure between any pair of variables, either with
The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**.
In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters).
Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1.
Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo).
Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo).

*Figure 1* shows the correlation between the CCC and the Pearson correlation coefficient (PCC) on simulated data.
The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in *Figure 2*.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
*Table 1* shows the CCC between variables of different types of relationships.

*The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in Figure 2.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
Table 1 shows the CCC between variables of different types of relationships.*

*Figure 3* shows the correlation between the CCC and the PCC on real data.
The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in *Figure 4*.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
*Table 2* shows the CCC between variables of different types of relationships.

*The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in Figure 4.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
Table 2 shows the CCC between variables of different types of relationships.*

*Figure 5* shows the correlation between the CCC and the PCC on real data.
The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in *Figure 6*.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
*Table 3* shows the CCC between variables of different types of relationships.

*The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in Figure 6.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
Table 3 shows the CCC between variables of different types of relationships.*

*Figure 7* shows the correlation between the CCC and the PCC on real data.
The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in *Figure 8*.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
*Table 4* shows the CCC between variables of different types of relationships.

*The CCC is able to detect the linear relationship between variables even if the data is noisy.
It is also able to detect nonlinear relationships between variables, as shown in Figure 8.
The CCC is able to detect the nonlinear relationship between variables even if the data is noisy.
The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy.
Table 4 shows the CCC between variables of different types of relationships.*


We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns.
Expand Down
14 changes: 7 additions & 7 deletions content/04.10.results_comp.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ We next examined the characteristics of these correlation coefficients in gene e
We selected the top 5,000 genes with the largest variance for our initial analyses on whole blood and then computed the correlation matrix between genes using Pearson, Spearman and CCC (see [Methods](#sec:data_gtex)).


We examined the distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs).
The distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs) was examined.
CCC (mean=0.14, median=0.08, sd=0.15) has a much more skewed distribution than Pearson (mean=0.31, median=0.24, sd=0.24) and Spearman (mean=0.39, median=0.37, sd=0.26).
The coefficients reach a cumulative set containing 70% of gene pairs at different values (Figure @fig:dist_coefs b), $c=0.18$, $p=0.44$ and $s=0.56$, suggesting that for this type of data, the coefficients are not directly comparable by magnitude, so we used ranks for further comparisons.
In GTEx v8, CCC values were closer to Spearman and vice versa than either was to Pearson (Figure @fig:dist_coefs c).
We also compared the Maximal Information Coefficient (MIC) in this data (see [Supplementary Note 1](#sec:mic)).
We found that CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)).
The Maximal Information Coefficient (MIC) was also compared in this data (see [Supplementary Note 1](#sec:mic)).
CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)).
MIC, an advanced correlation coefficient able to capture general patterns beyond linear relationships, represented a significant step forward in correlation analysis research and has been successfully used in various application domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
These results suggest that our findings for CCC generalize to MIC, therefore, in the subsequent analyses we focus on CCC and linear-only coefficients.

Expand Down Expand Up @@ -40,10 +40,10 @@ A logarithmic scale was used to color each hexagon.
](images/coefs_comp/gtex_whole_blood/upsetplot-main.svg "Intersection of gene pairs"){#fig:upsetplot_coefs width="100%"}


While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure @fig:upsetplot_coefs a).
While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure 2A).
There were also gene pairs with a high Pearson value and either low CCC (1,075), low Spearman (87) or both low CCC and low Spearman values (531).
However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later).
We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees with Pearson, Spearman or both.
However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure 2B and analyzed later).
We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure 2A, right) where CCC disagrees with Pearson, Spearman or both.

![
**The expression levels of *KDM6A* and *UTY* display sex-specific associations across GTEx tissues.**
Expand All @@ -55,4 +55,4 @@ The following three gene pairs (*UTY* - *KDM6A*, *RASSF2* - *CYTIP*, and *AC0685
In particular, genes *UTY* and *KDM6A* (paralogs) show a nonlinear relationship where a subset of samples follows a robust linear pattern and another subset has a constant (independent) expression of one gene.
This relationship is explained by the fact that *UTY* is in chromosome Y (Yq11) whereas *KDM6A* is in chromosome X (Xp11), and samples with a linear pattern are males, whereas those with no expression for *UTY* are females.
This combination of linear and independent patterns is captured by CCC ($c=0.29$, above the 80th percentile) but not by Pearson ($p=0.24$, below the 55th percentile) or Spearman ($s=0.10$, below the 15th percentile).
Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs (Figure @fig:gtex_tissues:kdm6a_uty).
Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs.
Loading