Skip to content

A post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

SONGDONGYUAN1994/ClusterDE

Repository files navigation

ClusterDE


The R package ClusterDE is a post-clustering DE method for controlling the false discovery rate (FDR) of identified between cell-type DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cell type, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Detailed tutorials that illustrate various functionalities of ClusterDE are available at this website. The following illustration figure summarizes the usage of ClusterDE:

Instead of a new pipeline, ClusterDE actually works as an add-on to popular pipelines such as Seurat. To find out more details about ClusterDE, you can check out our manuscript on bioRxiv.

Changelog

  • 2024-06-29 Important changes
    • Add functions for spatial clustering

The motivation and application of ClusterDE: In Seurat function findMarkers, the authors pointed out: "p-values should be interpreted cautiously, as the genes used for clustering are the same genes tested for differential expression." This is the "double-dipping" issue. If your clustering results are inaccurate and since the clustering has used your expression data already, the discovered DE genes may not represent the discrete cell type separation, but other variation in your data (e.g., cell cycle, total UMI, or other variation you are not clear. These are still biological variation but do not define discrete status).

ClusterDE aims at correcting the double-dipping issue for comparing two dubious clusters, which you are not sure if they are two discrete cell types or just an artifact of your clustering algorithm based on conventional DE analysis. ClusterDE controls the false discoveries in DE and prioritizes the true cell type markers.

Note: current version is focusing on one vs one comparison.

Installation

To install the development version from GitHub, please run:

if (!require("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("SONGDONGYUAN1994/scDesign3")
devtools::install_github("SONGDONGYUAN1994/ClusterDE")

Please note that ClusterDE is actually a wrapper of scDesign3. Therefore, you can also directly use scDesign3 to "design" your own synthetic null data. To better understand scDesign3, you can check out our manuscript on Nature Biotechnology:

Song, D., Wang, Q., Yan, G. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol (2023).

Quick Start

The following code is a quick example of how to generate the synthetic null data. The input data should be a gene by cell matrix containing the two clusters you want to compare. If your input matrix is count data (especially UMI counts), nb (Negative Binomial) is usually the appropriate choice. The synthetic null data generation is relatively time-consuming; you may use the fast version (fastVersion = TRUE).

data(exampleCounts)
nullData <- constructNull(mat = exampleCounts,
                          family = "nb",
                          formula = NULL,
                          extraInfo = NULL,
                          nCores = 1,
                          parallelization = "mcmapply",
                          fastVersion = FALSE,
                          corrCut = 0.2,
                          BPPARAM = NULL)

The parameters of constructNull() are:

  • mat: The input gene by cell matrix. It can be a sparse matrix.
  • family: A string of the distribution you want to use when fitting the model. Must be one of 'poisson', 'nb', 'zip', 'zinb' or 'gaussian'.
  • formula: A string of the mu parameter formula. It defines the relationship between gene expression in synthetic null data and the extra covariates. Default is NULL (cell type case). For example, if your input data is a spatial data with X, Y coordinates, the formula can be 's(X, Y, bs = 'gp', k = 4)'.
  • extraInfo: A data frame of the extra covariates used in \code{formula}. For example, the 2D spatial coordinates. Default is NULL.
  • nCores: An integer. The number of cores to use. Increasing the cores will greatly speed up the computation.
  • parallelization: A string indicating the specific parallelization function to use. Must be one of 'mcmapply', 'bpmapply', or 'pbmcmapply', which corresponds to the parallelization function in the package 'parallel', 'BiocParallel', and 'pbmcapply' respectively. The default value is 'pbmcmapply'.
  • fastVersion: A logic value. If TRUE, the fast approximation is used.
  • corrCut: A numeric value. The cutoff for non-zero proportions in genes used in modelling correlation. The reason is that lowly expressed genes are hard to calculate correlation.
  • BPPARAM: A MulticoreParam object or NULL. When the parameter parallelization = 'mcmapply' or 'pbmcmapply', this parameter must be NULL. When the parameter parallelization = 'bpmapply', this parameter must be one of the MulticoreParam object offered by the package 'BiocParallel'. The default value is NULL.

The output of constructNull() is the new gene by cell matrix in the same format as your input.

The following figure briefly describes how ClusterDE generates the synthetic null data:

After obtaining the synthetic null data, you should perform the same clustering procedure as you have done on your real data to get the DE p-values (nullPvalues). Finally, we compare the p-values from the null and p-values from the target data (real data) by callDE(). For illustration, here we use the Uniform random numbers as the p-values.

set.seed(123)
targetPvalues <- runif(10000)
nullPvalues <- runif(10000)
names(targetPvalues) <- names(nullPvalues) <- paste0("Gene", 1:10000)
res <- callDE(targetScores = targetPvalues,
              nullScores = nullPvalues,
              nlogTrans = TRUE,
              FDR = 0.05,
              correct = FALSE)

The parameters of callDE are:

  • targetScores: A named numeric vector of the DE scores from the target data, e.g., the p-values between two clusters from the real data.
  • nullScores: A named numeric vector of the DE scores from the synthetic null data, e.g., the p-values between two clusters from the null data.
  • nlogTrans: A logical value. If the input scores are p-values, take the -log10 transformation since Clipper require larger scores represent more significant DE. Default is TRUE.
  • FDR: A numeric value of the target False Discovery Rate (FDR). Default is 0.05.
  • correct: A logical value. If TRUE, perform the correction to make the distribution of contrast scores approximately symmetric. Default is FALSE.

The output of callDE is a list of target FDR, DE genes, and the detailed summary table.

Tutorials

For all detailed tutorials, please check the website. The tutorials will demonstrate the applications of ClusterDE in two cases: a cell line dataset (no cell type exists) and a PBMC dataset.

Contact

Any questions or suggestions on ClusterDE are welcomed! Please report it on issues, or contact Dongyuan Song ([email protected]{.email}).

Related Papers

Other methods for double dipping problem

About

A post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages