-
Notifications
You must be signed in to change notification settings - Fork 5
/
manuscript.Rmd
343 lines (189 loc) · 70 KB
/
manuscript.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Normalizing need not be the norm: count-based math for analyzing single-cell data"
output:
#word_document: default
#bookdown::word_document2: default
bookdown::pdf_document2:
keep_tex: true
latex_engine: pdflatex
#bookdown::html_document2: default
toc: false
csl: theory_in_biosciences.csl
link-citations: yes
bibliography: countland.bib
header-includes:
\usepackage[left]{lineno}
\linenumbers
---
```{r global options, include = FALSE}
knitr::opts_chunk$set(echo=FALSE, include = FALSE, warning=FALSE, message=FALSE, cache=FALSE)
```
**Samuel H. Church^1,^\*, Jasmine L. Mah^1^, Günter Wagner^1,2,3,4^, Casey W. Dunn^1^**
^1^ Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
^2^ Yale Systems Biology Institute, Yale University, New Haven, CT, USA
^3^ Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA
^4^ Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
\* Corresponding author, [email protected]
<!-- \setstretch{1.5} -->
# Abstract {-}
Counting transcripts of mRNA is a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package `countland`, available in `python` and `R`.
# Introduction
Counting transcripts in high-throughput assays is now both routine and highly productive for many lines of research [@wang2009rna;@van2013rna]. Next generation RNA sequencing assays, including single cell RNA sequencing (scRNA-seq), measure the abundance of gene transcripts across samples [@saliba2014single;@liu2016single]. In most assays, sequencing reads are collected, mapped to known genes, and measurements are reported as read counts per gene in a given sample (or in many modern assays, as counts of unique molecular identifiers, UMIs) [@van2013rna;@van2019rna;@luecken2019current]. When RNA samples are drawn from bulk tissue, the number of counts is typically large enough that counts for all but the rarest transcripts can be considered as measurements of proportional abundances and described with rational numbers [@van2013rna;@van2019rna;@wagner2012measurement]. As such, counts for each sample are often normalized to sequencing depth, scaled by a multiplier (e.g. 10,000), and log-transformed [@luecken2019current]. In recent years, single-cell and single-nucleus mRNA sequencing (scRNA-seq and snRNA-seq) have become practical and commonly applied assays [@saliba2014single;@liu2016single]. Where bulk tissue RNA-seq projects typically measure RNA for a handful of samples, scRNA-seq data are generated for thousands of cells in a single experiment. However, because the total counts per cell are much smaller than the counts per sample in bulk experiments, there are problems with interpreting counts as measurements of proportional abundances [@townes2019feature;@vallejos2017normalizing]. This is because for typical scRNA-seq data, the vast majority of UMI count values per gene and sample are zero, and the majority of non-zero counts are very small [@hicks2018missing].
The first step in most scRNA-seq analyses is to move data out of count-space by rescaling and transforming values. In the process, the measurements are transformed from natural numbers into rational numbers (numbers with fractional parts). The motivation of these steps is to reshape the data so that cells with similar biological proportions of transcripts end up with similar transformed measurements [@robinson2010scaling;@luecken2019current] and fall closer to each other in the multidimensional space of gene expression. However, a growing chorus has raised warnings that these transformations lead to unintended distortions [@hicks2018missing;@townes2019feature]. One of the primary issues is that stochasticity dominates the sampling process in this low-count regime, so that for many genes observed counts are a poor proxy for proportional abundance [@townes2019feature]. Because logarithmic transformations cannot be applied to zero values, additional steps of adding arbitrary pseudocounts (e.g. +1 to all values) are common, but this can introduce new biases [@lun2018overcoming;@townes2019feature]. Furthermore, rescaling low count values can invoke a false sense of confidence in relative gene abundances. From a philosophical point of view, transformations that lead to negative or fractional counts, that have no physical interpretation, violate the central tenet of measurement theory that data should correspond to the physical reality they represent [@houle2011measurement].
Though problems with transforming scRNA-seq data out of count-space have been well described, there is little consensus on how to remedy them. Recent suggestions include embracing zero values by converting single-cell data to binary zero/non-zero values [@qiu2020embracing], invoking alternative transformations (e.g. square root, proportional fitting) for variance-stabilization [@booeshaghi2022depth], or avoiding the use of normalization by modeling relative expression [@sarkar2021separating], such as with a generalized linear model (GLM), implemented with `Seurat's` `sctransform` [@hafemeister2019normalization] and in `GLM-PCA` [@townes2019feature]. Here we argue that there is great value in analyzing the data as what they are – counts of transcripts sampled from the total population of mRNA – rather than as estimates of relative expression. We present and evaluate an analytical approach that preserves the original measured properties of the data by considering them in a vector space over natural numbers (see Appendix 1), rather than transform them into rational numbers. Our objective is to find innovative solutions by integrating measurement theory into our analytical thinking. We demonstrate that a restricted algebra that is closed over the natural numbers, requiring no scaling or transforms that would result in values other than natural numbers, is sufficient to perform many of the key tasks in scRNA-seq analysis. These tasks include assessing cell similarity, measuring differential gene expression, and identifying clusters of cells with shared signatures of expression.
We present our implementation of count-based scRNA-seq analysis in the `python` and `R` package `countland`, available at `github.com/shchurch/countland`. This software seeks to complement existing approaches to model counts (e.g. `sctransfrom` in `Seurat` [@hafemeister2019normalization] , `GLM-PCA` [@townes2019feature] ) by offering an alternative that sidesteps estimates of relative expression altogether. We test this approach on benchmark data and show that it can achieve the standard objectives of scRNA-seq analysis, avoid pitfalls associated with transformation-based workflows, and improve the interpretability of results.
# Building intuition for count matrices
By increasing our familiarity with the structure and composition of a count matrix we can build our intuition about how common transformations are likely to reshape it. Count matrices of scRNA-seq data encode the number of counted transcripts for each gene in each cell (Fig. 1). Here we show genes as rows and cells as columns, though conventions vary (typically, cells are rows in `python` and columns in `R`). Because values within the matrix represent observations of physical transcripts by sequencing instruments, they are always either zero or positive integers.
![**Single-cell RNA count data are mostly zeros and small integers.** Visualizing a portion of the UMI-count matrix from the Silver standard PBMC dataset, described by Freytag _et al_ (2018) [@freytag2018comparison]. After count values of 0 (white), the most common values are 1 (gray) and 2 (blue). Count values of 3 (red) and higher (yellow) are rare and concentrated in a few genes with relatively high expression.](figures_and_panels/Figure_1_V2-01.png)
Visualizing a raw count matrix shows that the dominant feature is the presence of a large number of zeros (Fig. 1). These zeros can be considered as having three possible origins: biological, sampling, or technical [@silverman2020naught]. Biological zeros are the true absence of a given transcript in the cell, and may be the result of a gene not being expressed in a given cell or all transcripts of a gene having been degraded by the time of observation. Sampling zeros arise because we count only a small fraction of all transcripts in a given cell, so by chance many genes that have transcripts in the cell are not detected [@jiang2022statistics]. Technical zeros arise from artifacts that eliminate counts for some genes in some cells, and can be introduced during cell preparation, mapping, or other analysis steps. Several modern techniques such as using unique molecular identifiers (UMIs) have helped bring technical zeros to within distributional models of sampling [@svensson2020droplet;@cao2021umi]. Increased sequencing depth can reduce the number of sampling zeros [@jiang2022statistics]. Despite this, the benchmark for scRNA-seq is a data matrix with >90% of values zero, compared to 10-40% in bulk RNA-seq experiments [@jiang2022statistics].
Of the remaining non-zero values, most are of very small magnitude. In the widely used PBMC3k benchmark dataset, for example, 69.8% of non-zero UMI count values are exactly one and 88.6% are less than five (Fig. 1). This means that for a given cell or gene, it is often true that there are no or very few values greater than two across all measurements. Cells and genes with count values greater than ten are the exception rather than the rule.
The fact that count matrices are dominated by zeros and low integer values has implications for how we attribute meaning to the data. RNA sequencing seeks to profile the proportional abundances of all transcripts at a given instant in time, but our instruments do not directly measure proportions. Instead, with scRNA-seq we sample a small fraction, usually around 10%, of the total transcripts per cell [@liu2016single]. It is generally assumed that the probability of sampling a transcript is determined by the proportional abundance of the transcripts of each gene [@jiang2022statistics]. Normalizing the data to the total counts for each cell invokes an additional assumption that sampling has been sufficient such that observed counts are a robust proxy for proportional abundance. However, at low magnitudes, sampling one additional transcript results in big jumps in the apparent relative abundances. Incrementing from one count to two, for example, results in a two-fold increase in proportional abundance, and incrementing a count from zero to one results in an infinite-fold increase in proportional abundance.
# A restricted algebra for counts
## Count-space
In mathematical terms, scRNA-seq count measurements occupy a high-dimensional vector space over the natural numbers (here defined as inclusive of zero, see Appendix 1). These are computationally encoded as unsigned integers. There is no such thing as a negative or non-integer transcript count, as these are physically meaningless.
Group theory is the field of modern algebra that concerns collections of numbers and operations on those numbers. In group theory, a group is defined as closed when specified operations on members of the group result in objects that are also members. Groups can then be classified by which operations they are closed for. One familiar type of group is a field, which is closed under the operations addition, subtraction, multiplication, and division. Rational numbers form a field because applying any of these operations to rational numbers always results in a rational number. Natural numbers, such as transcript counts, do not form a field, because natural numbers are not closed under subtraction or division. These operations can result in values that are outside the group (e.g. non-natural numbers like -2 or ½). Groups that are closed under addition and multiplication but not under subtraction or division are classified as semirings. See Appendix 1 for a formal treatment of the topic.
Constraining operations to those that are closed over a semiring provides a restricted algebra for analyzing scRNA-seq data that maintains its original count nature, guides intuitive thinking, and is still surprisingly rich. To understand what operations in the space can tell us, we can visualize observations of transcripts in cells as vectors in count-space. In count-space each gene forms an axis, perpendicular to all other gene axes. This space, therefore, has as many dimensions as there are genes. Each sequenced cell exists at a point in count-space with coordinates given by the counts for all genes. This point is the terminus of a vector emanating from the origin, which is the point where all gene counts are zero. We can conceptualize the cell sitting at the origin, for which all counts are zero, as the null cell.
![**A simulated RNA counting process in a system with two genes and three cell populations with five cells each.** One step equals one new observation of a transcript and moves the cell one unit along the axis of the corresponding gene. Colors indicate cell populations. A, Paths after counting ten transcripts. Note that after an equivalent number of counts, all vectors terminate along a diagonal line with a slope of -1. B, Paths after counting 100 transcripts. Vectors begin to separate based on the actual proportions of transcripts in different cell populations (different colors). C, Paths after counting 1000 transcripts. Vectors corresponding to different cell populations terminate in distinct regions of count-space, and vectors from the same population are more similar to each other than to vectors from different populations.](figures_and_panels/Figure_2_V3-01.png)
The measurement process can be thought of as estimating this vector by walking through count-space from the null cell (Fig. 2). As RNA sampling proceeds, each new transcript moves the cell away from the origin, always by one integer value in the positive direction parallel to the axis of the corresponding gene (Fig. 2). The final number of steps is determined by the sampling depth for the cell in question, and is the total number of counts for a cell.
In this count-space, there is no concept of Euclidean distance. Calculating Euclidean distance requires the operations of subtraction and square roots, under which the natural numbers are not closed. Visualizing vectors in a reduced two-dimensional count-space helps clarify why Euclidean distance is not the measure we might have in mind. Vectors with the same number of total counts terminate along a diagonal (Fig. 2), not along a circle, meaning their Euclidean distances from the origin, could they be calculated, would be non-equal. Instead of distance from the origin, we have another straightforward measure of vector magnitude: the total number of counts, which is the number of steps from the origin to the vector terminus. This is equivalent to the Manhattan distance from the origin to the terminus.
## Operations in count-space
The restricted algebra of semirings gives rise to operations that have clear, intuitive, physical interpretations for count data. Addition of two cell vectors produces a new vector where each element is the sum of the corresponding elements in the original vectors. This operation is the equivalent to the physical act of pooling transcripts from two cells. Adding all cell vectors is equivalent to pooling all cells into a single bulk-tissue RNA-seq experiment.
Multiplying two cell vectors also has a clear physical interpretation. The inner product, also known as the dot product, is an assessment of the informational similarity of two vectors, and can be used to compare pairs of cells. The dot product of two vectors is calculated by multiplying each corresponding element between the vectors and summing the resulting products. Because the dot product is calculated using only the operations of addition and multiplication, the natural numbers are closed under this operation.
The dot product is a way of comparing cell-cell transcriptome count data. When cell vectors have no counts in common (they are orthogonal), the dot product will be zero. When cell vectors have many counts in common, the dot product will be large. Because the dot product relies on multiplication and addition, the dot product scales linearly with vector length, here sequencing depth. This means that short vectors (cells with low total counts), will be less similar to all other vectors than longer vectors. This can lead to a counter-intuitive result as when identical short vectors have a smaller dot product with each other than they each have with a longer vector. But this property also produces the intuitive result that a vector with no counts at all (the null cell, terminating at the origin), will have a dot product of zero with all other vectors. The dot product is not strictly a measure of vector alignment, it instead reflects both alignment and vector length, favoring similarity between long vectors– those with many read counts.
This property of the dot product is advantageous when we want our measure of similarity to reflect our confidence in the results. In the case of RNA-seq, with more counts we can make stronger statements about the similarity of RNA-composition between cells. Using sequencing depth to scale similarity serves to down-weight shallowly sequenced cells because short vectors will have small dot products with all other vectors. For a cell with no counts, we can't make any claims about its similarity to other cells. Because of this scaling property, dot products can be used to find the most and least similar pairs of cells, but to interpret the absolute value of the dot product in context requires the original total number of counts.
## Subspaces and subsampling
Subsetting the count matrix is equivalent to projecting the data onto a subspace of the original count-space. This can be conceptualized as multiplying count values for certain cells or genes by zero, and is permitted using our restricted algebra. Projections onto one feature axis allow us to consider counts in only one gene, and are useful when dealing with questions about gene-level differences, such as finding genes with differential or highly variable expression. Projections onto vectors, such as projecting one cell vector onto another, are not generally possible because such a projection could result in non-natural coordinates (see Appendix 1).
It is not always necessary to standardize sequencing depth across cells in order to make useful comparisons across cells, as we demonstrate in our results below. However there are certain comparisons where we might expect heterogeneity in sequencing depth to obscure biological differences, for example, when calculating differential gene expression across cells [@robinson2010scaling;@vallejos2017normalizing;@van2019rna;@booeshaghi2022depth]. One method of standardizing sequencing depth that is coherent with our physical understanding of the measurement process is to subsample counts to an equivalent number of counts for all cell vectors (also referred to as downsampling) [@grun2015design;@ziegenhain2017comparative]. Because the scRNA-seq counting process stops at an arbitrary point, we can randomly subsample from our observations per cell to stop the process at a specific number of our choosing. By subsampling counts to the same number across cells, we can standardize sequencing depth without converting counts to proportions.
There are also cases where we might expect heterogeneity in the magnitude of expression across genes to obscure biological variation. The dot product is calculated by summing the product of each gene, meaning genes with substantially larger count values will contribute more to the dot product than genes with smaller values. This feature of the dot product can be useful for emphasizing genes with the largest dynamic range of counts, given that expression variance scales with magnitude [@ahlmann2021comparison;@booeshaghi2022depth]. But when highly expressed genes do not contain informative biological signal, this variance might drown out signal from genes with lower expression magnitudes. Accounting for the mean-variance relationship is the driving motivation behind several common models and transformations in scRNA-seq workflows [@ahlmann2021comparison]. A count-based approach for tuning down the signal of highly expressed genes is to limit their total counts by randomly subsampling observations per gene. We can accomplish this by establishing a threshold for total counts and subsampling that number of observations from any gene vector that exceeds that threshold.
With both subsetting and subsampling, there is a trade-off between reducing unwanted variation and throwing away data. Discarding counts can seem particularly painful given the already sparse nature of scRNA-seq data. We note that the standard scRNA-seq workflow already invokes subsetting the data, e.g. using only the top fraction of genes, ranked by variance of transformed counts. In our recommended count-based workflow, subsampling and subsetting are optional steps that should be applied depending on the research question. Furthermore, since subsampling is stochastically accomplished, we can leverage the original data by repeatedly sampling and reanalyzing to test whether results are robust to sampling effects.
![**Standard and count-based workflows.** The scRNA-seq workflow as described in the documentation for `scanpy` and `Seurat` is shown in orange. A count-based approach, as implemented in `countland`, is shown in blue, with optional subsetting, subsampling, and rank reduction steps adjacent.](figures_and_panels/Figure_3_V2-01.png)
# countland: a count-based approach to RNA analysis
This restricted algebra can be used to achieve all the standard analytical objectives of scRNA-seq without invoking normalization or transformation (Fig. 3). We have implemented these in our software package called `countland`, written for both `python` and `R`.
## Comparing cells
A common approach for comparing cells in normalization-based scRNA-seq workflows is to use principal component analysis (PCA) to rotate and project expression values onto a smaller dimensional space, and then calculate a neighborhood graph using the cell-cell distances in the PCA representation [@luecken2019current]. In a count-based approach, we avoid the use of distances by calculating the dot product of untransformed counts between all pairwise combinations of cells. This process generates a similarity matrix, rather than a distance matrix. This similarity matrix has several important properties: it is square, with the number of rows and columns determined by the number of cells, and it is symmetric, with diagonal values giving the dot product of each vector with itself, which is the sum of squares of all its elements. Off-diagonal values are the dot products between different cell vectors, and these values range from zero to any positive integer, unbounded on the maximum end as alignment increases.
Cell similarity can be visualized by applying spectral embedding to this matrix. Spectral embedding involves calculating the graph Laplacian of the matrix and then estimating the eigenvectors and eigenvalues of this graph [@ng2001spectral]. The cell similarity matrix can be visualized in two dimensions by plotting points using two eigenvectors (Fig. 4). This matrix can also be used to cluster cells by applying a k-means algorithm to the eigenvector matrix (spectral clustering). Here we demonstrate the use of spectral clustering of the dot product matrix for identifying cell populations using the Gold standard scRNA-seq benchmark dataset, described by Freytag _et al_ (2018) [@freytag2018comparison]. This dataset consists of 925 cells from three human adenocarcinoma cell lines (HCC827, H1975, H2228). Using `countland`, we recover these populations with high fidelity (adjusted rand index = 0.994, see benchmark results below). While spectral embedding includes operations that are not closed for natural numbers (e.g. subtraction and division), here they are operating on comparisons between cells (the dot product matrix) that do respect the count nature of the data.
![**Clustering cells by similarity using a count-based approach.** The Gold standard scRNA-seq dataset of 925 cells drawn from three human adenocarcinoma cell lines (HCC827, H1975, H2228) are visualized using spectral embedding of the pairwise dot product matrix between all cells. A, cells are colored according to the ground truth labels as described in Freytag _et al_ (2018) [@freytag2018comparison]. B, cells are colored according to the results of spectral clustering. The two points outlined in black are the only cells for which cluster labels differed from ground-truth labels.](figures_and_panels/Figure_4_V3-01.png)
## Comparing genes
Gene expression can be compared with a count-based approach as well. There are several count-based measures that provide insight into expression dynamics, and which are useful in evaluating dataset quality, identifying marker genes and highly variable genes, and calculating differential expression. These measures include:
- The total number of counts per gene, summed across cells.
- The maximum observed count value per gene.
- The number of cells where a gene was detected, as well as the number where detected at a count value larger than one or some other value.
- The number of unique count values (e.g. 0, 1, 2). Given the discrete nature of low-magnitude count values, this measure can provide insight into expression variability across cells.
- The largest number $n$, where there are $n$ cells with $\geq n$ counts for the gene in question. We refer to this measure as the _count index_, and it can be helpful for finding genes that frequently show higher count values, as compared to genes that are mostly detected at values of 1 or 2, with a few high-count exceptions.
The ideal marker gene for a cluster of cells can be defined as the gene with the highest differential expression between cells in the cluster and all other cells, or in an alternative approach, as the gene that is most specifically expressed in cluster cells [@booeshaghi2022depth]. A count-based approach for identifying marker genes by specificity is to count the number of cells with non-zero observations for each gene, and then calculating the difference between the fraction of these cells in a cluster versus the fraction that are not. The ideal marker gene would return a value of one, indicating it was expressed in all cluster cells and no others.
Differential expression can be assessed using rank-sums tests, similar to those invoked in other workflows, but here applied to raw counts instead of transformed proportions. In this method, counts for a given gene are ranked between cluster and non-cluster cells, the ranks for each group are summed, and a test statistic is calculated. This statistic is used to test the hypothesis that observations from cluster cells are larger than those from non-cluster cells. Because heterogeneity in count depth can influence the magnitude of expression for individual gene [@robinson2010scaling;@vallejos2017normalizing], subsampling to a standard sequencing depth prior to calculating differential gene expression is recommended.
## Reducing dimensionality and low-rank approximation
The high dimensionality of single-cell count matrices is one of the fundamental challenges to analyzing, interpreting, and visualizing these data [@chari2021specious]. There are several motives for embedding the count matrix in a lower-dimensional space, primary among them being data visualization. In other workflows, dimensional reduction is achieved through linear transformation (e.g. PCA and projection) of the already-transformed count matrix, followed by further non-linear reductions to two-dimensions using t-SNE or UMAP (Fig. 3). Recent reports suggest that rather than reducing the distances between similar cells, this approach results in extensive distortions of cell-cell distances [@chari2021specious].
Here we implement a count-based method for reducing the number of dimensions that are not based on Euclidean distance. Integer matrix factorization is an alternative approach to achieve a low-rank approximation of matrices that include only natural numbers [@lin2005integer]. Like other matrix factorizations (e.g. singular-value decomposition), this method seeks to find lower-rank matrices that can be multiplied together to approximate a higher-rank matrix, here the count matrix. Integer matrix factorization generates three matrices, termed $\mathbf{U}$, $\mathbf{V}$, and $\mathbf{\Lambda}$, similar to the three matrices generated by singular-value decomposition on data consisting of real numbers. When applying integer matrix factorization on single-cell count data, matrix $\mathbf{U}$ has the dimensions m cells by $k$ features, with $k$ provided as the target rank, $\mathbf{V}$ has the dimensions $k$ features by $n$ genes, and $\mathbf{\Lambda}$ is a diagonal matrix of $k$ scaling factors.
Because of the discrete nature of count data, integer matrix factorization cannot be accomplished conventionally, but an approximation for this factorization has been proposed for other types of count-based data [@dong2018integer;@perros2018sustain]. Here we implement the algorithm for integer matrix approximation in `python` and `R`, and apply it to the approximation of single-cell count data. Following integer matrix approximation, cells can be embedded in a lower-dimensional space by multiplying the count matrix with matrix $\mathbf{V}$, scaled by $\mathbf{\Lambda}$. Cells can then be visualized by plotting their coordinates in two resulting embedding components (Fig. 5C). This is conceptually similar to calculating principal components for visualization using singular-value decomposition.
Given the problems inherent to data transformation, Townes _et al_ (2019) [@townes2019feature] recently proposed a generalized version of PCA (`GLM-PCA`) that takes advantage of the exponential family of likelihoods, and which can be applied directly to raw counts. This approach, and other generalized linear model based approaches such as `sctranscform` in `Seurat` [@hafemeister2019normalization], avoid normalization by modeling counts directly. Furthermore, like the count-based algebra described here, `GLM-PCA` additionally does not rely on Euclidean distances, allowing for visualization without normalization. We highlight these developments in GLM approaches as evidence of a growing consensus that depth-based normalization is problematic for scRNA-seq data. Here we take the conversation one step further by demonstrating that all transformations and models that result in an estimate of relative expression can be avoided, without limiting our analytical objectives.
![**Comparing dimensional reduction approaches.** Cells from the Gold standard benchmark dataset, colored in all panels according to ground truth labels. A, spectral embedding of the dot product matrix. B, reduced dimensionality using integer matrix approximation. C, generalized PCA (`GLM-PCA` [@townes2019feature]), as implemented in the R package `glmpca` [@townes2019feature]. D, uniform manifold approximation (UMAP) following data transformation, as implemented in `Seurat`. E, UMAP following `sctransform` in `Seurat`. Note that UMAP on transformed values results in the most apparent cluster boundaries, but it also results in one ground truth population (cyan) being split into two groups (see performance evaluation below).](figures_and_panels/Figure_5_V3-01.png)
# Evaluating count-based approaches to clustering
```{r source, include=F, cache=F}
load("tutorials_and_vignettes/R_tutorials_and_vignettes/performance_evaluation_gold_standard_results.RData")
load("tutorials_and_vignettes/R_tutorials_and_vignettes/performance_evaluation_silver_standard_results.RData")
```
## Gold standard
We tested the `countland` methods described above on published benchmark data to evaluate its performance relative to ground truth and consistency with published workflows. As shown in Figure 4, `countland` recovers the ground truth cell identities from the Gold standard dataset with high fidelity. Clustering accuracy was evaluated using three measures: the adjusted rand index (ARI), normalized mutual information (NMI), and cluster homogeneity. ARI and NMI both evaluate the similarity between two sets of clusterings, while homogeneity measures the degree to which an identified cluster contains only members of one ground truth group. Against ground truth, `countland` returns an ARI score of `r gold_standard_results[[1]][1,1]`, an NMI of `r gold_standard_results[[1]][1,2]`, and homogeneity of `r gold_standard_results[[1]][1,3]` (Table 1). These results were achieved using `countland` on the raw count matrix, demonstrating that without any form of normalization or data transformation, count-based methods can accurately identify cells by expression.
For the same dataset, `Seurat` returns a lower ARI (`r gold_standard_results[[1]][2,1]`) when using the default parameters, but after reducing the resolution used in clustering, `Seurat` returns an ARI of `r gold_standard_results[[1]][3,1]` (Table 1). The difference in scores with `Seurat` can be attributed to the software splitting one ground truth cell population into two clusters, as seen in the cyan colored cells in Fig. 5E.
Because the Gold standard dataset is significantly less sparse than many scRNA-seq datasets, we tested `countland`’s performance on a modified Gold standard dataset where each cell contained only 1% of the original number of observations. Both `countland` and `Seurat` can recover ground truth cell identities with high fidelity (Table 2). We note that when analyzing the more sparse dataset, `countland` achieves slightly better results when the number of components (Laplacian eigenvectors) used in spectral clustering are increased from 5 to 10 (ARI increased from `r gold_standard_results[[4]][3,1]` to `r gold_standard_results[[4]][4,1]`).
We tested performance on datasets derived from the Gold standard, modified to introduce the kinds of data heterogeneity that are often invoked to justify data transformations. First, we modified the sparse Gold standard dataset so that 100 of the 925 cells had their original measurements from before reducing counts to 1% of their original number. These 100 cells were randomly drawn 50 each from two of the three cell populations. With this dataset we observed a reduction in cluster accuracy analyzing counts directly, but after applying `countland`'s subsampling procedure to bring cells to a standardized sequencing depth, `countland` accurately recovers ground truth cell identities (Table 3). In contrast, `Seurat` with default parameters fails to accurately identify cells, despite normalizing data by sequencing depth (Table 3). Visualizing the `Seurat` results shows that clusters separate cells by sequencing depth as well as their original identities. This is likely because cells with more total counts have observations for many genes that are not observed in lower-count cells, a fact which depth normalization cannot account for. Using `sctransform` in `Seurat`, which avoids depth normalization by applying a generalized linear model to counts, avoids this problem and recovers ground truth cell identities (Table 3).
We also tested performance on a version of the sparse Gold standard dataset modified to have substantial heterogeneity in gene expression. To accomplish this we added ten highly expressed genes with no variation attributed to cell population. Count values were simulated for these genes using a Poisson distribution with a lambda value 10x larger than the largest observed mean count value across genes. Adding these genes with high count values resulted in decreased accuracy for `countland` when analyzing raw counts, but this effect was eliminated when highly expressed genes were subsampled to a predetermined maximum count value (Table 4). The performance of `Seurat` was robust to the addition of highly expressed genes (Table 4).
```{r gold, include=T, cache=F}
knitr::kable(gold_standard_results[[1]],caption="Gold standard dataset")
knitr::kable(gold_standard_results[[3]],caption="Gold standard dataset, with 1\\% of original counts")
knitr::kable(gold_standard_results[[5]],caption="Gold standard dataset, modified to have substantial heterogeneity in sequencing depth")
knitr::kable(gold_standard_results[[6]],caption="Gold standard dataset, modified to have substantial heterogeneity in gene expression")
```
## Silver standard
We evaluated the performance of `countland` and `Seurat` on the Silver standard dataset (version 3a) of peripheral mononuclear cells (PBMCs). This dataset does not include ground truth labels, instead it includes cell labels derived from similarity to a reference dataset [@zheng2017massively]. Identifying biologically relevant datasets with robust ground truth labels is an outstanding problem in scRNA-seq analysis [@ahlmann2021comparison]. When published cell labels are derived using a transformation-based approach, any comparison to these labels using an alternative approach will be inherently biased. In addition, previous evaluations of the Silver standard dataset that all methods fail to recover all labeled cell groups, with different methods returning an ARI between ~0.2 and ~0.6. We reanalyzed this dataset using the standard `Seurat` workflow and with `sctransform` (Table 5).
When subsampling highly expressed genes to a maximum total count value equal to the number of cells, countland returned similar results to other clustering software (Table 5). Subsampling genes even further returned higher scores, as did subsetting the data to remove the top 5% of genes altogether. This suggests that, for the Silver standard dataset, highly expressed genes are not useful in identifying cells according to the published labels, and may mask signal present in other genes. Subsampling cells to a standard sequencing depth did not result in an increase in clustering scores.
```{r silver, include=T, cache=F}
knitr::kable(silver_standard_results,caption="Silver standard dataset")
```
## Sponge dataset
We evaluated the performance of `countland` on a large and complex dataset from the sponge _Spongilla lacustris_, originally published in Musser _et al_., 2021 [@musser2021profiling]. This dataset is ideal given it size (>10,000 cells), the diversity of both transitional and differentiated cells, and the evidence that there is very little variation attributed to batch effects. This latter point is critical as we have not proposed in this paper a count-based method for integrating across batches. We tested `countland` on two datasets: the full dataset, and a dataset including only cells originally labeled as differentiated (excluding those labeled as transitional cells or archaeocytes and their relatives).
The originally published cell labels [@musser2021profiling] for this dataset were determined using a custom clustering approach derived from a normalization-based workflow. Clusters were first identified using a high resolution search, resulting in many subclusters, which were collapsed using a similarity threshold, with the objective of retaining predicted rare but distinct cell types without subdividing other larger cell clusters. Here we compared results derived using the dot product and spectral clustering in `countland` to those published labels (Fig 6).
We observed that using the dot product on counts is sufficient to distinguish many differentiated cell types without any modification to the original count data (no transformation, normalization, or subsampling). We recovered clusters that correspond to the originally published labels (Fig. 6), including amoebocytes (Amb), basopinacocytes (basPin), granulocytes (Grl), lophocytes (Lph), two clusters of metabolocytes (Met1 and 2), two clusters of myopeptidocytes (Myp1 and 2), and two clusters of incurrent pinacocytes (incPin1 and 2). We also recovered clusters of choanoblasts (Chb), apendopinacocytes (apnPin), and choanocytes+apopylar cells (Cho+Apo), but with the resolution set to the number of published labels (18 clusters), spectral clustering did not split these into the subclusters originally described (Chb1 and 2, apnPin1 and 2, Cho and Apo). Each of these pairs of subclusters were described as sister cell types in the original publication [@musser2021profiling]. Instead, at this resolution spectral clustering split other cell types into subclusters (Fig. 1D-F, gray clusters), including new subclusters of granulocytes (Grl) and of metabolocytes (Met). At this resolution, `countland` did not recover the small clusters corresponding to sclerophorocytes (Scp) or neuroid cells (Nrd), but both were recovered when the number of clusters is set to 20 or higher. Visualizing this dataset with spectral embedding, we observed that the first two components separate choanoblasts+choanocytes+apopylar cells from metabolocytes, and from a continuous grouping that connects apendopinacocytes to incurrent pinacocytes, and incurrent pinacocytes to myopeptidocytes.
Analyses on the full dataset, including transitional and undifferentiated cells, also returned comparable cell clusters to those originally published (Supplemental Fig. 1). Multiple clusters corresponding to differentiated and transitional cell types were identified by `countland`, however, the correspondence between the transitional clusters and published labels was lower than for differentiated cell types (Supp. Fig. 1D-F, gray clusters). This is expected, as boundaries between clusters for cells along a developmental trajectory are less reliably defined. In general, the results using a count-based approach show that applying the dot product and spectral clustering to the original UMI-count matrix is sufficient to reveal biologically relevant cellular diversity, even when working with large and complex datasets.
![**Analyzing a dataset of differentiated cells from the sponge _Spongilla lacustris_ using `countland`.** 2,845 cells are visualized using three embeddings: tSNE (A and D), as originally reported in Musser _et al_., 2021 [@musser2021profiling], `GLM-PCA` [@townes2019feature] (B and E), and spectral embedding (C and F). Points are colored using the originally published cell labels (A-C) or the results of spectral clustering (D-F). Colors between published labels and `countland` clusters were matched based on composition; `countland` clusters without a corresponding published label are colored in grayscale. Cell labels are as follows: amoebocytes (Amb), apendopinacocytes (apnPin), apopylar cells (Apo), basopinacocytes (basPin), choanoblasts (Chb), choanocytes (Cho), granulocytes (Grl), incurrent pinacocytes (incPin), lophocytes (Lph), metabolocytes (Met), myopeptidocytes (Myp), neuroid cells (Nrd), sclerophorocytes (Scp)](figures_and_panels/Figure_6_V1-01.png)
## Sequencing depth and zeros
As described above, the dot product is based on multiplication and addition, and therefore scales with vector length, here sequencing depth. Given that we are using the dot product to compare cells without any other form of normalization, we examined whether the resulting cell clusters reflect either sequencing depth or the fraction of zeros in the data. We visualized the Gold and Silver standard datasets as well as the large dataset from the sponge, _Spongilla lacustris_, coloring by clusters identified with countland, total counts, and number of zeros per cell (Supplemental Figs. 2 and 3). We found that clusters did not correspond to either total counts or the number of zero values for any of these datasets, with the exception of when gene variance of the Gold standard dataset was simulated to be very heterogeneous (Supplemental Fig. 2E). In this case, the gene subsampling approach we describe was able to reverse the effect (Supplemental Fig. 2F). We note that when sequencing depth was simulated to be very heterogeneous, the dot product and spectral clustering were able to accurately distinguish cell clusters without the need for subsampling.
## Performance time
We measured the performance time required for the dot product, embedding, and clustering steps implemented in `countland` (Supplemental Table 1). These operations complete in seconds for datasets with hundreds or thousands of cells, and minutes for datasets with 10,000 cells (6.5 minutes in `countland-py` and 14 minutes in `countland-R` on a dataset of 10,106 cells).
# Discussion
The best way forward can sometimes require first taking a few steps back. While many advances come by adding new steps to existing workflows, there is great value in simplification. Measurement theory [@wagner2012measurement] provides a roadmap for this: start by considering the measurement process by which numbers are assigned to attributes, and then consider how mathematical operations on the measurements correspond to physical processes. This is fundamental to understanding how our measurements address our research question. This approach is especially valuable for navigating the data-rich world of next generation sequencing and functional genomics. Current practices using these high-dimensional data often involve _ad hoc_ transformations that are applied with a general sense that something must be done to the data before downstream analyses are possible (e.g. the data must be transformed so that Euclidean distances become meaningful). We advocate for a more intentional approach, where data processing steps are only taken when they respect the measurement process and will be informative for the research question.
Current practices in single cell analysis have been highly optimized with one primary objective: to identify and visualize clusters of cells. But modern research with scRNA-seq data has far more ambitious and diverse objectives, including identifying developmental trajectories, comparing cells and genes across species and evolutionary time, and associating expression differences with phenotypes of interest. The result is that a substantial research effort is currently dedicated to things other than clustering, and this requires rethinking data preprocessing steps that were developed with clustering in mind. Here we have described an approach that seeks to sidestep, rather than patch, many current practices in order to avoid these challenges altogether.
Our results show that the restricted algebra of count-space is sufficient to perform many common scRNA-seq analysis tasks, while respecting the underlying count nature of the data. For example, assessing cell similarity with the dot product of transcript counts is a powerful tool to identify distinct clusters that correspond to real cell populations (Fig. 3). The restricted algebra implemented in `countland` works well, even when datasets are sparse, without taking any steps to account for heterogeneity in sequencing depth or gene expression magnitude. This indicates that it is not universally necessary to convert counts to fractional abundances or log-transform values in order to categorize cells by expression. And in cases where heterogeneity in those measures obscures biological differences, we provide a count-based solution via subsampling, and demonstrate that this solution can match or outperform transformation-based approaches.
However, we anticipate that there are scenarios when the transformation-based approach can recover clusters that are not identified using count-based approaches. There are almost certainly circumstances in which these transformations have the effect of exaggerating subtle but perhaps real biological differences between cell populations. The challenge is that these circumstances would be indistinguishable from situations where the same transformations instead result in exaggerating spurious and artifactual population structure. Count-based approaches, on the other hand, are easy to interpret; more similar cells are those that have more transcripts in common.
There is great potential for improvements in count-based approaches, especially in the area of spectral embedding for visualization and clustering of the dot product matrix. While distance-based clustering methods have undergone many generations of improvements, here we have used an out-of-the-box approach to spectral clustering, as implemented in `scikit-learn`, a standard clustering library in `python` [@pedregosa2011scikit] (which we reimplemented in `R`). When visualization is the ultimate goal, the current implementation of spectral embedding of the dot product often emphasizes the overall similarity of cells in terms of counts, leading to visualizations with largely continuous distributions of cells (Supp. Fig. 1C and F). Other methods that model counts directly, such as `GLM-PCA` [@townes2019feature], appear to strike a good balance between segregating clusters in the plot while still reflecting the underlying similarity in counts between cells (Fig. 1B and E, Supp. Fig 1B and E).
Future optimizations to the choice of graph Laplacian or the clustering algorithm that are specific to single-cell data may result in improved performance in visualizing and identifying cell populations using spectral embedding and clustering. However, we stress that visualizing, segregating, and labeling cells are only one small portion of the analyses that are possible with scRNA-seq data. Cells do not always fall into clear populations that conform to clustering algorithms, especially when considering cells over their developmental lifetimes. The dot product is an intuitive and powerful measure of cell-cell similarity, even when not invoking clustering. Comparisons of cells across timepoints, treatments, tissues, and species are all facilitated by leveraging a count-based approach that avoids _ad hoc_ transformations and improves interpretability.
# Acknowledgments
We thank Abby Skwara, Kevin O'Neil, Milo S. Johnson, Seth Donoughe, and members of the Dunn lab for comments on early drafts of this manuscript. We thank Jacob Musser for providing insight on analysis of the sponge dataset. We thank the anonymous reviewers for their recommendations. This material is based upon work supported by the NSF Postdoctoral Research Fellowship in Biology under Grant No. 2109502.
# Competing interests
The authors declare no competing interests.
# Methods
## countland
We implemented `countland` in `python` and `R`, available at https://github.com/shchurch/countland. Implementation in both languages will allow these methods to be more widely used, and will provide an opportunity to cross validate results.
The implementation of the restricted algebra in our package `countland` is summarized here using mathematical notation. See Appendix 1 for a background on the mathematical principles that justify our restricted algebra, and the git repository for more specifics.
Let $\mathbf{C}$ be the count matrix containing the raw measurements, and let $\mathbb{N}_0$ be the set of natural numbers, inclusive of 0, where $\mathbf{C}_{ij} \in \mathbb{N}_0$. $\mathbf{C}$ has the dimensions $m$ cells and $n$ genes. The established convention for $\mathbf{C}$ in `python` has cells as matrix rows and genes as columns, whereas in `R`, genes are rows and cells are columns. Our descriptions in the text of this manuscript use the `R` convention, unless stated otherwise.
### Dot product matrix
Let $\mathbf{D}$ be the $m \times m$ dot product matrix, where element $\mathbf{D}_{ij}$ is the dot product of cell $i$ with cell $j$. $\mathbf{D}$ is calculated using matrix multiplication as $\mathbf{D}=\mathbf{CC}^T$. Because this requires only multiplication and addition operations on the elements, this does not move the data out of count-space and will return only values contained in $\mathbb{N}_0$. Note that this is conceptually very similar to the calculation of the covariance matrix, which is at the heart of methods often used in scRNA-seq analyses, such as PCA. Covariance matrices are calculated as $\mathbf{XX}^T$, where $\mathbf{X}$ is the mean-centered transformation of $\mathbf{C}$. However calculating the mean and centering requires division and subtraction and moves the data out of count-space, resulting in negative and fractional values not contained in $\mathbb{N}_0$.
### Spectral clustering
Spectral embedding and clustering require a similarity matrix as input. Here, $\mathbf{D}$ is used as the cell-cell similarity matrix. Prior to spectral embedding, the diagonal elements of $\mathbf{D}$ are replaced with zeros to remove edges of the similarity graph that connect cells to themselves.
In the `python` implementation of `countland`, we used spectral clustering functions as written in `scikit learn`, modified so that it returns the eigenvalues as well as the eigenvectors of the graph Laplacian. As with `scikit learn`, in `countland` the user can input the target number of clusters as well as the number of components (eigenvectors of the graph Laplacian) that should be considered in the spectral clustering algorithm.
To ensure that methods and results are directly comparable between the `python` and `R` versions of `countland`, instead of using an existing package for spectral clustering (e.g. R:Spectrum [@john2020spectrum]), we re-implemented the `scikit learn` algorithm from scratch in `R`.
To facilitate the choice of the number of clusters, we implemented a common heuristic for determining the optimal numbers of clusters by considering the difference between eigenvalues $\lambda_{1},\dots,\lambda_{n}$ of the graph Laplacian. According to this heuristic, a reasonable choice for the optimal number of clusters $k$ is one where the eigenvalues $\lambda_{1},\dots,\lambda_{k}$ are relatively small, and the gap $|\lambda_{k+1}-\lambda_{k}|$ is relatively large. This heuristic is intended as a guideline, and other methods of selecting $k$ may also be useful (e.g. using _a priori_ biological information).
### Subsampling counts
Sequencing depth in scRNA-seq experiments is not standardized across samples, with the result that the sum of counts per cell $\sum_{j=1}^{n}C_{ij}$ varies. Sequencing depth can be standardized in `countland` by subsampling observations to a fixed number $x$ of total counts per cell, such that $\sum_{j=1}^{n}C_{ij}=x$. This is accomplished by flattening each cell vector to an array of transcripts, with the frequency of each transcript given by its count value, and then randomly choosing $x$ transcripts without replacement. The new cell vector is given by the sampled transcript frequency.
A similar approach is taken when subsampling genes to reduce the impact of heterogeneity in expression magnitude. In this case, each gene vector is flattened to an array of cell observations. For any gene with $>x$ total observations, $x$ observations are randomly chosen without replacement, and the new gene vector is given by the sampled cell frequency.
### Expression scores
The following expression scores for cell $i$ can be calculated in `countland`:
* total counts $=\sum_{j=1}^{n}C_{ij}$
* maximum count value $=\max_{1\leq j\leq n}(C_{ij})$
* The number of observations above zero $=\sum_{j=1}^{n}[C_{ij}>0]$ where $[F]=0$ if $F$ is false, and $1$ if $F$ is true
* The number of observations above one $=\sum_{j=1}^{n}[C_{ij}>1]$
* The number of observations above ten $=\sum_{j=1}^{n}[C_{ij}>10]$
* The number of unique count values $=$ unique$_{1\leq j\leq n}(C_{ij})$
* The count index $c=\max_{1\leq j\leq n}(C_{ij})$ where $\sum_{j=1}^{n}[C_{ij}\geq c] \geq c$
As well as the corresponding scores along the $n$ dimension.
We implemented two methods for comparing differential expression across clusters. Let $M'$ be the cells within the cluster and $M''$ be the rest. For the first method, we calculate the difference between proportions of non-zero observations \[\sum_{j=1,i\in M'}^{n}[C_{ij}>0] - \sum_{j=1,i\in M''}^{n}[C_{ij}>0]\]
For the second method, we use a Wilcoxon Rank-sum test (Mann Whitney U test) between $M'$ and $M''$, followed by Benjamini-Hochberg false discovery rate correction of p-values ($\alpha$ for significance set at 0.05).
### Integer matrix approximation
Integer matrix factorization is a method for estimating lower rank matrices that can be multiplied to approximate a higher rank matrix that contains discrete values, such as integers [@lin2005integer]. This factorization generates three matrices, $\mathbf{U}$, $\mathbf{V}$, and $\mathbf{\Lambda}$. $\mathbf{U}$ has the dimensions $m \times k$ features, with $k$ provided by the user. $\mathbf{V}$ has the dimensions $k \times n$ genes, and $\mathbf{\Lambda}$ is a $k \times k$ diagonal matrix of scaling factors. This is conceptually similar to other matrix factorizations (e.g. singular-value decomposition) that generated matrices $\mathbf{U}$, $\mathbf{V}$, and $\mathbf{\Sigma}$.
An approximation for integer matrix factorization has been implemented for `MATLAB` in the application [`SUSTain`](https://github.com/kperros/SUSTain) [@dong2018integer;@perros2018sustain]. This approximation takes advantage of an alternating integer least squares approach to partition the factorization into subproblems that can be solved in an optimally efficient manner [@dong2018integer]. The algorithm developed in `SUSTain` also defines a suborder for solving these problems that allows for intermediate results to be reused rather than recalculated [@perros2018sustain]. We re-implemented the required functions for integer matrix approximation on a matrix in `python` and `R`, and have made them available for public use at https://github.com/shchurch/integer_matrix_approximation.
As in `SUSTain`, integer matrix approximation is accomplished in three steps. First, parameters are set, including: the target rank of the factorized matrices, the upper and lower bounds of the integer values (default lower bound is zero), the maximum number of iterations (default is 1000000), and the stopping criterion (default is a difference of 0.0001). Second, initial matrices $\mathbf{U}$, $\mathbf{V}$, and $\mathbf{\Lambda}$ are calculated by sampling integers from the higher rank matrix, ensuring that values remain within the bounds. Third, $\mathbf{U}$, $\mathbf{V}$, and $\mathbf{\Lambda}$ are updated via the algorithm described in the corresponding `SUSTain` manuscript [@perros2018sustain].
## Performance evaluation
We evaluated `countland`’s performance using the Gold and Silver standard benchmark scRNA-seq datasets provided by Freytag _et al_ (2018) [@freytag2018comparison]. The Gold standard dataset includes data from 925 cells drawn from three human adenocarcinoma cell lines (HCC827, H1975, H2228). This dataset has ground truth cell labels, but is far less sparse than most scRNA-seq datasets (29% of values are non-zero). To evaluate performance on a more sparse dataset, we created a modified Gold standard dataset where each cell contained only 1% of its original number of observations.
For the Gold standard dataset, we evaluated performance after increasing the amount of heterogeneity in sequencing depth and in gene expression. To manipulate heterogeneity in sequencing depth, we restored 100 cells from the sparse Gold standard dataset to their original sequencing depth, drawing 50 cells each from two of the three cell populations present in the data. To manipulate heterogeneity in gene expression, we simulated counts for 10 genes, using a Poisson distribution with lambda values 10x larger than the largest observed mean count value across genes.
The Silver standard datasets are composed of fresh peripheral mononuclear cells (PBMCs). Here we have evaluated performance on dataset 3a, named by the original authors, which contains 4,310 cells labeled by matching cells to a reference dataset by expression. These labels do not constitute a ground truth but have been shown to match identities found using marker genes to classify cells [@zheng2017massively].
For the Gold and Silver standard datasets, we tested `countland` first on raw count matrices, meaning no subsampling or subsetting was performed prior to calculating the dot product and performing spectral embedding and clustering. We then tested the impact of subsampling cells to a standard sequencing depth, and of subsampling genes to a maximum total count value. In the case of the Silver standard dataset, we tested two thresholds for maximum gene counts: a threshold equal to the number of cells, and one equal to 1/2 the number of cells. We also tested the effects of dropping the top 5% of genes according to total counts. For the Gold standard dataset, we tested the impact of varying the number of components (Laplacian eigenvectors) used to identify clusters.
We evaluated the performance of `countland` on a large and complex dataset from the sponge _Spongilla lacustris_, originally published in Musser _et al_., 2021 [@musser2021profiling]. This dataset contains 10,106 cells from whole-body scRNA-seq. The published cell labels [@musser2021profiling] for this dataset were determined using a custom clustering approach, where clusters were first identified using a high resolution search in `Seurat`, resulting in many subclusters. These were then collapsed using a similarity threshold, with the objective of retaining predicted rare but distinct cell types without subdividing other larger cell clusters. To test `countland` on a dataset of only differentiated cells, we generated a dataset excluding cells labeled as transitional as well as archaeocytes and their relatives, sclerocytes and mesocytes (final cell number = 2,845).
We repeated the analytical steps originally described to generate a tSNE visualization of the sponge dataset. These steps included filtering cells with fewer than 200 unique features, and features present in fewer than three cells. We visualized this same data using the `GLM-PCA` package [@townes2019feature]. We tested `countland` on this filtered dataset with no subsampling and with the number of clusters set to the same number of originally published cell labels. To visualize corresponding cell types, we matched published cell labels to `countland` clusters by counting the number of shared cells between groups.
For the Gold and Silver standard datasets, we compared the clustering results using `countland` to results using a naive PCA on raw count matrices, clustered using _k_-means clustering with _k_ set at the ground truth number of cell populations. We also compared to results using `Seurat`, standard workflow and parameters unless described otherwise. All datasets and code required to perform analyses are provided along with the software at github.com/shchurch/countland.
## Performance time evaluation
We compare the time for executation of `Seurat`, `scanpy`, `countland-R`, and `countland-py` on the Gold standard, Silver standard, and sponge datasets. For `countland`, we measured the time required to calculate the dot product and perform spectral embedding and clustering. For `Seurat` and `scanpy`, we measured the time required for all steps recommended in the documentation for those pacakges, from normalization to clustering and visualization.
## Data availability
The `countland` package for `python` and `R`, as well as all data and code required to reproduce the results shown here is available at https://github.com/shchurch/countland.
# Appendix 1. Properties of vector spaces over zero and the natural numbers
This appendix provides additional background and information on mathematical concepts relevant to the analysis of transcript count matrices.
## The natural numbers, $\mathbb{N}_0$
The natural numbers are a group of all positive integers. Depending on the definition, this group may also include zero (i.e. all non-negative integers), a convention we follow here so that $\mathbb{N}_0$={0,1,2,3,...}. We can classify groups of numbers by the operations under which they are closed, meaning operations on elements in the group result in elements that are also in the group. The most familiar group is a field, which is closed under addition, subtraction, multiplication, and addition, and includes rational numbers $\mathbb{Q}$ and real numbers $\mathbb{R}$. The natural numbers form a semiring, because $\mathbb{N}_0$ is closed for multiplication and addition, but not subtraction or division.
$\mathbb{N}_0$ contains the additive identity element, 0, that can be added to any element to return the same element (e.g. $x+0=x$). $\mathbb{N}_0$ also contains the multiplicative identity element, 1, that can be multiplied with any element to return the same element (e.g. $x1=x$).
Inverses are elements that return the identity elements under specified operations. For example, the negative numbers are inverses under addition, because $x+(-x)$ returns the additive identity element, 0. However, because $\mathbb{N}_0$ does not contain negative numbers, it doesn’t have additive inverses. Similarly, reciprocal values are inverses under multiplication, because $x(1/x)$ returns the multiplicative identity element, 1. $\mathbb{N}_0$ likewise does not contain reciprocals, so therefore doesn’t have multiplicative inverses.
Subtraction can be defined as the addition of an additive inverse (a negative number), and division can be defined as multiplication with a multiplicative inverse (a reciprocal). Because these two inverses are not contained in $\mathbb{N}_0$, there is no subtraction or division. This is equivalent to the observation that $\mathbb{N}_0$ is not closed for subtraction or division.
## Vector spaces over $\mathbb{N}_0$
We can build the vector space over $\mathbb{N}_0$ as the group of all vectors $\mathbf{V}$ such that \[\mathbf{V} = {(a_1,a_2,a_3,...a_n): a_1,a_2,a_3,...a_n \in \mathbb{N}_0 }\]
The vector space over $\mathbb{N}_0$ can be envisioned as being restricted to the integer grid that is located over the upper right quadrant of a coordinate system, inclusive of the origin and axes. Certain operations are possible in this restricted space, while others are not. For example, we can apply the operation of the inner product (dot product) because this operation requires only multiplication and addition of vector elements. However, unlike a vector space over a field, in the vector space over $\mathbb{N}_0$ there are no angles between vectors. Calculating angles from dot product requires division by vector length, and $\mathbb{N}_0$ is not closed for division.
Furthermore, in count-space, vector length is not a Euclidean measure of distance as there is no equivalent measure of distance in a space without subtraction or square roots. Instead of Euclidean distance, we can use the number of integer steps in a positive direction as a measure of length, which is equivalent to the Manhattan distance between the vector terminus and the origin.
Vector rotation is not possible in count-space as it would require rotation matrices with new basis vectors that include negative elements. Vector reflections are possible, because we are free to permute our count matrix, as are shears of vectors. Some vector projections are possible, but not all. For example, if we project vector $\mathbf{b}$ onto vector $\mathbf{a}$, the result will be a multiple $q$ of $\mathbf{a}$, $q\mathbf{a}$. The value of that multiple will be equal to $q=(\mathbf{a}^T\mathbf{b})/(\mathbf{a}^T\mathbf{a})$, which requires division to calculate unless $\mathbf{a}^T\mathbf{a} = 1$. Over $\mathbb{N}_0$, that only happens when there is a single entry that is 1, i.e. when vector $\mathbf{a}$ is one of the original basis vectors. Therefore we can project vectors onto basis vectors, but not onto arbitrary vectors (e.g. we cannot project one cell vector onto another). Projecting onto basis vectors is the equivalent of multiplying some values by 0 while retaining others.
Without rotation and vector projection, it is clear that certain complex operations like principal component analysis that rely on these are not possible in this vector space.
## High-dimensional, low-magnitude vector spaces over $\mathbb{N}_0$
While the above pertains to vector spaces over $\mathbb{N}_0$ in general, there are interesting properties of the very high dimensional, low magnitude vector spaces that describe scRNA-seq count data.
Most gene expression datasets contain measurements for many thousands of genes, meaning this vector space has many thousands of dimensions. Furthermore, due to the sparse nature of these count matrices, it is difficult if not impossible to find a reasonable lower-dimensional approximation. In other words, because many features contain only a few, non-overlapping observations, there is no way to reduce the rank of this matrix without discarding features.
Because most measures of gene expression are low-magnitude integers, most cell vectors terminate only a few steps from the origin in any given direction. This does not mean that vectors are close to the origin overall. Vector length is non-Euclidean; it is calculated as the sum of steps back to the origin (Manhattan distance), not the distance along a diagonal (Euclidean distance).
Because the vast majority of values in the count matrix are zero, cell vectors are perpendicular to each other in many directions. The may result in the outcome that cell-cell similarity has more to do with the number and distribution of non-zero observations than expression magnitude. This has been demonstrated by the fact that binary transformations of scRNA-seq data to zero/non-zero contain enough information to recapitulate major patterns in the data [@qiu2020embracing].
\newpage
<!-- \small -->
<!-- \setstretch{1} -->
# References {-}