-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.Rmd
168 lines (143 loc) · 6.39 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# DelayedMatrixStats
**DelayedMatrixStats** is a port of the
[**matrixStats**](https://CRAN.R-project.org/package=matrixStats) API to work
with *DelayedMatrix* objects from the
[**DelayedArray**](http://bioconductor.org/packages/DelayedArray/) package.
For a *DelayedMatrix*, `x`, the simplest way to apply a function, `f()`, from
**matrixStats** is`matrixStats::f(as.matrix(x))`. However, this "*realizes*"
`x` in memory as a *base::matrix*, which typically defeats the entire purpose
of using a *DelayedMatrix* for storing the data.
The **DelayedArray** package already implements a clever strategy called
"block-processing" for certain common "matrix stats" operations (e.g.
`colSums()`, `rowSums()`). This is a good start, but not all of the
**matrixStats** API is currently supported. Furthermore, certain operations can
be optimized with additional information about `x`. I'll refer to these
"seed-aware" implementations.
## Installation
You can install **DelayedMatrixStats** from Bioconductor with:
```{r gh-installation, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedMatrixStats")
```
## Example
This example compares two ways of computing column sums of a *DelayedMatrix*
object:
1. `DelayedMatrix::colSums()`: The 'block-processing strategy', implemented in the **DelayedArray** package. The block-processing strategy works for any *DelayedMatrix* object, regardless of the type of *seed*.
2. `DelayedMatrixStats::colSums2()`: The 'seed-aware' strategy, implemented in the
**DelayedMatrixStats** package. The seed-aware implementation is optimized for both speed and memory but only for *DelayedMatrix* objects with certain types of *seed*.
```{r real_pkg_load, message = FALSE, echo = FALSE}
devtools::load_all()
library(sparseMatrixStats)
library(microbenchmark)
library(profmem)
```
```{r fake_pkg_load, message = FALSE, echo = TRUE, eval = FALSE}
library(DelayedMatrixStats)
library(sparseMatrixStats)
library(microbenchmark)
library(profmem)
```
```{r example, message = FALSE}
set.seed(666)
# Fast column sums of DelayedMatrix with matrix seed
dense_matrix <- DelayedArray(matrix(runif(20000 * 600), nrow = 20000,
ncol = 600))
class(seed(dense_matrix))
dense_matrix
microbenchmark(DelayedArray::colSums(dense_matrix),
DelayedMatrixStats::colSums2(dense_matrix),
times = 10)
profmem::total(profmem::profmem(DelayedArray::colSums(dense_matrix)))
profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(dense_matrix)))
# Fast, low-memory column sums of DelayedMatrix with sparse matrix seed
sparse_matrix <- seed(dense_matrix)
zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix))
sparse_matrix[zero_idx] <- 0
sparse_matrix <- DelayedArray(Matrix::Matrix(sparse_matrix, sparse = TRUE))
class(seed(sparse_matrix))
sparse_matrix
microbenchmark(DelayedArray::colSums(sparse_matrix),
DelayedMatrixStats::colSums2(sparse_matrix),
times = 10)
profmem::total(profmem::profmem(DelayedArray::colSums(sparse_matrix)))
profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(sparse_matrix)))
# Fast column sums of DelayedMatrix with Rle-based seed
rle_matrix <- RleArray(Rle(sample(2L, 200000 * 6 / 10, replace = TRUE), 100),
dim = c(2000000, 6))
class(seed(rle_matrix))
rle_matrix
microbenchmark(DelayedArray::colSums(rle_matrix),
DelayedMatrixStats::colSums2(rle_matrix),
times = 10)
profmem::total(profmem::profmem(DelayedArray::colSums(rle_matrix)))
profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(rle_matrix)))
```
## Benchmarking
An extensive set of benchmarks is under development at [http://peterhickey.org/BenchmarkingDelayedMatrixStats/](http://peterhickey.org/BenchmarkingDelayedMatrixStats/).
## API coverage
- ✔ = Implemented in **DelayedMatrixStats**
- ☑️ = Implemented in [**DelayedArray**](http://bioconductor.org/packages/DelayedArray/) or [**sparseMatrixStats**](http://bioconductor.org/packages/sparseMatrixStats/)
- ❌: = Not yet implemented
```{r, echo = FALSE, comment="", results = "asis"}
matrixStats <- sort(
c("colsum", "rowsum", grep("^(col|row)",
getNamespaceExports("matrixStats"),
value = TRUE)))
sparseMatrixStats <- getNamespaceExports("sparseMatrixStats")
DelayedMatrixStats <- getNamespaceExports("DelayedMatrixStats")
DelayedArray <- getNamespaceExports("DelayedArray")
api_df <- data.frame(
Method = paste0("`", matrixStats, "()`"),
`Block processing` = ifelse(
matrixStats %in% DelayedMatrixStats,
"✔",
ifelse(matrixStats %in% c(DelayedArray, sparseMatrixStats), "☑️", "❌")),
`_base::matrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "matrix_OR_array_OR_table_OR_numeric"),
"✔",
"❌"),
`_Matrix::dgCMatrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "dgCMatrix"),
"✔",
"❌"),
`_Matrix::lgCMatrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "lgCMatrix"),
"✔",
"❌"),
`_DelayedArray::RleArray_ (_SolidRleArraySeed_) optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "SolidRleArraySeed"),
"✔",
"❌"),
`_DelayedArray::RleArray_ (_ChunkedRleArraySeed_) optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "ChunkedRleArraySeed"),
"✔",
"❌"),
`_HDF5Array::HDF5Matrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "HDF5ArraySeed"),
"✔",
"❌"),
`_base::data.frame_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "data.frame"),
"✔",
"❌"),
`_S4Vectors::DataFrame_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "DataFrame"),
"✔",
"❌"),
check.names = FALSE)
knitr::kable(api_df, row.names = FALSE)
```