-
-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.qmd
372 lines (289 loc) · 13.9 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
format:
gfm:
default-image-extension: ""
always_allow_html: true
execute:
cache: true
freeze: auto
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r}
#| include: false
# load {SLmetrics}
library(SLmetrics)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 1280,
fig.height = 6
)
# 1) store data
DT <- SLmetrics:::DT
# 2) set colors of the lattice
lattice::trellis.par.set(
modifyList(
lattice::standard.theme(),
list(
background = list(col = "#212830"), # was
panel.background = list(col = "#2b3139"),
reference.line = list(col = "#4a5563", lty = 1), # #4a5563
axis.text = list(col = "#848e9c"),
par.xlab.text = list(col = "#848e9c"),
par.ylab.text = list(col = "#848e9c"),
par.main.text = list(col = "#848e9c"),
axis.line = list(col = "#848e9c"),
superpose.line = list(col = c("#5d8ca8", "#d5695d", "#65a479","#d3ba68")),
superpose.symbol = list(col = c("#5d8ca8", "#d5695d", "#65a479","#d3ba68"), pch = 16)
)
)
)
```
# {SLmetrics}: Machine learning performance evaluation on steroids <img src="man/figures/logo.png" align="right" height="150" alt="" />
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/SLmetrics)](https://CRAN.R-project.org/package=SLmetrics)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/SLmetrics?color=blue)](https://r-pkg.org/pkg/SLmetrics)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/serkor1/SLmetrics/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/serkor1/SLmetrics/actions/workflows/R-CMD-check.yaml)
[![macOS-clang](https://github.com/serkor1/SLmetrics/actions/workflows/macos-check-clang.yaml/badge.svg)](https://github.com/serkor1/SLmetrics/actions/workflows/macos-check-clang.yaml)
[![codecov](https://codecov.io/gh/serkor1/SLmetrics/branch/development/graph/badge.svg?token=X2osJDSRlN)](https://app.codecov.io/gh/serkor1/SLmetrics)
[![CodeFactor](https://www.codefactor.io/repository/github/serkor1/slmetrics/badge)](https://www.codefactor.io/repository/github/serkor1/slmetrics)
<!-- badges: end -->
[{SLmetrics}](https://serkor1.github.io/SLmetrics/) is a lightweight `R` package written in `C++` and [{Rcpp}](https://github.com/RcppCore/Rcpp) for *memory-efficient* and *lightning-fast* machine learning performance evaluation; it's like using a supercharged [{yardstick}](https://github.com/tidymodels/yardstick) but without the risk of soft to super-hard deprecations. [{SLmetrics}](https://serkor1.github.io/SLmetrics/) covers both regression and classification metrics and provides (almost) the same array of metrics as [{scikit-learn}](https://github.com/scikit-learn/scikit-learn) and [{PyTorch}](https://github.com/pytorch/pytorch) all without [{reticulate}](https://github.com/rstudio/reticulate) and the Python compile-run-(crash)-debug cylce.
Depending on the mood and alignment of planets [{SLmetrics}](https://serkor1.github.io/SLmetrics/) stands for Supervised Learning metrics, or Statistical Learning metrics. If [{SLmetrics}](https://serkor1.github.io/SLmetrics/) catches on, the latter will be the core philosophy and include unsupervised learning metrics. If not, then it will remain a {pkg} for Supervised Learning metrics, and a sandbox for me to develop my `C++` skills.
## :books: Table of Contents
* [:rocket: Gettting Started](#rocket-gettting-started)
+ [:shield: Installation](#shield-installation)
+ [:books: Basic Usage](#books-basic-usage)
* [:information_source: Why?](#information_source-why)
* [:zap: Performance Comparison](#zap-performance-comparison)
+ [:fast_forward: Speed comparison](#fast_forward-speed-comparison)
+ [:floppy_disk: Memory-efficiency](#floppy_disk-memory-efficiency)
* [:information_source: Basic usage](#information_source-basic-usage)
+ [:books: Regression](#books-regression)
+ [:books: Classification](#books-classification)
* [:information_source: Enable OpenMP](#information_source-enable-openmp)
+ [:books: Entropy without OpenMP](#books-entropy-without-openmp)
+ [:books: Entropy with OpenMP](#books-entropy-with-openmp)
* [:information_source: Installation](#information_source-installation)
+ [:shield: Stable version](#shield-stable-version)
+ [:hammer_and_wrench: Development version](#hammer_and_wrench-development-version)
* [:information_source: Code of Conduct](#information_source-code-of-conduct)
## :rocket: Gettting Started
Below you’ll find instructions to install [{SLmetrics}](https://serkor1.github.io/SLmetrics/) and get started with your first metric, the Root Mean Squared Error (RMSE).
### :shield: Installation
```{r}
#| eval: false
## install stable release
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics@*release',
ref = 'main'
)
```
### :books: Basic Usage
Below is a minimal example demonstrating how to compute both unweighted and weighted RMSE.
```{r}
#| message: false
library(SLmetrics)
actual <- c(10.2, 12.5, 14.1)
predicted <- c(9.8, 11.5, 14.2)
weights <- c(0.2, 0.5, 0.3)
cat(
"Root Mean Squared Error", rmse(
actual = actual,
predicted = predicted,
),
"Root Mean Squared Error (weighted)", weighted.rmse(
actual = actual,
predicted = predicted,
w = weights
),
sep = "\n"
)
```
That’s all! Now you can explore the rest of this README for in-depth usage, performance comparisons, and more details about [{SLmetrics}](https://serkor1.github.io/SLmetrics/).
## :information_source: Why?
Machine learning can be a complicated task; the steps from feature engineering to model deployment require carefully measured actions and decisions. One low-hanging fruit to simplify this process is *performance evaluation*.
At its core, performance evaluation is essentially just comparing two vectors — a programmatically and, at times, mathematically trivial step in the machine learning pipeline, but one that can become complicated due to:
1. Dependencies and potential deprecations
2. Needlessly complex or repetitive arguments
3. Performance and memory bottlenecks at scale
[{SLmetrics}](https://serkor1.github.io/SLmetrics/) solves these issues by being:
1. **Fast:** Powered by `C++` and [Rcpp](https://github.com/RcppCore/Rcpp)
2. **Memory-efficient:** Everything is structured around pointers and references
3. **Lightweight:** Only depends on [Rcpp](https://github.com/RcppCore/Rcpp), [RcppEigen](https://github.com/RcppCore/RcppEigen), and [lattice](https://github.com/deepayan/lattice)
4. **Simple:** S3-based, minimal overhead, and flexible inputs
Performance evaluation should be plug-and-play and “just work” out of the box — there’s no need to worry about *quasiquations*, *dependencies*, *deprecations*, or variations of the same functions relative to their arguments when using [{SLmetrics}](https://serkor1.github.io/SLmetrics/).
## :zap: Performance Comparison
One, obviously, can't build an `R`-package on `C++` and [{Rcpp}](https://github.com/RcppCore/Rcpp) without a proper pissing contest at the urinals - below is a comparison in execution time and memory efficiency of two simple cases that any {pkg} should be able to handle gracefully; computing a 2 x 2 confusion matrix and computing the RMSE[^1].
### :fast_forward: Speed comparison
```{r}
#| echo: false
lattice::xyplot(
median ~ sample_size | measure,
data = do.call(rbind, DT$speed),
groups = expr,
type = 'l',
auto.key = list(columns = 2, col = "#848e9c"),
scales = list(
y = list(log = FALSE, relation = "free"),
x = list(log = FALSE)
),
xlab = "Sample Size (N)",
ylab = "Median Execution Time (Microseconds)",
panel = function(...) {
lattice::panel.grid(...)
lattice::panel.xyplot(...)
}
)
```
As shown in the chart, [{SLmetrics}](https://serkor1.github.io/SLmetrics/) maintains consistently low(er) execution times across different sample sizes.
### :floppy_disk: Memory-efficiency
Below are the results for garbage collections and total memory allocations when computing a 2×2 confusion matrix (N = 1e7) and RMSE (N = 1e7) [^2]. Notice that [{SLmetrics}](https://serkor1.github.io/SLmetrics/) requires no GC calls for these operations.
```{r}
#| echo: false
# 1) prepare data
packages <- c("{SLmetrics}", "{yardstick}", "{MLmetrics}", "{mlr3measures}")
measures <- c("Confusion Matrix", "Root Mean Squared Error")
column_names <- c("Iterations", "Garbage Collections [gc()]", "gc() pr. second", "Memory Allocation (MB)")
```
```{r}
#| echo: false
# 1) extract data;
DT_ <- DT$memory[[1]][,c("n_itr", "n_gc", "gc/sec", "mem_alloc")]
DT_$mem_alloc <- round(DT_$mem_alloc/(1024^2))
colnames(DT_) <- column_names
rownames(DT_) <- packages
knitr::kable(DT_, caption = "2 x 2 Confusion Matrix (N = 1e7)", digits = 2)
```
```{r}
#| echo: false
# 1) extract data;
DT_ <- DT$memory[[2]][,c("n_itr", "n_gc", "gc/sec", "mem_alloc")]
DT_$mem_alloc <- round(DT_$mem_alloc/(1024^2))
colnames(DT_) <- column_names
rownames(DT_) <- packages
knitr::kable(DT_, caption = "RMSE (N = 1e7)", digits = 2)
```
In both tasks, [{SLmetrics}](https://serkor1.github.io/SLmetrics/) remains extremely memory-efficient, even at large sample sizes.
> [!IMPORTANT]
>
> From [{bench}](https://github.com/r-lib/bench) documentation: *Total amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. by `malloc()` or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.*
## :information_source: Basic usage
In its simplest form, [{SLmetrics}](https://serkor1.github.io/SLmetrics/)-functions work directly with pairs of \<numeric\> vectors (for regression) or \<factor\> vectors (for classification). Below we demonstrate this on two well-known datasets, `mtcars` (regression) and `iris` (classification).
### :books: Regression
We first fit a linear model to predict `mpg` in the `mtcars` dataset, then compute the in-sample RMSE:
```{r}
# Evaluate a linear model on mpg (mtcars)
model <- lm(mpg ~ ., data = mtcars)
rmse(mtcars$mpg, fitted(model))
```
### :books: Classification
Now we recode the `iris` dataset into a binary problem ("virginica" vs. "others") and fit a logistic regression. Then we generate predicted classes, compute the confusion matrix and summarize it.
```{r}
# 1) recode iris
# to binary problem
iris$species_num <- as.numeric(
iris$Species == "virginica"
)
# 2) fit the logistic
# regression
model <- glm(
formula = species_num ~ Sepal.Length + Sepal.Width,
data = iris,
family = binomial(
link = "logit"
)
)
# 3) generate predicted
# classes
predicted <- factor(
as.numeric(
predict(model, type = "response") > 0.5
),
levels = c(1,0),
labels = c("Virginica", "Others")
)
# 4) generate actual
# values as factor
actual <- factor(
x = iris$species_num,
levels = c(1,0),
labels = c("Virginica", "Others")
)
```
```{r}
# 4) generate
# confusion matrix
summary(
confusion_matrix <- cmatrix(
actual = actual,
predicted = predicted
)
)
```
## :information_source: Enable OpenMP
```{r}
#| echo: false
# Set column names for both
# examples
column_names <- c("Iterations", "Runtime (sec)" ,"Garbage Collections [gc()]", "gc() pr. second", "Memory Allocation (MB)")
```
> [!IMPORTANT]
>
> OpenMP support in [{SLmetrics}](https://serkor1.github.io/SLmetrics/) is experimental. Use it with caution, as performance gains and stability may vary based on your system configuration and workload.
You can control OpenMP usage within [{SLmetrics}](https://serkor1.github.io/SLmetrics/) using the setUseOpenMP function. Below are examples demonstrating how to enable and disable OpenMP:
```{r}
# enable OpenMP
SLmetrics::setUseOpenMP(TRUE)
# disable OpenMP
SLmetrics::setUseOpenMP(FALSE)
```
To illustrate the impact of OpenMP on performance, consider the following benchmarks for calculating entropy on a 1,000,000 x 200 matrix over 100 iterations[^3].
### :books: Entropy without OpenMP
```{r}
#| echo: false
# 1) extract data;
DT_ <- DT$OpenMP$FALSE_[,c("n_itr", "median" , "n_gc", "gc/sec", "mem_alloc")]
DT_$mem_alloc <- round(DT_$mem_alloc/(1024^2))
DT_$median <- round(DT_$median, 2)
colnames(DT_) <- column_names
knitr::kable(DT_, caption = "1e6 x 200 matrix without OpenMP", digits = 2)
```
### :books: Entropy with OpenMP
```{r}
#| echo: false
# 1) extract data;
DT_ <- DT$OpenMP$TRUE_[,c("n_itr", "median" , "n_gc", "gc/sec", "mem_alloc")]
DT_$mem_alloc <- round(DT_$mem_alloc/(1024^2))
DT_$median <- round(DT_$median, 2)
colnames(DT_) <- column_names
knitr::kable(DT_, caption = "1e6 x 200 matrix with OpenMP", digits = 2)
```
### :shield: Stable version
```{r}
#| echo: TRUE
#| eval: FALSE
## install stable release
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics@*release',
ref = 'main'
)
```
### :hammer_and_wrench: Development version
```{r}
#| echo: TRUE
#| eval: FALSE
## install development version
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics',
ref = 'development'
)
```
## :information_source: Code of Conduct
Please note that the [{SLmetrics}](https://serkor1.github.io/SLmetrics/) project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
[^1]: The source code is available [here](https://github.com/serkor1/SLmetrics/blob/development/data-raw/classification_performance.R) and [here](https://github.com/serkor1/SLmetrics/blob/development/data-raw/regression_performance.R).
[^2]: The source code is available [here](https://github.com/serkor1/SLmetrics/blob/development/data-raw/memory_performance.R).
[^3]: The source code is available [here](https://github.com/serkor1/SLmetrics/blob/development/data-raw/OpenMP_performance.R).