-
Notifications
You must be signed in to change notification settings - Fork 53
/
strategy-functions.qmd
133 lines (106 loc) · 5.91 KB
/
strategy-functions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# Three functions in a trench coat {#sec-strategy-functions}
```{r}
#| include = FALSE
source("common.R")
```
## What's the problem?
Sometimes a function that implements multiple strategies might be better off as independent functions.
Two signs that this might be the case:
- You're struggling to document how the arguments interact. Maybe you can set `a` and `b` and `a` and `c` but not `b` and `c`.
- The implementation of your function has a couple of big if branches that share relatively little code.
Splitting one complex multi-strategy function into multiple simpler functions can make maintenance and testing easier, while also improving the user experience.
I think of this the three (or more generally $n$) functions in a trench coat[^strategy-functions-1] problem, because the reality of the separate functions tends to become very obvious in time.
[^strategy-functions-1]: <https://tvtropes.org/pmwiki/pmwiki.php/Main/TotemPoleTrench>
## What are some examples?
- `forcats::fct_lump()` chooses between one of three lumping strategies depending on whether you supply just `n`, just `prop`, or neither, while supplying both `n` and `prop` is an error.
Additionally, the `ties.method` argument only does anything if you supply only `n`.
`fct_lump()` is hard to understand and document because it's really three smaller functions.
- Depending on the arguments used `library()` can load a package, it can list all installed packages, or display the help for a package.
- `diag()` is used to both extract the diagonal of a matrix and construct a matrix from a vector of diagonal values.
This combination of purposes makes its arguments hard to understand: if `x` is a matrix, you can `names` but not `nrow` and `ncol`; if `x` is a vector, you can use `nrow` and `ncol` but not `names`.
- `sample()` is used to both randomly reorder a vector and generate a random vector of specified length.
This function is particularly troublesome because it picks between the two strategies based on the length of the first argument.
- `rep()` is used to both repeat each element of a vector and to repeat the complete vector.
I discuss this more in @sec-cs-rep.
These functions all also suffer from the problem that the strategies are implicit, not explicit.
That's because they use either the presence or absence of different arguments or the type of an argument pick strategies.
This combination tends to produce particularly opaque code.
## How do I identify the problem?
Typically this problem arises as the scope of your function grows over time, and because the growth tends to be gradual it's hard to notice exactly when it becomes an issue.
One way to splot this problem is notice that your function consists of a big if statement where the branches share very little code.
An extreme example of this is the base `sample()` function.
As of R 4.3.0 it looks something like this:
```{r}
sample <- function(x, size, replace = FALSE, prob = NULL) {
if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
} else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
}
```
You can see that there are two branches that share very little code, and each branch uses a different default value for `size`.
This suggests it might be better to have two functions:
```{r}
sample_vec <- function(x, size = length(x), replace = FALSE, prob = NULL) {
# check_vector(x)
# check_number_whole(size)
x[sample.int(length(x), size, replace, prob)]
}
sample_int <- function(x, size = x, replace = FALSE, prob = NULL) {
# check_number_whole(x)
# check_number_whole(size)
x[sample.int(length(x), size, replace, prob)]
}
```
In other cases you might spot the problem because you're having trouble explaining the arguments in the documentation.
If it feels like
## How do I remediate past mistakes?
Remediating past mistakes is straightforward: define, document, and export one function for each strategy.
Then rewrite the original function to use those strategies, deprecating that entire function if desired.
For example, this is what `fct_lump()` looked like after we realised it was really the combination of three simpler functions:
```{r}
fct_lump <- function(f,
n,
prop,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")) {
if (missing(n) && missing(prop)) {
fct_lump_lowfreq(f, w = w, other_level = other_level)
} else if (missing(prop)) {
fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method)
} else if (missing(n)) {
fct_lump_prop(f, prop, w = w, other_level = other_level)
} else {
cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.")
}
}
```
We decided to supersede `fct_lump()` rather than deprecating it, so we kept old function around and working.
If we wanted to deprecate it, we'd need to add one deprecation for each branch:
```{r}
fct_lump <- function(f,
n,
prop,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")) {
if (missing(n) && missing(prop)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_lowfreq()")
fct_lump_lowfreq(f, w = w, other_level = other_level)
} else if (missing(prop)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_n()")
fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method)
} else if (missing(n)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_prop()")
fct_lump_prop(f, prop, w = w, other_level = other_level)
} else {
cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.")
}
}
```