-
Notifications
You must be signed in to change notification settings - Fork 3
/
17-apply.Rmd
154 lines (122 loc) · 4.16 KB
/
17-apply.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
layout: page
title: Intermediate R for reproducible scientific analysis
subtitle: Apply functions
minutes: 20
---
```{r, include=FALSE}
source("tools/chunk-options.R")
library("data.table")
gap <- fread("data/gapminder-FiveYearData.csv")
```
> ## Learning objectives {.objectives}
>
> * To learn how to use *apply* to automate tasks efficiently
> * To know the difference between `apply`, `lapply`, `sapply`, `tapply` and
> `mapply`.
>
### Vectorized task automation
Previously we introduced you to `for` loops: a basic programming construct that
is common across many programming languages. R has more optimised way of
automating tasks that are not only faster than for loops, but also take away the
pain of having to pre-define your results object.
The most common function you will encounter is `lapply`, and the closely related
`sapply`.
Lets take a look at the following `for` loop:
```{r, eval=FALSE}
for (cc in gap[,unique(continent)]) {
popsum <- gap[year == 2007 & continent == cc, sum(pop)]
print(paste(cc, ":", popsum))
}
```
It calculates the total population on each continent, then prints out the result.
If instead we want to save these results, we can either make a vector in advance
and save the results, or use one of the `apply` to take care of this detail for
us:
```{r}
results <- lapply(gap[,unique(continent)], function(cc) {
popsum <- gap[year == 2007 & continent == cc, sum(pop)]
popsum
})
names(results) <- gap[,unique(continent)]
results
```
`lapply` takes a vector (or list) as its first argument (in this case a vector
of the continent names), then a function as its second argument. This function
is then executed on every element in the first argument. This is very similar to
a for loop: first, `cc` stores the first continent name, "Asia", then runs the
code in the function body, then `cc` stores the second continent name, and runs
the function body, and so on. The code in the function body can be thought of in
exactly the same way as the body of the `for` loop. The result of the last line
is then returned to `lapply`, which combines the results into a list.
`sapply` is identical to `lapply`, except that it tries to simplify the results
object. If we run the same code with `sapply` instead of `lapply` the results
will be returned as a vector:
```{r}
results <- sapply(gap[,unique(continent)], function(cc) {
popsum <- gap[year == 2007 & continent == cc, sum(pop)]
popsum
})
names(results) <- gap[,unique(continent)]
results
```
### apply
The `apply` function is useful for matrix data: it allows you loop over either
the rows or columns of a matrix.
```{r}
# create some dummy data
r <- matrix(rnorm(10*4), nrow=10)
colnames(r) <- letters[1:4]
rownames(r) <- LETTERS[1:10]
```
```{r}
# Get the maximum value in each row:
apply(r, 1, max)
# and for each column:
apply(r, 2, max)
````
> ### means and sums {.callout}
>
> R has inbuilt functions for summing or calculating the mean of rows and
> columns: `colSums`, `colMeans`, `rowSums`, `rowMeans`. These are faster than
> writing your own `apply`!
>
> ### the return of apply {.callout}
>
> When the function given to `apply` returns a vector instead of a single value
> the results will always be combined into columns, even if running the
> function across the rows!
>
### mapply
The `mapply` function can be used to run a function with different combinations
of arguments. Let's take a look at an example:
```{r}
a <- 1:4
b <- 4:1
mapply(rep, a, b)
```
This is the same as running the following code:
```{r}
rep(a[1], b[1])
rep(a[2], b[2])
rep(a[3], b[3])
rep(a[4], b[4])
```
or the following `lapply` statement:
```{r}
lapply(1:4, function(ii) {
rep(a[ii], b[ii])
})
```
### tapply
The `tapply` function allows you to run a function on different groups within
a vector. Going back to our first example of the lesson, we can use `tapply` to
calculate the total population for each continent in 2007:
```{r}
gap2007 <- gap[year == 2007] # first filter by the year
tapply(
gap2007[,pop], # The column of population counts from the data frame
gap2007[,continent], # The column of continent labels for each entry
sum # The function to apply to the population vector within each continent
)
```