forked from datacarpentry/R-dplyr-ecology-archived
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_data_manipulation.Rmd
219 lines (159 loc) · 6.48 KB
/
02_data_manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
layout: page
title: R dplyr ecology
subtitle: Data manipulation
minutes: 45
---
> ## Learning Objectives
>
> * being able to select a column from a data frame
> * being able to filter rows on specific criteria
> * being able to order rows on specific criteria
> * being able to add a column as a function of existing columns
## Load data
```{r check-data, echo=FALSE}
if (!file.exists("data/surveys.csv")) {
download.file("http://files.figshare.com/1919744/surveys.csv",
"data/surveys.csv")
}
if (!file.exists("data/species.csv")) {
download.file("http://files.figshare.com/1919741/species.csv",
"data/species.csv")
}
```
```{r dplyr.init}
# if dplyr is not yet installed, uncomment the next line
# install.packages("dplyr")
library(dplyr)
```
Read in the data in comma separated file into a data frame.
```{r readSurveys}
surveys <- read.csv(file="data/surveys.csv")
```
## Selecting columns of data
In particular for larger datasets, it can be tricky to remember the column
number that corresponds to a particular variable. (is hindfoot length in column 7
or 9? oh, right... they are in column 8). In some cases, in which column the
variable will be can change if the script you are using adds or removes
columns. It's therefore often better to use column names to refer to a
particular variable, and it makes your code easier to read and your intentions
clearer.
In the introduction to R, we have covered a first way of selecting columns, using the `$` sign.
If you want to look at the result, you can use head().
```{r dollar}
hindfoot_length_column <- surveys$hindfoot_length
# look at the data in the variable hindfoot_length_column
head(hindfoot_length_column)
```
Using dplyr, you can do operations on a particular column, by selecting it using the `select()` function.
The first argument is the data frame, the second argument the column name you want to select.
You can use `names(surveys)` or `colnames(surveys)` to remind yourself of the column names.
```{r select}
hindfoot_length_column <- select(surveys, hindfoot_length)
```
You can easily select multiple columns by adding the names as additional arguments.
```{r select_mult}
hindfoot_length_species_columns <- select(surveys, hindfoot_length, species_id)
```
## Filtering rows of data
When analyzing data, though, we often want to look at partial statistics, such
as the maximum value of a variable per species or the average value per plot.
One way to do this is to filter the data we want, and create a new temporary
data frame, using the `filter()` function. For instance, if we just want to look at
the animals of the species "DO":
```{r filter}
surveys_DO <- filter(surveys, species_id == "DO")
```
Filtering on multiple criteria (column names) can be done by adding extra filters:
```{r filter_mult}
surveys_DM <- filter(surveys, species_id == "DM", year >= "2001")
```
> ## Callout Box: Boolean operations and Filtering
>
> When you filter on a criterium, you compare each row of data.frame.
> Using boolean operators, such as AND (`&`) and OR (`|`), you can combine different criteria.
> Let's see how this works with some examples using a vector:
>
> * First create a vector with some numbers:
```{r boolean1}
u <- c(1, 4, 2, 5, 6, 3, 7)
```
> * Check which numbers ar smaller than e.g. three. This results in a vector of the same size
> with for each number a boolean value of `TRUE` or `FALSE`, depending if the respective number passed the criterium.
```{r boolean2}
u < 3
```
> * To know how many elements pass the criterium, you can use `sum()`
```{r boolean2b}
sum(u < 3)
```
> * If you want to select only the elements of the vector that are conform the selection criterium, you can use the following:
```{r boolean3}
u[u < 3]
```
> * To select elements that do not conform to the criterium, the NOT operator `!` can be used:
```{r boolean5}
u[! u > 5]
```
> * Selection criteria can be combined using AND (`&`) and OR (`|`):
```{r boolean4}
# elements below 3 OR larger or equal to 4:
u[u < 3 | u >= 4]
# elements larger than 5 AND smaller than 1.
# As this is mathematically impossible, nothing matches this condition!
u[u > 5 & u < 1 ]
u[u > 5 & ! u < 1 ]
u[u > 5 & u < 7]
```
> ## Challenge: filtering rows on multiple factors
>
> 1. Boolean operators can also be used in dplyr! Filter rows on two different species (e.g. both DO and DM).
>
> Have a look at the [Introduction to dplyr](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#filter-rows-with-filter) if you need a hint.
```{r filter_or, echo=FALSE, eval=FALSE}
#If you want to include e.g. multiple species, you can use a boolean operator OR ("|"):
surveys_DO_DM <- filter(surveys, species_id == "DO" | species_id == "DM", year == "1977")
```
## Ordering rows according to a column
In stead of filtering rows, you sometimes just want to reorder the data.
This can be done using `arrange()`. It takes a data frame as first argument and a list of
column names.
```{r arrange}
surveys_order <- arrange(surveys, year, month, day)
# have a look
head(surveys_order)
```
Default rows are ordered ascendingly, you can reverse this using `desc()`:
```{r arrange_desc}
surveys_desc <- arrange(surveys, desc(weight))
# have a look at the data
head(surveys_desc)
```
> ## Challenge ordering of rows
>
> 1. Arrange the surveys so you see the most recent month first, ascendingly according to the day
>
```{r arrange_challenge, echo=FALSE, eval=FALSE}
surveys_challenge <- arrange(surveys, desc(year), desc(month), day)
# have a look at the data
head(surveys_challenge)
```
## Adding a column to our dataset
Sometimes, you may have to add a new column to your dataset that represents a
new variable, based on existing columns. You can do this using the function `mutate()`.
Here we will add a column, converting the hindfoot_length from mm to cm:
```{r mutate}
surveys_plus <- mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
head(surveys_plus)
```
> ## Challenge adding column
>
> 1. Add a column calculating the ratio of hindfoot_lenght and weight.
```{r, echo=FALSE, eval=FALSE}
surveys_plus <- mutate(surveys, ratio = hindfoot_length / weight)
```
> 2. Because the weight and hindfoot_length is not known for all data, you can first filter the data to see some more meaningful results.
Select for rows where the weight is above 100 g and the hindfoot_length is longer than 10 mm.
```{r, echo=FALSE, eval=FALSE}
head(filter(surveys_plus, weight > 100, hindfoot_length > 10))
```