-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-supp-factors.Rmd
199 lines (156 loc) · 6.01 KB
/
01-supp-factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
layout: page
title: Programming with R
subtitle: Understanding factors
minutes: 20
---
```{r, include = FALSE}
source('tools/chunk-options.R')
opts_chunk$set(fig.path = "fig/01-supp-factors-")
```
> ## Learning Objectives {.objectives}
>
> * Understand how to represent catagorical data in R
> * Know the difference between ordered and unordered factors
> * Be aware of some of the problems encountered when using factors
This section is modeled after the [datacarpentry lessons](http://datacarpentry.org).
Factors are used to represent categorical data. Factors can be ordered or
unordered and are an important class for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique
integers. While factors look (and often behave) like character vectors, they are
actually integers under the hood, and you need to be careful when treating them
like strings.
Once created, factors can only contain a pre-defined set values, known as
*levels*. By default, R always sorts *levels* in alphabetical order. For
instance, if you have a factor with 2 levels:
> ## Tip {.callout}
>
> The `factor()` command is used to create and modify factors in R
```{r intro-to-factors}
sex <- factor(c("male", "female", "female", "male"))
```
R will assign `1` to the level `"female"` and `2` to the level `"male"` (because
`f` comes before `m`, even though the first element in this vector is
`"male"`). You can check this by using the function `levels()`, and check the
number of levels using `nlevels()`:
```{r examining-factors}
levels(sex)
nlevels(sex)
```
Sometimes, the order of the factors does not matter, other times you might want
to specify the order because it is meaningful (e.g., "low", "medium", "high") or
it is required by particular type of analysis. Additionally, specifying the
order of the levels allows us to compare levels:
```{r, error=TRUE}
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
food <- factor(food, levels=c("low", "medium", "high"))
levels(food)
min(food) ## doesn't work
food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)
min(food) ## works!
```
In R's memory, these factors are represented by numbers (1, 2, 3). They are
better than using simple integer labels because factors are self describing:
`"low"`, `"medium"`, and `"high"`" is more descriptive than `1`, `2`, `3`. Which
is low? You wouldn't be able to tell with just integer data. Factors have this
information built in. It is particularly helpful when there are many levels
(like the subjects in our example data set).
> ## Challenge - Representing data in R {.challenge}
>
> You have a vector representing levels of exercise undertaken by 5 subjects
>
> **"l","n","n","i","l"** ; n=none, l=light, i=intense
>
> What is the best way to represent this in R?
>
> a) exercise<-c("l","n","n","i","l")
>
> b) exercise<-factor(c("l","n","n","i","l"), ordered=TRUE)
>
> c) exercise<-factor(c("l","n","n","i","l"), levels=c("n","l","i"), ordered=FALSE)
>
> d) exercise<-factor(c("l","n","n","i","l"), levels=c("n","l","i"), ordered=TRUE)
### Converting Factors
Converting from a factor to a number can cause problems:
```{r converting-factors}
f<-factor(c(3.4, 1.2, 5))
as.numeric(f)
```
This does not behave as expected (and there is no warning).
The recommended way is to use the integer vector to index the factor levels:
```{r converting-factors-correctly}
levels(f)[f]
```
This returns a character vector, the `as.numeric()` function is still required to convert the values to the proper type (numeric).
```{r converting-to-numeric}
f<-levels(f)[f]
f<-as.numeric(f)
```
### Using Factors
Lets load our example data to see the use of factors:
```{r load-example-data}
dat<-read.csv(file='data/sample.csv', stringsAsFactors=TRUE)
```
> ## Tip {.callout}
>
> `stringsAsFactors=TRUE` is the default behaviour for R. We could leave this argument out. It is included here for clarity.
```{r examine-example-data}
str(dat)
```
Notice the first 3 columns have been converted to factors. These values were text in the data file so R automatically interpreted them as catagorical variables.
```{r examine-example-data2}
summary(dat)
```
Notice the `summary()` function handles factors differently to numbers (and strings), the occurence counts for each value is often more useful information.
> ## Tip {.callout}
>
> The `summary()` function is a great way of spotting errors in your data, look at the *dat$Gender* column. It's also a great way for spotting missing data.
> ## Challenge - Reordering factors {.challenge}
>
> The function `table()` tabulates observations and can be used to create bar plots quickly. For instance:
>
> ```{r reordering-factors}
> table(dat$Group)
> barplot(table(dat$Group))
> ```
> Use the `factor()` command to modify the column dat$Group so that the *control* group is plotted last
### Removing Levels from a Factor
Some of the Gender values in our dataset have been coded incorrectly.
Let's remove factors.
```{r gender-counts}
barplot(table(dat$Gender))
```
Values should have been recorded as lowercase 'm' & 'f'. We should correct this.
```{r recoding-gender}
dat$Gender[dat$Gender=='M']<-'m'
```
> ## Challenge - Updating factors {.challenge}
>
> ```{r updating-factors}
> plot(x=dat$Gender,y=dat$BloodPressure)
> ```
>
> Why does this plot show 4 levels?
>
> *Hint* how many levels does dat$Gender have?
We need to tell R that "M" is no longer a valid value for this column.
We use the `droplevels()` function to remove extra levels.
```{r dropping-levels}
dat$Gender<-droplevels(dat$Gender)
plot(x=dat$Gender,y=dat$BloodPressure)
```
> ## Tip {.callout}
>
> Adjusting the `levels()` of a factor provides a useful shortcut for reassigning values in this case.
>
> ```{r adjusting-levels}
> levels(dat$Gender)[1] <- 'f'
> plot(x = dat$Gender, y = dat$BloodPressure)
> ```
> ## Key Points {.callout}
>
> * Factors are used to represent catagorical data
> * Factors can be *ordered* or *unordered*
> * Some R functions have special methods for handling functions