-
Notifications
You must be signed in to change notification settings - Fork 1
/
200-Regression_when_data_are_missing.Rmd
313 lines (218 loc) · 10.8 KB
/
200-Regression_when_data_are_missing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# Regression when data are missing: multiple imputation
**Caution: in a highly developmental stage! See Section \@ref(caution).**
(DSCI 562 Tutorial)
```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(mice))
```
Let's take a closer look at mean imputation vs. multiple imputation.
## Mean Imputation
Let's consider a simple linear regression example, with one explanatory variable. We'll generate 100 data points, and make 10 of the response values missing.
```{r}
set.seed(13)
x <- rnorm(100)
y <- -1 + 2 * x + rnorm(100)
y[1:10] <- NA
```
Here are the data:
```{r}
x
y
```
Here's the scatterplot with the missing data removed, and the corresponding linear regression fit:
```{r}
p <- qplot(x, y) + geom_smooth(method="lm", se=FALSE)
p
```
The mean imputation method replaces the `NA`'s with an estimate for the mean of $Y$. The simplest case is to use the sample average of the response. The imputed observations are shown in red, and the resulting `lm` fit is also in red.
```{r}
ybar <- mean(y, na.rm=TRUE)
datrm <- na.omit(data.frame(x=x, y=y))
datimp <- data.frame(x=x[1:10], y=ybar)
p + geom_point(data=datimp, colour="red") +
geom_smooth(data=rbind(datrm, datimp), method="lm", se=FALSE, colour="red")
```
Notice that the new regression line is flatter.
Another mean-imputation method is to replace the `NA`'s with an alternative mean estimate: the regression predictions.
```{r}
fit2 <- lm(y ~ x, na.action=na.omit)
yhat <- predict(fit2, newdata=data.frame(x=x[1:10]))
datimp2 <- data.frame(x=x[1:10], y=yhat)
p + geom_point(data=datimp2, colour="red") +
geom_smooth(data=rbind(datrm, datimp2), method="lm", se=FALSE, colour="red", size=0.5)
```
The regression line has not changed. This method seems smarter, but it still has consequences, since the imputed data suggests that the dataset is bound closer to the regression line than reality. So the residual variance is biased to be smaller.
These are both mean imputation methods. So, in your Lab 2 assignment, you can use any mean imputation method -- your explanation of the comparison will just depend on what you choose.
## Multiple Imputation
Recall that _multiple imputation_ is a technique for handling missing data. It replaces the missing data with _many_ plausible values, to obtain mutliple data sets. An analysis is done on each data set, and the results are combined.
A very powerful R package to assist with multiple imputation is the `mice` package. Some key things that it does:
- Displays patterns in missing data.
- Imputes data to obtain multiple data sets.
- Pools multiple analyses into one.
We'll look at the `airquality` dataset in R.
```{r}
head(airquality)
```
### Patterns
Where are the `NA`s?
```{r}
md.pattern(airquality)
```
A "1" indicates that an observation is present, and a "0" indicates absense. The periphery of the matrix are counts: to the right, are the number of `NA`s in the row; at the bottom, are the number of `NA`s in each column; to the left, are the number of observations having a missing data pattern indicated in the matrix.
So we can see that there are 7 missing Solar Radiation observations, and 37 missing Ozone observations. We could check that in another way as follows:
```{r}
sum(is.na(airquality$Solar.R))
sum(is.na(airquality$Ozone))
```
### Multiple Imputation
There are many methods of doing an imputation. But generally, they use other columns in the data set to do prediction on the missing data.
The function to do this is `mice`. Let's impute 50 data sets using the "Predictive Mean Matching" method.
```{r}
(dats <- mice(airquality, m=50, method="pmm", seed=123, printFlag=FALSE))
```
The `m` argument is the number of imputed datasets. `method` is the method (you can check out the other methods in the "Details" part of the documentation of `mice`). Because there's a random component to the imputation, `seed` indicates the seed to initiate the random number generator -- useful for reproducibility! Finally, I didn't want `mice` to be verbose with its output, so I silenced it with `printFlag=FALSE`.
`dats` isn't just a list of 50 datasets. It has more information bundled in it. The info is bundled in an object of type "mids":
```{r}
class(dats)
```
But we can extract the data sets. Want to see the fourth imputed data set? Here it is:
```{r}
head(mice::complete(dats, 4))
```
### Pooling
The `mice` package allows you to pool many types of regression analyses. Let's try a simple linear regression to predict `Ozone` from `Solar.R`, `Wind`, and `Temp`. You'll need to use base R's `with` function.
```{r}
fits <- with(dats, lm(Ozone ~ Solar.R + Wind + Temp))
```
If you were to print `fits` to the screen, it would look like a list of 50 regression fits -- one for each of the imputed data sets. But it's not. Take a look:
```{r}
names(fits)
```
Like `dats`, `fits` has more info in it. But it _does_ have the 50 regression fits. And they can be pooled using the `pool` function:
```{r}
(fit <- pool(fits))
summary(fit)
```
And there are the results of the pooled fit. This pooling works for more than just `lm`!
```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(mice))
```
Consider predicting the air quality (ozone levels) in New York mid-year. We'll use the `airquality` dataset, recorded for mid 1973:
```{r}
airquality
```
## Step 0: What data are missing?
There are some missing data. Use `md.pattern` to see patterns in missingness:
```{r, fig.height=3.5}
md.pattern(airquality)
```
Fill in the following:
- There are **111** rows of complete data.
- There are **35** rows where only ozone is missing.
- There are **2** rows where both ozone and Solar.R are missing.
- There are **37** rows missing an ozone measurement.
- There are **44** `NA`'s in the dataset.
## Step 1: Handling Missing Data
### Any Ideas?
Here is a scatterplot of `Solar.R` and `Ozone`, with missing values "pushed" to the intercepts:
```{r, fig.height=3}
airquality %>%
mutate(missing = if_else(is.na(Solar.R) | is.na(Ozone), TRUE, FALSE),
Solar.R = ifelse(is.na(Solar.R), 0, Solar.R),
Ozone = ifelse(is.na(Ozone), 0, Ozone)) %>%
ggplot(aes(Solar.R, Ozone)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(colour = missing)) +
theme_bw() +
scale_colour_discrete(guide = FALSE)
```
Discussion: What are some ways of handling the missing data? What are the consequences of these ways?
1. Remove data.
- Remove rows with missing data (called the _complete case_).
- Consequence: We're throwing away information that could be used to reduce the final model function's SE.
- Remove rows where only the response is missing, and don't use `Solar.R` in your regression (because it has some missing values).
- Consequence: If `Solar.R` is predictive of `Ozone`, then we'd be losing that predictive power by not including it.
2. Impute:
- Mean imputation: replace an `NA` with a prediction of its mean using other variables as predictors.
- Consequence: imputed data would fall artificially close to the center of the "data cloud". This means a fitted model function would have an artificially small SE.
- Multiple imputation: impute with multiple draws from a predictive distribution.
- A great choice! No real "consequences" here, aside from the inherent risk of biasing the model function that comes with imputing values.
### `mice`
First, remove the day and month columns (we won't be using them):
```{r}
airquality <- airquality %>%
select(-Month, -Day)
```
Make `m` random imputations using `mice::mice()`:
```{r}
m <- 10
(init <- mice(airquality, m = m, printFlag = FALSE))
```
Check out the first imputed data set using `mice::complete()`. **WARNING**: there's also a `tidyr::complete()`! Rows 5 and 6, for example, originally contained missing data.
```{r}
mice::complete(init, 1)
```
Plot one of them:
```{r, fig.height=3}
mice::complete(init, 1) %>%
mutate(missing = if_else(is.na(airquality$Solar.R) |
is.na(airquality$Ozone), TRUE, FALSE)) %>%
ggplot(aes(Solar.R, Ozone)) +
geom_point(aes(colour = missing)) +
theme_bw()
```
Now, fit a linear model on each data set using the `with()` generic function (method `with.mids()`:
```{r}
(fits <- with(init, lm(Ozone ~ Solar.R + Wind + Temp)))
```
Looks can be deceiving. This is not actually a list of length `m`! Unveil its true nature:
```{r}
unclass(fits) %>%
str(max.level = 1)
```
It's now easier to find the `lm` fit on the first dataset:
```{r}
fits$analyses[[1]]
```
Or, we can obtain a summary of each fitted model:
```{r}
summary(fits)
```
As an aside, let's demonstrate that we can also use `mice` to fit GLM's:
```{r}
with(init, glm(
Ozone ~ Solar.R + Wind + Temp,
family = Gamma(link="log")
)) %>%
summary()
```
## Step 3: Pool results
The last step is to pool the results together:
```{r}
pool(fits)
```
The `estimate` column you see are just the averages of all `m` models.
Column names make more sense in light of the book "Multiple Imputation for Nonresponse in Surveys" by Rubin (1987), page 76-77:
- `estimate`: the average of the regression coefficients across `m` models.
- `ubar`: the average variance (i.e., average SE^2) across `m` models.
- `b`: the sample variance of the `m` regression coefficients.
- `t`: a final estimate of the SE^2 of each regression coefficient.
- = `ubar + (1 + 1/m) b`
- `df`: the degrees of freedom associated with the final regression coefficient estimates.
- An `alpha`-level CI: `estimate +/- qt(alpha/2, df) * sqrt(t)`.
- `riv`: the relative increase in variance due to randomness.
- = `t/ubar - 1`
## Concepts
- There are three common missing data mechanisms:
- Missing Completely At Random (MCAR): when the chance of missingness does not depend on any variable; missingness is totally random.
- Missing At Random (MAR): when the chance of missingness depends on other observed variables.
- Missing Not At Random (MNAR): when the chance of missingness depends on unobserved variables.
- Proceeding with an analysis by removing missing data can result in a model with standard errors of the estimates that are larger than they could be by including partially complete records.
- Proceeding with an analysis by imputing missing data by an estimate of the mean can result in a model with standard errors of the estimates that are smaller than they ought to be.
- An approach that uses the information contained in partially complete records, yet does not assume any more information, is to use multiple imputations. The approach contains three steps:
1. Form multiple datasets containing imputed values. Each dataset should be formed by imputing the missing records in each unit/row with a random draw from a predictive distribution for those records.
2. Fit the model of interest on each imputed dataset.
3. Combine the models to obtain one final model.