200-Regression_when_data_are_missing.Rmd

# Regression when data are missing: multiple imputation

**Caution: in a highly developmental stage! See Section  \@ref(caution).**

(DSCI 562 Tutorial)

```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(mice))
```


Let's take a closer look at mean imputation vs. multiple imputation.

## Mean Imputation

Let's consider a simple linear regression example, with one explanatory variable. We'll generate 100 data points, and make 10 of the response values missing. 

```{r}
set.seed(13)
x <- rnorm(100)
y <- -1 + 2 * x + rnorm(100)
y[1:10] <- NA
```

Here are the data:

```{r}
x
y
```

Here's the scatterplot with the missing data removed, and the corresponding linear regression fit:

```{r}
p <- qplot(x, y) + geom_smooth(method="lm", se=FALSE)
p
```

The mean imputation method replaces the `NA`'s with an estimate for the mean of $Y$. The simplest case is to use the sample average of the response. The imputed observations are shown in red, and the resulting `lm` fit is also in red.

```{r}
ybar <- mean(y, na.rm=TRUE)
datrm <- na.omit(data.frame(x=x, y=y))
datimp <- data.frame(x=x[1:10], y=ybar)
p + geom_point(data=datimp, colour="red") +
    geom_smooth(data=rbind(datrm, datimp), method="lm", se=FALSE, colour="red")
```

Notice that the new regression line is flatter.

Another mean-imputation method is to replace the `NA`'s with an alternative mean estimate: the regression predictions. 

```{r}
fit2 <- lm(y ~ x, na.action=na.omit)
yhat <- predict(fit2, newdata=data.frame(x=x[1:10]))
datimp2 <- data.frame(x=x[1:10], y=yhat)
p + geom_point(data=datimp2, colour="red") +
    geom_smooth(data=rbind(datrm, datimp2), method="lm", se=FALSE, colour="red", size=0.5)
```

The regression line has not changed. This method seems smarter, but it still has consequences, since the imputed data suggests that the dataset is bound closer to the regression line than reality. So the residual variance is biased to be smaller.

These are both mean imputation methods. So, in your Lab 2 assignment, you can use any mean imputation method -- your explanation of the comparison will just depend on what you choose.


## Multiple Imputation

Recall that _multiple imputation_ is a technique for handling missing data. It replaces the missing data with _many_ plausible values, to obtain mutliple data sets. An analysis is done on each data set, and the results are combined.

A very powerful R package to assist with multiple imputation is the `mice` package. Some key things that it does:

- Displays patterns in missing data.
- Imputes data to obtain multiple data sets.
- Pools multiple analyses into one.

We'll look at the `airquality` dataset in R.

```{r}
head(airquality)
```

### Patterns

Where are the `NA`s?

```{r}
md.pattern(airquality)
```

A "1" indicates that an observation is present, and a "0" indicates absense. The periphery of the matrix are counts: to the right, are the number of `NA`s in the row; at the bottom, are the number of `NA`s in each column; to the left, are the number of observations having a missing data pattern indicated in the matrix. 

So we can see that there are 7 missing Solar Radiation observations, and 37 missing Ozone observations. We could check that in another way as follows:

```{r}
sum(is.na(airquality$Solar.R))
sum(is.na(airquality$Ozone))
```

### Multiple Imputation

There are many methods of doing an imputation. But generally, they use other columns in the data set to do prediction on the missing data. 

The function to do this is `mice`. Let's impute 50 data sets using the "Predictive Mean Matching" method.

```{r}
(dats <- mice(airquality, m=50, method="pmm", seed=123, printFlag=FALSE))
```

The `m` argument is the number of imputed datasets. `method` is the method (you can check out the other methods in the "Details" part of the documentation of `mice`). Because there's a random component to the imputation, `seed` indicates the seed to initiate the random number generator -- useful for reproducibility! Finally, I didn't want `mice` to be verbose with its output, so I silenced it with `printFlag=FALSE`. 

`dats` isn't just a list of 50 datasets. It has more information bundled in it. The info is bundled in an object of type "mids":

```{r}
class(dats)
```

But we can extract the data sets. Want to see the fourth imputed data set? Here it is:

```{r}
head(mice::complete(dats, 4))
```

### Pooling

The `mice` package allows you to pool many types of regression analyses. Let's try a simple linear regression to predict `Ozone` from `Solar.R`, `Wind`, and `Temp`. You'll need to use base R's `with` function. 

```{r}
fits <- with(dats, lm(Ozone ~ Solar.R + Wind + Temp))
```

If you were to print `fits` to the screen, it would look like a list of 50 regression fits -- one for each of the imputed data sets. But it's not. Take a look:

```{r}
names(fits)
```

Like `dats`, `fits` has more info in it. But it _does_ have the 50 regression fits. And they can be pooled using the `pool` function:

```{r}
(fit <- pool(fits))
summary(fit)
```

And there are the results of the pooled fit. This pooling works for more than just `lm`!

```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(mice))
```

Consider predicting the air quality (ozone levels) in New York mid-year. We'll use the `airquality` dataset, recorded for mid 1973:

```{r}
airquality
```

## Step 0: What data are missing?

There are some missing data. Use `md.pattern` to see patterns in missingness:

```{r, fig.height=3.5}
md.pattern(airquality)
```

Fill in the following:

- There are **111** rows of complete data.
- There are **35** rows where only ozone is missing.
- There are **2** rows where both ozone and Solar.R are missing.
- There are **37** rows missing an ozone measurement.
- There are **44** `NA`'s in the dataset.

## Step 1: Handling Missing Data

### Any Ideas?

Here is a scatterplot of `Solar.R` and `Ozone`, with missing values "pushed" to the intercepts:

```{r, fig.height=3}
airquality %>% 
	mutate(missing = if_else(is.na(Solar.R) | is.na(Ozone), TRUE, FALSE),
		   Solar.R = ifelse(is.na(Solar.R), 0, Solar.R),
		   Ozone   = ifelse(is.na(Ozone), 0, Ozone)) %>% 
	ggplot(aes(Solar.R, Ozone)) +
	geom_hline(yintercept = 0, linetype = "dashed") +
	geom_vline(xintercept = 0, linetype = "dashed") +
	geom_point(aes(colour = missing)) +
	theme_bw() +
	scale_colour_discrete(guide = FALSE)
```

Discussion: What are some ways of handling the missing data? What are the consequences of these ways?

1. Remove data.
	- Remove rows with missing data (called the _complete case_).
		- Consequence: We're throwing away information that could be used to reduce the final model function's SE.
	- Remove rows where only the response is missing, and don't use `Solar.R` in your regression (because it has some missing values).
		- Consequence: If `Solar.R` is predictive of `Ozone`, then we'd be losing that predictive power by not including it.
2. Impute:
	- Mean imputation: replace an `NA` with a prediction of its mean using other variables as predictors.
		- Consequence: imputed data would fall artificially close to the center of the "data cloud". This means a fitted model function would have an artificially small SE.
	- Multiple imputation: impute with multiple draws from a predictive distribution.
		- A great choice! No real "consequences" here, aside from the inherent risk of biasing the model function that comes with imputing values.

### `mice`

First, remove the day and month columns (we won't be using them):

```{r}
airquality <- airquality %>% 
	select(-Month, -Day)
```

Make `m` random imputations using `mice::mice()`:

```{r}
m <- 10
(init <- mice(airquality, m = m, printFlag = FALSE))
```

Check out the first imputed data set using `mice::complete()`. **WARNING**: there's also a `tidyr::complete()`! Rows 5 and 6, for example, originally contained missing data.

```{r}
mice::complete(init, 1)
```

Plot one of them:

```{r, fig.height=3}
mice::complete(init, 1) %>% 
	mutate(missing = if_else(is.na(airquality$Solar.R) | 
							 	is.na(airquality$Ozone), TRUE, FALSE)) %>% 
	ggplot(aes(Solar.R, Ozone)) +
	geom_point(aes(colour = missing)) +
	theme_bw()
```

Now, fit a linear model on each data set using the `with()` generic function (method `with.mids()`:

```{r}
(fits <- with(init, lm(Ozone ~ Solar.R + Wind + Temp)))
```

Looks can be deceiving. This is not actually a list of length `m`! Unveil its true nature:

```{r}
unclass(fits) %>% 
	str(max.level = 1)
```

It's now easier to find the `lm` fit on the first dataset:

```{r}
fits$analyses[[1]]
```

Or, we can obtain a summary of each fitted model:

```{r}
summary(fits)
```

As an aside, let's demonstrate that we can also use `mice` to fit GLM's:

```{r}
with(init, glm(
	Ozone ~ Solar.R + Wind + Temp, 
	family = Gamma(link="log")
)) %>% 
	summary()
```

## Step 3: Pool results

The last step is to pool the results together:

```{r}
pool(fits)
```

The `estimate` column you see are just the averages of all `m` models.

Column names make more sense in light of the book "Multiple Imputation for Nonresponse in Surveys" by Rubin (1987), page 76-77:

- `estimate`: the average of the regression coefficients across `m` models.
- `ubar`: the average variance (i.e., average SE^2) across `m` models.
- `b`: the sample variance of the `m` regression coefficients.
- `t`: a final estimate of the SE^2 of each regression coefficient. 
	- = `ubar + (1 + 1/m) b`
- `df`: the degrees of freedom associated with the final regression coefficient estimates.
	- An `alpha`-level CI: `estimate +/- qt(alpha/2, df) * sqrt(t)`.
- `riv`: the relative increase in variance due to randomness. 
	- = `t/ubar - 1`


## Concepts

- There are three common missing data mechanisms:
	- Missing Completely At Random (MCAR): when the chance of missingness does not depend on any variable; missingness is totally random.
	- Missing At Random (MAR): when the chance of missingness depends on other observed variables.
	- Missing Not At Random (MNAR): when the chance of missingness depends on unobserved variables.
- Proceeding with an analysis by removing missing data can result in a model with standard errors of the estimates that are larger than they could be by including partially complete records.
- Proceeding with an analysis by imputing missing data by an estimate of the mean can result in a model with standard errors of the estimates that are smaller than they ought to be.
- An approach that uses the information contained in partially complete records, yet does not assume any more information, is to use multiple imputations. The approach contains three steps:
	1. Form multiple datasets containing imputed values. Each dataset should be formed by imputing the missing records in each unit/row with a random draw from a predictive distribution for those records.
	2. Fit the model of interest on each imputed dataset.
	3. Combine the models to obtain one final model.