-
Notifications
You must be signed in to change notification settings - Fork 1
/
Duncan_regression_example.Rmd
354 lines (272 loc) · 17.4 KB
/
Duncan_regression_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "Duncan Prestige Regression Example"
author: "Jeffrey B. Arnold"
date: "04/21/2015"
output:
html_document:
toc: yes
---
This example will use Duncan's Occpational Prestige Data, which has data on
prestige and other characteristics of 45 U.S. occupations in 1950. See the help
page for more info. This data is contained in the **car** package. ```{r}
library("car") data("Duncan") ```
```{r eval = FALSE} ?Duncan ```
We'll also use a few other packages, which we should load now. ```{r message =
FALSE} library("ggplot2") library("dplyr") library("broom") ```
The `Duncan` data frame contains the names of the profession as row names. This
is something that is generally discouraged in modern R, so we will make a new
column named `occupation`.
This uses the `$` to assign a value to a column in R. We've been using **dplyr**
so much that we haven't had to do this, but it can be easier at times. This
could have also been done using the dplyr function `add_rownames`. ```{r} Duncan
<- add_rownames(Duncan, var = "occupation") ``` Without **dplyr** we would have
used ```{r} # Duncan$occupation <- rownames(Duncan) ```
## Scatterplots
Before starting let's create a couple of scatterplots of the data. First,
prestige vs. income, labeling the points, and coloring them by type. ```{r}
ggplot(Duncan, aes(x = income, y = prestige, colour = type, label = occupation))
+ geom_text() + theme_minimal() ``` Now, prestige vs. income, coloring the
points by education level: ```{r} ggplot(Duncan, aes(x = income, y = prestige,
colour = education)) + geom_point() + theme_minimal() ```
Run a regression of prestige on income and education ```{r} mod1 <- lm(prestige
~ income + education, data = Duncan) mod1_summary <- summary(mod1) mod1_summary
```
This extracts the coefficient estimates $\hat{\beta}$, ```{r} beta <- coef(mod1)
``` This extracts the variance covariance matrix of the coefficient estimatates
$V(\hat{\beta})$. ```{r} beta_vc <- vcov(mod1) ``` Standard errors of the
coefficients are $se(\hat{\beta}) = \sqrt{diag(V(\hat{\beta}))}$, ```{r} beta_se
<- sqrt(diag(beta_vc)) ``` The $t$-statistic of the hypothesis test $\beta = 0$
is $\frac{\hat{\beta}}{se(\hat{\beta})}$, ```{r} tstat <- beta / beta_se ``` And
the $p$-value of the two-sided hypothesis test $H_0: \beta = 0$, $H_a: \beta
\neq 0$ is, ```{r} pval <- 2 * (1 - pt(tstat, mod1$df.residual)) ``` This uses
the `qt` function to calculate probabilities from the $t$-distribution.
The degrees of freedom $n - k - 1$ ```{r} mod1$df.residual ``` The coefficients,
$\hat{\beta}$: ```{r} coef(mod1) ``` The coefficients can also be extracted
directly from the element in the `lm` object ```{r} mod1$coefficients ```
The fitted values, $\hat{y}$ (of the data used for fitting the regression).
```{r} head(fitted(mod1)) head(mod1$fitted) ```
The residuals, $\hat{epsilon} = y - \hat{y}$: ```{r} head(residuals(mod1))
head(mod1$residuals) ```
You can check that the residuals are $y - \hat{y}$, ```{r} head(Duncan$prestige
- fitted(mod1)) ```
To get the data used in the regression back from the regression, ```{r}
head(model.frame(mod1)) ``` To get the $y$ values (model or design matrix) used
in the regression, ```{r} head(model.response(model.frame(mod1))) ```
Note that if you have missing values in the regression, R has sophisticated but
subtle ways of handling them, especially in `predict`. See the help for
`na.omit` and follow links about `lm`.
The `anova` function returns a table with the total, model, and residual sum of
squares. ```{r} anova(mod1) ```
## Result post-processing with broom
[Broom](https://github.com/dgrtwo/broom) is a relatively new package that works
well with the **dplyr** and `%>%` workflow by converting the results of common
models into data frames that can be processed more easily than the default
objects R returns.
**broom** has three main functions, all of which return data frames (not lists,
numeric vectors, or other types of object). `glance` returns a data frame with a
single row summary of the model: ```{r} glance(mod1) ``` `tidy` returns a data
frame with a row for each coefficient estimate: ```{r} tidy(mod1) ``` `augment`
returns the original data frame used in the model with additional columns for
fitted values, the standard errors of those fitted values, residuals, etc.
```{r} head(augment(mod1)) ```
## Coefficient Plots
Also known as airplane plots or ropeladder plots:
```{r} ggplot(tidy(mod1) %>% filter(term != "(Intercept)"), aes(x = term, y =
estimate, ymin = estimate - 2 * std.error, ymax = estimate + 2 * std.error)) +
geom_pointrange() + coord_flip() ```
This could also be done with the `coefplot` function from the **coefplot**
package: ```{r} library("coefplot") coefplot(mod1) ``` or to drop the intercept
```{r} coefplot(mod1, coefficients = c("income", "education")) ``` Since
internally `coefplot` uses **ggplot2**, you could also edit the object it
returns.
## Creating regression tables
Several packages (**stargazer**, **texreg**, **apsrtable**) are useful for
creating publication type regression tables. **stargazer** and **texreg** are
the most complete package. Both allow output to either LaTeX or HTML tables for
many types of statistical models. We'll use **stargazer** here: ```{r}
library("stargazer") stargazer(mod1, type = "text") ``` Now render that as html
instead, ```{r results = 'asis'} stargazer(mod1, type = "html") ``` Look at the
source `.Rmd` file for this document; the chunk above used `results = "asis"` to
print and render the HTML rather than R output.
This usefulness of this function is apparent when multiple regressions are
plotted: ```{r results = "asis"} mod3 <- lm(prestige ~ income, data = Duncan)
mod4 <- lm(prestige ~ income * type + education * type, data = Duncan)
stargazer(mod1, mod3, mod4, type = "html") ```
Addtionally, the packages **xtable** and **pander** are not a specific to the
problem of creating regression tables, but since they are more genral purpose,
they are good for creating LaTeX / HTML / Markdown tables for a variety of R
objects.
## Predicted values
You could calculate predicted values manually. For example, the predicted
`prestige` from `mod1` of an occupation with an income of 41.9 and an
`education` of 52.5 is ```{r} coef(mod1)["(Intercept)"] + coef(mod1)["income"] *
41.9 + coef(mod1)["education"] * 52.5 ``` or ```{r} c(1, 41.9, 52.5) %*%
coef(mod1) ```
However, it is much easier to calculate this with the `predict` function. If
predict is used without a `newdata` argument it acts similarly to `fitted` and
returns the predicted values for the data used to estimate the model (although
it will include predicted values for missing data): ```{r} head(predict(mod1))
```
But, if the `newdata` argument is used, then `predict` can be used to find the
predicted values for new data points. To calculate the predicted value of the
example above: ```{r} predict(mod1, newdata = data.frame(education = 52.5,
income = 41.9)) ```
You can find the confidence intervals by specifying `interval = "confidence"`.
```{r} yhat <- predict(mod1, interval = "confidence") ```
A note on help for `predict`: `predict` works with many different types of
analyses and object. You need to go to `?predict.lm` to get help for predict as
it relates to `lm`.
To plot, convert to a data_frame ```{r} yhat_df <- as.data.frame(yhat) %>%
add_rownames(var = "occupation") ```
Now it is easy to plot in **ggplot2** ```{r} ggplot() + geom_point(data =
Duncan, aes(x = income, y = prestige)) ```
Now do it with hypothetical values. All values of income observed in the data,
but education at its mean. ```{r} duncan_mean_education <- data.frame(education
= mean(Duncan$education), income = seq(min(Duncan$income), max(Duncan$income),
length.out = 50)) mod1_predicted <- as.data.frame(predict(mod1, newdata =
duncan_mean_education, interval = "confidence", conf.level = 0.95)) #
augment(mod1, newdata = duncan_mean_education) %>% # mutate(lower = .fitted +
qt(0.025, mod1$df.residual) * .se.fit, # upper = .fitted + qt(0.975,
mod1$df.residual) * .se.fit) ``` To plot this, we'll need to combine it back
with the original data. ```{r} mod1_predicted <- cbind(duncan_mean_education,
mod1_predicted) ```
Now, let's plot the predicted values and the 95% confidence interval of
predicted values as income changes. Additionally, we'll plot the original
values. ```{r} ggplot() + geom_line(data = mod1_predicted, mapping = aes(x =
income, y = fit)) + geom_ribbon(data = mod1_predicted, mapping = aes(x = income,
ymin = lwr, ymax = upr), alpha = 0.2) + geom_point(data = Duncan, mapping =
aes(x = income, y = prestige)) + ylab("prestige") ``` Note that we use different
datasets in each `geom`, and do not use any values in `ggplot`. The `ggplot`
function can provide default mappings in `aes` and a default dataset. But if you
are using multiple datasets, it can be safter and less buggy to specify a `data`
and `mapping` argument for each `geom` layer.
For more interesting predicted value plots, let's run a regression with a
categorical variable. ```{r} mod2 <- lm(prestige ~ income + education + type,
data = Duncan) mod2 ```
Now, let's predict values for each type of occupation for all values of income,
holding education at its mean value. First, we need to create the data that will
be used for the predicted values. This can be made easier with the function
`expand.grid`, which returns a data frame with all combinations of its
arguments. For example, ```{r} expand.grid(a = 1:3, b = c("a", "b")) ``` With
that information, let's create data frame with the values needed for prediction
```{r} newdata_types_inc <- expand.grid(type = unique(Duncan$type), income =
seq(min(Duncan$income), max(Duncan$income), length.out = 5), education =
mean(Duncan$education)) # # Another method # newdata_types_inc <- # Duncan %>%
{ # expand.grid(type = unique(.$type), # income =
seq(min(.$income), max(.$income), # length.out = 5),
# education = mean(.$education)) # } ```
Then, create predicted values and confidence intervals with either `predict` or
`augment`. ```{r} predicted_mod2 <- augment(mod2, newdata = newdata_types_inc)
%>% mutate(lower = .fitted + qt(0.025, mod2$df.residual) * .se.fit, upper =
.fitted + qt(0.975, mod2$df.residual) * .se.fit) ``` And then plot it, ```{r}
ggplot(predicted_mod2, aes(x = income, y = .fitted, ymin = lower, ymax = upper))
+ geom_line(mapping = aes(colour = type)) + geom_ribbon(mapping = aes(fill =
type), alpha = 0.2) + ylab("prestige") + ggtitle("Predicted values of prestige
by type, holding education constant") ```
## Heteroskedasticity
## Unusual and Influential Data
### Leverage
The hat vlaue is a measure of leverage in regression. The fitted values
$\hat{y}_i$ are weighted sum of the observed values of $y_i$, $$ \hat{y}_i =
h_{1i} y_1 + h_{2i} y_2 + \dots + h_{Ni} y_N = \sum_{j = 1}^N h_{ji} y_j . $$
The hat-value $h_{ji}$ weights captures the contribution of observation $y_j$ on
the fitted value $\hat{y}_i$. The hat value $h_i$ summarizes to total influence
of an observation on all the fitted values, $$ h_ii = \sum_{j = 1}^N h_{ji} . $$
All hat-values are between $1 / n$ and 1. The average hat-value is $$ \bar{h} =
(K + 1) / N $$
In a simple regression the hat-value is the distance from the mean of $\vec{x}$,
$$ h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2)}{\sum_{j = 1}^N (x_j -
\bar{x})^2} . $$ In the multivariate case, the squared distance from the center
of $\bar{\vec{x}}$ is used.
**R**
- Built-in functions
[hatvalues()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
and `hat()` calculate hat values - The column `.hat` in data frame returned by
the **broom** function
[augment](http://www.rdocumentation.org/packages/broom/functions/lm_tidiers.html).
- The R function [leveragePlot()](http://www.rdocumentation.org/packages/car/html/leveragePlot) is a variation of added-variable plots.
### Outliers
It would seem sufficient to look at the residuals to identify outliers of a
regression. However, even though the errors, $\epsilon_i$, have equal variance,
the residuals of OLS, $\hat{\epsilon}_i$, to not. $$ \Var{\hat{\epsilon}} =
\sigma^2 (1 - h_i). $$ The regression variance $\sigma^2$ is adjusted by the
leverage, since high-leverage observations pull the regression line towards
them, and thus will have smaller residuals.
We will define two variations of the residuals. The standardized residual
divides by the residual by its variance, $$ \hat{\epsilon}'_i =
\frac{\hat{\epsilon}_i}{\hat{\sigma} \sqrt{1 - h_i}} . $$ However, the standard
error of the regression $\hat{\sigma}$ is still based on the value of
$\hat{\epsilon}_i$, which means that the standard error will be large if the
residual of $i$ is also large. The studentized residual replaces the standard
error of the regression, with the standard error of the regression when the
observation $i$ is deleted, $\hat{\sigma}_{-i}$. Fortunately, there is an easy
way to calculate this standard error without having to run those regressions, by
adjusting the studentized regression. $$ \hat{\epsilon}*_i =
\frac{\hat{\epsilon}_i}{\hat{\sigma}_{-i} \sqrt{1 - h_i}} = \hat{\epsilon}_i^*
\sqrt{\frac{N - K - 2}{N - K 1 - \hat\epsilon_i'^2}} . $$ Notethat the
distinction between these two residuals is generally negligable as the sample
size gets large, since in that case $$ \frac{1}{N - K - 1} \sum \hat{\epsilon}_i
\approx \frac{1}{N - K - 2} \sum_{j: j \neq i } \hat{\epsilon}_{j} . $$ The
studentized residual is distributed Student's t with $N - K - 1$ degrees of
freedom (if the model is correct), $$ \hat\epsilon_i^* \sim t_{N - K - 2} . $$
How to use studentize residuals to test for outliers? Since there are $N$
studentized residuals, running a significance test for outliers means performing
$N$ tests. If an a typical 5% value is used, then in expectation 5% of the
residuals will be statistically significant. Thus, use a Bonferroni-adjustment
for the $p$-value, which adjusts for multiple tests. The function
[p.adjust](http://www.rdocumentation.org/packages/stats/functions/p.adjust) can
be used to apply the Bonferroni adjustment to $p$-values.
**R**
- [rstudent()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
and
[rstandard](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
calculate the studentized and standardized residuals, respectively. - The
function
[augment](http://www.rdocumentation.org/packages/broom/html/augment.lm.html) in
the **broom** package includes the standardized residuals in the column
`.std.resid`. The studentized residuals can be calculated from the standardized
residual with, ```r .std.resid * sqrt((n - k - 2) / (n -k - 1 - .std.resid ^ 2))
``` where $n$ is the number of observations, and $k$ is the number of
coefficients.
### Influence
Influence is a direct measure of the effect of individual observations on
regression coefficients. This could be calculated by repeatedly running the
regression dropping one observation at a time. However, in the case of OLS,
there are several statistics which calculate this influence without running
those regressions.
Observations that have both high leverage and large studentized residuals have
large influences on the coefficients. There are several statistics, on different
scales, which measure this influence.
DFBETA is the difference in each coefficient $j$, with and without observation
$i$, $$ D_{ij} = \hat\beta_{j} - \hat\beta_{j(-i)} \text{for $i = 1, \dots, N$
and $J = 0, 1, \dots, K$.} $$ DFBETAS normalizes this by the (deleted)
coefficient standard errors, $$ D_{ij} =
\frac{D_{ij}}{\hat{\se}_{-i}(\hat\beta_j)} . $$
The problem with the DFBETAS is there are $N \times (K + 1)$ coefficients, one
for each coefficient, for each observation. The statistics, Cook's $D$ and
DFFITS, summarize the overall influence of an observation on all the
coefficients.
Cook's $D$ statistic is scaled to produce a measure of distance that is similar to
an $F$ test and independent of the scale of the $\vec{x}$ variables, $$ D_i =
\frac{(\hat\epsilon'_i)^2}{K + 1} \times \frac{h_i}{h_i + 1} . $$ DFFITS is
scaled so that it is on the scale of the fitted values, $$ DFFITS_i =
\hat\epsilon_i \frac{h_i}{1 - h_i} \approx DFFITS_i^2 / (K + 1) $$
**R**
- DFBETA: [dfbeta()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
- DFBETAS: [dfbetas()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
- Cook's $D$: [cooks.distance()](http://www.rdocumentation.org/packages/stats/functions/influence.measures.html) Column `.cooksd` in the results of the **broom** function [augment](http://www.rdocumentation.org/packages/broom/html/augment.lm.html). -
- DFFITS: [dffits()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html)
### Other-measures
High leverage observations can also have an influence on the standard error of
coefficients. If they have a high residual they can decrease the standard errors
if they are deleted, and vice versa, if they have a small residual, they can
increase the standard error if deleted. The statistic `COVRATIO` measures this.
You can calculate this using the R function
`[covratio()](http://www.rdocumentation.org/packages/stats/html/influence.measures.html).
Calculating the influence for sets of multiple observations is difficult, since
the number to test increases combinatorially. Fox suggests using
**added-variable** plots.
See the function [avPlots](http://www.rdocumentation.org/packages/car/html/avPlots.html) in the **car** package.
## References
### Outliers and Influential Observations
- Fox, Chapter 11. "Unusual and Influential Observations"