-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01_Regression-3.Rmd
409 lines (311 loc) · 11.5 KB
/
01_Regression-3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "Chapter 1: Linear regression"
subtitle: "Predictivity and variability"
author: "Joris Vankerschaver"
header-includes:
- \useinnertheme[shadow=true]{rounded}
- \usecolortheme{rose}
- \setbeamertemplate{footline}[frame number]
- \usepackage{color}
- \usepackage{graphicx}
- \usepackage{amsmath}
- \graphicspath{{./images/01-linear-regression}}
output:
beamer_presentation:
theme: "default"
keep_tex: true
includes:
in_header: columns.tex
---
```{r include = FALSE}
CWD <- read.table("./datasets/01-linear-regression/christ.csv", header = T, sep = ",", dec = ".")
attach(CWD)
model <- lm(CWD.BASA ~ RIP.DENS)
model3 <- lm(I(log(CWD.BASA)) ~ RIP.DENS + I(RIP.DENS^2))
summary(model)
confint(model)
needles <- read.table("./datasets/01-linear-regression/needles.txt", header = T, sep = "\t", dec = ".")
attach(needles)
model_l5 <- lm(length ~ nitrogen * phosphor + potassium
+ phosphor * residu)
model_l8 <- lm(length ~ nitrogen * phosphor + potassium)
library(car)
body.fat <- read.table("./datasets/01-linear-regression/bodyfatNKNW.txt", header = T, dec = ".")
attach(body.fat)
```
# Prediction
## Example
Use model to predict length of larch based on mineral composition of needles.
\footnotesize
```{r echo=FALSE}
coefs <- summary(model_l8)$coefficients
coefs
```
\normalsize
- Percentages: nitrogen = 1.9, phosphorus = 0.2, potassium = 0.7.
- Predicted **average** length:
\begin{multline*}
160.66-76.5\times 1.9-1120.7\times 0.2+138.06\times 0.7 + \\
724.38\times 1.9\times 0.2=163.1.
\end{multline*}
## Accuracy of prediction
To determine the accuracy of a prediction, we need to take into account the
- **variability** of the observations around the regression line
- **precision** of the estimated regression line.
## Estimating variability via the residual standard error
\alert{Residual standard error} (RSE):
- CWD basal area: 1.01 on 13 degrees of freedom
- Larches: 35.55 on 21 degrees of freedom.
Residual standard deviation tells that 95\% of lengths, given nitrogen, phosphorus and potassium percentages of 1.9, 0.2 and 0.7, are expected to lie within a distance
\[
2\times 35.55=71.1
\]
of the mean.
## Residual standard error in \texttt{R}
\scriptsize
```{r}
summary(model_l8)
```
\normalsize
## Residual standard error (by hand)
RSE can be calculated as
\[
RSE = \sqrt{\frac{SSE}{n - p}} = \sqrt{MSE}
\]
with SSE, the sum-squared of the residuals, given by
\[
SSE = \sum_{i = 1}^n(y_i - \hat{y}_i)^2 = \sum_{i = 1}^n e_i^2.
\]
and $p$ the number of parameters in the model.
For example (larches, $p = 5$):
```{r}
SSE <- sum(model_l8$residuals^2)
RSE <- sqrt(SSE/(26 - 5))
RSE
```
## Prediction/confidence intervals
- **Prediction intervals** combine both inaccuracies
- Variability around the regression line
- Precision of the regression line.
- Designed to contain, with 95\% confidence, a random observation (e.g. CWD basal area or tree length) for given predictor values (e.g. tree density or given proportions of nitrogen, phosphorus, and potassium)
- **Confidence intervals** incorporate only the precision of the regression line.
- Designed to contain, with 95\% confidence, the **average** of random observations for given predictor values.
## Prediction intervals in \texttt{R}: CWD basal area
\small
```{r}
p <- predict(model3, newdata = data.frame(RIP.DENS=800:2200),
interval = "confidence")
```
```{r echo=FALSE}
p[1:3,]
```
```{r}
p <- predict(model3, newdata = data.frame(RIP.DENS=800:2200),
interval = "prediction")
```
```{r echo=FALSE}
p[1:3,]
```
## Prediction intervals in \texttt{R}: Larches
\small
```{r}
newdata <- data.frame(nitrogen = 1.9, phosphor = 0.2,
potassium = 0.7)
newdata
predict.lm(model_l8, newdata, interval = "confidence")
predict.lm(model_l8, newdata, interval = "prediction")
```
# Variability in regression models
## Predictivity
Another way to gain insight in predictivity compares
\begin{itemize}
\item variability \alert{around} regression line
\item with variability \alert{on} the regression line, explained by the regression line
\end{itemize}
## Total and residual variability
Idea: compare variability of residuals and variability of (centered) predictions.
```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- x + 2 * rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))
par(mfrow = c(1, 2), cex = 1.5)
plot(NULL, xlim=c(min(x), max(x)),
ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")
res <- m$residuals
plot(NULL, xlab = "", ylab = "",
xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```
## High predictivity: low variability around line
```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- x + 0.5 * rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))
par(mfrow = c(1, 2), cex = 1.5)
plot(NULL, xlim=c(min(x), max(x)),
ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")
res <- m$residuals
plot(NULL, xlab = "", ylab = "",
xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```
**High predictivity**: variability of residuals is much **lower** than variability of predictions.
## Low predictivity: small variability on line
```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- 0.01*x + 4*rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))
par(mfrow = c(1, 2), cex = 1.5)
plot(NULL, xlim=c(min(x), max(x)),
ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")
res <- m$residuals
plot(NULL, xlab = "", ylab = "",
xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```
**Low predictivity**: variability of residuals is much **higher** than variability of predictions.
## Sum of squares
- Let $\hat{y}_i$ be the prediction for observation $i$, then
\begin{align*}
SST & = \sum_{i=1}^n (y_i-\bar y)^2 \\
& = \sum_{i=1}^n(\hat{y}_i-\bar y)^2+ \sum_{i=1}^n (y_i-\hat{y}_i)^2\\
& = \sum_{i=1}^n (\hat{y}_i-\bar y)^2+ \sum_{i=1}^n e_i^2\\
& = SSR + SSE.
\end{align*}
- Total sum of squares (SST) = Regression sum of squares (SSR) +
Residual sum of squares (SSE).
- Total variability = Variability captured by regression + Variability in residuals.
## Multiple correlation coefficient
- **Multiple correlation coefficient** or coefficient of determination:
\[
R^2 = \frac{SSR}{SST}.
\]
- Expresses the proportion of variability in data is captured by their association with explanatory variable.
- Measure for \alert{predictive value} of explanatory variable.
- Always between 0 and 1.
- Simple linear regression: the square of the correlation between $X$ and $Y$.
## Multiple correlation coefficient
Look at the R `summary` output:
- CWD basal area:
```{verbatim}
Multiple R-squared: 0.7159, Adjusted R-squared: 0.6722
```
71.59\% of variability on CWD basal area is explained by tree density.
- Larches:
```{verbatim}
Multiple R-squared: 0.8836, Adjusted R-squared: 0.8614
```
88.36\% of variability on tree length is explained by mineral composition of needles.
\alert{Be careful!} High $R^2$ only demanded for prediction, not to estimate effect of $X$ on $Y$
## Aside: adjusted multiple correlation coefficient
- $R^2$ always increases (gets closer to 1) when model becomes more complex
- To "punish" complexity, use adjusted $R^2$:
\[
R^2_\mathrm{adj} = 1 - \frac{n-1}{n - p}(1 - R^2).
\]
- Adjusted $R^2$ is always lower than $R^2$.
- Interpretation not so straightforward, used mainly for **model comparison**.
Larches: $n = 26$, $p = 5$, $R^2 = 0.8836$, so
\[
R^2_\mathrm{adj} = 1 - \frac{25}{21}(1 - 0.8836) = 0.8614.
\]
# Comparing simple vs. complex models
## Example
```{r, echo=FALSE, out.height="50%", fig.align='center'}
set.seed(123)
x <- seq(-5, 5, length.out = 20)
y_weak <- 0.1 * x + 1 * rnorm(20)
y_strong <- 0.7 * x + 1 * rnorm(20)
m_weak <- lm(y_weak ~ x)
m_strong <- lm(y_strong ~ x)
par(mfrow=c(1, 2), cex=1.5)
plot(x, y_weak, xlab = "", ylab = "", main = "Weak")
abline(m_weak)
abline(h = mean(y_weak), lty = "dashed")
plot(x, y_strong, xlab = "", ylab = "", main = "Strong")
abline(m_strong)
abline(h = mean(y_strong), lty = "dashed")
```
Compare linear model (line) with model that just predicts the mean (dashed)
- Left: linear model barely better than mean value.
- Right: linear model **obviously better** than mean value.
## Nested models
Nested models:
- Complex model: with many predictors.
- Simple model: like complex, but some predictors have been removed.
Example (larches):
- Complex: $E(Y|X) = \alpha + \beta_N X_N+ \beta_P X_P + \beta_K X_K +\beta_r X_r$
- Simple: $E(Y|X) = \alpha + \beta_P X_P$
How do we quantify which model is better?
- Single regression: hypothesis test for $\beta$.
- Multiple regression: need to compare effect of **all coefficients at once**.
## Intuition: comparing variance
Idea: compare residual variability (SSE) to assess model fit.
- $SSE_\mathrm{complex}$ *always* lower than $SSE_\mathrm{simple}$.
- If it is *much* lower, decide that complex model is better.
Formalized via $F$-test:
- Null hypothesis: simple and complex model fit data equally well.
- Alternative hypothesis: complex model is better.
- Test statistic:
$$
f = \frac{\frac{SSE_\mathrm{simple}- SSE_\mathrm{complex}}{p_\mathrm{complex} - p_\mathrm{simple}}}{\frac{SSE_\mathrm{complex}}{n - p_\mathrm{complex} }}
\sim F_{p_\mathrm{complex} - p_\mathrm{simple}, n - p_\mathrm{complex}}.
$$
## Example: larches
Residual sum of squares:
- $SSE_\mathrm{simple} = 91404.49$
- $SSE_\mathrm{complex} = 30121.92$
Number of parameters:
- $p_\mathrm{simple} = 2$
- $p_\mathrm{complex} = 5$
Hypothesis test:
- Test statistic: $f = 14.24139$
- $p$-value: $p = 0.00002744$.
Conclusion: complex model is significantly better.
## Example in \texttt{R}: larches
\footnotesize
```{r}
model_l1 <- lm(length ~ phosphor)
model_l2 <- lm(length ~ nitrogen + phosphor + potassium + residu)
anova(model_l1, model_l2)
```
\normalsize
## \texttt{R} summary command
```{r, echo=TRUE, results="hide"}
summary(model_l8)
```
\footnotesize
\begin{verbatim}
(...)
Residual standard error: 35.55 on 21 degrees of freedom
Multiple R-squared: 0.8836, Adjusted R-squared: 0.8614
F-statistic: 39.85 on 4 and 21 DF, p-value: 1.603e-09
\end{verbatim}
\normalsize
Last line:
- $F$-statistic: compares model to model with intercept only.
- **"Is my complex model capturing something meaningful?"**