01_Regression-3.Rmd

---
title: "Chapter 1: Linear regression"
subtitle: "Predictivity and variability"
author: "Joris Vankerschaver"
header-includes:
  - \useinnertheme[shadow=true]{rounded}
  - \usecolortheme{rose}
  - \setbeamertemplate{footline}[frame number]
  - \usepackage{color}
  - \usepackage{graphicx}
  - \usepackage{amsmath}
  - \graphicspath{{./images/01-linear-regression}}
output: 
  beamer_presentation:
    theme: "default"
    keep_tex: true
    includes:
      in_header: columns.tex
---

```{r include = FALSE}
CWD <- read.table("./datasets/01-linear-regression/christ.csv", header = T, sep = ",", dec = ".")
attach(CWD)

model <- lm(CWD.BASA ~ RIP.DENS)
model3 <- lm(I(log(CWD.BASA)) ~ RIP.DENS + I(RIP.DENS^2))
summary(model)
confint(model)

needles <- read.table("./datasets/01-linear-regression/needles.txt", header = T, sep = "\t", dec = ".")
attach(needles)
model_l5 <- lm(length ~ nitrogen * phosphor + potassium 
               + phosphor * residu)
model_l8 <- lm(length ~ nitrogen *  phosphor + potassium)

library(car)

body.fat <- read.table("./datasets/01-linear-regression/bodyfatNKNW.txt", header = T, dec = ".")
attach(body.fat)
```

# Prediction

## Example

Use model to predict length of larch based on mineral composition of needles.

\footnotesize
```{r echo=FALSE}
coefs <- summary(model_l8)$coefficients
coefs
```
\normalsize

- Percentages: nitrogen = 1.9, phosphorus = 0.2, potassium = 0.7.
- Predicted **average** length:
\begin{multline*}
160.66-76.5\times 1.9-1120.7\times 0.2+138.06\times 0.7 + \\
724.38\times 1.9\times 0.2=163.1.
\end{multline*}

## Accuracy of prediction

To determine the accuracy of a prediction, we need to take into account the 

- **variability** of the observations around the regression line
- **precision** of the estimated regression line.

## Estimating variability via the residual standard error

\alert{Residual standard error} (RSE):

- CWD basal area: 1.01 on 13 degrees of freedom
- Larches: 35.55 on 21 degrees of freedom.

Residual standard deviation tells that 95\% of lengths, given nitrogen, phosphorus and potassium percentages of 1.9, 0.2 and 0.7, are expected to lie within a distance
\[
  2\times 35.55=71.1
\]
of the mean.

## Residual standard error in \texttt{R}

\scriptsize
```{r}
summary(model_l8)
```
\normalsize

## Residual standard error (by hand)

RSE can be calculated as
\[
  RSE = \sqrt{\frac{SSE}{n - p}} = \sqrt{MSE}
\]
with SSE, the sum-squared of the residuals, given by
\[
  SSE = \sum_{i = 1}^n(y_i - \hat{y}_i)^2 = \sum_{i = 1}^n e_i^2.
\]
and $p$ the number of parameters in the model.

For example (larches, $p = 5$):
```{r}
SSE <- sum(model_l8$residuals^2)
RSE <- sqrt(SSE/(26 - 5))
RSE
```

## Prediction/confidence intervals 

- **Prediction intervals** combine both inaccuracies
  - Variability around the regression line
  - Precision of the regression line.
- Designed to contain, with 95\% confidence, a random observation (e.g. CWD basal area or tree length) for given predictor values (e.g. tree density or given proportions of nitrogen, phosphorus, and potassium)

- **Confidence intervals** incorporate only the precision of the regression line.
- Designed to contain, with 95\% confidence, the **average** of random observations for given predictor values.

## Prediction intervals in \texttt{R}: CWD basal area

\small
```{r}
p <- predict(model3, newdata = data.frame(RIP.DENS=800:2200), 
             interval = "confidence")
```
```{r echo=FALSE}
p[1:3,]
```
```{r}
p <- predict(model3, newdata = data.frame(RIP.DENS=800:2200), 
             interval = "prediction")
```
```{r echo=FALSE}
p[1:3,]
```

## Prediction intervals in \texttt{R}: Larches

\small
```{r}
newdata <- data.frame(nitrogen = 1.9, phosphor = 0.2, 
                      potassium = 0.7)
newdata
predict.lm(model_l8, newdata, interval = "confidence")
predict.lm(model_l8, newdata, interval = "prediction")
```

# Variability in regression models

## Predictivity 

Another way to gain insight in predictivity compares 
\begin{itemize}
\item variability \alert{around} regression line
\item with variability \alert{on} the regression line, explained by the regression line
\end{itemize}

## Total and residual variability

Idea: compare variability of residuals and variability of (centered) predictions.

```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- x + 2 * rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))

par(mfrow = c(1, 2), cex = 1.5)

plot(NULL, xlim=c(min(x), max(x)), 
     ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")

res <- m$residuals
plot(NULL, xlab = "", ylab = "", 
     xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```


## High predictivity: low variability around line 

```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- x + 0.5 * rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))

par(mfrow = c(1, 2), cex = 1.5)

plot(NULL, xlim=c(min(x), max(x)), 
     ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")

res <- m$residuals
plot(NULL, xlab = "", ylab = "", 
     xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```

**High predictivity**: variability of residuals is much **lower** than variability of predictions.

## Low predictivity: small variability on line

```{r, echo=FALSE, out.height="50%", fig.align="center"}
set.seed(123)
x <- seq(-5, 5, length.out=20) + 0.2 * rnorm(20)
y <- 0.01*x + 4*rnorm(20)
m <- lm(y ~ x)
y_pred <- predict(m, newdata = data.frame(x = x))

par(mfrow = c(1, 2), cex = 1.5)

plot(NULL, xlim=c(min(x), max(x)), 
     ylim = c(min(y, y_pred), max(y, y_pred)), xlab = "", ylab = "")
segments(x, y_pred, x, y, lwd=2, col = "lightblue")
abline(m)
points(x, y, pch = 19)
points(x, y_pred, pch = 19, col = "red")

res <- m$residuals
plot(NULL, xlab = "", ylab = "", 
     xlim = c(-0.5, 1.5), ylim = c(min(res, y_pred), max(res, y_pred)), xaxt = "n")
points(rep(0, length(res)), res, pch = 19, col = "lightblue")
points(rep(1, length(y_pred)), y_pred - mean(y_pred), pch = 19, col = "red")
axis(1, at = c(0, 1), labels = c("Residuals", "cPredictions"))
```

**Low predictivity**: variability of residuals is much **higher** than variability of predictions.

## Sum of squares

- Let $\hat{y}_i$ be the prediction for observation $i$, then
\begin{align*}
SST & = \sum_{i=1}^n (y_i-\bar y)^2 \\
  & = \sum_{i=1}^n(\hat{y}_i-\bar y)^2+ \sum_{i=1}^n (y_i-\hat{y}_i)^2\\
  & = \sum_{i=1}^n (\hat{y}_i-\bar y)^2+ \sum_{i=1}^n e_i^2\\
  & = SSR + SSE.
\end{align*}
- Total sum of squares (SST) =  Regression sum of squares (SSR) +
Residual sum of squares (SSE).
- Total variability = Variability captured by regression + Variability in residuals.

## Multiple correlation coefficient

- **Multiple correlation coefficient** or coefficient of determination:
\[
  R^2 = \frac{SSR}{SST}.
\] 
- Expresses the proportion of variability in data is captured by their association with explanatory variable.
- Measure for \alert{predictive value} of explanatory variable.
- Always between 0 and 1.
- Simple linear regression: the square of the correlation between $X$ and $Y$.

## Multiple correlation coefficient

Look at the R `summary` output:

- CWD basal area:
```{verbatim}
Multiple R-squared:  0.7159,	Adjusted R-squared:  0.6722 
```
71.59\% of variability on CWD basal area is explained by tree density.

- Larches: 
```{verbatim}
Multiple R-squared:  0.8836,	Adjusted R-squared:  0.8614 
```
88.36\% of variability on tree length is explained by mineral composition of needles.

\alert{Be careful!} High $R^2$ only demanded for prediction, not to estimate effect of $X$ on $Y$

## Aside: adjusted multiple correlation coefficient

- $R^2$ always increases (gets closer to 1) when model becomes more complex
- To "punish" complexity, use adjusted $R^2$:
\[
  R^2_\mathrm{adj} = 1 - \frac{n-1}{n - p}(1 - R^2).
\]
- Adjusted $R^2$ is always lower than $R^2$.
- Interpretation not so straightforward, used mainly for **model comparison**.

Larches: $n = 26$, $p = 5$, $R^2 = 0.8836$, so
\[
  R^2_\mathrm{adj} = 1 - \frac{25}{21}(1 - 0.8836) = 0.8614.
\]

# Comparing simple vs. complex models

## Example

```{r, echo=FALSE, out.height="50%", fig.align='center'}
set.seed(123)
x <- seq(-5, 5, length.out = 20)
y_weak <- 0.1 * x + 1 * rnorm(20)
y_strong <- 0.7 * x + 1 * rnorm(20)
m_weak <- lm(y_weak ~ x)
m_strong <- lm(y_strong ~ x)

par(mfrow=c(1, 2), cex=1.5)
plot(x, y_weak, xlab = "", ylab = "", main = "Weak")
abline(m_weak)
abline(h = mean(y_weak), lty = "dashed")

plot(x, y_strong, xlab = "", ylab = "", main = "Strong")
abline(m_strong)
abline(h = mean(y_strong), lty = "dashed")

```

Compare linear model (line) with model that just predicts the mean (dashed)

- Left: linear model barely better than mean value.
- Right: linear model **obviously better** than mean value.

## Nested models

Nested models:

- Complex model: with many predictors.
- Simple model: like complex, but some predictors have been removed.

Example (larches):

- Complex: $E(Y|X) = \alpha + \beta_N X_N+ \beta_P X_P + \beta_K X_K +\beta_r X_r$
- Simple: $E(Y|X) = \alpha + \beta_P X_P$

How do we quantify which model is better?

- Single regression: hypothesis test for $\beta$.
- Multiple regression: need to compare effect of **all coefficients at once**. 

## Intuition: comparing variance

Idea: compare residual variability (SSE) to assess model fit.

- $SSE_\mathrm{complex}$ *always* lower than $SSE_\mathrm{simple}$.
- If it is *much* lower, decide that complex model is better.

Formalized via $F$-test:

- Null hypothesis: simple and complex model fit data equally well.
- Alternative hypothesis: complex model is better.
- Test statistic:
$$
  f = \frac{\frac{SSE_\mathrm{simple}- SSE_\mathrm{complex}}{p_\mathrm{complex} - p_\mathrm{simple}}}{\frac{SSE_\mathrm{complex}}{n - p_\mathrm{complex} }}
  \sim F_{p_\mathrm{complex} - p_\mathrm{simple}, n - p_\mathrm{complex}}.
$$

## Example: larches

Residual sum of squares:

- $SSE_\mathrm{simple} = 91404.49$
- $SSE_\mathrm{complex} = 30121.92$

Number of parameters:

- $p_\mathrm{simple} = 2$
- $p_\mathrm{complex} = 5$

Hypothesis test:

- Test statistic: $f = 14.24139$
- $p$-value: $p = 0.00002744$.

Conclusion: complex model is significantly better.

## Example in \texttt{R}: larches

\footnotesize
```{r}
model_l1 <- lm(length ~ phosphor)
model_l2 <- lm(length ~ nitrogen + phosphor + potassium + residu)
anova(model_l1, model_l2)
```
\normalsize

## \texttt{R} summary command

```{r, echo=TRUE, results="hide"}
summary(model_l8)
```

\footnotesize
\begin{verbatim}
(...)
Residual standard error: 35.55 on 21 degrees of freedom
Multiple R-squared:  0.8836,	Adjusted R-squared:  0.8614 
F-statistic: 39.85 on 4 and 21 DF,  p-value: 1.603e-09
\end{verbatim}
\normalsize

Last line:

- $F$-statistic: compares model to model with intercept only.
- **"Is my complex model capturing something meaningful?"**