Update basis expansion slides

UBC-STAT · Oct 3, 2024 · 8f63b3c · 8f63b3c
1 parent f3eefac
commit 8f63b3c
Show file tree

Hide file tree

Showing 11 changed files with 4,888 additions and 1,109 deletions.
diff --git a/_freeze/schedule/slides/10-basis-expansions/execute-results/html.json b/_freeze/schedule/slides/10-basis-expansions/execute-results/html.json
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/sim-results-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/sim-results-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-1-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-1-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-2-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-2-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-3-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-3-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-5-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-5-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-6-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-6-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-7-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-7-1.svg
diff --git a/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-8-1.svg b/_freeze/schedule/slides/10-basis-expansions/figure-revealjs/unnamed-chunk-8-1.svg
diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css
diff --git a/schedule/slides/10-basis-expansions.qmd b/schedule/slides/10-basis-expansions.qmd
@@ -7,12 +7,117 @@ metadata-files:
 
 {{< include _titleslide.qmd >}}
 
+# Quick Review of Ridge and Lasso
 
+## OLS: low bias, (potentially) high variance
+
+$$
+\begin{gathered}
+\text{Model:} \quad y = x^\top \beta + \epsilon, \qquad \epsilon \sim N(0, \sigma^2)
+\\
+\text{OLS:} \quad \bls = \argmin_\beta \| \y - \X\beta\|_2^2\quad
+\end{gathered}
+$$
+
+- Bias: $\E[\bls] - \beta = \E[\E[\bls \mid \X]] - \beta = \ldots = 0$
+- variance: $\Var{\bls} =  \sigma^2(\X^\top \X)^{-1}$
+
+[When is $(\X^\top \X)^{-1}$ large?]{.secondary}
+
+. . . 
+
+When we have *nearly colinear features*\
+Nearly colinear features $\Rightarrow$ small singular values $\Rightarrow$ large matrix inverse\
+[(Colinearity is more likely when $n$ is small)]{.secondary}
+
+
+## Reducing variance (at the cost of additional bias)
+
+0. *Manual variable selection*
+1. *Ridge regression*: $\min_\beta \| \y - \X\beta\|_2^2 + \lambda \Vert \beta \Vert_2^2$
+2. *Lasso*: $\min_\beta \| \y - \X\beta\|_2^2 + \lambda \Vert \beta \Vert_1$
+
+. . .
+
+*Ridge* shrinks all parameters towards 0.\
+*Lasso* performs automatic variable selection.\
+
+. . .
+
+\
+Increasing $\lambda$ *increases bias* and *decreases variance*.\
+[(For ridge, larger lambda $\rightarrow$ smaller $\beta$.)]{.small}\
+[(For lasso, larger lambda $\rightarrow$ sparser $\beta$.)]{.small}
+
+
+---
+
+```{r, fig.width=11,fig.align="center",dev="svg",fig.height=7, echo=FALSE}
+library(glmnet)
+data(prostate, package = "ElemStatLearn")
+X <- prostate |> dplyr::select(-train, -lpsa) |>  as.matrix()
+Y <- prostate$lpsa
+ridge <- glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum
+lasso <- glmnet(X, Y, alpha = 1, lambda.min.ratio = 1e-10) # added to get a minimum
+ridge.cv <- cv.glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum
+lasso.cv <- cv.glmnet(X, Y, alpha = 1, lambda.min.ratio = 1e-10) # added to get a minimum
+par(mfrow = c(2, 2))
+plot(ridge, main = "Ridge", xvar = "lambda")
+plot(lasso, main = "Lasso", xvar = "lambda")
+plot(ridge.cv, main = "Ridge", xvar = "lambda")
+plot(lasso.cv, main = "Lasso", xvar = "lambda")
+```
+
+
+## Computing ridge and lasso predictors
+
+- *OLS:* $\bls = (\X^\top \X)^{-1}\X^\top \y$
+- *Ridge:* $\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y$
+- *Lasso:* No closed form solution 😔
+  - (Convex optimization problem solvable with iterative algorithms.)
+
+
+## (Optional) Proof that ridge shrinks parameters
+
+(This proof is not too hard if you use facts about SVDs and eigenvalues. I recommend working through it.)
+
+Let $\mathbf{UDV^\top} = X$ be the SVD of $\mathbf X$.
+
+$$
+\begin{align}
+\brl &= (\X^\top \X + \lambda \mathbf{I})^{-1} \X^\top \y
+     \\
+     &= (\X^\top \X + \lambda \mathbf{I})^{-1} {\color{blue} \X \X^\top (\X^\top \X)^{-1}} \X^\top \y
+     \\
+     &= (\X^\top \X + \lambda \mathbf{I})^{-1} \X \X^\top \underbrace{\left( (\X^\top \X)^{-1} \X^\top \y \right)}_{\bls}
+     \\
+     &= (\mathbf V \mathbf D^2 \mathbf V^\top + \lambda \mathbf{I})^{-1} \mathbf V \mathbf D^2 \mathbf V^\top \bls
+     \\
+     &= (\mathbf V (\mathbf D^2 + \lambda I) \mathbf V^\top)^{-1} \mathbf V \mathbf D^2 \mathbf V^\top \bls
+     \\
+     &= \mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top \bls
+\end{align}
+$$
+
+---
+
+$(\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1}$ is a diagonal matrix with entries $d_i^2/(d_i^2 + \lambda) < 1$.
+
+So $\mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top$ is a matrix that *shrinks* all coefficients of any vector multiplied against it.
+
+$$
+\Vert \mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top \bls \Vert_2 < \Vert \bls \Vert_2.
+$$
+So $\Vert \brl \Vert_2 < \Vert \bls \Vert_2$
+
+
+# Now onto new stuff
+(But first, more clickers!)
 
 ## What about nonlinear things
 
 
-$$\Expect{Y \given X=x} = \sum_{j=1}^p x_j\beta_j$$
+$$\text{Our usual model:} \quad \Expect{Y \given X=x} = \sum_{j=1}^p x_j\beta_j$$
 
 Now we relax this assumption of linearity:
 
@@ -113,26 +218,94 @@ Polynomials
 : $x \mapsto \left(1,\ x,\ x^2, \ldots, x^p\right)$ (technically, not quite this, they are orthogonalized)
 
 Linear splines
-: $x \mapsto \bigg(1,\ x,\ (x-k_1)_+,\ (x-k_2)_+,\ldots, (x-k_p)_+\bigg)$ for some choices $\{k_1,\ldots,k_p\}$
+: $x \mapsto \bigg(1,\ x,\ (x-k_1)_+,\ (x-k_2)_+,\ldots, (x-k_p)_+\bigg)$ for some $\{k_1,\ldots,k_p\}$
 
+<!--
 Cubic splines
 : $x \mapsto \bigg(1,\ x,\ x^2,\ x^3,\ (x-k_1)^3_+,\ (x-k_2)^3_+,\ldots, (x-k_p)^3_+\bigg)$ for some choices $\{k_1,\ldots,k_p\}$
+-->
 
 Fourier series
 : $x \mapsto \bigg(1,\ \cos(2\pi x),\ \sin(2\pi x),\ \cos(2\pi 2 x),\ \sin(2\pi 2 x), \ldots, \cos(2\pi p x),\ \sin(2\pi p x)\bigg)$
 
+```{r fig.height=3, fig.width=9}
+#| code-fold: true
+library(cowplot)
+library(ggplot2)
+
+relu_shifted <- function(x, shift) {pmax(0, x - shift)}
+
+# Create a sequence of x values
+x_vals <- seq(-3, 3, length.out = 1000)
 
+# Create a data frame with all the shifted functions
+data <- data.frame(
+  x = rep(x_vals, 5),
+  polynomial = c(x_vals, x_vals^2, x_vals^3, x_vals^4, x_vals^5),
+  linear.splines = c(relu_shifted(x_vals, 2), relu_shifted(x_vals, 1), relu_shifted(x_vals, 0), relu_shifted(x_vals, -1), relu_shifted(x_vals, -2)),
+  fourier = c(cos(pi / 2 * x_vals), sin(pi / 2 * x_vals), cos(pi / 4 * x_vals), sin(pi / 4 * x_vals), cos(pi * x_vals)),
+  function_label = rep(c("f1", "f2", "f3", "f4", "f5"), each = length(x_vals))
+)
+
+# Plot using ggplot2
+g1 <- ggplot(data, aes(x = x, y = polynomial, color = function_label)) +
+      geom_line(size = 1, show.legend=FALSE) +
+      theme(axis.text.y=element_blank())
+g2 <- ggplot(data, aes(x = x, y = linear.splines, color = function_label)) +
+      geom_line(size = 1, show.legend=FALSE) +
+      theme(axis.text.y=element_blank())
+g3 <- ggplot(data, aes(x = x, y = fourier, color = function_label)) +
+      geom_line(size = 1, show.legend=FALSE) +
+      theme(axis.text.y=element_blank())
+
+plot_grid(g1, g2, g3, ncol = 3)
+```
 
 ## How do you choose?
 
 [Procedure 1:]{.secondary}
 
-1. Pick your favorite basis. This is not as easy as it sounds. For instance, if $f$ is a step function, linear splines will do well with good knots, but polynomials will be terrible unless you have __lots__ of terms.
+1. Pick your favorite basis. (Think if the data might "prefer" one basis over another.)
+    - [How "smooth" is the response you're trying to model?]{.small}
+
+<!-- This is not as easy as it sounds. For instance, if $f$ is a step function, linear splines will do well with good knots, but polynomials will be terrible unless you have __lots__ of terms. -->
 
 2. Perform OLS on different orders.
 
 3. Use model selection criterion to choose the order.
 
+---
+
+```{r echo=FALSE, fig.height=2, fig.width=6}
+plot_grid(g1, g2, g3, ncol = 3)
+```
+
+```{r fig.height=4, fig.width=12, echo=FALSE}
+set.seed(12345)
+coeff <- rnorm(5)
+funcs <- data.frame(x = x_vals,
+                    f1 = (matrix(data$polynomial, nrow = length(x_vals), ncol = 5)  %*% coeff),
+                    f2 = (matrix(data$fourier, nrow = length(x_vals), ncol = 5)  %*% coeff),
+                    f3 = (matrix(data$linear.splines, nrow = length(x_vals), ncol = 5)  %*% coeff))
+
+g4 <- ggplot(funcs, aes(x=x, y=f1)) + geom_line(color = "blue") + theme()
+g5 <- ggplot(funcs, aes(x=x, y=f2)) + geom_line(color = "blue") + theme()
+g6 <- ggplot(funcs, aes(x=x, y=f3)) + geom_line(color = "blue") + theme()
+
+plot_grid(g4, g5, g6, ncol = 3)
+```
+
+What bases do you think will work best for $f1$, $f2$ and $f3$?
+
+. . .
+
+[*Answer: $f1$ was made from polynomial bases, $f2$ from fourier, $f3$ from linear splines*]{.secondary}
+
+
+---
+
+## How do you choose?
+
 [Procedure 2:]{.secondary}
 
 1. Use a bunch of high-order bases, say Linear splines and Fourier series and whatever else you like.
@@ -148,7 +321,7 @@ Fourier series
 
 2. Estimate polynomials up to 20 as before and choose best order.
 
-3. Do ridge, lasso and elastic net $\alpha=.5$ on 20th order polynomials, B splines with 20 knots, and Fourier series with $p=20$. Choose tuning parameter (using `lambda.1se`).
+3. Do ridge, lasso and elastic net $\alpha=.5$ on 20th order polynomials, splines with 20 knots, and Fourier series with $p=20$. Choose tuning parameter (using `lambda.1se`).
 
 4. Repeat 1-3 10 times (different splits)
 
@@ -162,7 +335,7 @@ mapto01 <- function(x, pad = .005) (x - min(x) + pad) / (max(x) - min(x) + 2 * p
 x <- mapto01(arcuate$position)
 Xmat <- cbind(
   poly(x, 20), 
-  splines::bs(x, df = 20), 
+  splines::bs(x, df = 20, degree = 1), 
   cos(2 * pi * outer(x, 1:20)), sin(2 * pi * outer(x, 1:20))
 )
 y <- arcuate$fa
@@ -223,16 +396,19 @@ Used $2p+1$ dimensions with Fourier basis
 Each case applied a [feature map]{.secondary} to $x$, call it $\Phi$
 
 We used new "features" $\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)$
+w/ a linear model
 
-Neural networks (coming in module 4) use this idea
+$$f(x) = \Phi(x)^\top \beta$$ 
 
-You've also probably seen it in earlier courses when you added interaction terms or other transformations.
+Neural networks (coming in module 4) build upon this idea
 
-. . .
+<!-- You've also probably seen it in earlier courses when you added interaction terms or other transformations. -->
 
-Some methods (notably Support Vector Machines and Ridge regression) allow $k=\infty$
+. . .
 
-See [ISLR] 9.3.2 for baby overview or [ESL] 5.8 (note 😱)
+\
+Some methods (notably Support Vector Machines and other Kernel Machines) allow $k=\infty$\
+[See [ISLR] 9.3.2 for baby overview or [ESL] 5.8 (note 😱)]{.small}
 
 
 # Next time...