Skip to content

Latest commit

 

History

History
881 lines (663 loc) · 21.2 KB

02b-pca-theory.qmd

File metadata and controls

881 lines (663 loc) · 21.2 KB
title subtitle author format
Principal component analysis: theory and concepts
Introduction to Statistical Modelling
Prof. Joris Vankerschaver
beamer
theme colortheme fonttheme header-includes
Pittsburgh
default
default
\setbeamertemplate{frametitle}[default][left] \setbeamertemplate{footline}[frame number]
```{r, include=FALSE} set.seed(1234) library(tidyverse) theme_set(theme_bw() + theme(text = element_text(size = 14))) ``` ## Goal of dimensionality reduction - Pre-processing - Remove collinear predictors (multicollinearity) - Computational efficiency - Retain import features to speed up computational processing - Visualization ## Learning outcomes At the end of this lecture, you should be able to: 1. Explain the ideas behind PCA 2. Do a PCA by hand given a covariance matrix 3. Do a PCA with R 4. Interpret and explain the PCA results 5. Build and explain a PCR model ## References - *Introduction to statistical modeling*. Chapter available on Ufora. - *An Introduction to Statistical Learning*. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. Available for free online at https://www.statlearning.com/. - PCA: section 6.3 ## Reminder: multicollinearity Bodyfat dataset: 20 observations, predict amount of body fat from three body measurements. :::: {.columns} ::: {.column width="50%"} - Linear regression model: \begin{align*} \texttt{bodyfat} & = 117.085 \\ & + 4.334 \cdot \texttt{triceps} \\ & - 2.857 \cdot \texttt{thigh} \\ & - 2.186 \cdot \texttt{midarm} \end{align*} - **Do you see anything wrong with this?** ::: ::: {.column width="50%"} ```{r} bodyfat <- read.csv("datasets/01-linear-regression/bodyfatNKNW.txt", sep = " ") bodyfat_predictors <- bodyfat[,c("triceps.skinfold.thickness", "thigh.circumference", "midarm.circumference")] library(GGally) ggpairs(bodyfat_predictors) ``` ::: :::: ## \scriptsize ```{r, echo=FALSE} m <- lm(bodyfat ~ ., data = bodyfat) summary(m) ``` # Principal component analysis ## Directions of maximal variability Intuitively: - Find directions of maximal variability in the dataset - Discard directions in which there is neglible variability ```{r, echo=FALSE} #| fig-height: 3 #| fig-width: 5 #| fig-align: center library(mvtnorm) S <- matrix(c(5, 2, 2, 2), byrow = TRUE, nrow = 2) adjust_cov_mean <- function(z, mean, sigma) { # Adjust NxD matrix z so that sample mean and sample # covariance are exactly `mean` and `sigma` # whiten the z's z_mean <- colMeans(z) z <- t(apply(z, 1, function(row) row - z_mean)) R <- chol(cov(z)) z <- t(solve(t(R), t(z))) # impose desired covariance, mean R <- chol(sigma) z <- t(apply(z %*% R, 1, function(row) row + mean)) z } z <- rmvnorm(100, mean = c(0, 0)) z <- adjust_cov_mean(z, c(1, 1), S) df <- as.data.frame(z) colnames(df) <- c("X", "Y") vx <- 2/5^0.5 vy <- 1/5^0.5 segment <- function(lx, ly, color = "cornflowerblue", start = c(1, 1), scale = 1) { geom_segment(aes(x = start[1], y = start[2], xend = start[1] + scale * lx, yend = start[2] + scale * ly), arrow = arrow(length=unit(0.5, 'cm')), color = color, lwd = 1.5, alpha = 1, lineend = "round") } l1_sqrt <- 6**0.5 l2_sqrt <- 1 ggplot() + geom_point(data = df, aes(x = X, y = Y)) + segment(vx, vy, color = "cornflowerblue", scale = l1_sqrt) + segment(-vy, vx, color = "chocolate", scale = l2_sqrt) + xlim(c(-5, 6)) + ylim(c(-2.5, 4.5)) + theme_void() ``` ## Directions of less variability Since `triceps` and `thigh` are highly correlated, specifying both is superfluous. What do we lose if we throw away one of these variables? ```{r} #| fig-align: center #| out-width: 4in #| out-height: 3in bodyfat <- read.csv("datasets/01-linear-regression/bodyfatNKNW.txt", sep = " ") bodyfat_predictors <- bodyfat[,c("triceps.skinfold.thickness", "thigh.circumference", "midarm.circumference")] library(GGally) ggpairs(bodyfat_predictors) ``` ## Notation Dataset: - $N$ observations $\mathbf{x}_k$, $k = 1, \ldots, N$ - Each observation is a (column) vector in $\mathbb{R}^D$ Data matrix: $$ \mathbf{X} = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \vdots \\ \mathbf{x}_N^T \end{bmatrix} \in \mathbb{R}^{N \times D} $$ Columns of the data matrix: - Referred to as features, predictors, independent variables - Denoted by $\mathbf{X}_i$, $i = 1, \ldots, D$ ## Notation: example (body fat dataset) - 20 observations with 3 features each - Data matrix is $20 \times 3$ matrix - Features: - $\mathbf{X}_1$: `triceps.skinfold.thickness` - $\mathbf{X}_2$: `thigh.circumference` - $\mathbf{X}_3$: `midarm.circumference` ## The covariance matrix Given observations $\mathbf{x}_1, \ldots, \mathbf{x}_N \in \mathbb{R}^D$, the variance-covariance matrix $\mathbf{S}$ is defined as: $$ \mathbf{S} = \frac{1}{N} \sum_{k = 1}^N \left(\mathbf{x}_k\mathbf{x}_k^T - \bar{\mathbf{x}}\bar{\mathbf{x}}^T \right). $$ Structure: variances and covariances between components of the data. $$ \mathbf{S} = \begin{bmatrix} \text{Var}(x_1) & \text{Cov}(x_1, x_2) & \cdots & \text{Cov}(x_1, x_D) \\ \text{Cov}(x_2, x_1) & \text{Var}(x_2) & \cdots & \text{Cov}(x_2, x_D) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(x_D, x_1) & \text{Cov}(x_D, x_2) & \cdots & \text{Var}(x_D) \end{bmatrix} $$ ## The covariance matrix: examples ```{r} make_dataset <- function(S, n = 100) { z <- rmvnorm(n, mean = c(0, 0)) z <- adjust_cov_mean(z, c(0, 0), S) df <- as.data.frame(z) colnames(df) <- c("X", "Y") return(df) } ``` :::: {.columns} ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in S_ellipsoid <- matrix(c(6, 0, 0, 1), nrow = 2, byrow = TRUE) df_ellipsoid <- make_dataset(S_ellipsoid) ggplot(df_ellipsoid, aes(x = X, y = Y)) + geom_point() + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` $$ \mathbf{S} = \begin{bmatrix} 6 & 0 \\ 0 & 1 \end{bmatrix} $$ ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in S_sphere <- matrix(c(1, 0, 0, 1), nrow = 2) df_sphere <- make_dataset(S_sphere) ggplot(df_sphere, aes(x = X, y = Y)) + geom_point() + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` $$ \mathbf{S} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$ ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in S_rotated <- matrix(c(5, 2, 2, 2), nrow = 2) df_rotated <- make_dataset(S_rotated) ggplot(df_rotated, aes(x = X, y = Y)) + geom_point() + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` $$ \mathbf{S} = \begin{bmatrix} 5 & 2 \\ 2 & 2 \end{bmatrix} $$ ::: :::: Clearly the covariance matrix will help us find directions of maximum variability, but how? ## Linear combination of features - The **first principal component** $\mathbf{Z}_1$ is a linear combination of the columns of $\mathbf{X}$: $$ \mathbf{Z}_1 = v_1 \mathbf{X}_1 + \cdots + v_D \mathbf{X}_D, $$ where we will choose the coefficients $v_i$ so that the variances is maximal, in some sense. - The coefficients $v_i$ are referred to as the **loadings** and the vector $$ \mathbf{v} = \begin{bmatrix} v_1 \\ \vdots \\ v_D \end{bmatrix} $$ is the **loadings vector**. - Variance of $\mathbf{Z}_1$: $$ \text{Var}(\mathbf{Z}_1) = \mathbf{v}^T \mathbf{S} \mathbf{v}. $$ ## Maximizing the variance - Idea: choose loadings $\mathbf{v}$ so that $\text{Var}(\mathbf{Z}_1)$ is **maximal**. - Problem: just by increasing the norm of $\mathbf{v}$, variance can become as large as we want. Solution: impose that $\left\Vert \mathbf{v} \right\Vert = 1$. ::: {.callout-tip} ## PC1: maximization of variance The loadings vector $\mathbf{v}$ for the first principal component is found by solving the following maximization problem: $$ \text{maximize $\mathbf{v}^T \mathbf{S} \mathbf{v}$ so that $\mathbf{v}^T \mathbf{v} = 1$.} $$ ::: ## Geometric interpretation - $\mathbf{Z}_1$: projection of data on the line in the direction of $\mathbf{v}$. - Find direction $\mathbf{v}$ so that variance is maximal (blue) ![](./images/02-pca/max-variance-projection.svg){fig-align=center} ## Eigenvalue problem Through Lagrange multipliers, can show that maximizing variance is equivalent to finding eigenvalues and eigenvectors of $\mathbf{S}$. ::: {.callout-tip} ## PC1: eigenvalues The loadings vector $\mathbf{v}$ for the first principal component is the eigenvector of $\mathbf{S}$ with the largest eigenvalue: $$ \mathbf{S} \mathbf{v} = \lambda \mathbf{v} $$ ::: Eigenvectors are typically quite efficient to compute. ## Amount of variance explained Take the eigenvalue equation $$ \mathbf{S} \mathbf{v} = \lambda \mathbf{v}, $$ and left-multiply by $\mathbf{v}^T$ to get $$ \lambda = \lambda \mathbf{v}^T \mathbf{v} = \mathbf{v}^T \mathbf{S}\mathbf{v} = \text{Var}(\mathbf{Z}_1). $$ ::: {.callout-note} ## Eigenvalues and eigenvectors - The largest eigenvalue of $\mathbf{S}$ is equal to the variance contained in ("explained by") the first principal component $\mathbf{Z}_1$. - The corresponding eigenvector gives the loadings vector $\mathbf{v}$. ::: ## Example :::: {.columns} ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_ellipsoid) ggplot(df_ellipsoid, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_sphere) ggplot(df_sphere, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_rotated) ggplot(df_rotated, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: :::: The loadings vectors may point in the opposite direction of what you expected... Why is this not a problem? ## The remaining principal components Next principal components $\mathbf{Z}_2, \mathbf{Z}_3, \ldots$ involve variation in the data after $\mathbf{Z}_1$ has been taken into account. - For $\mathbf{Z}_2$: $$ \text{maximize Var($\mathbf{Z}_2$) so that Cov($\mathbf{Z}_1$, $\mathbf{Z}_2$) = 0} $$ - Equivalent to: find second largest eigenvalue $\lambda_2$ and eigenvector $\mathbf{v}_2$. Same story for remaining principal components. ::: {.callout-note} The principal components are *uncorrelated* linear combinations of features that *maximize variance*. ::: ## How many principal components are there? Recall: - $\mathbf{S}$ is a symmetric $D \times D$ matrix - Such a matrix always has $D$ eigenvalues and eigenvectors When $D \le N$ (more data points than features) - In general, $D$ non-zero principal components When $D > N$: - $\mathbf{S}$ has rank at most $N$: $N$ non-zero principal components - Can happen in high-dimensional datasets (e.g. gene assays) ## Percentage of variance explained - Eigenvalue $\lambda_i$ is amount of variance explained by PC $i$. - Total amount of variance: $\lambda_1 + \lambda_2 + \cdots + \lambda_D$ - Percentage of variance explained by PC $i$: $$ \frac{\lambda_i}{\lambda_1 + \cdots + \lambda_D} $$ In many cases, the first few PCs will explain the majority of variance (80% to 90%). **Dimensionality reduction:** we can omit the remaining principal components with only a small loss of information ## Example :::: {.columns} ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_ellipsoid) ggplot(df_ellipsoid, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + segment(pca$rotation[1, 2], pca$rotation[2, 2], color = "chocolate", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_sphere) ggplot(df_sphere, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + segment(pca$rotation[1, 2], pca$rotation[2, 2], color = "chocolate", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: ::: {.column width="33%"} ```{r} #| fig-width: 2 #| fig-height: 2 #| out-width: 1.5in #| out-height: 1.5in pca <- prcomp(df_rotated) ggplot(df_rotated, aes(x = X, y = Y)) + geom_point() + segment(pca$rotation[1, 1], pca$rotation[2, 1], color = "cornflowerblue", start = c(0, 0), scale = 3) + segment(pca$rotation[1, 2], pca$rotation[2, 2], color = "chocolate", start = c(0, 0), scale = 3) + xlab(NULL) + ylab(NULL) + xlim(c(-6, 6)) + ylim(c(-6, 6)) ``` ::: :::: Blue: first PC, red: second PC. ## Worked out example (by hand) :::: {.columns} ::: {.column width="50%"} Dataset is chosen so that $$ \mathbf{S} = \begin{bmatrix} 5 & 2 \\ 2 & 2 \end{bmatrix}. $$ Eigenvalues: $$ \lambda_1 = 6, \quad \lambda_2 = 1. $$ Eigenvectors: $$ \mathbf{v}_1 = \frac{1}{\sqrt{5}} \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad \mathbf{v}_2 = \frac{1}{\sqrt{5}} \begin{bmatrix} -1 \\ 2 \end{bmatrix}. $$ ::: ::: {.column width="50%"} ```{r} #| out-width: 2in #| out-height: 1.2in #| fig-width: 5 #| fig-height: 3 ggplot() + geom_point(data = df, aes(x = X, y = Y)) + segment(l1_sqrt*vx, l1_sqrt*vy, color = "cornflowerblue") + segment(-l2_sqrt*vy, l2_sqrt*vx, color = "chocolate") + xlim(c(-5, 6)) + ylim(c(-2.5, 4.5)) + theme(text = element_text(size = 20)) ``` ::: :::: ## Worked out example (with R) ```{r, echo=TRUE} prcomp(df) ``` Note: - Standard deviations are **square roots** of eigenvalues - Columns of rotation matrix give loadings vectors ## Example: body fat dataset ```{r, echo=TRUE} pca <- prcomp(bodyfat_predictors) pca ``` ## Percentage of variance explained ```{r, echo=TRUE} summary(pca) ``` - First 2 PCs explain over 99% of variance in data - Interpretation: \begin{align*} \text{PC}_1 & = 0.693 \cdot \texttt{triceps} + 0.699 \cdot \texttt{thigh} + 0.179 \cdot \texttt{midarm} \\ \text{PC}_2 & = 0.151 \cdot \texttt{triceps} - 0.384 \cdot \texttt{thigh} - 0.910 \cdot \texttt{midarm} \end{align*} ## Standardizing the features Often, data are standardized before running PCA: $$ \mathbf{Y}_i = \frac{\mathbf{X}_i - \bar{\mathbf{X}}_i}{\text{SD}(\mathbf{X}_i)} $$ - Standardization puts all features on the same scale and affects the outcome of your PCA. - Often a good idea when features have different units (e.g. mm, Watt, sec). - **Not** a good idea when features have the same units (e.g. pixel intensities in an image). In R: `prcomp(df, center = TRUE, scale = TRUE)`. # Interpretation of PCA results ## Score plot - Scatter plot of two PC (usually PC1 and PC2) - Can be used to spot patterns in data (see later) ```{r} #| fig-align: center #| fig-height: 4 # Explained variation for first and second component total_var <- sum(pca$sdev^2) pct_var_1 <- pca$sdev[1]^2 / total_var pct_var_2 <- pca$sdev[2]^2 / total_var df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2]) ggplot(df, aes(x = PC1, y = PC2)) + geom_point() + xlab(paste0("PC1 (", round(pct_var_1 * 100, 2), "% var. explained)")) + ylab(paste0("PC2 (", round(pct_var_2 * 100, 2), "% var. explained)")) ``` ## Loadings plot - Shows how much each variable contributes to each PC - Useful to discern patterns in PCs ```{r} #| fig-align: center #| fig-height: 4 pc <- prcomp(bodyfat_predictors) df <- as.data.frame(pc$rotation) %>% rownames_to_column(var = "Variable") %>% pivot_longer(c("PC1", "PC2", "PC3"), names_to = "Component", values_to = "Loading") ggplot(df, aes(x = as.factor(Variable), y = Loading, group = Component, color = Component)) + geom_line() + geom_point() + xlab("Variable") ``` ## Scree plot - Shows percentage of variance explained per PC - Useful to determine the PCs that contribute most to variance - Can be made with R's `screeplot` command, but better to make your own (it's just a line plot) ```{r} #| fig-align: center #| fig-height: 4 gg_screeplot <- function(pc, n = length(pc$sdev)) { sdev = pc$sdev[1:n] var_explained <- sdev^2 / sum(sdev^2) total_var <- cumsum(var_explained) df_var <- data.frame( n = seq(1, n), v = var_explained, t = total_var) ggplot(df_var) + geom_line(aes(x = n, y = v, color = "Per component")) + geom_point(aes(x = n, y = v, color = "Per component")) + geom_line(aes(x = n, y = t, color = "Cumulative")) + geom_point(aes(x = n, y = t, color = "Cumulative")) + ylim(c(0, 1)) + scale_color_manual( name = "Explained variance", breaks = c("Per component", "Cumulative"), values = c("Per component" = "cornflowerblue", "Cumulative" = "chocolate") ) + scale_x_continuous(breaks = 1:n) + xlab("Principal component") + ylab("Explained variance (%)") } gg_screeplot(pc) ``` ## Selecting the number of principal components to retain :::: {.columns} ::: {.column width="50%"} \vspace*{2cm} Many heuristics exist for selecting "optimal" number of PCs: - Explain fixed percentage (e.g. 80%) of variance - "Elbow" in scree plot - ... Can also determine number of PCs dynamically (e.g. if doing regression on PCs, look at $R^2$) ::: ::: {.column width="50%"} ![](./images/02-pca/Yamnuska_bottom_cliff.jpg) \scriptsize Image credit: https://en.wikipedia.org/wiki/Scree (Kevin Lenz, CC BY-SA 2.5) ::: :::: ## Biplot - Biplot = loadings plot + score plot - Numbers: data for first two PC - Arrows: contribution of variables to first two PC ```{r, echo=TRUE} #| fig-align: center #| fig-width: 4.5 #| fig-height: 4.5 #| out-width: 2in #| out-height: 2in biplot(pc) ``` # Principal component regression ## Principal component regression (PCR) Idea: - Do a PCA (usually with standardized features) - Build a linear model on reduced number of PCs Why do we do this? - PCs are uncorrelated, so takes care of multicollinearity - Results in a simpler model where only important features play a role Not always the right thing to do, alternatives exist (e.g. ridge regression, see later) ## PCR by hand: starting with PC 1 \scriptsize ```{r, echo=TRUE} pc1 <- pc$x[, "PC1"] model_1 <- lm(bodyfat$bodyfat ~ pc1) summary(model_1) ``` ## PCR by hand: adding PC 2 \scriptsize ```{r} pc2 <- pc$x[, "PC2"] model_12 <- lm(bodyfat$bodyfat ~ pc1 + pc2) summary(model_12) ``` ## Aside: variance inflation factors Reminder: principal components are uncorrelated by definition, so all the VIFs will be equal to 1. ```{r} library(faraway) plot(vif(model_12)) ``` ## PCR by hand: putting together the final model - We would probably select `model_1`: $$ \texttt{bodyfat} = 20.195 + 0.614 \cdot \texttt{PC}_1 $$ - This model uses the principal components as predictors, but we typically want the original predictors. Recall $$ \text{PC}_1 = 0.693 \cdot \texttt{triceps} + 0.699 \cdot \texttt{thigh} + 0.179 \cdot \texttt{midarm} $$ - Putting these two together gives the final PCR model: \begin{multline*} \texttt{bodyfat} = 20.195 \ + \\ 0.426 \cdot \texttt{triceps} + 0.429 \cdot \texttt{thigh} + 0.110 \cdot \texttt{midarm} \end{multline*} Stepwise building a model and rewriting it back in terms of the original predictors is a lot of work. Is there a better way? ## PCR via the pls package \scriptsize ```{r, echo=TRUE} library(pls) pcr_model <- pcr(bodyfat ~ ., data = bodyfat, validation = "CV") summary(pcr_model) ``` ## Selecting the optimal number of components - "1-sigma rule": select model with least number of components whose cross-validation error is at most 1 standard deviation away from optimal model - In human language: 2 PCs gives lowest RMSEP, but we can go down to 1 PC without losing too much ```{r, echo=TRUE} #| fig-height: 4 #| out-height: 3in selectNcomp(pcr_model, method = "onesigma", plot = TRUE) ``` ## Ridge regression (optional) - PCR is not the only way to "regularize" a regression model - Ridge regression: like ordinary regression, but punish model for coefficients that become too large. - Objective function: $$ J_{\text{ridge}}(\alpha, \beta) = J(\alpha, \beta) + \lambda (\alpha^2 + \beta^2). $$ - Parameter $\lambda$ set to a fixed value or determined via cross-validation. - $\lambda = 0$: ridge regression = ordinary regression - $\lambda \to \infty$: all coefficients become zero - Can be done with the `glmnet` package (not very userfriendly) ## Example session \scriptsize ```{r, warning=FALSE, echo=TRUE} library(glmnet) predictors <- data.matrix(bodyfat_predictors) outcome <- bodyfat$bodyfat lambdas <- 10^seq(2, -2, by = -.1) ridge_cv <- cv.glmnet(predictors, outcome, alpha = 0, lambda = lambdas) best <- ridge_cv$lambda.min best best_ridge <- glmnet(predictors, outcome, alpha = 0, lambda = best) coef(best_ridge) ```