From dad2ff3e2960263d360c2084b8556a43039e9907 Mon Sep 17 00:00:00 2001 From: "Daniel J. McDonald" Date: Wed, 3 Apr 2024 08:18:32 -0700 Subject: [PATCH] rebuild --- .../bootstrap/execute-results/html.json | 2 +- .../execute-results/html.json | 2 +- .../slides/git/execute-results/html.json | 2 +- .../grad-school/execute-results/html.json | 2 +- .../model-selection/execute-results/html.json | 20 ++ .../figure-revealjs/unnamed-chunk-1-1.svg | 276 ++++++++++++++++++ .../organization/execute-results/html.json | 2 +- .../pca-intro/execute-results/html.json | 2 +- .../presentations/execute-results/html.json | 2 +- .../execute-results/html.json | 2 +- .../slides/syllabus/execute-results/html.json | 2 +- .../time-series/execute-results/html.json | 2 +- .../unit-tests/execute-results/html.json | 2 +- 13 files changed, 307 insertions(+), 11 deletions(-) create mode 100644 _freeze/schedule/slides/model-selection/execute-results/html.json create mode 100644 _freeze/schedule/slides/model-selection/figure-revealjs/unnamed-chunk-1-1.svg diff --git a/_freeze/schedule/slides/bootstrap/execute-results/html.json b/_freeze/schedule/slides/bootstrap/execute-results/html.json index 140af83..55e2919 100644 --- a/_freeze/schedule/slides/bootstrap/execute-results/html.json +++ b/_freeze/schedule/slides/bootstrap/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "d2869649cdb959b50f0e8ab08e8f9e05", "result": { - "markdown": "---\nlecture: \"The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n\n## {background-image=\"https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg\" background-size=\"contain\"}\n\n\n## {background-image=\"http://rackjite.com/wp-content/uploads/rr11014aa.jpg\" background-size=\"contain\"}\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat y$, or a median, etc.\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n. . .\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n\n## Refresher on sampling distributions\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\bar{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\bar{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\bar{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\hat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\nThe bootstrap gives this to you.\n\n\n\n\n## Procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate a LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Very basic example\n\n* Let $X_i\\sim Exponential(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2 \\frac{s}{\\sqrt{n}}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Now what\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2022-11-01)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.669627\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nbootMed <- function() median(sample(x, replace = TRUE)) # resample, and get the median\nFhat <- replicate(B, bootMed()) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## {background-image=\"gfx/boot1.png\" background-size=\"contain\"}\n\n## {background-image=\"gfx/boot2.png\" background-size=\"contain\"}\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.9293 -1.0460 -0.1407 0.8298 16.2536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.81895 0.07678 49.74 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.549 on 143 degrees of freedom\nMultiple R-squared: 0.9454,\tAdjusted R-squared: 0.945 \nF-statistic: 2474 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## When we fit models, we examine diagnostics\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe tails are too fat, I don't believe that CI...\n:::\n\n::: {.column width=\"50%\"}\n\n\nWe bootstrap\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.654977 3.955927 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for most cases.\n\n. . .\n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n## Same data\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1-alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.673559 3.970251 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nnewcats <- fatcats\nfor (b in 1:B) {\n samp <- sample(residuals(cats.lm), replace = TRUE)\n newcats$Hwt <- predict(cats.lm) + samp # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.665531 3.961896 \n```\n:::\n:::\n\n\n:::\n::::\n\n## Bootstrap error sources\n\n\n[Simulation error]{.secondary}:\n\nusing only $B$ samples to estimate $F$ with $\\hat{F}$.\n\n[Statistical error]{.secondary}:\n\nour data depended on a sample from the population. We don't have the whole population so we make an error by using a sample \n\n(Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\n[Specification error]{.secondary}:\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n\n", + "markdown": "---\nlecture: \"The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n\n## {background-image=\"https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg\" background-size=\"contain\"}\n\n\n## {background-image=\"http://rackjite.com/wp-content/uploads/rr11014aa.jpg\" background-size=\"contain\"}\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat y$, or a median, etc.\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n. . .\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n\n## Refresher on sampling distributions\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\bar{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\bar{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\bar{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\hat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\nThe bootstrap gives this to you.\n\n\n\n\n## Procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate a LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Very basic example\n\n* Let $X_i\\sim Exponential(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2 \\frac{s}{\\sqrt{n}}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Now what\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2022-11-01)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.669627\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nbootMed <- function() median(sample(x, replace = TRUE)) # resample, and get the median\nFhat <- replicate(B, bootMed()) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## {background-image=\"gfx/boot1.png\" background-size=\"contain\"}\n\n## {background-image=\"gfx/boot2.png\" background-size=\"contain\"}\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.9293 -1.0460 -0.1407 0.8298 16.2536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.81895 0.07678 49.74 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.549 on 143 degrees of freedom\nMultiple R-squared: 0.9454,\tAdjusted R-squared: 0.945 \nF-statistic: 2474 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## When we fit models, we examine diagnostics\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe tails are too fat, I don't believe that CI...\n:::\n\n::: {.column width=\"50%\"}\n\n\nWe bootstrap\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.654977 3.955927 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for most cases.\n\n. . .\n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n## Same data\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1-alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.673559 3.970251 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nnewcats <- fatcats\nfor (b in 1:B) {\n samp <- sample(residuals(cats.lm), replace = TRUE)\n newcats$Hwt <- predict(cats.lm) + samp # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.665531 3.961896 \n```\n:::\n:::\n\n\n:::\n::::\n\n## Bootstrap error sources\n\n\n[Simulation error]{.secondary}:\n\nusing only $B$ samples to estimate $F$ with $\\hat{F}$.\n\n[Statistical error]{.secondary}:\n\nour data depended on a sample from the population. We don't have the whole population so we make an error by using a sample \n\n(Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\n[Specification error]{.secondary}:\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n\n", "supporting": [ "bootstrap_files" ], diff --git a/_freeze/schedule/slides/cluster-computing/execute-results/html.json b/_freeze/schedule/slides/cluster-computing/execute-results/html.json index 74a1167..90bae3f 100644 --- a/_freeze/schedule/slides/cluster-computing/execute-results/html.json +++ b/_freeze/schedule/slides/cluster-computing/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "58a87718a72a159acbb6294711461399", "result": { - "markdown": "---\nlecture: \"Cluster computing (at UBC)\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## UBC HPC\n\n### 3 potentially useful systems:\n\n1. Department VM\n1. [UBC ARC Sockeye](https://arc.ubc.ca/ubc-arc-sockeye)\n1. [Digital Research Alliance of Canada](https://docs.alliancecan.ca/wiki/Technical_documentation)\n\n\nI've only used 1 and 3. I mainly use 3.\n\n\n### Accessing\n\nAs far as I know, access for students requires \"faculty\" support\n\n1. Email The/Binh. \n1. Possible you can access without a faculty PI.\n1. Email your advisor to ask for an account.\n\n\n### The rest of this will focus on 3.\n\n\n\n## Prerequisites\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n1. Command line interface (Terminal on Mac)\n\n2. (optional) helpful to have ftp client. (Cyberduck)\n\n3. [Globus Connect](https://www.globus.org/globus-connect-personal). File transfer approved by DRAC\n:::\n\n::: {.column width=\"50%\"}\nUseful CL commands\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\ncd ~/path/to/directory\n\ncp file/to/copy.txt copied/to/copy1.txt\n\nrm file/to/delete.txt\n\nrm -r dir/to/delete/\n\nls -a # list all files\n```\n:::\n\n:::\n::::\n\n\n\n## How to connect\n\nLogin to a system:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nssh dajmcdon@cedar.alliancecan.ca\n```\n:::\n\n\n* Upon login, you're on a \"head\" or \"login\" node. \n* Jobs > 30min will be killed. \n* You can continuously run short interactive jobs.\n\n\n## Rule 1\n\n::: {.callout-tip}\nIf you're doing work for school: run it on one of these machines. \n:::\n\n* Yes, there is overhead to push data over and pull results back.\n* But DRAC/Sockeye is much faster than your machine.\n* And this won't lock up your laptop for 4 hours while you run the job.\n* It's also a good experience.\n* You can log out and leave the job running. Just log back in to see if it's done (you should _always_ have some idea how long it will take)\n\n\n## Modules\n\n* Once you connect with `ssh`:\n\n* There are no Applications loaded.\n\n* You must tell the system what you want.\n\n* The command is `module load r` or `module load sas`\n\n* If you find yourself using the same [modules](https://docs.alliancecan.ca/wiki/Utiliser_des_modules/en) all the time:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nmodule load StdEnv/2023 r gurobi python # stuff I use\n\nmodule save my_modules # save loaded modules\n\nmodule restore my_modules # on login, load the usual set\n```\n:::\n\n\n\n## Running something interactively\n\n1. Login\n2. Load modules\n3. Request interactive compute\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nsalloc --time=1:0:0 --ntasks=1 --account=def-dajmcdon --mem-per-cpu=4096M\n# allocate 1 hour on 1 cpu with 4Gb RAM\n```\n:::\n\n\n* For the user `def-dajmcdon` (that's me, accounts start with `def-`)\n\nThen I would start R\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nr\n```\n:::\n\n\nAnd run whatever I want. If it takes more than an hour or needs more than 4GB of memory, it'll quit.\n\n\n\n## Interactive jobs\n\n* Once started they'll just go\n* You can do whatever else you want on your machine\n* But you can't kill the connection\n* So don't close your laptop and walk away\n* This is not typically the best use of this resource.\n* Better is likely [syzygy](http://syzygy.ca/).\n\nAlthough, syzygy has little memory and little storage, so it won't do intensive tasks \n\n> I think your home dir is limited to 1GB\n\n\n\n## Big memory jobs\n\n* Possible you can do this interactively, but discouraged\n\n\n\n::: {.callout-note}\n## Example\n\n* Neuroscience project\n* Dataset is about 10GB\n* Peak memory usage during analysis is about 24GB\n* Can't do this on my computer\n* Want to offload onto DRAC\n:::\n\n1. Write a `R` / `python` script that does the whole analysis and saves the output.\n\n2. You need to ask DRAC to run the script for you.\n\n\n## The scheduler\n\n* You *can* log in to DRAC and \"do stuff\"\n* But resources are limited.\n* There's a process that determines who gets resources when.\n* Technically the `salloc` command we used before requested some resources.\n* It may \"sit\" until the resources you want are available, but probably not long.\n* Anything else has to go through the scheduler.\n* DRAC uses the `slurm` scheduler\n\n\n## Example script\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash\n\n#SBATCH --account=def-dajmcdon\n#SBATCH --job-name=dlbcl-suffpcr\n#SBATCH --output=%x-%j.out\n#SBATCH --error=%x-%j.out\n#SBATCH --time=10:00:00\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --mem-per-cpu=32G\n\nRscript -e 'source(\"dlbcl-nocv.R\")'\n```\n:::\n\n\n* This asks for 10 hours of compute time with 32GB of memory\n* The `job-name` / `output` / `error` fields are for convenience. \n* If unspecified, I'll end up with files named things like `jobid60607-60650934.out`\n\n\n\n## Submitting and other useful commands\n\n* Suppose that `slurm` script is saved as `dlbcl-slurm.sh`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nsbatch dlbcl-slurm.sh # submit the job to the scheduler\n\nsqueue -u $USER # show status of your jobs ($USER is an env variable)\n\nscancel -u $USER # cancel all your jobs\n\nscancel -t PENDING -u $USER # cancel all your pending jobs\n```\n:::\n\n\n::: {.callout-important}\n1. Jobs inherit environment variables. So if you load modules, then submit, your modules are available to run.\n\n2. On Cedar, jobs cannot run from `~/`. It must be run from `~/scratch/` or `~/projects/`.\n:::\n\n# Really big jobs {background-color=\"#e98a15\" }\n\n\n## Types of jobs\n\n1. Big jobs (need lots of RAM)\n\n2. GPU jobs (you want deep learning, I don't know how to do this)\n\n3. Other jobs with *internal* parallelism (I almost never do this)\n\n4. [Embarrassingly parallel jobs (I do this all the time)]{.secondary}\n\n\n\n## Simple parallelization\n\n- Most of my major computing needs are \"embarrassingly parallel\"\n- I want to run a few algorithms on a bunch of different simulated datasets under different parameter configurations.\n- Perhaps run the algos on some real data too.\n- `R` has packages which are good for parallelization (`snow`, `snowfall`, `Rmpi`, `parallel`)\n- This is how I originally learned to do parallel computing. But these packages are not good for the cluster \n- They're fine for your machine, but we've already decided we're not going to do that anymore.\n\n\n\n## Example of the bad parallelism\n\n[Torque script]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash \n#PBS -l nodes=8:ppn=8,walltime=200:00:00\n#PBS -m abe\n#PBS -n ClusterPermute \n#PBS -j oe \nmpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R\n```\n:::\n\n\n* Torque is a different scheduler. UBC ARC Sockeye uses Torque. Looks much like Slurm.\n\n* Here, `ClusterPermute.R` uses `Rmpi` to do \"parallel `lapply`\"\n\n* So I asked for 8 processors on each of 8 nodes.\n\n## Example of the bad parallelism\n\n[Torque script]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash \n#PBS -l nodes=8:ppn=8,walltime=200:00:00\n#PBS -m abe\n#PBS -n ClusterPermute \n#PBS -j oe \nmpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R\n```\n:::\n\n\n\n[Problem]{.secondary}\n\n* The scheduler has to find 8 nodes with 8 available processors before this job will start. \n\n* This often takes a while, sometimes days.\n\n* But the jobs don't *need* those things to happen *at the same time*.\n\n\n## `{batchtools}`\n\n* Using `R` (or `python`) to parallelize is inefficient when there's a scheduler in the middle.\n* Better is to actually submit 64 different jobs each requiring 1 node\n* Then each can get out of the queue whenever a processor becomes available.\n* But that would seem to require writing 64 different `slurm` scripts\n\n- `{batchtools}` does this for you, all in `R`\n\n 1. It automates writing/submitting `slurm` / `torque` scripts.\n 2. It automatically stores output, and makes it easy to collect.\n 3. It generates lots of jobs.\n 4. All this from `R` directly.\n \n\nIt's easy to port across machines / schedulers.\n\nI can test parts (or even run) it on my machine without making changes for the cluster.\n\n\n## Setup `{batchtools}`\n\n1. Create a directory where all your jobs will live (in subdirectories). Mine is `~/`\n\n2. In that directory, you need a template file. Mine is `~/.batchtools.slurm.tmpl` (next slide)\n\n3. Create a configuration file which lives in your home directory. You must name it `~/.batchtools.conf.R`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# ~/.batchtools.conf.R\ncluster.functions <- makeClusterFunctionsSlurm()\n```\n:::\n\n\n\n\n## `~/.batchtools.slurm.tmpl`\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash\n\n## Job Resource Interface Definition\n##\n## ntasks [integer(1)]: Number of required tasks,\n## Set larger than 1 if you want to further parallelize\n## with MPI within your job.\n## ncpus [integer(1)]: Number of required cpus per task,\n## Set larger than 1 if you want to further parallelize\n## with multicore/parallel within each task.\n## walltime [integer(1)]: Walltime for this job, in seconds.\n## Must be at least 60 seconds for Slurm to work properly.\n## memory [integer(1)]: Memory in megabytes for each cpu.\n## Must be at least 100 (when I tried lower values my\n## jobs did not start at all).\n##\n## Default resources can be set in your .batchtools.conf.R by defining the variable\n## 'default.resources' as a named list.\n\n<%\n# relative paths are not handled well by Slurm\nlog.file = fs::path_expand(log.file)\n-%>\n\n#SBATCH --account=def-dajmcdon\n#SBATCH --mail-user=daniel@stat.ubc.ca\n#SBATCH --mail-type=ALL\n#SBATCH --job-name=<%= job.name %>\n#SBATCH --output=<%= log.file %>\n#SBATCH --error=<%= log.file %>\n#SBATCH --time=<%= resources$walltime %>\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=<%= resources$ncpus %>\n#SBATCH --mem-per-cpu=<%= resources$memory %>\n<%= if (array.jobs) sprintf(\"#SBATCH --array=1-%i\", nrow(jobs)) else \"\" %>\n\n## Run R:\n## we merge R output with stdout from SLURM, which gets then logged via --output option\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n```\n:::\n\n\n. . .\n\nWhen I'm ready to run, I'll call something like:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbatchtools::submitJobs(\n job.ids, \n resources = list(ncpus=1, walltime=\"24:00:00\", memory=\"32G\")\n)\n```\n:::\n\n\n\n\n## Workflow\n\n[See the vignette:]{.secondary} `vignette(\"batchtools\")`\n\nor the \n\n[website](https://mllg.github.io/batchtools/articles/batchtools.html)\n\n1. Create a folder to hold your code. Mine usually contains 2 files, one to set up/run the experiment, one to collect results. Code needed to run the experiment lives in an `R` package.\n\n2. Write a script to setup the experiment and submit.\n\n3. Wait.\n\n4. Collect your results. Copy back to your machine etc.\n\n\n\n# Do it {background-color=\"#e98a15\"}\n\n\n## Example 1: Use genetics data to predict viral load\n\n* An \"extra\" example in a methods paper to appease reviewers\n* Method is: \n \n 1. apply a special version of PCA to a big (wide) data set\n 1. Do OLS using the top few PCs\n \n* This is \"principle components regression\" with sparse principle components.\n* Got 413 COVID patients, measure \"viral load\" and gene expression\n* 9435 differentially expressed genes.\n* The method needs to form a 10K x 10K matrix multiple times and do an approximate SVD. Requires 32GB memory. Compute time is ~6 hours.\n* Two tuning parameters: $\\lambda$ and number of PCs\n* Want to do CV to choose, and then use those on the whole data, describe selected genes.\n\n\n## Example 1: Use genetics data to predict viral load\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(batchtools)\n\nreg <- makeExperimentRegistry(\"spcr-genes\", packages = c(\"tidyverse\", \"suffpcr\"))\nx <- readRDS(here::here(\"suffpcr-covid\", \"covid_x.rds\"))\ny <- readRDS(here::here(\"suffpcr-covid\", \"covid_y.rds\"))\n\nsubsample = function(data, job, ratio, ...) {\n n <- nrow(data$x)\n train <- sample(n, floor(n * ratio))\n test <- setdiff(seq_len(n), train)\n list(test = test, train = train)\n}\n\naddProblem(\"cv\", data = list(x = x, y = y), fun = subsample)\naddProblem(\"full\", data = list(x = x, y = y))\n\naddAlgorithm(\n \"spcr_cv\",\n fun = function(job, data, instance, ...) { # args are required\n fit <- suffpcr(\n data$x[instance$train, ], data$y[instance$train], \n lambda_min = 0, lambda_max = 1, ...\n )\n valid_err <- colMeans(\n (\n data$y[instance$test] - \n as.matrix(predict(fit, newdata = data$x[instance$test, ]))\n )^2,\n na.rm = TRUE\n )\n return(list(fit = fit, valid_err = valid_err))\n }\n)\n\naddAlgorithm(\n \"spcr_full\",\n fun = function(job, data, instance, ...) {\n suffpcr(data$x, data$y, lambda_max = 1, lambda_min = 0, ...)\n }\n)\n\n## Experimental design\npdes_cv <- list(cv = data.frame(ratio = .75))\npdes_full <- list(full = data.frame())\nades_cv <- list(spcr_cv = data.frame(d = c(3, 5, 15)))\nades_full <- list(spcr_full = data.frame(d = c(3, 5, 15)))\n\naddExperiments(pdes_cv, ades_cv, repls = 5L)\naddExperiments(pdes_full, ades_full)\n\nsubmitJobs(\n findJobs(), \n resources = list(ncpus = 1, walltime = \"8:00:00\", memory = \"32G\")\n)\n```\n:::\n\n\n\nEnd up with 18 jobs. \n\n\n## Example 2: Predicting future COVID cases\n\n* Take a few _very simple_ models and demonstrate that some choices make a big difference in accuracy.\n\n* At each time $t$, download COVID cases as observed on day $t$ for a bunch of locations\n\n* Estimate a few different models for predicting days $t+1,\\ldots,t+k$\n\n* Store point and interval forecasts.\n\n* Do this for $t$ every week over a year.\n\n\n## Example 2: Predicting future COVID cases\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfcasters <- list.files(here::here(\"code\", \"forecasters\"))\nfor (fcaster in fcasters) source(here::here(\"code\", \"forecasters\", fcaster))\nregistry_path <- here::here(\"data\", \"forecast-experiments\")\nsource(here::here(\"code\", \"common-pars.R\"))\n\n# Setup the data ----------------------------------------------------\nreg <- makeExperimentRegistry(\n registry_path,\n packages = c(\"tidyverse\", \"covidcast\"),\n source = c(\n here::here(\"code\", \"forecasters\", fcasters), \n here::here(\"code\", \"common-pars.R\")\n )\n)\n\ngrab_data <- function(data, job, forecast_date, ...) {\n dat <- covidcast_signals(\n data_sources, signals, as_of = forecast_date, \n geo_type = geo_type, start_day = \"2020-04-15\") %>% \n aggregate_signals(format = \"wide\") \n names(dat)[3:5] <- c(\"value\", \"num\", \"covariate\") # assumes 2 signals\n dat %>% \n filter(!(geo_value %in% drop_geos)) %>% \n group_by(geo_value) %>% \n arrange(time_value)\n}\naddProblem(\"covidcast_proper\", fun = grab_data, cache = TRUE)\n\n# Algorithm wrappers -----------------------------------------------------\nbaseline <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, value) %>% \n group_modify(prob_baseline, ...)\n}\nar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n group_modify(prob_ar, ...)\n}\nqar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n group_modify(quant_ar, ...)\n}\ngam <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>%\n group_modify(safe_prob_gam_ar, ...)\n}\nar_cov <- function(data, job, instance, ...) {\n instance %>% \n group_modify(prob_ar_cov, ...)\n}\njoint <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n joint_ar(...)\n}\ncorrected_ar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, num) %>% \n rename(value = num) %>% \n corrections_single_signal(cparams) %>% \n group_modify(prob_ar, ...)\n}\n\naddAlgorithm(\"baseline\", baseline)\naddAlgorithm(\"ar\", ar)\naddAlgorithm(\"qar\", qar)\naddAlgorithm(\"gam\", gam)\naddAlgorithm(\"ar_cov\", ar_cov)\naddAlgorithm(\"joint_ar\", joint)\naddAlgorithm(\"corrections\", corrected_ar)\n\n# Experimental design -----------------------------------------------------\nproblem_design <- list(covidcast_proper = data.frame(forecast_date = forecast_dates))\nalgorithm_design <- list(\n baseline = CJ(\n train_window = train_windows, min_train_window = min(train_windows), ahead = aheads\n ),\n ar = CJ(\n train_window = train_windows, min_train_window = min(train_windows), \n lags = lags_list, ahead = aheads\n ),\n qar = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads\n ),\n gam = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads, df = gam_df\n ),\n ar_cov = CJ(\n train_window = train_windows, min_train_window = min(train_windows), \n lags = lags_list, ahead = aheads\n ),\n joint_ar = CJ(\n train_window = joint_train_windows, min_train_window = min(joint_train_windows), \n lags = lags_list, ahead = aheads\n ),\n corrections = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads\n )\n)\n\naddExperiments(problem_design, algorithm_design)\nids <- unwrap(getJobPars()) %>% \n select(job.id, forecast_date) %>% \n mutate(chunk = as.integer(as.factor(forecast_date))) %>% \n select(-forecast_date)\n\n## ~13000 jobs, we don't want to submit that many since they run fast\n## Chunk them into groups by forecast_date (to download once for the group)\n## Results in 68 chunks\n\nsubmitJobs(\n ids, \n resources = list(ncpus = 1, walltime = \"4:00:00\", memory = \"16G\")\n)\n```\n:::\n\n\n## Takeaways\n\n::: flex\n::: w-50\n\n### Benefits of this workflow:\n\n* Don't lock up your computer\n* Stuff runs much faster\n* Can easily scale up to many jobs\n* Logs are stored for debugging\n* Forces you to think about the [design]{.secondary}\n* No overhead to store results\n* Easy to add more experiments later, adjust parameters, etc.\n\n:::\n\n::: w-50\n### Costs:\n\n* I only know how to do *this* in `R`\n* Overhead of moving between machines\n* Some headaches to understand the syntax\n\n:::\n:::", + "markdown": "---\nlecture: \"Cluster computing (at UBC)\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## UBC HPC\n\n### 3 potentially useful systems:\n\n1. Department VM\n1. [UBC ARC Sockeye](https://arc.ubc.ca/ubc-arc-sockeye)\n1. [Digital Research Alliance of Canada](https://docs.alliancecan.ca/wiki/Technical_documentation)\n\n\nI've only used 1 and 3. I mainly use 3.\n\n\n### Accessing\n\nAs far as I know, access for students requires \"faculty\" support\n\n1. Email The/Binh. \n1. Possible you can access without a faculty PI.\n1. Email your advisor to ask for an account.\n\n\n### The rest of this will focus on 3.\n\n\n\n## Prerequisites\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n1. Command line interface (Terminal on Mac)\n\n2. (optional) helpful to have ftp client. (Cyberduck)\n\n3. [Globus Connect](https://www.globus.org/globus-connect-personal). File transfer approved by DRAC\n:::\n\n::: {.column width=\"50%\"}\nUseful CL commands\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\ncd ~/path/to/directory\n\ncp file/to/copy.txt copied/to/copy1.txt\n\nrm file/to/delete.txt\n\nrm -r dir/to/delete/\n\nls -a # list all files\n```\n:::\n\n:::\n::::\n\n\n\n## How to connect\n\nLogin to a system:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nssh dajmcdon@cedar.alliancecan.ca\n```\n:::\n\n\n* Upon login, you're on a \"head\" or \"login\" node. \n* Jobs > 30min will be killed. \n* You can continuously run short interactive jobs.\n\n\n## Rule 1\n\n::: {.callout-tip}\nIf you're doing work for school: run it on one of these machines. \n:::\n\n* Yes, there is overhead to push data over and pull results back.\n* But DRAC/Sockeye is much faster than your machine.\n* And this won't lock up your laptop for 4 hours while you run the job.\n* It's also a good experience.\n* You can log out and leave the job running. Just log back in to see if it's done (you should _always_ have some idea how long it will take)\n\n\n## Modules\n\n* Once you connect with `ssh`:\n\n* There are no Applications loaded.\n\n* You must tell the system what you want.\n\n* The command is `module load r` or `module load sas`\n\n* If you find yourself using the same [modules](https://docs.alliancecan.ca/wiki/Utiliser_des_modules/en) all the time:\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nmodule load StdEnv/2023 r gurobi python # stuff I use\n\nmodule save my_modules # save loaded modules\n\nmodule restore my_modules # on login, load the usual set\n```\n:::\n\n\n\n## Running something interactively\n\n1. Login\n2. Load modules\n3. Request interactive compute\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nsalloc --time=1:0:0 --ntasks=1 --account=def-dajmcdon --mem-per-cpu=4096M\n# allocate 1 hour on 1 cpu with 4Gb RAM\n```\n:::\n\n\n* For the user `def-dajmcdon` (that's me, accounts start with `def-`)\n\nThen I would start R\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nr\n```\n:::\n\n\nAnd run whatever I want. If it takes more than an hour or needs more than 4GB of memory, it'll quit.\n\n\n\n## Interactive jobs\n\n* Once started they'll just go\n* You can do whatever else you want on your machine\n* But you can't kill the connection\n* So don't close your laptop and walk away\n* This is not typically the best use of this resource.\n* Better is likely [syzygy](http://syzygy.ca/).\n\nAlthough, syzygy has little memory and little storage, so it won't do intensive tasks \n\n> I think your home dir is limited to 1GB\n\n\n\n## Big memory jobs\n\n* Possible you can do this interactively, but discouraged\n\n\n\n::: {.callout-note}\n## Example\n\n* Neuroscience project\n* Dataset is about 10GB\n* Peak memory usage during analysis is about 24GB\n* Can't do this on my computer\n* Want to offload onto DRAC\n:::\n\n1. Write a `R` / `python` script that does the whole analysis and saves the output.\n\n2. You need to ask DRAC to run the script for you.\n\n\n## The scheduler\n\n* You *can* log in to DRAC and \"do stuff\"\n* But resources are limited.\n* There's a process that determines who gets resources when.\n* Technically the `salloc` command we used before requested some resources.\n* It may \"sit\" until the resources you want are available, but probably not long.\n* Anything else has to go through the scheduler.\n* DRAC uses the `slurm` scheduler\n\n\n## Example script\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash\n\n#SBATCH --account=def-dajmcdon\n#SBATCH --job-name=dlbcl-suffpcr\n#SBATCH --output=%x-%j.out\n#SBATCH --error=%x-%j.out\n#SBATCH --time=10:00:00\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --mem-per-cpu=32G\n\nRscript -e 'source(\"dlbcl-nocv.R\")'\n```\n:::\n\n\n* This asks for 10 hours of compute time with 32GB of memory\n* The `job-name` / `output` / `error` fields are for convenience. \n* If unspecified, I'll end up with files named things like `jobid60607-60650934.out`\n\n\n\n## Submitting and other useful commands\n\n* Suppose that `slurm` script is saved as `dlbcl-slurm.sh`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\nsbatch dlbcl-slurm.sh # submit the job to the scheduler\n\nsqueue -u $USER # show status of your jobs ($USER is an env variable)\n\nscancel -u $USER # cancel all your jobs\n\nscancel -t PENDING -u $USER # cancel all your pending jobs\n```\n:::\n\n\n::: {.callout-important}\n1. Jobs inherit environment variables. So if you load modules, then submit, your modules are available to run.\n\n2. On Cedar, jobs cannot run from `~/`. It must be run from `~/scratch/` or `~/projects/`.\n:::\n\n# Really big jobs {background-color=\"#e98a15\" }\n\n\n## Types of jobs\n\n1. Big jobs (need lots of RAM)\n\n2. GPU jobs (you want deep learning, I don't know how to do this)\n\n3. Other jobs with *internal* parallelism (I almost never do this)\n\n4. [Embarrassingly parallel jobs (I do this all the time)]{.secondary}\n\n\n\n## Simple parallelization\n\n- Most of my major computing needs are \"embarrassingly parallel\"\n- I want to run a few algorithms on a bunch of different simulated datasets under different parameter configurations.\n- Perhaps run the algos on some real data too.\n- `R` has packages which are good for parallelization (`snow`, `snowfall`, `Rmpi`, `parallel`)\n- This is how I originally learned to do parallel computing. But these packages are not good for the cluster \n- They're fine for your machine, but we've already decided we're not going to do that anymore.\n\n\n\n## Example of the bad parallelism\n\n[Torque script]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash \n#PBS -l nodes=8:ppn=8,walltime=200:00:00\n#PBS -m abe\n#PBS -n ClusterPermute \n#PBS -j oe \nmpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R\n```\n:::\n\n\n* Torque is a different scheduler. UBC ARC Sockeye uses Torque. Looks much like Slurm.\n\n* Here, `ClusterPermute.R` uses `Rmpi` to do \"parallel `lapply`\"\n\n* So I asked for 8 processors on each of 8 nodes.\n\n## Example of the bad parallelism\n\n[Torque script]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash \n#PBS -l nodes=8:ppn=8,walltime=200:00:00\n#PBS -m abe\n#PBS -n ClusterPermute \n#PBS -j oe \nmpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R\n```\n:::\n\n\n\n[Problem]{.secondary}\n\n* The scheduler has to find 8 nodes with 8 available processors before this job will start. \n\n* This often takes a while, sometimes days.\n\n* But the jobs don't *need* those things to happen *at the same time*.\n\n\n## `{batchtools}`\n\n* Using `R` (or `python`) to parallelize is inefficient when there's a scheduler in the middle.\n* Better is to actually submit 64 different jobs each requiring 1 node\n* Then each can get out of the queue whenever a processor becomes available.\n* But that would seem to require writing 64 different `slurm` scripts\n\n- `{batchtools}` does this for you, all in `R`\n\n 1. It automates writing/submitting `slurm` / `torque` scripts.\n 2. It automatically stores output, and makes it easy to collect.\n 3. It generates lots of jobs.\n 4. All this from `R` directly.\n \n\nIt's easy to port across machines / schedulers.\n\nI can test parts (or even run) it on my machine without making changes for the cluster.\n\n\n## Setup `{batchtools}`\n\n1. Create a directory where all your jobs will live (in subdirectories). Mine is `~/`\n\n2. In that directory, you need a template file. Mine is `~/.batchtools.slurm.tmpl` (next slide)\n\n3. Create a configuration file which lives in your home directory. You must name it `~/.batchtools.conf.R`.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# ~/.batchtools.conf.R\ncluster.functions <- makeClusterFunctionsSlurm()\n```\n:::\n\n\n\n\n## `~/.batchtools.slurm.tmpl`\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.bash .cell-code}\n#!/bin/bash\n\n## Job Resource Interface Definition\n##\n## ntasks [integer(1)]: Number of required tasks,\n## Set larger than 1 if you want to further parallelize\n## with MPI within your job.\n## ncpus [integer(1)]: Number of required cpus per task,\n## Set larger than 1 if you want to further parallelize\n## with multicore/parallel within each task.\n## walltime [integer(1)]: Walltime for this job, in seconds.\n## Must be at least 60 seconds for Slurm to work properly.\n## memory [integer(1)]: Memory in megabytes for each cpu.\n## Must be at least 100 (when I tried lower values my\n## jobs did not start at all).\n##\n## Default resources can be set in your .batchtools.conf.R by defining the variable\n## 'default.resources' as a named list.\n\n<%\n# relative paths are not handled well by Slurm\nlog.file = fs::path_expand(log.file)\n-%>\n\n#SBATCH --account=def-dajmcdon\n#SBATCH --mail-user=daniel@stat.ubc.ca\n#SBATCH --mail-type=ALL\n#SBATCH --job-name=<%= job.name %>\n#SBATCH --output=<%= log.file %>\n#SBATCH --error=<%= log.file %>\n#SBATCH --time=<%= resources$walltime %>\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=<%= resources$ncpus %>\n#SBATCH --mem-per-cpu=<%= resources$memory %>\n<%= if (array.jobs) sprintf(\"#SBATCH --array=1-%i\", nrow(jobs)) else \"\" %>\n\n## Run R:\n## we merge R output with stdout from SLURM, which gets then logged via --output option\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n```\n:::\n\n\n. . .\n\nWhen I'm ready to run, I'll call something like:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbatchtools::submitJobs(\n job.ids, \n resources = list(ncpus=1, walltime=\"24:00:00\", memory=\"32G\")\n)\n```\n:::\n\n\n\n\n## Workflow\n\n[See the vignette:]{.secondary} `vignette(\"batchtools\")`\n\nor the \n\n[website](https://mllg.github.io/batchtools/articles/batchtools.html)\n\n1. Create a folder to hold your code. Mine usually contains 2 files, one to set up/run the experiment, one to collect results. Code needed to run the experiment lives in an `R` package.\n\n2. Write a script to setup the experiment and submit.\n\n3. Wait.\n\n4. Collect your results. Copy back to your machine etc.\n\n\n\n# Do it {background-color=\"#e98a15\"}\n\n\n## Example 1: Use genetics data to predict viral load\n\n* An \"extra\" example in a methods paper to appease reviewers\n* Method is: \n \n 1. apply a special version of PCA to a big (wide) data set\n 1. Do OLS using the top few PCs\n \n* This is \"principle components regression\" with sparse principle components.\n* Got 413 COVID patients, measure \"viral load\" and gene expression\n* 9435 differentially expressed genes.\n* The method needs to form a 10K x 10K matrix multiple times and do an approximate SVD. Requires 32GB memory. Compute time is ~6 hours.\n* Two tuning parameters: $\\lambda$ and number of PCs\n* Want to do CV to choose, and then use those on the whole data, describe selected genes.\n\n\n## Example 1: Use genetics data to predict viral load\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(batchtools)\n\nreg <- makeExperimentRegistry(\"spcr-genes\", packages = c(\"tidyverse\", \"suffpcr\"))\nx <- readRDS(here::here(\"suffpcr-covid\", \"covid_x.rds\"))\ny <- readRDS(here::here(\"suffpcr-covid\", \"covid_y.rds\"))\n\nsubsample = function(data, job, ratio, ...) {\n n <- nrow(data$x)\n train <- sample(n, floor(n * ratio))\n test <- setdiff(seq_len(n), train)\n list(test = test, train = train)\n}\n\naddProblem(\"cv\", data = list(x = x, y = y), fun = subsample)\naddProblem(\"full\", data = list(x = x, y = y))\n\naddAlgorithm(\n \"spcr_cv\",\n fun = function(job, data, instance, ...) { # args are required\n fit <- suffpcr(\n data$x[instance$train, ], data$y[instance$train], \n lambda_min = 0, lambda_max = 1, ...\n )\n valid_err <- colMeans(\n (\n data$y[instance$test] - \n as.matrix(predict(fit, newdata = data$x[instance$test, ]))\n )^2,\n na.rm = TRUE\n )\n return(list(fit = fit, valid_err = valid_err))\n }\n)\n\naddAlgorithm(\n \"spcr_full\",\n fun = function(job, data, instance, ...) {\n suffpcr(data$x, data$y, lambda_max = 1, lambda_min = 0, ...)\n }\n)\n\n## Experimental design\npdes_cv <- list(cv = data.frame(ratio = .75))\npdes_full <- list(full = data.frame())\nades_cv <- list(spcr_cv = data.frame(d = c(3, 5, 15)))\nades_full <- list(spcr_full = data.frame(d = c(3, 5, 15)))\n\naddExperiments(pdes_cv, ades_cv, repls = 5L)\naddExperiments(pdes_full, ades_full)\n\nsubmitJobs(\n findJobs(), \n resources = list(ncpus = 1, walltime = \"8:00:00\", memory = \"32G\")\n)\n```\n:::\n\n\n\nEnd up with 18 jobs. \n\n\n## Example 2: Predicting future COVID cases\n\n* Take a few _very simple_ models and demonstrate that some choices make a big difference in accuracy.\n\n* At each time $t$, download COVID cases as observed on day $t$ for a bunch of locations\n\n* Estimate a few different models for predicting days $t+1,\\ldots,t+k$\n\n* Store point and interval forecasts.\n\n* Do this for $t$ every week over a year.\n\n\n## Example 2: Predicting future COVID cases\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfcasters <- list.files(here::here(\"code\", \"forecasters\"))\nfor (fcaster in fcasters) source(here::here(\"code\", \"forecasters\", fcaster))\nregistry_path <- here::here(\"data\", \"forecast-experiments\")\nsource(here::here(\"code\", \"common-pars.R\"))\n\n# Setup the data ----------------------------------------------------\nreg <- makeExperimentRegistry(\n registry_path,\n packages = c(\"tidyverse\", \"covidcast\"),\n source = c(\n here::here(\"code\", \"forecasters\", fcasters), \n here::here(\"code\", \"common-pars.R\")\n )\n)\n\ngrab_data <- function(data, job, forecast_date, ...) {\n dat <- covidcast_signals(\n data_sources, signals, as_of = forecast_date, \n geo_type = geo_type, start_day = \"2020-04-15\") %>% \n aggregate_signals(format = \"wide\") \n names(dat)[3:5] <- c(\"value\", \"num\", \"covariate\") # assumes 2 signals\n dat %>% \n filter(!(geo_value %in% drop_geos)) %>% \n group_by(geo_value) %>% \n arrange(time_value)\n}\naddProblem(\"covidcast_proper\", fun = grab_data, cache = TRUE)\n\n# Algorithm wrappers -----------------------------------------------------\nbaseline <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, value) %>% \n group_modify(prob_baseline, ...)\n}\nar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n group_modify(prob_ar, ...)\n}\nqar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n group_modify(quant_ar, ...)\n}\ngam <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>%\n group_modify(safe_prob_gam_ar, ...)\n}\nar_cov <- function(data, job, instance, ...) {\n instance %>% \n group_modify(prob_ar_cov, ...)\n}\njoint <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, value) %>% \n joint_ar(...)\n}\ncorrected_ar <- function(data, job, instance, ...) {\n instance %>% \n dplyr::select(geo_value, time_value, num) %>% \n rename(value = num) %>% \n corrections_single_signal(cparams) %>% \n group_modify(prob_ar, ...)\n}\n\naddAlgorithm(\"baseline\", baseline)\naddAlgorithm(\"ar\", ar)\naddAlgorithm(\"qar\", qar)\naddAlgorithm(\"gam\", gam)\naddAlgorithm(\"ar_cov\", ar_cov)\naddAlgorithm(\"joint_ar\", joint)\naddAlgorithm(\"corrections\", corrected_ar)\n\n# Experimental design -----------------------------------------------------\nproblem_design <- list(covidcast_proper = data.frame(forecast_date = forecast_dates))\nalgorithm_design <- list(\n baseline = CJ(\n train_window = train_windows, min_train_window = min(train_windows), ahead = aheads\n ),\n ar = CJ(\n train_window = train_windows, min_train_window = min(train_windows), \n lags = lags_list, ahead = aheads\n ),\n qar = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads\n ),\n gam = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads, df = gam_df\n ),\n ar_cov = CJ(\n train_window = train_windows, min_train_window = min(train_windows), \n lags = lags_list, ahead = aheads\n ),\n joint_ar = CJ(\n train_window = joint_train_windows, min_train_window = min(joint_train_windows), \n lags = lags_list, ahead = aheads\n ),\n corrections = CJ(\n train_window = train_windows, min_train_window = min(train_windows),\n lags = lags_list, ahead = aheads\n )\n)\n\naddExperiments(problem_design, algorithm_design)\nids <- unwrap(getJobPars()) %>% \n select(job.id, forecast_date) %>% \n mutate(chunk = as.integer(as.factor(forecast_date))) %>% \n select(-forecast_date)\n\n## ~13000 jobs, we don't want to submit that many since they run fast\n## Chunk them into groups by forecast_date (to download once for the group)\n## Results in 68 chunks\n\nsubmitJobs(\n ids, \n resources = list(ncpus = 1, walltime = \"4:00:00\", memory = \"16G\")\n)\n```\n:::\n\n\n## Takeaways\n\n::: flex\n::: w-50\n\n### Benefits of this workflow:\n\n* Don't lock up your computer\n* Stuff runs much faster\n* Can easily scale up to many jobs\n* Logs are stored for debugging\n* Forces you to think about the [design]{.secondary}\n* No overhead to store results\n* Easy to add more experiments later, adjust parameters, etc.\n\n:::\n\n::: w-50\n### Costs:\n\n* I only know how to do *this* in `R`\n* Overhead of moving between machines\n* Some headaches to understand the syntax\n\n:::\n:::", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/git/execute-results/html.json b/_freeze/schedule/slides/git/execute-results/html.json index 6473cab..9844ea1 100644 --- a/_freeze/schedule/slides/git/execute-results/html.json +++ b/_freeze/schedule/slides/git/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "a694de2b18ced9e1000f78652f8de195", "result": { - "markdown": "---\nlecture: \"Version control\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Why version control?\n\n\n![](http://www.phdcomics.com/comics/archive/phd101212s.gif){fig-align=\"center\"}\n\n\n[Much of this lecture is based on material from Colin Rundel and Karl Broman]{.smallest}\n\n\n## Why version control?\n\n* Simple formal system for tracking all changes to a project\n* Time machine for your projects\n + Track blame and/or praise\n + Remove the fear of breaking things\n* Learning curve is steep, but when you need it you [REALLY]{.secondary} need it\n\n::: {.callout-tip icon=false}\n## Words of wisdom\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n\n-- _Paul Wilson_\n:::\n\n\n## Why Git\n\n::: flex\n::: w-60\n* You could use something like Box or Dropbox\n* These are poor-man's version control\n* Git is much more appropriate\n* It works with large groups\n* It's very fast\n* It's [much]{.secondary} better at fixing mistakes\n* Tech companies use it (so it's in your interest to have some experience)\n:::\n\n::: w-40\n![](https://imgs.xkcd.com/comics/git.png){fig-align=\"center\"}\n:::\n:::\n\n. . .\n\n::: {.callout-important appearance=\"simple\"}\nThis will hurt, but what doesn't kill you, makes you stronger.\n:::\n\n## Overview\n\n* `git` is a command line program that lives on your machine\n* If you want to track changes in a directory, you type `git init`\n* This creates a (hidden) directory called `.git`\n* The `.git` directory contains a history of all changes made to \"versioned\" files\n* This top directory is referred to as a \"repository\" or \"repo\"\n* is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders `.ipynb` & `.md`\n* Some IDEs (pycharm, RStudio, VScode) have built in `git`\n* `git`/GitHub is broad and complicated. Here, just what you [need]{.secondary}\n\n## Aside on \"Built-in\" & \"Command line\" {background-color=\"#97D4E9\"}\n\n:::{.callout-tip}\nFirst things first, RStudio and the Terminal\n:::\n\n\n* Command line is the \"old\" type of computing. You type commands at a prompt and the computer \"does stuff\". \n\n* You may not have seen where this is. RStudio has one built in called \"Terminal\"\n\n* The Mac System version is also called \"Terminal\". If you have a Linux machine, this should all be familiar.\n\n* Windows is not great at this.\n\n* To get the most out of Git, you have to use the command line.\n\n\n## Typical workflow {.smaller}\n\n\n1. Download a repo from Github\n```{.bash}\ngit clone https://github.com/stat550-2021/lecture-slides.git\n```\n2. Create a **branch**\n```{.bash}\ngit branch \n```\n3. Make changes to your files.\n4. Add your changes to be tracked (\"stage\" them)\n```{.bash}\ngit add \n```\n5. Commit your changes\n```{.bash}\ngit commit -m \"Some explanatory message\"\n```\n\n**Repeat 3--5 as needed. Once you're satisfied**\n\n* Push to GitHub\n```{.bash}\ngit push\ngit push -u origin \n```\n\n---\n\n![](gfx/git-clone.png){fig-align=\"center\"}\n\n\n## What should be tracked?\n\n
\n\nDefinitely\n: code, markdown documentation, tex files, bash scripts/makefiles, ...\n\n
\n\nPossibly\n: logs, jupyter notebooks, images (that won’t change), ...\n\n
\n\nQuestionable\n: processed data, static pdfs, ...\n\n
\n\nDefinitely not\n: full data, continually updated pdfs, other things compiled from source code, ...\n\n\n\n## What things to track\n\n* You decide what is \"versioned\". \n\n* A file called `.gitignore` tells `git` files or types to never track\n\n```{.bash}\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n```\n\n* Shortcut to track everything (use carefully):\n\n```{.bash}\ngit add .\n```\n\n\n## What's a PR?\n\n* This exists on GitHub (not git)\n* Demonstration\n\n\n::: {.r-stack}\n![](gfx/pr1.png){.fragment height=\"550\"}\n\n![](gfx/pr2.png){.fragment height=\"550\"}\n:::\n\n## Some things to be aware of\n\n* `master` vs `main`\n* If you think you did something wrong, stop and ask for help\n* The hardest part is the initial setup. Then, this should all be rinse-and-repeat.\n* This book is great: [Happy Git with R](https://happygitwithr.com)\n 1. See Chapter 6 if you have install problems.\n 1. See Chapter 9 for credential caching (avoid typing a password all the time)\n 1. See Chapter 13 if RStudio can't find `git`\n \n## The `main/develop/branch` workflow\n\n* When working on your own\n 1. Don't NEED branches (but you should use them, really)\n 1. I make a branch if I want to try a modification without breaking what I have.\n \n \n* When working on a large team with production grade software\n 1. `main` is protected, released version of software (maybe renamed to `release`)\n 1. `develop` contains things not yet on `main`, but thoroughly tested\n 1. On a schedule (once a week, once a month) `develop` gets merged to `main`\n 1. You work on a `feature` branch off `develop` to build your new feature\n 1. You do a PR against `develop`. Supervisors review your contributions\n \n. . .\n\nI and many DS/CS/Stat faculty use this workflow with my lab.\n\n## Protection\n\n* Typical for your PR to trigger tests to make sure you don't break things\n\n* Typical for team members or supervisors to review your PR for compliance\n\n::: {.callout-tip}\nI suggest (require?) you adopt the \"production\" version for your HW 2\n:::\n\n\n## Operations in Rstudio \n\n::: flex\n::: w-50\n\n1. Stage\n1. Commit\n1. Push\n1. Pull\n1. Create a branch\n\n\n\n[Covers:]{.secondary}\n\n* Everything to do your HW / Project if you're careful\n* Plus most other things you \"want to do\"\n\n:::\n\n::: w-50\n\n\nCommand line versions (of the same)\n\n```{.bash}\ngit add \n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b \n```\n\n:::\n:::\n\n\n## Other useful stuff (but command line only) {.smaller}\n\n::: flex\n::: w-50\nInitializing\n```{.bash}\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\n```\n\n\nStaging\n```{.bash}\ngit add name/of/file # stage 1 file\ngit add . # stage all\n```\n\nCommitting\n```{.bash}\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \n```\n\nPushing\n```{.bash}\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n```\n:::\n\n::: w-50\nBranching\n```{.bash}\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\n```\n\nCheck the status\n```{.bash}\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs\n```\n:::\n:::\n\n## Commit messages {.smaller}\n\n::: {.callout-tip appearance=\"simple\"}\n1. Write meaningful messages. Not `fixed stuff` or `oops? maybe done?`\n1. These appear in the log and help you determine what you've done.\n1. Think _imperative mood_: \"add cross validation to simulation\"\n1. Best to have each commit \"do one thing\"\n:::\n\n[Conventions:]{.secondary} (see [here](https://www.conventionalcommits.org/en/v1.0.0/) for details)\n\n* feat: – a new feature is introduced with the changes\n* fix: – a bug fix has occurred\n* chore: – changes that do not relate to a fix or feature (e.g., updating dependencies)\n* refactor: – refactored code that neither fixes a bug nor adds a feature\n* docs: – updates to documentation such as a the README or other markdown files\n* style: – changes that do not affect the function of the code\n* test – including new or correcting previous tests\n* perf – performance improvements\n* ci – continuous integration related\n\n```{.bash}\ngit commit -m \"feat: add cross validation to simulation, closes #251\"\n```\n\n## Conflicts\n\n* Sometimes you merge things and \"conflicts\" happen.\n\n* Meaning that changes on one branch would overwrite changes on a different branch.\n\n::: flex\n::: w-50\n\n[They look like this:]{.secondary}\n\n```\nHere are lines that are either unchanged\nfrom the common ancestor, or cleanly\nresolved because only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n```\n\n:::\n\n::: w-50\n\n[You decide what to keep]{.secondary}\n\n1. Your changes (above `======`)\n2. Their changes (below `======`)\n3. Both.\n4. Neither.\n\nAlways delete the `<<<<<`, `======`, and `>>>>>` lines.\n\nOnce you're satisfied, commit to resolve the conflict.\n\n:::\n:::\n\n## Some other pointers\n\n* Commits have long names: `32b252c854c45d2f8dfda1076078eae8d5d7c81f`\n * If you want to use it, you need \"enough to be unique\": `32b25`\n\n* Online help uses directed graphs in ways different from statistics:\n * In stats, arrows point from cause to effect, forward in time\n * In `git` docs, it's reversed, they point to the thing on which they depend\n \n \n### Cheat sheet\n\n\n\n\n## How to undo in 3 scenarios\n\n* Suppose we're concerned about a file named `README.md`\n* Often, `git status` will give some of these as suggestions\n\n::: flex\n::: w-50\n\n[1. Saved but not staged]{.secondary}\n\nIn RStudio, select the file and click ``{=html} ``{=html} then select ``{=html} Revert...\n```{.bash}\n# grab the old committed version\ngit checkout -- README.md \n```\n\n[2. Staged but not committed]{.secondary}\n\nIn RStudio, uncheck the box by the file, then use the method above.\n```{.bash}\n# first unstage, then same as 1\ngit reset HEAD README.md\ngit checkout -- README.md\n```\n:::\n\n::: w-50\n\n[3. Committed]{.secondary}\n\nNot easy to do in RStudio...\n```{.bash}\n# check the log to find the chg \ngit log\n# go one step before that \n# (e.g., to commit 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n```\n\n
\n\n```{.bash}\n# alternatively, if it happens\n# to also be on another branch\ngit checkout otherbranch -- README.md\n```\n:::\n:::\n\n## Recovering from things\n\n1. Accidentally did work on main,\n```{.bash}\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n```\n\n2. Made a branch, did lots of work, realized it's trash, and you want to burn it\n```{.bash}\ngit checkout main\ngit branch -d badbranch\n```\n\n3. Anything more complicated, either post to Slack or LMGTFY\n\n\n## Rules for HW 2\n\n* Each team has their own repo\n* Make a PR against `main` to submit\n* Tag me and all the assigned reviewers\n* Peer evaluations are done via PR review (also send to Estella)\n* YOU must make at [least 5 commits]{.secondary} (fewer will lead to deductions)\n* I review your work and merge the PR\n\n::: {.callout-important}\n☠️☠️ Read all the instructions in the repo! ☠️☠️\n:::\n\n\n# Practice time...\n\n[dajmcdon/sugary-beverages](https://github.com/dajmcdon/sugary-beverages)\n", + "markdown": "---\nlecture: \"Version control\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Why version control?\n\n\n![](http://www.phdcomics.com/comics/archive/phd101212s.gif){fig-align=\"center\"}\n\n\n[Much of this lecture is based on material from Colin Rundel and Karl Broman]{.smallest}\n\n\n## Why version control?\n\n* Simple formal system for tracking all changes to a project\n* Time machine for your projects\n + Track blame and/or praise\n + Remove the fear of breaking things\n* Learning curve is steep, but when you need it you [REALLY]{.secondary} need it\n\n::: {.callout-tip icon=false}\n## Words of wisdom\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n\n-- _Paul Wilson_\n:::\n\n\n## Why Git\n\n::: flex\n::: w-60\n* You could use something like Box or Dropbox\n* These are poor-man's version control\n* Git is much more appropriate\n* It works with large groups\n* It's very fast\n* It's [much]{.secondary} better at fixing mistakes\n* Tech companies use it (so it's in your interest to have some experience)\n:::\n\n::: w-40\n![](https://imgs.xkcd.com/comics/git.png){fig-align=\"center\"}\n:::\n:::\n\n. . .\n\n::: {.callout-important appearance=\"simple\"}\nThis will hurt, but what doesn't kill you, makes you stronger.\n:::\n\n## Overview\n\n* `git` is a command line program that lives on your machine\n* If you want to track changes in a directory, you type `git init`\n* This creates a (hidden) directory called `.git`\n* The `.git` directory contains a history of all changes made to \"versioned\" files\n* This top directory is referred to as a \"repository\" or \"repo\"\n* is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders `.ipynb` & `.md`\n* Some IDEs (pycharm, RStudio, VScode) have built in `git`\n* `git`/GitHub is broad and complicated. Here, just what you [need]{.secondary}\n\n## Aside on \"Built-in\" & \"Command line\" {background-color=\"#97D4E9\"}\n\n:::{.callout-tip}\nFirst things first, RStudio and the Terminal\n:::\n\n\n* Command line is the \"old\" type of computing. You type commands at a prompt and the computer \"does stuff\". \n\n* You may not have seen where this is. RStudio has one built in called \"Terminal\"\n\n* The Mac System version is also called \"Terminal\". If you have a Linux machine, this should all be familiar.\n\n* Windows is not great at this.\n\n* To get the most out of Git, you have to use the command line.\n\n\n## Typical workflow {.smaller}\n\n\n1. Download a repo from Github\n```{.bash}\ngit clone https://github.com/stat550-2021/lecture-slides.git\n```\n2. Create a **branch**\n```{.bash}\ngit branch \n```\n3. Make changes to your files.\n4. Add your changes to be tracked (\"stage\" them)\n```{.bash}\ngit add \n```\n5. Commit your changes\n```{.bash}\ngit commit -m \"Some explanatory message\"\n```\n\n**Repeat 3--5 as needed. Once you're satisfied**\n\n* Push to GitHub\n```{.bash}\ngit push\ngit push -u origin \n```\n\n---\n\n![](gfx/git-clone.png){fig-align=\"center\"}\n\n\n## What should be tracked?\n\n
\n\nDefinitely\n: code, markdown documentation, tex files, bash scripts/makefiles, ...\n\n
\n\nPossibly\n: logs, jupyter notebooks, images (that won’t change), ...\n\n
\n\nQuestionable\n: processed data, static pdfs, ...\n\n
\n\nDefinitely not\n: full data, continually updated pdfs, other things compiled from source code, ...\n\n\n\n## What things to track\n\n* You decide what is \"versioned\". \n\n* A file called `.gitignore` tells `git` files or types to never track\n\n```{.bash}\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n```\n\n* Shortcut to track everything (use carefully):\n\n```{.bash}\ngit add .\n```\n\n\n## What's a PR?\n\n* This exists on GitHub (not git)\n* Demonstration\n\n\n::: {.r-stack}\n![](gfx/pr1.png){.fragment height=\"550\"}\n\n![](gfx/pr2.png){.fragment height=\"550\"}\n:::\n\n## Some things to be aware of\n\n* `master` vs `main`\n* If you think you did something wrong, stop and ask for help\n* The hardest part is the initial setup. Then, this should all be rinse-and-repeat.\n* This book is great: [Happy Git with R](https://happygitwithr.com)\n 1. See Chapter 6 if you have install problems.\n 1. See Chapter 9 for credential caching (avoid typing a password all the time)\n 1. See Chapter 13 if RStudio can't find `git`\n \n## The `main/develop/branch` workflow\n\n* When working on your own\n 1. Don't NEED branches (but you should use them, really)\n 1. I make a branch if I want to try a modification without breaking what I have.\n \n \n* When working on a large team with production grade software\n 1. `main` is protected, released version of software (maybe renamed to `release`)\n 1. `develop` contains things not yet on `main`, but thoroughly tested\n 1. On a schedule (once a week, once a month) `develop` gets merged to `main`\n 1. You work on a `feature` branch off `develop` to build your new feature\n 1. You do a PR against `develop`. Supervisors review your contributions\n \n. . .\n\nI and many DS/CS/Stat faculty use this workflow with my lab.\n\n## Protection\n\n* Typical for your PR to trigger tests to make sure you don't break things\n\n* Typical for team members or supervisors to review your PR for compliance\n\n::: {.callout-tip}\nI suggest (require?) you adopt the \"production\" version for your HW 2\n:::\n\n\n## Operations in Rstudio \n\n::: flex\n::: w-50\n\n1. Stage\n1. Commit\n1. Push\n1. Pull\n1. Create a branch\n\n\n\n[Covers:]{.secondary}\n\n* Everything to do your HW / Project if you're careful\n* Plus most other things you \"want to do\"\n\n:::\n\n::: w-50\n\n\nCommand line versions (of the same)\n\n```{.bash}\ngit add \n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b \n```\n\n:::\n:::\n\n\n## Other useful stuff (but command line only) {.smaller}\n\n::: flex\n::: w-50\nInitializing\n```{.bash}\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\n```\n\n\nStaging\n```{.bash}\ngit add name/of/file # stage 1 file\ngit add . # stage all\n```\n\nCommitting\n```{.bash}\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \n```\n\nPushing\n```{.bash}\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n```\n:::\n\n::: w-50\nBranching\n```{.bash}\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\n```\n\nCheck the status\n```{.bash}\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs\n```\n:::\n:::\n\n## Commit messages {.smaller}\n\n::: {.callout-tip appearance=\"simple\"}\n1. Write meaningful messages. Not `fixed stuff` or `oops? maybe done?`\n1. These appear in the log and help you determine what you've done.\n1. Think _imperative mood_: \"add cross validation to simulation\"\n1. Best to have each commit \"do one thing\"\n:::\n\n[Conventions:]{.secondary} (see [here](https://www.conventionalcommits.org/en/v1.0.0/) for details)\n\n* feat: – a new feature is introduced with the changes\n* fix: – a bug fix has occurred\n* chore: – changes that do not relate to a fix or feature (e.g., updating dependencies)\n* refactor: – refactored code that neither fixes a bug nor adds a feature\n* docs: – updates to documentation such as a the README or other markdown files\n* style: – changes that do not affect the function of the code\n* test – including new or correcting previous tests\n* perf – performance improvements\n* ci – continuous integration related\n\n```{.bash}\ngit commit -m \"feat: add cross validation to simulation, closes #251\"\n```\n\n## Conflicts\n\n* Sometimes you merge things and \"conflicts\" happen.\n\n* Meaning that changes on one branch would overwrite changes on a different branch.\n\n::: flex\n::: w-50\n\n[They look like this:]{.secondary}\n\n```\nHere are lines that are either unchanged\nfrom the common ancestor, or cleanly\nresolved because only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n```\n\n:::\n\n::: w-50\n\n[You decide what to keep]{.secondary}\n\n1. Your changes (above `======`)\n2. Their changes (below `======`)\n3. Both.\n4. Neither.\n\nAlways delete the `<<<<<`, `======`, and `>>>>>` lines.\n\nOnce you're satisfied, commit to resolve the conflict.\n\n:::\n:::\n\n## Some other pointers\n\n* Commits have long names: `32b252c854c45d2f8dfda1076078eae8d5d7c81f`\n * If you want to use it, you need \"enough to be unique\": `32b25`\n\n* Online help uses directed graphs in ways different from statistics:\n * In stats, arrows point from cause to effect, forward in time\n * In `git` docs, it's reversed, they point to the thing on which they depend\n \n \n### Cheat sheet\n\n\n\n\n## How to undo in 3 scenarios\n\n* Suppose we're concerned about a file named `README.md`\n* Often, `git status` will give some of these as suggestions\n\n::: flex\n::: w-50\n\n[1. Saved but not staged]{.secondary}\n\nIn RStudio, select the file and click ``{=html} ``{=html} then select ``{=html} Revert...\n```{.bash}\n# grab the old committed version\ngit checkout -- README.md \n```\n\n[2. Staged but not committed]{.secondary}\n\nIn RStudio, uncheck the box by the file, then use the method above.\n```{.bash}\n# first unstage, then same as 1\ngit reset HEAD README.md\ngit checkout -- README.md\n```\n:::\n\n::: w-50\n\n[3. Committed]{.secondary}\n\nNot easy to do in RStudio...\n```{.bash}\n# check the log to find the chg \ngit log\n# go one step before that \n# (e.g., to commit 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n```\n\n
\n\n```{.bash}\n# alternatively, if it happens\n# to also be on another branch\ngit checkout otherbranch -- README.md\n```\n:::\n:::\n\n## Recovering from things\n\n1. Accidentally did work on main,\n```{.bash}\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n```\n\n2. Made a branch, did lots of work, realized it's trash, and you want to burn it\n```{.bash}\ngit checkout main\ngit branch -d badbranch\n```\n\n3. Anything more complicated, either post to Slack or LMGTFY\n\n\n## Rules for HW 2\n\n* Each team has their own repo\n* Make a PR against `main` to submit\n* Tag me and all the assigned reviewers\n* Peer evaluations are done via PR review (also send to Estella)\n* YOU must make at [least 5 commits]{.secondary} (fewer will lead to deductions)\n* I review your work and merge the PR\n\n::: {.callout-important}\n☠️☠️ Read all the instructions in the repo! ☠️☠️\n:::\n\n\n# Practice time...\n\n[dajmcdon/sugary-beverages](https://github.com/dajmcdon/sugary-beverages)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/grad-school/execute-results/html.json b/_freeze/schedule/slides/grad-school/execute-results/html.json index 32be4ee..3a455e2 100644 --- a/_freeze/schedule/slides/grad-school/execute-results/html.json +++ b/_freeze/schedule/slides/grad-school/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "3e9694c1b0946782432a1fba557f0a95", "result": { - "markdown": "---\nlecture: \"Skills for graduate students\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## Something happens in graduate school\n\n* As undergrads, you took lots of classes\n* You didn't care that much about all of them\n* Sure, you wanted good grades, but you may not have always wanted to [really learn]{.tertiary} the material\n* And you probably didn't try to go in depth beyond the requirements\n\n. . .\n\n* That has to change in grad school\n* Even if you don't want a to be a professor, to get a PhD, to do an MSc thesis.\n* This is the material that you have decided you will use for the rest of your life\n\n. . .\n\n* If you disagree, then we should talk\n\n\n## Side discussion on \"Reading for research\"\n\n\n* You should \"read\" regularly: set aside an 2-3 hours every week\n* Stay up-to-date on recent research, determine what you find interesting\n* What do people care about? What does it take to write journal articles?\n\n\n## What is \"read\"?\n\n* Start with titles, then abstracts, then intro+conclusion\n* Each is a filter to determine how far to go\n* Pass 3 filters, [read]{.secondary} the paper (should take about ~30 minutes)\n* Don't get bogged down in notation, proofs\n* Organize your documents somehow, make notes in the margins, etc\n* After you [read]{.secondary} it, you should be able to tell me what they show, why it's important, why it's novel\n* If you can, figure out [how]{.tertiary} they show something. This is hard.\n\n\n## How to find and organize papers\n\n* arXiv, AOS, JASA, JCGS have RSS feeds, email lists etc\n* Find a statistician you like who filters\n* Follow reading groups\n* Conference proceedings\n* Become an IMS member, SSC member (ASA costs money:( )\n* BibDesk, Zotero\n\n## Ideal outcome\n\n* If you need to learn something, you can teach yourself\n* Know how to find the basics on the internet\n* Know how to go in depth with real sources\n* Collect a set of resources that you can turn to regularly\n* If you need to read a book, you can\n* If you need to pick up a new coding language, you can\n\n. . .\n\n::: {.callout-note}\n## What this doesn't mean\n\nYou are not expected to have all the answers at the tips of your fingers.\n:::\n\nBut you should get progressively good at finding them.", + "markdown": "---\nlecture: \"Skills for graduate students\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## Something happens in graduate school\n\n* As undergrads, you took lots of classes\n* You didn't care that much about all of them\n* Sure, you wanted good grades, but you may not have always wanted to [really learn]{.tertiary} the material\n* And you probably didn't try to go in depth beyond the requirements\n\n. . .\n\n* That has to change in grad school\n* Even if you don't want a to be a professor, to get a PhD, to do an MSc thesis.\n* This is the material that you have decided you will use for the rest of your life\n\n. . .\n\n* If you disagree, then we should talk\n\n\n## Side discussion on \"Reading for research\"\n\n\n* You should \"read\" regularly: set aside an 2-3 hours every week\n* Stay up-to-date on recent research, determine what you find interesting\n* What do people care about? What does it take to write journal articles?\n\n\n## What is \"read\"?\n\n* Start with titles, then abstracts, then intro+conclusion\n* Each is a filter to determine how far to go\n* Pass 3 filters, [read]{.secondary} the paper (should take about ~30 minutes)\n* Don't get bogged down in notation, proofs\n* Organize your documents somehow, make notes in the margins, etc\n* After you [read]{.secondary} it, you should be able to tell me what they show, why it's important, why it's novel\n* If you can, figure out [how]{.tertiary} they show something. This is hard.\n\n\n## How to find and organize papers\n\n* arXiv, AOS, JASA, JCGS have RSS feeds, email lists etc\n* Find a statistician you like who filters\n* Follow reading groups\n* Conference proceedings\n* Become an IMS member, SSC member (ASA costs money:( )\n* BibDesk, Zotero\n\n## Ideal outcome\n\n* If you need to learn something, you can teach yourself\n* Know how to find the basics on the internet\n* Know how to go in depth with real sources\n* Collect a set of resources that you can turn to regularly\n* If you need to read a book, you can\n* If you need to pick up a new coding language, you can\n\n. . .\n\n::: {.callout-note}\n## What this doesn't mean\n\nYou are not expected to have all the answers at the tips of your fingers.\n:::\n\nBut you should get progressively good at finding them.", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/model-selection/execute-results/html.json b/_freeze/schedule/slides/model-selection/execute-results/html.json new file mode 100644 index 0000000..a015712 --- /dev/null +++ b/_freeze/schedule/slides/model-selection/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "5a7a59a46e373a7910c302177681b262", + "result": { + "markdown": "---\nlecture: \"Statistical models and model selection\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\nbibliography: refs.bib\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## What is a model?\n\nIn statistics, \"model\" has a mathematical meaning.\n\nDistinct from \"algorithm\" or \"procedure\".\n\nDefining a model often leads to a procedure/algorithm with good properties.\n\nSometimes procedure/algorithm $\\Rightarrow$ a specific model.\n\n> Statistics (the field) tells me how to understand when different procedures\n> are desirable and the mathematical guarantees that they satisfy.\n\nWhen are certain models appropriate?\n\n> One definition of \"Statistical Learning\" is the \"statistics behind the procedure\".\n\n## Statistical models 101\n\nWe observe data $Z_1,\\ Z_2,\\ \\ldots,\\ Z_n$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n> A [statistical model]{.secondary} is a set of distributions $\\mathcal{P}$.\n\n\nSome examples:\n\n 1. $\\P = \\{ 0 < p < 1 : P(z=1)=p,\\ P(z=0)=1-p\\}$.\n 2. $\\P = \\{ \\beta \\in \\R^p,\\ \\sigma>0 : Y \\sim N(X^\\top\\beta,\\sigma^2),\\ X\\mbox{ fixed}\\}$.\n 2. $\\P = \\{\\mbox{all CDF's }F\\}$.\n 3. $\\P = \\{\\mbox{all smooth functions } f: \\R^p \\rightarrow \\R : Z_i = (X_i, Y_i),\\ E[Y_i] = f(X_i) \\}$\n \n## Statistical models \n\nWe want to use the data to [select]{.secondary} a distribution $P$ that probably \ngenerated the data.\n\n. . . \n\n#### My model:\n\n$$\n\\P = \\{ P(z=1)=p,\\ P(z=0)=1-p,\\ 0 < p < 1 \\}\n$$\n \n* To completely characterize $P$, I just need to estimate $p$.\n\n* Need to assume that $P \\in \\P$. \n\n* This assumption is mostly empty: _need independent, can't see $z=12$._\n\n## Statistical models \n\nWe observe data $Z_i=(Y_i,X_i)$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n. . . \n\n#### My model\n\n$$\n\\P = \\{ \\beta \\in \\R^p, \\sigma>0 : Y_i \\given X_i=x_i \\sim N(x_i^\\top\\beta,\\ \\sigma^2) \\}.\n$$\n\n \n* To completely characterize $P$, I just need to estimate $\\beta$ and $\\sigma$.\n\n* Need to assume that $P\\in\\P$.\n\n* This time, I have to assume a lot more: \n_(conditional) Linearity, independence, conditional Gaussian noise,_\n_no ignored variables, no collinearity, etc._\n\n\n## Statistical models, unfamiliar example\n\nWe observe data $Z_i \\in \\R$ generated by some probability\ndistribution $P$. We want to use the data to learn about $P$. \n\n#### My model\n\n$$\n\\P = \\{ Z_i \\textrm{ has a density function } f \\}.\n$$\n\n \n* To completely characterize $P$, I need to estimate $f$.\n\n* In fact, we can't hope to do this.\n\n\n[Revised Model 1]{.secondary} - $\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < M \\}$\n\n[Revised Model 2]{.secondary} - $\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < K < M \\}$\n\n[Revised Model 3]{.secondary} - $\\P=\\{ Z_i \\textrm{ has a density function } f : \\int |f'| dx < M \\}$\n\n* Each of these suggests different ways of estimating $f$\n\n\n## Assumption Lean Regression\n\nImagine $Z = (Y, \\mathbf{X}) \\sim P$ with $Y \\in \\R$ and $\\mathbf{X} = (1, X_1, \\ldots, X_p)^\\top$.\n\nWe are interested in the _conditional_ distribution $P_{Y|\\mathbf{X}}$\n\nSuppose we think that there is _some_ function of interest which relates $Y$ and $X$.\n\nLet's call this function $\\mu(\\mathbf{X})$ for the moment. How do we estimate $\\mu$? What is $\\mu$?\n\n::: aside\nSee [Berk et al. _Assumption Lean Regression_](https://doi.org/10.1080/00031305.2019.1592781).\n:::\n\n\n. . . \n\nTo make this precise, we \n\n* Have a model $\\P$.\n* Need to define a \"good\" functional $\\mu$.\n* Let's loosely define \"good\" as\n\n> Given a new (random) $Z$, $\\mu(\\mathbf{X})$ is \"close\" to $Y$.\n\n## Evaluating \"close\"\n\nWe need more functions.\n \nChoose some _loss function_ $\\ell$ that measures how close $\\mu$ and $Y$ are.\n\n\n::: flex\n\n::: w-50\n\n* _Squared-error:_ \n$\\ell(y,\\ \\mu) = (y-\\mu)^2$\n\n* _Absolute-error:_ \n$\\ell(y,\\ \\mu) = |y-\\mu|$\n\n* _Zero-One:_ \n$\\ell(y,\\ \\mu) = I(y\\neq\\mu)=\\begin{cases} 0 & y=\\mu\\\\1 & \\mbox{else}\\end{cases}$\n\n* _Cauchy:_ \n$\\ell(y,\\ \\mu) = \\log(1 + (y - \\mu)^2)$\n\n:::\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](model-selection_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n:::\n\n\n## Start with (Expected) Squared Error\n\nLet's try to minimize the _expected_ squared error (MSE).\n\nClaim: $\\mu(X) = \\Expect{Y\\ \\vert\\ X}$ minimizes MSE.\n\nThat is, for any $r(X)$, $\\Expect{(Y - \\mu(X))^2} \\leq \\Expect{(Y-r(X))^2}$.\n\n\n. . .\n\nProof of Claim:\n\n\n\\begin{aligned}\n\\Expect{(Y-r(X))^2} \n&= \\Expect{(Y- \\mu(X) + \\mu(X) - r(X))^2}\\\\\n&= \\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2\\Expect{(Y- \\mu(X))(\\mu(X) - r(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2(\\mu(X) - r(X))\\Expect{(Y- \\mu(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} + 0\\\\\n&\\geq \\Expect{(Y- \\mu(X))^2}\n\\end{aligned}\n\n\n\n\n## The regression function\n\nSometimes people call this solution:\n\n\n$$\\mu(X) = \\Expect{Y \\ \\vert\\ X}$$\n\n\nthe regression function. (But don't forget that it depended on $\\ell$.)\n\nIf we [assume]{.secondary} that $\\mu(x) = \\Expect{Y \\ \\vert\\ X=x} = x^\\top \\beta$, then we get back exactly OLS.\n\n. . .\n\nBut why should we assume $\\mu(x) = x^\\top \\beta$?\n\n\n## Brief aside {background-color=\"#97D4E9\"}\n\nSome notation / terminology\n\n* \"Hats\" on things mean \"estimates\", so $\\widehat{\\mu}$ is an estimate of $\\mu$\n\n* Parameters are \"properties of the model\", so $f_X(x)$ or $\\mu$ or $\\Var{Y}$\n\n* Random variables like $X$, $Y$, $Z$ may eventually become data, $x$, $y$, $z$, once observed.\n\n* \"Estimating\" means \"using observations to estimate _parameters_\"\n\n* \"Predicting\" means \"using observations to predict _future data_\"\n\n* Often, there is a parameter whose estimate will provide a prediction.\n\n* \"Non-parametric\" means \"we don’t assume a parametric form\" for the regression function (or density)\n\n## Estimation vs. Prediction\n\n* In consulting, you're usually interested in estimating parameters accurately.\n\n* This is a departure from machine learning, when you want to predict new data.\n\n* But to \"select a model\", we may have to choose between plausible alternatives.\n\n* This can be challenging to understand.\n\n## Prediction risk for regression\n\nGiven the _training data_ $\\mathcal{D}$, we\nwant to predict some independent _test data_\n$Z = (X,Y)$\n\nThis means forming a $\\hat f$, which is a function of both the range of\n$X$ and the training data $\\mathcal{D}$, which provides predictions\n$\\hat Y = \\hat f(X)$.\n\n\nThe quality of this prediction is measured via the prediction risk\n$$R(\\hat{f}) = \\Expect{(Y - \\hat{f}(X))^2}.$$\n\nWe know that the _regression function_,\n$\\mu(X) = \\mathbb{E}[Y \\mid X]$, is the best possible predictor.\n\n## Model selection and tuning parameters\n\n* Often \"model selection\" means \"choosing a set of predictors/variables\"\n - E.g. Lasso performs model selection by setting many $\\widehat\\beta=0$\n* \"Model selection\" [really]{.secondary} means\n> making any necessary decisions to arrive at a final model\n* Sometimes this means \"choosing predictors\"\n* It could also mean \"selecting a tuning parameter\"\n* Or \"deciding whether to use LASSO or Ridge\" (and picking tuning parameters)\n* Model selection means \"choose $\\mathcal{P}$\"\n \n\n## My pet peeve\n\n* Often people talk about \"using LASSO\" or \"using an SVM\"\n* This isn't quite right.\n* LASSO is a regularized procedure that depends on $\\lambda$\n* To \"use LASSO\", you must pick a particular $\\lambda$\n* Different ways to pick $\\lambda$ (today's topic) produce different final \nestimators\n* Thus we should say \"I used LASSO + CV\" or \"I used Ridge + GCV\"\n* Probably also indicate \"how\" (I used the CV minimum.)\n\n## Bias and variance\n\nRecall that $\\mathcal{D}$ is the training data.\n\n\n$$R_n(f) := \\Expect{L(Y,f(X))} = \\Expect{\\Expect{L(Y,f(X)) \\given \\mathcal{D}}}$$\n\n\n* Note the difference between \n$R_n(f)\\;\\;\\textrm{and}\\;\\;\\Expect{L(Y,f(X)) \\given \\mathcal{D}}$\n* If you use $\\mathcal{D}$ to choose $f$, then these are different.\n* If you use $\\mathcal{D}$ to choose $f$, then both depend on how much data you have seen.\n\n\n\n## Risk estimates\n\n![[@HastieTibshirani2009]](gfx/bias-var.jpg)\n\n* We can use risk estimates for 2 different goals\n\n1. Choosing between different potential models.\n2. Characterizing the out-of-sample performance of the chosen model.\n\n* I am not generally aware of other methods of accomplishing (1).\n\n## A model selection picture\n\n![[@HastieTibshirani2009]](gfx/model-space.jpg)\n\n## Why?\n\nWe want to do model selection for at least three reasons:\n\nPrediction accuracy\n: Can essentially *always* be improved by introducing some bias\n\nInterpretation\n: A large number of features can sometimes be reduced to an interpretable subset\n\nComputation\n: A large $p$ can create a huge computational bottleneck.\n\n## Things you shouldn't do\n\n* Estimate $R_n$ with $\\widehat{R}_n(f) = \\sum_{i=1}^n L(Y_i,\\widehat{f}(X_i))$.\n* Throw away variables with small $p$-values.\n* Use $F$-tests\n* Compare the log-likelihood between different models\n\n> These last two can occasionally be ok, but aren't in general. You should investigate the assumptions that are implicit in them.\n\n# Risk estimators\n\n## Unbiased risk estimation\n\n* It is very hard (impossible?) to estimate $R_n$.\n* Instead we focus on \n\n$$\\overline{R}_n(f) = \\E_{Y_1,\\ldots,Y_n}\\left[\\E_{Y^0}\\left[\\frac{1}{n}\\sum_{i=1}^n L(Y^0_i,\\hat{f}(x_i))\\given \\mathcal{D}\\right]\\right].$$\n\n* The difference is that $\\overline{R}_n(f)$ averages over the observed $x_i$ rather than taking the expected value over the distribution of $X$.\n* In the \"fixed design\" setting, these are equal.\n\n## Unbiased risk estimation\n\nFor many $L$ and some predictor $\\hat{f}$, one can show\n\n$$\\overline{R}_n(\\hat{f}) = \\Expect{\\hat{R}_n(\\hat{f})} + \\frac{2}{n} \\sum_{i=1}^n \\Cov{Y_i}{\\hat{f}(x_i)}.$$\n\n\nThis suggests estimating $\\overline{R}_n(\\hat{f})$ with\n\n$$\\hat{R}_{\\textrm{gic}} := \\hat{R}_n(\\hat{f}) + \\textrm{pen}.$$\n\n\nIf $\\Expect{\\textrm{pen}} = \\frac{2}{n}\\sum_{i=1}^n \\Cov{Y_i}{\\hat{f}(x_i)}$, we have an unbiased estimator of $\\overline{R}_n(\\hat{f})$.\n\n\n## Normal means model\n\n\nSuppose we observe the following data:\n\n$$Y_i = \\beta_i + \\epsilon_i, \\quad\\quad i=1,\\ldots,n$$\n\nwhere $\\epsilon_i\\overset{iid}{\\sim} \\mbox{N}(0,1)$.\n \n \nWe want to estimate \n$$\\boldsymbol{\\beta} = (\\beta_1,\\ldots,\\beta_n).$$\n\n\nThe usual estimator (MLE) is $$\\widehat{\\boldsymbol{\\beta}}^{MLE} = (Y_1,\\ldots,Y_n).$$\n\n\nThis estimator has lots of nice properties: __consistent, unbiased, UMVUE, (asymptotic) normality...__\n\n## MLEs are bad\n\n \nBut, the standard estimator __STINKS!__ It's a bad estimator. \n \nIt has no bias, but big variance.\n\n\n$$R_n(\\widehat{\\boldsymbol{\\beta}}^{MLE}) = \\mbox{bias}^2 + \\mbox{var} = 0\n+ n\\cdot 1= n$$\n\n\nWhat if we use a biased estimator?\n\nConsider the following estimator instead:\n$$\\widehat{\\beta}_i^S = \\begin{cases} Y_i & i \\in S\\\\ 0 & \\mbox{else}. \\end{cases}$$\n\n \nHere $S \\subseteq \\{1,\\ldots,n\\}$. \n\n\n## Biased normal means\n\n\nWhat is the risk of this estimator?\n\n$$\nR_n(\\widehat{\\boldsymbol{\\beta}}^S) = \\sum_{i\\not\\in S} \\beta_i^2 + |S|.\n$$\n\nIn other words, if some $|\\beta_i| < 1$, then don't bother estimating them!\n\nIn general, introduced parameters like $S$ will be called __tuning parameters__.\n\nOf course we don't know which $|\\beta_i| < 1$.\n\nBut we could try to estimate $R_n(\\widehat{\\boldsymbol{\\beta}}^S)$, and choose $S$ to minimize our estimate.\n\n\n\n\n## Dangers of using the training error\n\n\nAlthough\n\n$$\\widehat{R}_n(\\widehat{\\boldsymbol{\\beta}}) \\approx R_n(\\widehat{\\boldsymbol{\\beta}}),$$\nthis approximation can be very bad. In fact:\n\n\nTraining Error\n: $\\widehat{R}_n(\\widehat{\\boldsymbol{\\beta}}^{MLE}) = 0$\n\nRisk\n: $R_n(\\widehat{\\boldsymbol{\\beta}}^{MLE}) = n$\n\nIn this case, the __optimism__ of the training error is $n$. \n\n\n## Normal means\n\nWhat about $\\widehat{\\boldsymbol{\\beta}}^S$?\n\n$$\\widehat{R}_n(\\widehat{\\boldsymbol{\\beta}}^S) = \\sum_{i=1}^n (\\widehat{\\beta_i}-\n Y_i)^2 = \\sum_{i \\notin S} Y_i^2 %+ |S|\\sigma^2$$\n\nWell\n $$\\E\\left[\\widehat{R}_n(\\widehat{\\boldsymbol{\\beta}}^S)\\right] =\n R_n(\\widehat{\\boldsymbol{\\beta}}^S) - 2|S| +n.$$\n\nSo I can choose $S$ by minimizing $\\widehat{R}_n(\\widehat{\\boldsymbol{\\beta}}^S) + 2|S|$. \n \n\n$$\\mbox{Estimate of Risk} = \\mbox{training error} + \\mbox{penalty}.$$\n \nThe penalty term corrects for the optimism.\n\n\n\n\n## `pen()` in the nice cases\n\n__Result:__ \nSuppose $\\hat{f}(x_i) = HY$ for some matrix $H$, and $Y_i$'s are IID. Then \n\n$$\\frac{2}{n} \\sum_{i=1}^n \\Cov{Y_i}{\\hat{f}(x_i)} = \\frac{2}{n} \\sum_{i=1}^n H_{ii} \\Cov{Y_i}{Y_i} = \\frac{2\\Var{Y}}{n} \\tr{H}.$$\n\n* Such estimators are called \"linear smoothers\".\n* Obvious extension to the heteroskedastic case.\n* We call $\\frac{1}{\\Var{Y}}\\sum_{i=1}^n \\Cov{Y_i}{\\hat{f}(x_i)}$ the [degrees of freedom]{.secondary} of $\\hat{f}$.\n* Linear smoothers are ubiquitous\n* Examples: OLS, ridge regression, KNN, dictionary regression, smoothing splines, kernel regression, etc.\n\n\n## Examples of DF\n\n* OLS\n\n$$H = X^\\top (X^\\top X)^{-1} X^\\top \\Rightarrow \\tr{H} = \\textrm{rank}(X) = p$$\n\n* Ridge (decompose $X=UDV^\\top$)\n\n$$H = X^\\top (X^\\top X + \\lambda I_p)^{-1} X^\\top \\Rightarrow \\tr{H} = \\sum_{i=1}^p \\frac{d_i^2}{d_i^2 + \\lambda} < \\min\\{p,n\\}$$\n\n\n* KNN $\\textrm{df} = n/K$ (each point is it's own nearest neighbor, it gets weight $1/K$)\n\n\n## Finding risk estimators\n\nThis isn't the way everyone introduces/conceptualizes prediction risk.\n\nFor me, thinking of $\\hat{R}_n$ as\noverly optimistic and correcting for that\noptimism is conceptually appealing\n\nAn alternative approach is to discuss [information criteria]{.secondary}.\n\nIn this case one forms a (pseudo)-metric on probability measures.\n\n\n# Comparing probability measures\n\n## Kullback–Leibler\n\nSuppose we have data $Y$ that comes from the probability density\nfunction $f$.\n\n\nWhat happens if we use the probability density function $g$ instead?\n\n\nExample\n: Suppose\n$Y \\sim N(\\mu,\\sigma^2) =: f$. We want to predict a new $Y_*$, but we\nmodel it as $Y_* \\sim N(\\mu_*,\\sigma^2) =: g$\n\nHow far away are we? We can either compare\n$\\mu$ to $\\mu_*$ or $Y$ to $Y^*$.\n\nOr, we can compute how far $f$ is from $g$.\n\nWe need a notion of distance.\n\n## Kullback–Leibler\n\n[Kullback–Leibler] divergence (or discrepancy)\n\n$$\\begin{aligned}\nKL(f\\;\\Vert\\; g) & = \\int \\log\\left( \\frac{f(y)}{g(y)} \\right)f(y) dy \\\\\n& \\propto\n-\\int \\log (g(y)) f(y) dy \\qquad \\textrm{(ignore term without $g$)}\\\\\n& = \n-\\mathbb{E}_f [\\log (g(Y))] \\end{aligned}$$\n\n* Measures the loss incurred\nby (incorrectly) using $g$ instead of $f$.\n\n* KL is not symmetric: $KL(f\\;\\Vert\\; g) \\neq KL(g\\;\\Vert\\; f)$, so it's not a distance, but it is non-negative and satisfies the triangle inequality.\n\n* Usually, $f,\\ g$ will depend on some parameters, call them $\\theta$\n\n\n## KL example\n\n\n* In regression, we can specify $f = N(X^{\\top} \\beta_*, \\sigma^2)$ \n* for a fixed (true) $\\beta_*$, \n* let $g_\\theta = N(X^{\\top}\\beta,\\sigma^2)$ over all $\\theta = (\\beta,\\sigma^2) \\in \\mathbb{R}^p\\times\\mathbb{R}^+$\n* $KL(f,g_\\theta) = -\\mathbb{E}_f [\\log (g_\\theta(Y))]$, we want to minimize\nthis over $\\theta$.\n* But $f$ is unknown, so we minimize $-\\log (g_\\theta(Y))$\ninstead. \n* This is the maximum likelihood value\n$$\\hat{\\theta}_{ML} = \\argmax_\\theta g_\\theta(Y)$$\n* We don't actually need to assume things about a true model nor have it be nested in\nthe alternative models to make this work.\n\n## Operationalizing\n\n* Now, to get an operational characterization of the KL divergence at the\nML solution $$-\\mathbb{E}_f [\\log (g_{\\hat\\theta_{ML}}(Y))]$$ we need an\napproximation (don't know $f$, still).\n\n\nResult\n: If you maximize the likelihood for a finite dimensional parameter vector $\\theta$ of length $p$, then as $n\\rightarrow \\infty$, $$-\\mathbb{E}_f [\\log (g_{\\hat\\theta_{ML}}(Y))] = -\\log (g_{\\hat\\theta_{ML}}(Y)) + p.$$\n\n* This is AIC (originally \"an information criterion\", now \"Akaike's information criterion\").\n\n## AIC warnings\n\n* Choose the model with smallest AIC\n* Often multiplied by 2 \"for historical reasons\". \n* Sometimes by $-2$ \"to be extra annoying\".\n* Your estimator for $\\theta$ needs to be the MLE. (or the asymptotics may be wrong)\n* $p$ includes all estimated parameters.\n\n## Back to the OLS example\n\nSuppose $Y$ comes from the standard normal linear regression model with [known]{.secondary} variance $\\sigma^2$. \n\n$$\n\\begin{aligned}\n-\\log(g_{\\hat{\\theta}}) &\\propto \\log(\\sigma^2) + \\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i - x_i^\\top \\hat{\\beta}_{MLE})^2\\\\ \\Rightarrow AIC &= 2\\frac{n}{2\\sigma^2}\\hat{R}_n + 2p = \\hat{R}_n + \\frac{2\\sigma^2}{n} p.\n\\end{aligned}\n$$\n\n\n## Back to the OLS example\n\nSuppose $Y$ comes from the standard normal linear regression model with [unknown]{.secondary} variance $\\sigma^2$. \n\nNote that $\\hat{\\sigma}_{MLE}^2 = \\frac{1}{n} \\sum_{i=1}^n (y_i-x_i^\\top\\hat{\\beta}_{MLE})^2$. \n\n$$\n\\begin{aligned}\n-\\log(g_{\\hat{\\theta}}) &\\propto \\frac{n}{2}\\log(\\hat{\\sigma}^2) + \\frac{1}{2\\hat{\\sigma^2}}\\sum_{i=1}^n (y_i - x_i^\\top \\hat{\\beta}_{MLE})^2\\\\ \\Rightarrow AIC &\\propto 2 n\\log(\\hat{\\sigma}^2)/2 + 2(p+1) \\propto \\log(\\hat{R}_n) + \\frac{2(p+1)}{n}.\n\\end{aligned}\n$$\n\n\n## Mallow's Cp\n\n* Defined for linear regression.\n* No likelihood assumptions.\n* Variance is known\n\n$$C_p = \\hat{R}_n + 2\\sigma^2 \\frac{\\textrm{df}}{n} = AIC$$\n\n\n## Bayes factor\n\nFor Bayesian Analysis, we want the posterior. Suppose we have two models A and B.\n\n$$\n\\begin{aligned}\nP(B\\given \\mathcal{D}) &= \\frac{P(\\mathcal{D}\\given B)P(B)}{P(\\mathcal{D})} \n\\propto P(\\mathcal{D}\\given B)P(B)\\\\\nP(A\\given \\mathcal{D}) &= \\frac{P(\\mathcal{D}\\given A)P(A)}{P(\\mathcal{D})} \n\\propto P(\\mathcal{D}\\given A)P(A)\n\\end{aligned}\n$$\nWe assume that $P(A) = P(B)$. Then to compare, \n$$\n\\frac{P(B\\given \\mathcal{D})}{P(A\\given \\mathcal{D})} = \\frac{P(\\mathcal{D}\\given B)} {P(\\mathcal{D}\\given A)}.\n$$\n\n* Called the [Bayes Factor]{.secondary}.\n* This is the ratio of marginal likelihoods under the different models. \n\n## Bayes Factor\n\n* Not easy to calculate generally. ()\n* Use the Laplace approximation, some simplifications, assumptions:\n\n$$\\log P(\\mathcal{D}\\given B) = \\log P(\\mathcal{D} \\given \\hat{\\theta},\\ B) -\\frac{p\\log(n)}{2} + O(1)\n$$\n\n* Multiply through by $-2$:\n$$\nBIC = -\\log (g_\\theta(Y)) + p\\log(n) = \\log(\\hat{R}_n) + \\frac{p\\log(n)}{n}\n$$\n\n* Also called Schwarz IC. Compare to AIC (variance unknown case)\n\n## SURE\n\n\n$$\\hat{R}_{gic} := \\hat{R}_n(\\hat{f}) + \\textrm{pen}.$$\n\nIf $\\Expect{\\textrm{pen}} = \\frac{2}{n}\\sum_{i=1}^n \\Cov{Y_i}{\\hat{f}(x_i)}$, we have an unbiased estimator of $\\overline{R}_n(\\hat{f})$.\n\nResult (Stein's Lemma) \n: Suppose $Y_i\\sim N(\\mu_i,\\sigma^2)$ and suppose $f$ is weakly differentiable. Then\n\n$$\\frac{1}{\\sigma^2} \\sum_{i=1}^n\\Cov{Y_i}{\\hat{f}_i(Y)} = \\Expect{\\sum_{i=1}^n \\frac{\\partial f_i}{\\partial y_i} \\hat{f}(Y)}.$$\n\n* Note: Here I'm writing $\\hat{f}$ as a function of $Y$ rather than $x$. \n\n## SURE\n\n* This gives \"Stein's Unbiased Risk Estimator\"\n\n$$SURE = \\hat{R}_n(\\hat{f}) + 2\\sigma^2 \\sum_{i=1}^n \\frac{\\partial f_i}{\\partial y_i} \\hat{f}(Y) - n\\sigma^2.$$\n\n* If $f(Y) = HY$ is linear, we're back to AIC (variance known case)\n\n* If $\\sigma^2$ is unknown, may not be unbiased anymore. May not care.\n\n# CV\n\n\n## Intuition for CV\n\n\nOne reason that $\\widehat{R}_n(\\widehat{f})$ is bad is that we are using the same data to pick $\\widehat{f}$ __AND__ to estimate $R_n$.\n\n\"Validation set\" fixes this, but holds out a particular, fixed block of data we pretend mimics the \"test data\"\n\n\nWhat if we set aside one observation, say the first one $(y_1, x_1)$.\n\nWe estimate $\\widehat{f}^{(1)}$ without using the first observation.\n\nThen we test our prediction:\n\n$$\\widetilde{R}_1(\\widehat{f}^{(1)}) = (y_1 -\\widehat{f}^{(1)}(x_1))^2.$$\n\n\n(why the notation $\\widetilde{R}_1$? Because we're estimating the risk with 1 observation. )\n\n\n## Keep going\n\nBut that was only one data point $(y_1, x_1)$. Why stop there?\n\nDo the same with $(y_2, x_2)$! Get an estimate $\\widehat{f}^{(2)}$ \nwithout using it, then\n\n$$\\widetilde{R}_1(\\widehat{f}^{(2)}) = (y_2 -\\widehat{f}^{(2)}(x_2))^2.$$\n\nWe can keep doing this until we try it for every data point.\n\nAnd then average them! (Averages are good)\n\n\n$$\\mbox{LOO-CV} = \\frac{1}{n}\\sum_{i=1}^n \\widetilde{R}_1(\\widehat{f}^{(i)}) = \\frac{1}{n}\\sum_{i=1}^n \n(y_i - \\widehat{f}^{(i)}(x_i))^2$$\n\n\nThis is [__leave-one-out cross validation__]{.secondary}\n\n\n## Problems with LOO-CV\n\n🤮 Each held out set is small $(n=1)$. Therefore, the variance of the Squared Error of each prediction is high.\n\n🤮 The training sets overlap. This is bad. \n\n- Usually, averaging reduces variance: $\\Var{\\overline{X}} = \\frac{1}{n^2}\\sum_{i=1}^n \\Var{X_i} = \\frac{1}{n}\\Var{X_1}.$\n- But only if the variables are independent. If not, then $\\Var{\\overline{X}} = \\frac{1}{n^2}\\Var{ \\sum_{i=1}^n X_i} = \\frac{1}{n}\\Var{X_1} + \\frac{1}{n^2}\\sum_{i\\neq j} \\Cov{X_i}{X_j}.$\n- Since the training sets overlap a lot, that covariance can be pretty big.\n \n🤮 We have to estimate this model $n$ times.\n\n🎉 Bias is low because we used almost all the data to fit the model: $E[\\mbox{LOO-CV}] = R_{n-1}$ \n\n \n## K-fold CV\n\n::: flex\n::: w-50\nTo alleviate some of these problems, people usually use $K$-fold cross validation.\n\nThe idea of $K$-fold is \n\n1. Divide the data into $K$ groups. \n1. Leave a group out and estimate with the rest.\n1. Test on the held-out group. Calculate an average risk over these $\\sim n/K$ data.\n1. Repeat for all $K$ groups.\n1. Average the average risks.\n\n\n:::\n\n\n::: w-50\n🎉 Less overlap, smaller covariance.\n\n🎉 Larger hold-out sets, smaller variance.\n\n🎉 Less computations (only need to estimate $K$ times)\n\n🤮 LOO-CV is (nearly) unbiased for $R_n$\n\n🤮 K-fold CV is unbiased for $R_{n(1-1/K)}$\n\nThe risk depends on how much data you use to estimate the model. $R_n$ depends on $n$.\n\n:::\n:::\n\n\n## Comparison\n\n* LOO-CV and AIC are asymptotically equivalent $p\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/schedule/slides/model-selection/figure-revealjs/unnamed-chunk-1-1.svg b/_freeze/schedule/slides/model-selection/figure-revealjs/unnamed-chunk-1-1.svg new file mode 100644 index 0000000..e3e4249 --- /dev/null +++ b/_freeze/schedule/slides/model-selection/figure-revealjs/unnamed-chunk-1-1.svg @@ -0,0 +1,276 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/_freeze/schedule/slides/organization/execute-results/html.json b/_freeze/schedule/slides/organization/execute-results/html.json index bd87e3c..e2126d9 100644 --- a/_freeze/schedule/slides/organization/execute-results/html.json +++ b/_freeze/schedule/slides/organization/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "46509e1b87037017090fcbb152d46295", "result": { - "markdown": "---\nlecture: \"Organization and reports\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Topics for today\n\n1. Organizing your file system\n2. Writing reports that mix output and text\n3. (Avoiding buggy code)\n\n## The guiding theme\n\n![](https://imgs.xkcd.com/comics/is_it_worth_the_time.png){.center}\n\n\n# Organization {background-color=\"#e98a15\"}\n\n* Students come to my office\n* All their stuff is on their Desktop\n* This is 🤮\n\n## I urge you to consult:\n\n[Karl Broman's Notes](https://kbroman.org/Tools4RR/assets/lectures/06_org_eda_withnotes.pdf)\n\n\n## Some guiding principles\n\n1. Avoid naming by date. \n - Your file system already knows the date.\n - Sometimes projects take a while.\n - You can add this inside a particular report: `Last updated: 2022-01-07`\n1. If you're going to use a date anywhere, do `YYYY-MM-DD` or `YYYYMMDD` not `DD-MMM-YY`\n1. This is a process\n1. Don't get tied down\n1. But don't reorganize every time you find a better system\n1. Customize to your needs, preferences\n \n\n## Organizing your stuff\n\n\n```{.bash}\n├── Advising\n│ ├── arash\n│ ├── gian-carlo\n├── CV\n├── Computing\n│ ├── batchtools.slurm.tmpl\n│ ├── computecanada_notes.md\n│ ├── FKF\n│ └── ghclass\n├── Grants\n│ ├── B&E JSM 2010\n│ ├── CANSSI RRP 2020\n│ ├── NSERC 2020\n├── LettersofRec\n├── Manuscripts\n| ├── learning-white-matter\n| ├── rt-est\n│ ├── zzzz Old\n├── Referee reports\n├── Talks\n│ ├── JobTalk2020\n│ ├── ubc-stat-covid-talk\n│ └── utoronto-grad-advice\n├── Teaching\n│ ├── stat-406\n│ ├── stat-550\n│ ├── zzzz CMU TA\n│ └── zzzz booth\n└── Website\n```\n\n\n\n## Inside a project\n\n```{.bash}\n.\n├── README.md\n├── Summary of Goals.rtf\n├── cluster_output\n├── code\n├── data\n├── dsges-github.Rproj\n├── manuscript\n└── waldman-triage\n```\n\n* Include a README\n* Ideally have a MAKEFILE\n* Under version control, shared with collaborator\n\n\n## Basic principles\n\n* Be consistent\n – directory structure; names\n - all project files in 1 directory, not multiples\n* Always separate raw from processed data\n* Always separate code from data\n* It should be obvious what code created what files, and what the dependencies are. (MAKEFILE forces this)\n* [No hand-editing of data files]{.secondary}\n* Don’t use spaces in file names\n* In code, use relative paths, not absolute paths\n - `../blah` not `~/blah` or `/users/dajmcdon/Documents/Work/proj-1/blah`\n - The `{here}` package in `R` is great for this\n \n## Problem: Coordinating with collaborators\n\n* Where to put data that multiple people will work with?\n* Where to put intermediate/processed data?\n* Where to indicate the code that created those processed data files?\n* How to divvy up tasks and know who did what?\n* Need to agree on directory structure and file naming conventions\n\n[GitHub is (I think) the ideal solution, but not always feasible.]{.secondary}\n\n## Problem: Collaborators who don’t use GitHub\n\n* Use GitHub yourself\n* Copy files to/from some shared space\n - Ideally, in an automated way (Dropbox, S3 Bucket)\n - Avoid Word at all costs. Google Docs if needed.\n - Word and Git do not mix\n - [Last resort:]{.secondary} Word file in Dropbox. Everything else nicely organized on your end. Rmd file with similar structure to Manuscript that does the analysis.\n* Commit their changes.\n\n. . .\n\nOverleaf has Git built in (paid tier). I don't like Overleaf. Costs money, the viewer is crap and so is the editor. I suggest you avoid it.\n\n# Reports that mix output and text {background-color=\"#e98a15\"}\n\n## Using Rmarkdown/Quarto/Jupyter for most things\n\n### Your goal is to [Avoid at all costs]{.secondary}:\n\n* \"How did I create this plot?\"\n* \"Why did I decide to omit those six samples?\"\n* \"Where (on the web) did I find these data?\"\n* \"What was that interesting gene/feature/predictor?\"\n\n
\n \nReally useful resource:\n\n* Emily Reiderer [RmdDD](https://emilyriederer.netlify.app/post/rmarkdown-driven-development/)\n* Talk [Slides](https://www.slideshare.net/EmilyRiederer/rmarkdown-driven-development-rstudioconf-2020)\n\n## When I begin a new project\n\n1. Create a directory structure\n - `code/`\n - `papers/`\n - `notes/` (maybe?)\n - `README.md`\n - `data/` (maybe?)\n1. Write scripts in the `code/` directory\n1. TODO items in the README\n1. Use Rmarkdown/Quarto/Jupyter for reports, render to `.pdf`\n\n## As the project progresses...\n\nReorganize\n\n* Some script files go to a package (thorougly tested), all that remains is for the paper\n* These now load the package and run simulations or analyses (that take a while)\n* Maybe add a directory that contains dead-ends (code or text or ...)\n* Add `manuscript/`. I try to go for `main.tex` and `Supplement.Rmd`\n* `Supplement.Rmd` runs anything necessary in `code/` and creates all figures in the main doc and the supplement. Also generates any online supplementary material\n* Sometimes, just `manuscript/main.Rmd` \n* Sometimes `main.tex` just inputs `intro.tex`, `methods.tex`, etc.\n\n## The old manuscript (starting in School, persisting too long)\n\n1. Write lots of LaTeX, `R` code in separate files\n1. Need a figure. Run `R` code, get figure, save as `.pdf`.\n1. Recompile LaTeX. Axes are unreadable. Back to `R`, rerun `R` code, ...\n1. Recompile LaTeX. Can't distinguish lines. Back to `R`, rerun `R` code, ...\n1. Collaborator wants changes to the simulation. Edit the code. Rerun figure script, doesn't work. More edits....Finally Recompile.\n1. Reviewer \"what if `n` is bigger\". Hope I can find the right location. But the code isn't functions. Something breaks ...\n1. Etc, etc.\n\n## Now: \n\n\n1. `R` package with documented code, available on GitHub. \n1. One script to run the analysis, one to gather the results. \n1. One `.Rmd` file to take in the results, do preprocessing, generate all figures. \n1. LaTeX file on Journal style.\n\n### The optimal\n\nSame as above but with a MAKEFILE to automatically run parts of 1--4 as needed\n\n\n\n\n## Evolution of presentations\n\n1. LaTeX + Beamer (similar to the manuscript):\n a. Write lots of LaTeX, `R` code in separate files\n a. Need a figure. Run `R` code, get figure, save as `.pdf`.\n a. Rinse and repeat.\n1. Course slides in Rmarkdown + Slidy\n1. Seminars in Rmarkdown + Beamer (with lots of customization)\n1. Seminars in Rmarkdown + Xaringan\n1. Everything in Quarto\n\n::: {.callout-tip appearance=\"simple\"}\n* Easy to use.\n* Easy to customize (defaults are not great)\n* WELL DOCUMENTED\n:::\n\n\n## Takeaways", + "markdown": "---\nlecture: \"Organization and reports\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Topics for today\n\n1. Organizing your file system\n2. Writing reports that mix output and text\n3. (Avoiding buggy code)\n\n## The guiding theme\n\n![](https://imgs.xkcd.com/comics/is_it_worth_the_time.png){.center}\n\n\n# Organization {background-color=\"#e98a15\"}\n\n* Students come to my office\n* All their stuff is on their Desktop\n* This is 🤮\n\n## I urge you to consult:\n\n[Karl Broman's Notes](https://kbroman.org/Tools4RR/assets/lectures/06_org_eda_withnotes.pdf)\n\n\n## Some guiding principles\n\n1. Avoid naming by date. \n - Your file system already knows the date.\n - Sometimes projects take a while.\n - You can add this inside a particular report: `Last updated: 2022-01-07`\n1. If you're going to use a date anywhere, do `YYYY-MM-DD` or `YYYYMMDD` not `DD-MMM-YY`\n1. This is a process\n1. Don't get tied down\n1. But don't reorganize every time you find a better system\n1. Customize to your needs, preferences\n \n\n## Organizing your stuff\n\n\n```{.bash}\n├── Advising\n│ ├── arash\n│ ├── gian-carlo\n├── CV\n├── Computing\n│ ├── batchtools.slurm.tmpl\n│ ├── computecanada_notes.md\n│ ├── FKF\n│ └── ghclass\n├── Grants\n│ ├── B&E JSM 2010\n│ ├── CANSSI RRP 2020\n│ ├── NSERC 2020\n├── LettersofRec\n├── Manuscripts\n| ├── learning-white-matter\n| ├── rt-est\n│ ├── zzzz Old\n├── Referee reports\n├── Talks\n│ ├── JobTalk2020\n│ ├── ubc-stat-covid-talk\n│ └── utoronto-grad-advice\n├── Teaching\n│ ├── stat-406\n│ ├── stat-550\n│ ├── zzzz CMU TA\n│ └── zzzz booth\n└── Website\n```\n\n\n\n## Inside a project\n\n```{.bash}\n.\n├── README.md\n├── Summary of Goals.rtf\n├── cluster_output\n├── code\n├── data\n├── dsges-github.Rproj\n├── manuscript\n└── waldman-triage\n```\n\n* Include a README\n* Ideally have a MAKEFILE\n* Under version control, shared with collaborator\n\n\n## Basic principles\n\n* Be consistent\n – directory structure; names\n - all project files in 1 directory, not multiples\n* Always separate raw from processed data\n* Always separate code from data\n* It should be obvious what code created what files, and what the dependencies are. (MAKEFILE forces this)\n* [No hand-editing of data files]{.secondary}\n* Don’t use spaces in file names\n* In code, use relative paths, not absolute paths\n - `../blah` not `~/blah` or `/users/dajmcdon/Documents/Work/proj-1/blah`\n - The `{here}` package in `R` is great for this\n \n## Problem: Coordinating with collaborators\n\n* Where to put data that multiple people will work with?\n* Where to put intermediate/processed data?\n* Where to indicate the code that created those processed data files?\n* How to divvy up tasks and know who did what?\n* Need to agree on directory structure and file naming conventions\n\n[GitHub is (I think) the ideal solution, but not always feasible.]{.secondary}\n\n## Problem: Collaborators who don’t use GitHub\n\n* Use GitHub yourself\n* Copy files to/from some shared space\n - Ideally, in an automated way (Dropbox, S3 Bucket)\n - Avoid Word at all costs. Google Docs if needed.\n - Word and Git do not mix\n - [Last resort:]{.secondary} Word file in Dropbox. Everything else nicely organized on your end. Rmd file with similar structure to Manuscript that does the analysis.\n* Commit their changes.\n\n. . .\n\nOverleaf has Git built in (paid tier). I don't like Overleaf. Costs money, the viewer is crap and so is the editor. I suggest you avoid it.\n\n# Reports that mix output and text {background-color=\"#e98a15\"}\n\n## Using Rmarkdown/Quarto/Jupyter for most things\n\n### Your goal is to [Avoid at all costs]{.secondary}:\n\n* \"How did I create this plot?\"\n* \"Why did I decide to omit those six samples?\"\n* \"Where (on the web) did I find these data?\"\n* \"What was that interesting gene/feature/predictor?\"\n\n
\n \nReally useful resource:\n\n* Emily Reiderer [RmdDD](https://emilyriederer.netlify.app/post/rmarkdown-driven-development/)\n* Talk [Slides](https://www.slideshare.net/EmilyRiederer/rmarkdown-driven-development-rstudioconf-2020)\n\n## When I begin a new project\n\n1. Create a directory structure\n - `code/`\n - `papers/`\n - `notes/` (maybe?)\n - `README.md`\n - `data/` (maybe?)\n1. Write scripts in the `code/` directory\n1. TODO items in the README\n1. Use Rmarkdown/Quarto/Jupyter for reports, render to `.pdf`\n\n## As the project progresses...\n\nReorganize\n\n* Some script files go to a package (thorougly tested), all that remains is for the paper\n* These now load the package and run simulations or analyses (that take a while)\n* Maybe add a directory that contains dead-ends (code or text or ...)\n* Add `manuscript/`. I try to go for `main.tex` and `Supplement.Rmd`\n* `Supplement.Rmd` runs anything necessary in `code/` and creates all figures in the main doc and the supplement. Also generates any online supplementary material\n* Sometimes, just `manuscript/main.Rmd` \n* Sometimes `main.tex` just inputs `intro.tex`, `methods.tex`, etc.\n\n## The old manuscript (starting in School, persisting too long)\n\n1. Write lots of LaTeX, `R` code in separate files\n1. Need a figure. Run `R` code, get figure, save as `.pdf`.\n1. Recompile LaTeX. Axes are unreadable. Back to `R`, rerun `R` code, ...\n1. Recompile LaTeX. Can't distinguish lines. Back to `R`, rerun `R` code, ...\n1. Collaborator wants changes to the simulation. Edit the code. Rerun figure script, doesn't work. More edits....Finally Recompile.\n1. Reviewer \"what if `n` is bigger\". Hope I can find the right location. But the code isn't functions. Something breaks ...\n1. Etc, etc.\n\n## Now: \n\n\n1. `R` package with documented code, available on GitHub. \n1. One script to run the analysis, one to gather the results. \n1. One `.Rmd` file to take in the results, do preprocessing, generate all figures. \n1. LaTeX file on Journal style.\n\n### The optimal\n\nSame as above but with a MAKEFILE to automatically run parts of 1--4 as needed\n\n\n\n\n## Evolution of presentations\n\n1. LaTeX + Beamer (similar to the manuscript):\n a. Write lots of LaTeX, `R` code in separate files\n a. Need a figure. Run `R` code, get figure, save as `.pdf`.\n a. Rinse and repeat.\n1. Course slides in Rmarkdown + Slidy\n1. Seminars in Rmarkdown + Beamer (with lots of customization)\n1. Seminars in Rmarkdown + Xaringan\n1. Everything in Quarto\n\n::: {.callout-tip appearance=\"simple\"}\n* Easy to use.\n* Easy to customize (defaults are not great)\n* WELL DOCUMENTED\n:::\n\n\n## Takeaways", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/pca-intro/execute-results/html.json b/_freeze/schedule/slides/pca-intro/execute-results/html.json index d488b7e..e670fa0 100644 --- a/_freeze/schedule/slides/pca-intro/execute-results/html.json +++ b/_freeze/schedule/slides/pca-intro/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "b43f7074da22613bdd9c0a2aefd2fc3d", "result": { - "markdown": "---\nlecture: \"Principal components analysis\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Representation learning\n\nRepresentation learning is the idea that performance of ML methods is\nhighly dependent on the choice of representation\n\n\nFor this reason, much of ML is geared towards transforming the data into\nthe relevant features and then using these as inputs\n\n\nThis idea is as old as statistics itself, really,\n\nHowever, the idea is constantly revisited in a variety of fields and\ncontexts\n\n\nCommonly, these learned representations capture low-level information\nlike overall shapes\n\n\n\nIt is possible to quantify this intuition for PCA at least\n\n. . .\n\nGoal\n: Transform $\\mathbf{X}\\in \\R^{n\\times p}$ into $\\mathbf{Z} \\in \\R^{n \\times ?}$\n\n?-dimension can be bigger (feature creation) or smaller (dimension reduction) than $p$\n\n\n\n\n\n## PCA\n\nPrincipal components analysis (PCA) is a dimension\nreduction technique\n\n\nIt solves various equivalent optimization problems\n\n(Maximize variance, minimize $\\ell_2$ distortions, find closest subspace of a given rank, $\\ldots$)\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-1-1.png){fig-align='center'}\n:::\n:::\n\n\n\n\n## Lower dimensional embeddings\n\nSuppose we have predictors $\\x_1$ and $\\x_2$ (columns / features / measurements)\n\n- We more faithfully preserve the structure of this data by keeping\n $\\x_1$ and setting $\\x_2$ to zero than the opposite\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Lower dimensional embeddings\n\nAn important feature of the previous example is that $\\x_1$ and $\\x_2$\naren't correlated\n\nWhat if they are?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\nWe lose a lot of structure by setting either $\\x_1$ or $\\x_2$ to zero\n\n\n\n## Lower dimensional embeddings\n\n\nThe only difference is the first is a rotation of the second\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## PCA\n\nIf we knew how to rotate our data, then we could more \neasily retain the structure.\n\n[PCA]{.secondary} gives us exactly this rotation\n\n1. Center (+scale?) the data matrix $\\X$\n2. Compute the SVD of $\\X = \\U\\D \\V^\\top$ (always exists)\n3. Return $\\U_M\\D_M$, where $\\D_M$ is the largest $M$\n singular values of $\\X$\n\n\n## PCA\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n## PCA on some pop music data\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,269 × 15\n artist danceability energy key loudness mode speechiness acousticness\n \n 1 Taylor Swi… 0.781 0.357 0 -16.4 1 0.912 0.717 \n 2 Taylor Swi… 0.627 0.266 9 -15.4 1 0.929 0.796 \n 3 Taylor Swi… 0.516 0.917 11 -3.19 0 0.0827 0.0139 \n 4 Taylor Swi… 0.629 0.757 1 -8.37 0 0.0512 0.00384\n 5 Taylor Swi… 0.686 0.705 9 -10.8 1 0.249 0.832 \n 6 Taylor Swi… 0.522 0.691 2 -4.82 1 0.0307 0.00609\n 7 Taylor Swi… 0.31 0.374 6 -8.46 1 0.0275 0.761 \n 8 Taylor Swi… 0.705 0.621 2 -8.09 1 0.0334 0.101 \n 9 Taylor Swi… 0.553 0.604 1 -5.30 0 0.0258 0.202 \n10 Taylor Swi… 0.419 0.908 9 -5.16 1 0.0651 0.00048\n# ℹ 1,259 more rows\n# ℹ 7 more variables: instrumentalness , liveness , valence ,\n# tempo , time_signature , duration_ms , explicit \n```\n:::\n:::\n\n\n## PCA on some pop music data\n\n* 15 dimensions to 2\n* coloured by artist\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/pca-music-plot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Plotting the weights, $\\alpha_j,\\ j=1,2$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n# Mathematical details\n\n## Matrix decompositions\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\nThis is expressed via the SVD: $\\X = \\U\\D\\V^{\\top}$.\n\n\n::: {.callout-important}\nWe assume throughout that we have centered the data\n:::\n\nThen our new features are\n\n$$\\mathbf{Z} = \\X \\V = \\U\\D$$\n\n\n\n## Short SVD aside \n\n* Any $n\\times p$ matrix can be decomposed into $\\mathbf{UDV}^\\top$.\n\n* This is a computational procedure, like inverting a matrix, `svd()`\n\n* These have properties:\n\n1. $\\mathbf{U}^\\top \\mathbf{U} = \\mathbf{I}_n$\n2. $\\mathbf{V}^\\top \\mathbf{V} = \\mathbf{I}_p$\n3. $\\mathbf{D}$ is diagonal (0 off the diagonal)\n\n\n[Many]{.secondary} methods for dimension reduction use the SVD of some matrix.\n\n\n\n## Why? {.smaller}\n\n1. Given $\\X$, find a projection $\\mathbf{P}$ onto $\\R^M$ with $M \\leq p$ \nthat minimizes the reconstruction error\n$$\n\\begin{aligned}\n\\min_{\\mathbf{P}} &\\,\\, \\lVert \\mathbf{X} - \\mathbf{X}\\mathbf{P} \\rVert^2_F \\,\\,\\, \\textrm{(sum all the elements)}\\\\\n\\textrm{subject to} &\\,\\, \\textrm{rank}(\\mathbf{P}) = M,\\, \\mathbf{P} = \\mathbf{P}^T,\\, \\mathbf{P} = \\mathbf{P}^2\n\\end{aligned}\n$$\nThe conditions ensure that $\\mathbf{P}$ is a projection matrix onto $M$ dimensions.\n\n2. Maximize the variance explained by an orthogonal transformation $\\mathbf{A} \\in \\R^{p\\times M}$\n$$\n\\begin{aligned}\n\\max_{\\mathbf{A}} &\\,\\, \\textrm{trace}\\left(\\frac{1}{n}\\mathbf{A}^\\top \\X^\\top \\X \\mathbf{A}\\right)\\\\\n\\textrm{subject to} &\\,\\, \\mathbf{A}^\\top\\mathbf{A} = \\mathbf{I}_M\n\\end{aligned}\n$$\n\n* In case one, the minimizer is $\\mathbf{P} = \\mathbf{V}_M\\mathbf{V}_M^\\top$\n* In case two, the maximizer is $\\mathbf{A} = \\mathbf{V}_M$.\n\n## Code output to look at\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n", + "markdown": "---\nlecture: \"Principal components analysis\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Representation learning\n\nRepresentation learning is the idea that performance of ML methods is\nhighly dependent on the choice of representation\n\n\nFor this reason, much of ML is geared towards transforming the data into\nthe relevant features and then using these as inputs\n\n\nThis idea is as old as statistics itself, really,\n\nHowever, the idea is constantly revisited in a variety of fields and\ncontexts\n\n\nCommonly, these learned representations capture low-level information\nlike overall shapes\n\n\n\nIt is possible to quantify this intuition for PCA at least\n\n. . .\n\nGoal\n: Transform $\\mathbf{X}\\in \\R^{n\\times p}$ into $\\mathbf{Z} \\in \\R^{n \\times ?}$\n\n?-dimension can be bigger (feature creation) or smaller (dimension reduction) than $p$\n\n\n\n\n\n## PCA\n\nPrincipal components analysis (PCA) is a dimension\nreduction technique\n\n\nIt solves various equivalent optimization problems\n\n(Maximize variance, minimize $\\ell_2$ distortions, find closest subspace of a given rank, $\\ldots$)\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-1-1.png){fig-align='center'}\n:::\n:::\n\n\n\n\n## Lower dimensional embeddings\n\nSuppose we have predictors $\\x_1$ and $\\x_2$ (columns / features / measurements)\n\n- We more faithfully preserve the structure of this data by keeping\n $\\x_1$ and setting $\\x_2$ to zero than the opposite\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Lower dimensional embeddings\n\nAn important feature of the previous example is that $\\x_1$ and $\\x_2$\naren't correlated\n\nWhat if they are?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\nWe lose a lot of structure by setting either $\\x_1$ or $\\x_2$ to zero\n\n\n\n## Lower dimensional embeddings\n\n\nThe only difference is the first is a rotation of the second\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## PCA\n\nIf we knew how to rotate our data, then we could more \neasily retain the structure.\n\n[PCA]{.secondary} gives us exactly this rotation\n\n1. Center (+scale?) the data matrix $\\X$\n2. Compute the SVD of $\\X = \\U\\D \\V^\\top$ (always exists)\n3. Return $\\U_M\\D_M$, where $\\D_M$ is the largest $M$\n singular values of $\\X$\n\n\n## PCA\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n## PCA on some pop music data\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,269 × 15\n artist danceability energy key loudness mode speechiness acousticness\n \n 1 Taylor Swi… 0.781 0.357 0 -16.4 1 0.912 0.717 \n 2 Taylor Swi… 0.627 0.266 9 -15.4 1 0.929 0.796 \n 3 Taylor Swi… 0.516 0.917 11 -3.19 0 0.0827 0.0139 \n 4 Taylor Swi… 0.629 0.757 1 -8.37 0 0.0512 0.00384\n 5 Taylor Swi… 0.686 0.705 9 -10.8 1 0.249 0.832 \n 6 Taylor Swi… 0.522 0.691 2 -4.82 1 0.0307 0.00609\n 7 Taylor Swi… 0.31 0.374 6 -8.46 1 0.0275 0.761 \n 8 Taylor Swi… 0.705 0.621 2 -8.09 1 0.0334 0.101 \n 9 Taylor Swi… 0.553 0.604 1 -5.30 0 0.0258 0.202 \n10 Taylor Swi… 0.419 0.908 9 -5.16 1 0.0651 0.00048\n# ℹ 1,259 more rows\n# ℹ 7 more variables: instrumentalness , liveness , valence ,\n# tempo , time_signature , duration_ms , explicit \n```\n:::\n:::\n\n\n## PCA on some pop music data\n\n* 15 dimensions to 2\n* coloured by artist\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/pca-music-plot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Plotting the weights, $\\alpha_j,\\ j=1,2$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n# Mathematical details\n\n## Matrix decompositions\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\nThis is expressed via the SVD: $\\X = \\U\\D\\V^{\\top}$.\n\n\n::: {.callout-important}\nWe assume throughout that we have centered the data\n:::\n\nThen our new features are\n\n$$\\mathbf{Z} = \\X \\V = \\U\\D$$\n\n\n\n## Short SVD aside \n\n* Any $n\\times p$ matrix can be decomposed into $\\mathbf{UDV}^\\top$.\n\n* This is a computational procedure, like inverting a matrix, `svd()`\n\n* These have properties:\n\n1. $\\mathbf{U}^\\top \\mathbf{U} = \\mathbf{I}_n$\n2. $\\mathbf{V}^\\top \\mathbf{V} = \\mathbf{I}_p$\n3. $\\mathbf{D}$ is diagonal (0 off the diagonal)\n\n\n[Many]{.secondary} methods for dimension reduction use the SVD of some matrix.\n\n\n\n## Why? {.smaller}\n\n1. Given $\\X$, find a projection $\\mathbf{P}$ onto $\\R^M$ with $M \\leq p$ \nthat minimizes the reconstruction error\n$$\n\\begin{aligned}\n\\min_{\\mathbf{P}} &\\,\\, \\lVert \\mathbf{X} - \\mathbf{X}\\mathbf{P} \\rVert^2_F \\,\\,\\, \\textrm{(sum all the elements)}\\\\\n\\textrm{subject to} &\\,\\, \\textrm{rank}(\\mathbf{P}) = M,\\, \\mathbf{P} = \\mathbf{P}^T,\\, \\mathbf{P} = \\mathbf{P}^2\n\\end{aligned}\n$$\nThe conditions ensure that $\\mathbf{P}$ is a projection matrix onto $M$ dimensions.\n\n2. Maximize the variance explained by an orthogonal transformation $\\mathbf{A} \\in \\R^{p\\times M}$\n$$\n\\begin{aligned}\n\\max_{\\mathbf{A}} &\\,\\, \\textrm{trace}\\left(\\frac{1}{n}\\mathbf{A}^\\top \\X^\\top \\X \\mathbf{A}\\right)\\\\\n\\textrm{subject to} &\\,\\, \\mathbf{A}^\\top\\mathbf{A} = \\mathbf{I}_M\n\\end{aligned}\n$$\n\n* In case one, the minimizer is $\\mathbf{P} = \\mathbf{V}_M\\mathbf{V}_M^\\top$\n* In case two, the maximizer is $\\mathbf{A} = \\mathbf{V}_M$.\n\n## Code output to look at\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n", "supporting": [ "pca-intro_files" ], diff --git a/_freeze/schedule/slides/presentations/execute-results/html.json b/_freeze/schedule/slides/presentations/execute-results/html.json index a82521d..2898efb 100644 --- a/_freeze/schedule/slides/presentations/execute-results/html.json +++ b/_freeze/schedule/slides/presentations/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "6cd1212fed22adff9e7e3817c3ad2b52", "result": { - "markdown": "---\nlecture: \"Giving presentations\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Structure\n\n\n1. Strategy (applies to papers too)\n\n1. Dos and don'ts\n\n1. Personal preferences\n\n\n\n# Strategy\n\n\n## Genre\n\nTalks take many forms (like papers)\n\n* Department seminar\n\n* Short conference presentation\n\n* Class lecture\n\n* `...`\n\nCalibrate your talk to the [Genre]{.secondary} and the [Audience]{.tertiary}\n\n. . .\n\n* A job talk takes much more work than a class presentation\n* For context, after much practice, it takes me about 1 hour per minute of presentation length, depending on the amount of polish.\n* My course lectures take about 4x the target duration.\n* General ideas are the same for all styles.\n\n## Audience\n\n* Think about who you are talking to\n - Statisticians?\n - Students?\n - Potential employer?\n - People with PhD's but in other disciplines?\n - Your grandma?\n \n* Regardless of the audience, I think of dividing the talk roughly in 3rds.\n\n\n## (Your audience for your in-class talk) {background-color=\"#86D4FE\"}\n\n* 2/3 of the time, [the client]{.secondary}. \n* You're teaching them this topic.\n* Think \"someone who took 1 or 2 classes in statistics\"\n* 1/3 of the time, [your classmates]{.secondary}. \n* What are the details they need to know that they don't know?\n* Think carefully how to structure for that breakdown.\n\n\n\n---\n\n### 1. \nTalk to your grandma. Why are you listening to me? Why is what I'm saying interesting?\n\n### 2. \nTalk to your audience. What have I done that you should be able to do at the end?\n\n### 3. \nTalk [slightly]{.secondary} over your audience. Why should you be impressed by me?\n\n. . .\n\nPart 3 is shorter depending on the Audience.\n\n\n\n## Content \n\nEach part is a little mini-talk\n\n1. Starts with the general idea\n2. Develops a few details. [Strategy:]{.secondary} problem/solution or question/answer\n3. Ends with a takeaway\n\nBut these parts are [recalibrated]{.fourth-colour} to the audience.\n\n* Your Grandma doesn't want to see math.\n* Your employer might, but doesn't want to hear about $\\sigma$-fields. \n* Statisticians don't want to see proofs (but might want a sketch).\n* `...`\n\n\n## Story structure\n\n::: {.callout-note}\n## What I often see...\nOnce upon a time, a young MSc student went into the woods of theory and found some trees. \n\nFirst they looked at one tree, it was oak. \n\nThen the looked at the next tree, it was maple. \n\nThen they wondered if trees could talk. \n\nAfter three months of wandering, they saw a house...\n:::\n\n. . .\n\n::: {.callout-important}\n## The attention grabber\nAxe-wielding woodsman saves student from wolf attack!\n:::\n\n## Better structure\n\n1. (Enough details to give the headline.)\n1. Headline result.\n1. How do we know the result is real. What are the details of \ncomputation, inference, methodology.\n1. Demonstration with empirics.\n\n## You should consider...\n\n* Attention span diminishes quickly. \n* What are the 3-5 takeaways?\n* Hit your main result at the beginning: this is what I can do that I couldn't before.\n\n## The ideal map\n\nMap out what you've done.\n\n* What did you find?\n* What are the implications? \n* Why does audience care?\n* How do we do it?\n\n. . .\n\nAvoid wandering in the wilderness:\n\n1. First we did this;\n1. But that didn't work, so we tried ...\n1. But then we added ...\n1. Finally we got to the beach ...\n1. And the water was nice ...\n\n\n# Good resource\n\nProf. Trevor Campbell's [\"How to Explain Things\"](https://docs.google.com/presentation/d/13vwchlzQAZjjfiI3AiBC_kM-syI6GJKzbuZoLxgy1a4/edit?usp=sharing)\n\n# Dos and don'ts\n\n## Words\n\n:::: {.columns}\n::: {.column width=\"45%\"}\nToo many words on a slide is bad\n\n* Bullet points\n* Too densely concentrated are bad\n* Are bad\n* Are hard to focus on\n\n

\n\nEmpty space is your friend\n:::\n\n\n::: {.column width=\"55%\"}\n\n\nLorem markdownum et moras et ponendi odores, neu magna per! Tyria meo iungitur\nvidet, frigore terras rogis Anienis poteram, dant. His vallem arma corpore\nvident nunc nivibus [avus](http://mirantes.org/), dea. Spatium luce certa\ncupiunt, lina. [Amabam](http://www.sub.com/satisego) opem, Iovis fecundaque et\nparum.\n\nAede virum annis audit modo: meus ramis videri: nec quod insidiisque Aonio\ntenuem, AI. Trames Iason: nocent hortatus lacteus praebita\n[paternos](http://ex.net/milledubitavit) petit, Paridis **aptus prius ut** origo\nfuriisque. Mercibus sis nullo aliudve Amathunta sufficit ululatibus,\npraevalidusque segnis *et* Dryopen. \n:::\n::::\n\n## Images\n\n:::: {.columns}\n::: {.column width=\"40%\"}\nPictures are good\n\n
\n\nFlow charts are good.\n\n
\n\n[Careful]{.fourth-colour} use of [colour]{.secondary} is [good]{.tertiary}.\n\n
\n\n### Size is good.\n\n
\n\n_too much variation is distracting_\n:::\n\n::: {.column width=\"60%\"}\n\n![](https://cdn.stocksnap.io/img-thumbs/960w/cat-kitten_LH77KFTY76.jpg)\n:::\n::::\n\n. . .\n\nHow long did you stare at the cat?\n\n\n# Personal preferences\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours1-1.png){fig-align='center'}\n:::\n:::\n\n\n. . .\n\n::: {.callout-important}\nDefaults are almost always terrible.\n:::\n\n\n## Issues with the preceding\n\n* Colours are awful\n* Grey background is distracting\n* Text size is too small\n* Legend position on the side is strange?\n* Numbers on the y-axis are nonesense\n* With barchart, y-axis should start at 0.\n* `.png` vs `.svg`\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours2-1.png){fig-align='center'}\n:::\n:::\n\n\n## Again\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours3-1.png){fig-align='center'}\n:::\n:::\n\n\n::: {.callout-tip}\nI like this, but ~10% of men are colour blind (including some faculty in this department). \n:::\n\n## Simulation\n\n![](gfx/colour-blind.png)\n\n## Jargon\n\n* Be wary of acronyms (MLE, BLUP, RKHS)\n* Again, think of your audience. MLE is fine for any statistician.\n* Others need definitions in words and written on the slide\n* Same for math notation $\\bar{X},\\ \\mu,\\ \\sigma,\\ \\mathbf{UDV}^\\top$\n* And for applied work e.g. SNP\n* [B]{.secondary}est [L]{.secondary}inear [U]{.secondary}nbiased [P]{.secondary}redictor\n\n---\n\n## {background-image=\"gfx/jargon.png\"}\n\n## Things I hate\n\n``{=html} Saying \"I'm not going to talk about ...\" ``{=html} \"I'm happy to discuss ... later if you'd like\".\n\n``{=html} Wiggling your laser pointer at every word. Highlight important things with pretty colours. Use pointer sparingly.\n\n``{=html} Playing with your collar, your pockets, your water bottle...\n\n``{=html} Staring at your slides ...\n\n``{=html} Displaying the total number of slides as in 6/85 in the lower right hand corner ...\n\n``{=html} Running over time. Skipping 6 slides to desperately make the time limit.\n\n``{=html} Using the default themes:\n\n## {background-image=\"gfx/beamer-crud.png\"}\n\n## Never use tables of numbers\n\n* Economists do this all the time for inexplicable reasons\n* I rarely put these in papers either\n* If I'm not going to talk about it, it doesn't go on the slide\n* There's no way I'm going to read off the number, certainly not to 4 decimal places\n* Use a graph\n\n## Use graphs, but\n\n* A graph with 3 dots should be a table of 3 numbers.\n* But why do you have only 3 numbers?\n* Any table can be a better graph.\n\n::: {.callout-tip}\n## Ask yourself:\nIs this the best way to display the data? Have I summarized too much? \n:::\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Things you should do\n\n``{=html} Number your slides\n\n``{=html} Have lots of prepared backup slides (details, answers to potential questions, further analysis)\n\n``{=html} Practice a lot. Practice in front of others. Practice the beginning more than the rest.\n\n``{=html} BE EXCITED. You worked hard on this. All results are cool. Play them up. You did something good and you want to tell everyone about how awesome you are. Own it. \n\n``{=html} Take credit. Say \"I showed this\" not \"It can be shown\".\n\n\n## Things that are debatable\n\n* Math talks tend to be \"chalkboard\"\n* CS talks tend to be \"sales pitch\"\n* Stats is in the middle.\n* I lean toward details with elements of salesmanship\n* If I hear your talk, I want to be able to \"do\" what you created. This is hard without some math. \n* This also colours my decisions about software.\n\n. . .\n\n::: {.callout-note}\nJeff Bezos banned Powerpoint from Amazon presentations\n:::\n\n## Closing suggestions\n\n### 1. Slow down\n\n* Get a bottle of water before the talk. \n* Drink it to pause on (pre-planned) key slides.\n* This will help you relax. \n* It will also give the audience a few seconds to get the hard stuff into their head.\n\n### 2. Cut back\n\n* Most of your slides probably have too many words.\n* And too much \"filler\" --> Kill the filler\n\n## Closing suggestions\n\n### 3. Try to move\n\n* It's good to move physically, engage the audience\n* Try to make eye contact with the whole room\n* Record yourself once to see if you do anything extraneous\n\n### 4. Have fun.\n\n\n## Example talks:\n\n1. [Teaching PCA](pca-intro.qmd)\n2. [Short research talk about Latent Infections](https://cmu-delphi.github.io/cdcflu-latent-infections/)\n", + "markdown": "---\nlecture: \"Giving presentations\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n## Structure\n\n\n1. Strategy (applies to papers too)\n\n1. Dos and don'ts\n\n1. Personal preferences\n\n\n\n# Strategy\n\n\n## Genre\n\nTalks take many forms (like papers)\n\n* Department seminar\n\n* Short conference presentation\n\n* Class lecture\n\n* `...`\n\nCalibrate your talk to the [Genre]{.secondary} and the [Audience]{.tertiary}\n\n. . .\n\n* A job talk takes much more work than a class presentation\n* For context, after much practice, it takes me about 1 hour per minute of presentation length, depending on the amount of polish.\n* My course lectures take about 4x the target duration.\n* General ideas are the same for all styles.\n\n## Audience\n\n* Think about who you are talking to\n - Statisticians?\n - Students?\n - Potential employer?\n - People with PhD's but in other disciplines?\n - Your grandma?\n \n* Regardless of the audience, I think of dividing the talk roughly in 3rds.\n\n\n## (Your audience for your in-class talk) {background-color=\"#86D4FE\"}\n\n* 2/3 of the time, [the client]{.secondary}. \n* You're teaching them this topic.\n* Think \"someone who took 1 or 2 classes in statistics\"\n* 1/3 of the time, [your classmates]{.secondary}. \n* What are the details they need to know that they don't know?\n* Think carefully how to structure for that breakdown.\n\n\n\n---\n\n### 1. \nTalk to your grandma. Why are you listening to me? Why is what I'm saying interesting?\n\n### 2. \nTalk to your audience. What have I done that you should be able to do at the end?\n\n### 3. \nTalk [slightly]{.secondary} over your audience. Why should you be impressed by me?\n\n. . .\n\nPart 3 is shorter depending on the Audience.\n\n\n\n## Content \n\nEach part is a little mini-talk\n\n1. Starts with the general idea\n2. Develops a few details. [Strategy:]{.secondary} problem/solution or question/answer\n3. Ends with a takeaway\n\nBut these parts are [recalibrated]{.fourth-colour} to the audience.\n\n* Your Grandma doesn't want to see math.\n* Your employer might, but doesn't want to hear about $\\sigma$-fields. \n* Statisticians don't want to see proofs (but might want a sketch).\n* `...`\n\n\n## Story structure\n\n::: {.callout-note}\n## What I often see...\nOnce upon a time, a young MSc student went into the woods of theory and found some trees. \n\nFirst they looked at one tree, it was oak. \n\nThen the looked at the next tree, it was maple. \n\nThen they wondered if trees could talk. \n\nAfter three months of wandering, they saw a house...\n:::\n\n. . .\n\n::: {.callout-important}\n## The attention grabber\nAxe-wielding woodsman saves student from wolf attack!\n:::\n\n## Better structure\n\n1. (Enough details to give the headline.)\n1. Headline result.\n1. How do we know the result is real. What are the details of \ncomputation, inference, methodology.\n1. Demonstration with empirics.\n\n## You should consider...\n\n* Attention span diminishes quickly. \n* What are the 3-5 takeaways?\n* Hit your main result at the beginning: this is what I can do that I couldn't before.\n\n## The ideal map\n\nMap out what you've done.\n\n* What did you find?\n* What are the implications? \n* Why does audience care?\n* How do we do it?\n\n. . .\n\nAvoid wandering in the wilderness:\n\n1. First we did this;\n1. But that didn't work, so we tried ...\n1. But then we added ...\n1. Finally we got to the beach ...\n1. And the water was nice ...\n\n\n# Good resource\n\nProf. Trevor Campbell's [\"How to Explain Things\"](https://docs.google.com/presentation/d/13vwchlzQAZjjfiI3AiBC_kM-syI6GJKzbuZoLxgy1a4/edit?usp=sharing)\n\n# Dos and don'ts\n\n## Words\n\n:::: {.columns}\n::: {.column width=\"45%\"}\nToo many words on a slide is bad\n\n* Bullet points\n* Too densely concentrated are bad\n* Are bad\n* Are hard to focus on\n\n

\n\nEmpty space is your friend\n:::\n\n\n::: {.column width=\"55%\"}\n\n\nLorem markdownum et moras et ponendi odores, neu magna per! Tyria meo iungitur\nvidet, frigore terras rogis Anienis poteram, dant. His vallem arma corpore\nvident nunc nivibus [avus](http://mirantes.org/), dea. Spatium luce certa\ncupiunt, lina. [Amabam](http://www.sub.com/satisego) opem, Iovis fecundaque et\nparum.\n\nAede virum annis audit modo: meus ramis videri: nec quod insidiisque Aonio\ntenuem, AI. Trames Iason: nocent hortatus lacteus praebita\n[paternos](http://ex.net/milledubitavit) petit, Paridis **aptus prius ut** origo\nfuriisque. Mercibus sis nullo aliudve Amathunta sufficit ululatibus,\npraevalidusque segnis *et* Dryopen. \n:::\n::::\n\n## Images\n\n:::: {.columns}\n::: {.column width=\"40%\"}\nPictures are good\n\n
\n\nFlow charts are good.\n\n
\n\n[Careful]{.fourth-colour} use of [colour]{.secondary} is [good]{.tertiary}.\n\n
\n\n### Size is good.\n\n
\n\n_too much variation is distracting_\n:::\n\n::: {.column width=\"60%\"}\n\n![](https://cdn.stocksnap.io/img-thumbs/960w/cat-kitten_LH77KFTY76.jpg)\n:::\n::::\n\n. . .\n\nHow long did you stare at the cat?\n\n\n# Personal preferences\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours1-1.png){fig-align='center'}\n:::\n:::\n\n\n. . .\n\n::: {.callout-important}\nDefaults are almost always terrible.\n:::\n\n\n## Issues with the preceding\n\n* Colours are awful\n* Grey background is distracting\n* Text size is too small\n* Legend position on the side is strange?\n* Numbers on the y-axis are nonesense\n* With barchart, y-axis should start at 0.\n* `.png` vs `.svg`\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours2-1.png){fig-align='center'}\n:::\n:::\n\n\n## Again\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours3-1.png){fig-align='center'}\n:::\n:::\n\n\n::: {.callout-tip}\nI like this, but ~10% of men are colour blind (including some faculty in this department). \n:::\n\n## Simulation\n\n![](gfx/colour-blind.png)\n\n## Jargon\n\n* Be wary of acronyms (MLE, BLUP, RKHS)\n* Again, think of your audience. MLE is fine for any statistician.\n* Others need definitions in words and written on the slide\n* Same for math notation $\\bar{X},\\ \\mu,\\ \\sigma,\\ \\mathbf{UDV}^\\top$\n* And for applied work e.g. SNP\n* [B]{.secondary}est [L]{.secondary}inear [U]{.secondary}nbiased [P]{.secondary}redictor\n\n---\n\n## {background-image=\"gfx/jargon.png\"}\n\n## Things I hate\n\n``{=html} Saying \"I'm not going to talk about ...\" ``{=html} \"I'm happy to discuss ... later if you'd like\".\n\n``{=html} Wiggling your laser pointer at every word. Highlight important things with pretty colours. Use pointer sparingly.\n\n``{=html} Playing with your collar, your pockets, your water bottle...\n\n``{=html} Staring at your slides ...\n\n``{=html} Displaying the total number of slides as in 6/85 in the lower right hand corner ...\n\n``{=html} Running over time. Skipping 6 slides to desperately make the time limit.\n\n``{=html} Using the default themes:\n\n## {background-image=\"gfx/beamer-crud.png\"}\n\n## Never use tables of numbers\n\n* Economists do this all the time for inexplicable reasons\n* I rarely put these in papers either\n* If I'm not going to talk about it, it doesn't go on the slide\n* There's no way I'm going to read off the number, certainly not to 4 decimal places\n* Use a graph\n\n## Use graphs, but\n\n* A graph with 3 dots should be a table of 3 numbers.\n* But why do you have only 3 numbers?\n* Any table can be a better graph.\n\n::: {.callout-tip}\n## Ask yourself:\nIs this the best way to display the data? Have I summarized too much? \n:::\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Things you should do\n\n``{=html} Number your slides\n\n``{=html} Have lots of prepared backup slides (details, answers to potential questions, further analysis)\n\n``{=html} Practice a lot. Practice in front of others. Practice the beginning more than the rest.\n\n``{=html} BE EXCITED. You worked hard on this. All results are cool. Play them up. You did something good and you want to tell everyone about how awesome you are. Own it. \n\n``{=html} Take credit. Say \"I showed this\" not \"It can be shown\".\n\n\n## Things that are debatable\n\n* Math talks tend to be \"chalkboard\"\n* CS talks tend to be \"sales pitch\"\n* Stats is in the middle.\n* I lean toward details with elements of salesmanship\n* If I hear your talk, I want to be able to \"do\" what you created. This is hard without some math. \n* This also colours my decisions about software.\n\n. . .\n\n::: {.callout-note}\nJeff Bezos banned Powerpoint from Amazon presentations\n:::\n\n## Closing suggestions\n\n### 1. Slow down\n\n* Get a bottle of water before the talk. \n* Drink it to pause on (pre-planned) key slides.\n* This will help you relax. \n* It will also give the audience a few seconds to get the hard stuff into their head.\n\n### 2. Cut back\n\n* Most of your slides probably have too many words.\n* And too much \"filler\" --> Kill the filler\n\n## Closing suggestions\n\n### 3. Try to move\n\n* It's good to move physically, engage the audience\n* Try to make eye contact with the whole room\n* Record yourself once to see if you do anything extraneous\n\n### 4. Have fun.\n\n\n## Example talks:\n\n1. [Teaching PCA](pca-intro.qmd)\n2. [Short research talk about Latent Infections](https://cmu-delphi.github.io/cdcflu-latent-infections/)\n", "supporting": [ "presentations_files" ], diff --git a/_freeze/schedule/slides/regularization-lm/execute-results/html.json b/_freeze/schedule/slides/regularization-lm/execute-results/html.json index 42f5b72..903fdd8 100644 --- a/_freeze/schedule/slides/regularization-lm/execute-results/html.json +++ b/_freeze/schedule/slides/regularization-lm/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "da189046925c9ca229d4d2ef3c3ee2f1", "result": { - "markdown": "---\nlecture: \"Linear models, selection, regularization, and inference\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\nbibliography: refs.bib\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## Recap\n\nModel Selection means [select a family of distributions for your data]{.secondary}.\n\nIdeally, we'd do this by comparing the $R_n$ for one family with that for\nanother.\n\nWe'd use whichever has smaller $R_n$.\n\nBut $R_n$ depends on the truth, so we estimate it with $\\widehat{R}$.\n\nThen we use whichever has smaller $\\widehat{R}$.\n\n## Example\n\nThe truth:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndat <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100),\n y = 3 + x1 - 5 * x2 + sin(x1 * x2 / (2 * pi)) + rnorm(100, sd = 5)\n)\n```\n:::\n\n\nModel 1: $y_i = \\beta_0 + \\beta_1 x_{i1} + \\beta_2 x_{i2} + \\epsilon_i$, $\\quad\\epsilon_i \\overset{iid}{\\sim} N(0, \\sigma^2)$\n\nModel 2: `y ~ x1 + x2 + x1*x2` (what's the math version?)\n\nModel 3: `y ~ x2 + sin(x1 * x2)`\n\n\n## Fit each model and estimate $R_n$\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlist(\"y ~ x1 + x2\", \"y ~ x1 * x2\", \"y ~ x2 + sin(x1*x2)\") |>\n map(~ {\n fits <- lm(as.formula(.x), data = dat)\n tibble(\n R2 = summary(fits)$r.sq,\n training_error = mean(residuals(fits)^2),\n loocv = mean( (residuals(fits) / (1 - hatvalues(fits)))^2 ),\n AIC = AIC(fits),\n BIC = BIC(fits)\n )\n }) |> list_rbind()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n R2 training_error loocv AIC BIC\n \n1 0.589 21.3 22.9 598. 608.\n2 0.595 21.0 23.4 598. 611.\n3 0.586 21.4 23.0 598. 609.\n```\n:::\n:::\n\n\n# Greedy selection\n\n::: {.callout-note}\nI'm doing everything for linear models, but applies to generalized linear models.\n:::\n\n## Model Selection vs. Variable Selection\n\nModel selection is very comprehensive\n\nYou choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.\n\nVariable selection is a subset of this. It means \n\n> choosing which predictors to include in a predictive model\n\nEliminating a predictor, means removing it from the model.\n\nSome [procedures]{.hand} automatically search predictors, and eliminate some.\n\nWe call this variable selection. But the procedure is implicitly selecting a model\nas well.\n\n\nMaking this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them.\n\n## Selecting variables / predictors with linear methods\n\n\nSuppose we have a pile of predictors.\n\nWe estimate models with different subsets of predictors and use CV / Cp / AIC \n/ BIC to decide which is preferred.\n\nSometimes you might have a few plausible subsets. Easy enough to choose with our criterion.\n\nSometimes you might just have a bunch of predictors, then what do you do?\n\n## Best subsets\n\nIf we imagine that only a few predictors are relevant, we could solve\n\n$$\\min_{\\beta\\in\\R^p} \\frac{1}{2n}\\norm{Y-\\X\\beta}_2^2 + \\lambda\\norm{\\beta}_0$$\n\n\nThe $\\ell_0$-norm counts the number of non-zero coefficients.\n\nThis may or may not be a good thing to do.\n\nIt is computationally infeasible if $p$ is more than about 20.\n\nTechnically NP-hard (you must find the error of each of the $2^p$ models)\n\nThough see [@BertsimasKing2016] for a method of solving reasonably large cases via mixed integer programming.\n\n## Greedy methods\n\nBecause this is an NP-hard problem, we fall back on greedy algorithms.\n\nAll are implemented by the `regsubsets` function in the `leaps` package. \n\nAll subsets\n: estimate model based on every possible subset of size $|\\mathcal{S}| \\leq \\min\\{n, p\\}$, use one with \nlowest risk estimate\n\nForward selection\n: start with $\\mathcal{S}=\\varnothing$, add predictors greedily\n\nBackward selection\n: start with $\\mathcal{S}=\\{1,\\ldots,p\\}$, remove greedily\n\nHybrid\n: combine forward and backward smartly\n\n##\n\n::: {.callout-note}\nWithin each procedure, we're comparing _nested_ models.\n:::\n\n\n## Costs and benefits\n\n\nAll subsets\n: 👍 estimates each subset \n💣 takes $2^p$ model fits when $pn$\n\nHybrid\n: 👍 visits more models than forward/backward \n💣 slower\n\n\n## Synthetic example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2024 - 550)\nn <- 550\ndf <- tibble( \n x1 = rnorm(n),\n x2 = rnorm(n, mean = 2, sd = 1),\n x3 = rexp(n, rate = 1),\n x4 = x2 + rnorm(n, sd = .1), # correlated with x2\n x5 = x1 + rnorm(n, sd = .1), # correlated with x1\n x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)\n x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)\n y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only\n)\n```\n:::\n\n\n$\\mathbf{x}_1$ and $\\mathbf{x}_2$ are the true predictors\n\nBut the rest are correlated with them\n\n\n## Full model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfull <- lm(y ~ ., data = df)\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.120 -1.386 -0.060 1.417 6.536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.17176 0.21823 -0.787 0.43158 \nx1 4.94560 1.62872 3.036 0.00251 **\nx2 1.88209 1.34057 1.404 0.16091 \nx3 0.10755 0.90835 0.118 0.90579 \nx4 -1.51043 0.97746 -1.545 0.12287 \nx5 -1.79872 0.94961 -1.894 0.05874 . \nx6 -0.08277 0.92535 -0.089 0.92876 \nx7 -0.05477 0.90159 -0.061 0.95159 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.176 on 542 degrees of freedom\nMultiple R-squared: 0.6538,\tAdjusted R-squared: 0.6494 \nF-statistic: 146.2 on 7 and 542 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## True model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntruth <- lm(y ~ x1 + x2, data = df)\nsummary(truth)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ x1 + x2, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.0630 -1.4199 -0.0654 1.3871 6.7382 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.12389 0.20060 -0.618 0.537 \nx1 2.99853 0.09434 31.783 < 2e-16 ***\nx2 0.44614 0.09257 4.820 1.87e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.179 on 547 degrees of freedom\nMultiple R-squared: 0.6498,\tAdjusted R-squared: 0.6485 \nF-statistic: 507.5 on 2 and 547 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## All subsets\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(leaps)\ntrythemall <- regsubsets(y ~ ., data = df)\nsummary(trythemall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df)\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: exhaustive\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \"*\" \" \" \" \" \" \" \" \" \" \"\n3 ( 1 ) \"*\" \"*\" \" \" \" \" \"*\" \" \" \" \"\n4 ( 1 ) \"*\" \"*\" \" \" \"*\" \"*\" \" \" \" \"\n5 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \" \" \" \"\n6 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \" \"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\"\n```\n:::\n:::\n\n\n\n## BIC and Cp\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/more-all-subsets1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Theory\n\nThis result is due to @FosterGeorge1994.\n\n1. If the truth is linear.\n2. $\\lambda = C\\sigma^2\\log p.$\n3. $\\norm{\\beta_*}_0 = s$\n\n$$\\frac{\\Expect{\\norm{\\X\\beta_*-\\X\\hat\\beta}_2^2}/n}{s\\sigma^2/n} \\leq 4\\log p + 2 + o(1).$$\n\n\n$$\\inf_{\\hat\\beta}\\sup_{\\X,\\beta_*} \\frac{\\Expect{\\norm{\\X\\beta_*-\\X\\hat\\beta}_2^2}/n}{s\\sigma^2/n} \\geq 2\\log p - o(\\log p).$$\n\n\n##\n\n::: {.callout-important}\n\n- even if we could compute the subset selection estimator at scale, it’s not clear that we would want to\n- (Many people assume that we would.) \n- theory provides an understanding of the performance of various estimators under typically idealized conditions\n\n:::\n\n\n\n# Regularization\n\n## Regularization\n\n\n* Another way to control bias and variance is through [regularization]{.secondary} or\n[shrinkage]{.secondary}. \n\n\n* Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.\n\n\n* But, make your estimates of $\\beta$ \"smaller\"\n\n\n\n## Brief aside on optimization\n\n* An optimization problem has 2 components:\n\n 1. The \"Objective function\": e.g. $\\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$.\n 2. The \"constraint\": e.g. \"fewer than 5 non-zero entries in $\\beta$\".\n \n* A constrained minimization problem is written\n\n\n$$\\min_\\beta f(\\beta)\\;\\; \\mbox{ subject to }\\;\\; C(\\beta)$$\n\n* $f(\\beta)$ is the objective function\n* $C(\\beta)$ is the constraint\n\n\n## Ridge regression (constrained version)\n\nOne way to do this for regression is to solve (say):\n$$\n\\minimize_\\beta \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\sum_j \\beta^2_j < s\n$$\nfor some $s>0$.\n\n* This is called \"ridge regression\".\n* Write the minimizer as $\\hat{\\beta}_s$.\n\n. . .\n\nCompare this to ordinary least squares:\n\n$$\n\\minimize_\\beta \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 \n\\quad \\st \\beta \\in \\R^p\n$$\n\n\n\n## Geometry of ridge regression (contours)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/plotting-functions-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Reminder of norms we should remember\n\n$\\ell_q$-norm\n: $\\left(\\sum_{j=1}^p |z_j|^q\\right)^{1/q}$\n\n$\\ell_1$-norm (special case)\n: $\\sum_{j=1}^p |z_j|$\n\n$\\ell_0$-norm\n: $\\sum_{j=1}^p I(z_j \\neq 0 ) = \\lvert \\{j : z_j \\neq 0 \\}\\rvert$\n\n$\\ell_\\infty$-norm\n: $\\max_{1\\leq j \\leq p} |z_j|$\n\n::: aside\nRecall what a norm is: \n:::\n\n\n## Ridge regression\n\nAn equivalent way to write\n\n$$\\hat\\beta_s = \\argmin_{ \\Vert \\beta \\Vert_2^2 \\leq s} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$$\n\n\nis in the [Lagrangian]{.secondary} form\n\n\n$$\\hat\\beta_\\lambda = \\argmin_{ \\beta} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\frac{\\lambda}{2} \\Vert \\beta \\Vert_2^2.$$\n\n\n\n\nFor every $\\lambda$ there is a unique $s$ (and vice versa) that makes \n\n$$\\hat\\beta_s = \\hat\\beta_\\lambda$$\n\n## Ridge regression\n\n$\\hat\\beta_s = \\argmin_{ \\Vert \\beta \\Vert_2^2 \\leq s} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$\n\n$\\hat\\beta_\\lambda = \\argmin_{ \\beta} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\frac{\\lambda}{2} \\Vert \\beta \\Vert_2^2.$\n\nObserve:\n\n* $\\lambda = 0$ (or $s = \\infty$) makes $\\hat\\beta_\\lambda = \\hat\\beta_{ols}$\n* Any $\\lambda > 0$ (or $s <\\infty$) penalizes larger values of $\\beta$, effectively shrinking them.\n\n\n$\\lambda$ and $s$ are known as [tuning parameters]{.secondary}\n\n\n\n\n## Example data\n\n`prostate` data from [ESL]\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 97 × 10\n lcavol lweight age lbph svi lcp gleason pgg45 lpsa train\n \n 1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE \n 2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE \n 4 -1.20 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE \n 6 -1.05 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE \n 7 0.737 3.47 64 0.615 0 -1.39 6 0 0.765 FALSE\n 8 0.693 3.54 58 1.54 0 -1.39 6 0 0.854 TRUE \n 9 -0.777 3.54 47 -1.39 0 -1.39 6 0 1.05 FALSE\n10 0.223 3.24 63 -1.39 0 -1.39 6 0 1.05 FALSE\n# ℹ 87 more rows\n```\n:::\n:::\n\n\n::: notes\n\nUse `lpsa` as response.\n\n:::\n\n\n## Ridge regression path\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nY <- prostate$lpsa\nX <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))\nlibrary(glmnet)\nridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)\n```\n:::\n\n\n::: flex\n::: w-60\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n::: w-35\n\nModel selection here: \n\n* means [choose]{.secondary} some $\\lambda$ \n\n* A value of $\\lambda$ is a vertical line.\n\n* This graphic is a \"path\" or \"coefficient trace\"\n\n* Coefficients for varying $\\lambda$\n:::\n:::\n\n\n## Solving the minimization\n\n* One nice thing about ridge regression is that it has a closed-form solution (like OLS)\n\n\n$$\\hat\\beta_\\lambda = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y$$\n\n* This is easy to calculate in `R` for any $\\lambda$.\n\n* However, computations and interpretation are simplified if we examine the \n[Singular Value Decomposition]{.secondary} of $\\X = \\mathbf{UDV}^\\top$.\n\n* Recall: any matrix has an SVD.\n\n* Here $\\mathbf{D}$ is diagonal and $\\mathbf{U}$ and $\\mathbf{V}$ are orthonormal: $\\mathbf{U}^\\top\\mathbf{U} = \\mathbf{I}$.\n\n## Solving the minization\n\n$$\\hat\\beta_\\lambda = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y$$\n\n* Note that $\\mathbf{X}^\\top\\mathbf{X} = \\mathbf{VDU}^\\top\\mathbf{UDV}^\\top = \\mathbf{V}\\mathbf{D}^2\\mathbf{V}^\\top$.\n\n\n* Then,\n\n\n$$\\hat\\beta_\\lambda = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top + \\lambda \\mathbf{I})^{-1}\\mathbf{VDU}^\\top \\y\n= \\mathbf{V}(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1} \\mathbf{DU}^\\top \\y.$$\n\n* For computations, now we only need to invert $\\mathbf{D}$.\n\n\n## Comparing with OLS\n\n\n* $\\mathbf{D}$ is a diagonal matrix\n\n$$\\hat\\beta_{ols} = (\\X^\\top\\X)^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top)^{-1}\\mathbf{VDU}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-2}\\mathbf{D}}\\mathbf{U}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-1}}\\mathbf{U}^\\top \\y$$\n\n$$\\hat\\beta_\\lambda = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = \\mathbf{V}\\color{red}{(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1}} \\mathbf{DU}^\\top \\y.$$\n\n\n* Notice that $\\hat\\beta_{ols}$ depends on $d_j/d_j^2$ while $\\hat\\beta_\\lambda$ depends on $d_j/(d_j^2 + \\lambda)$.\n\n* Ridge regression makes the coefficients smaller relative to OLS.\n\n* But if $\\X$ has small singular values, ridge regression compensates with $\\lambda$ in the denominator.\n\n# Multicollinearity\n\n## Ridge regression and multicollinearity\n\n[Multicollinearity:]{.secondary} a linear combination of predictor variables is nearly equal to another predictor variable. \n\n## Multicollinearity questions\n\n1. Can I test `cor(x1, x2) == 0` to determine if these are collinear?\n2. What plots or summaries can I look at?\n3. If multivariate regression or logistic regression is applied on a data set with many explanatory variables, what in the regression output might indicate potential multicollinearity?\n4. Is there a test or diagnostic procedure for multicollinearity? \n\n\n::: notes\n1. No. \n2. Correlation matrix of continuous $x$. \n3. Large standard errors, estimated coefficients with opposite sign. `NA` estimates. Removing vars brings down SEs without much change in fit.\n4. Big VIF `summary(lm(xj ~ . - xj - y))$r.sq`\n:::\n\n\n\n## Multicollinearity thoughts\n\nSome comments:\n\n* A better phrase: $\\X$ is ill-conditioned\n\n* AKA \"(numerically) rank-deficient\".\n\n* $\\X = \\mathbf{U D V}^\\top$ ill-conditioned $\\Longleftrightarrow$ some elements of $\\mathbf{D} \\approx 0$\n\n* $\\hat\\beta_{ols}= \\mathbf{V D}^{-1} \\mathbf{U}^\\top \\y$. Small entries of $\\mathbf{D}$ $\\Longleftrightarrow$ huge elements of $\\mathbf{D}^{-1}$\n\n* Means huge variance: $\\Var{\\hat\\beta_{ols}} = \\sigma^2(\\X^\\top \\X)^{-1} = \\sigma^2 \\mathbf{V D}^{-2} \\mathbf{V}^\\top$\n\n* If you're doing prediction, this is a purely computational concern.\n\n\n## Ridge regression and ill-posed $\\X$\n\n\nRidge Regression fixes this problem by preventing the division by a near-zero number\n\nConclusion\n: $(\\X^{\\top}\\X)^{-1}$ can be really unstable, while $(\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1}$ is not.\n\nAside\n: Engineering approach to solving linear systems is to always do this with small $\\lambda$. The thinking is about the numerics rather than the statistics.\n\n### Which $\\lambda$ to use?\n\nComputational\n: Use CV and pick the $\\lambda$ that makes this smallest.\n\nIntuition (bias)\n: As $\\lambda\\rightarrow\\infty$, bias ⬆\n\nIntuition (variance)\n: As $\\lambda\\rightarrow\\infty$, variance ⬇\n\nYou should think about why.\n\n\n\n## Can we get the best of both worlds?\n\nTo recap:\n\n* Deciding which predictors to include, adding quadratic terms, or interactions is [model selection]{.secondary} (more precisely variable selection within a linear model).\n\n* Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity. \n\n* If the LM is **true**, \n 1. OLS is unbiased, but Variance depends on $\\mathbf{D}^{-2}$. Can be big.\n 2. Ridge is biased (can you find the bias?). But Variance is smaller than OLS.\n\n* Ridge regression does not perform variable selection.\n\n* But [picking]{.hand} $\\lambda=3.7$ and thereby [deciding]{.hand} to predict with $\\widehat{\\beta}^R_{3.7}$ is [model selection]{.secondary}.\n\n\n\n## Can we get the best of both worlds?\n\nRidge regression \n: $\\minimize \\frac{1}{2n}\\Vert\\y-\\X\\beta\\Vert_2^2 \\ \\st\\ \\snorm{\\beta}_2^2 \\leq s$ \n\nBest (in-sample) linear regression model of size $s$\n: $\\minimize \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2 \\ \\st\\ \\snorm{\\beta}_0 \\leq s$\n\n\n$||\\beta||_0$ is the number of nonzero elements in $\\beta$\n\nFinding the best in-sample linear model (of size $s$, among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)\n\nRidge regression is convex (easy to solve), but doesn't do __variable__ selection\n\nCan we somehow \"interpolate\" to get both?\n\n\nNote: selecting $\\lambda$ is still __model__ selection, but we've included __all__ the variables.\n\n\n## Ridge theory\n\nRecalling that $\\beta^\\top_*x$ is the best linear approximation to $f_*(x)$\n\nIf $\\norm{x}_\\infty< r$, [@HsuKakade2014],\n$$R(\\hat\\beta_\\lambda) - R(\\beta_*) \\leq \\left(1+ O\\left(\\frac{1+r^2/\\lambda}{n}\\right)\\right)\n\\frac{\\lambda\\norm{\\beta_*}_2^2}{2} + \\frac{\\sigma^2\\tr{\\Sigma}}{2n\\lambda}$$\n\n\nOptimizing over $\\lambda$, and setting $B=\\norm{\\beta_*}$ gives\n\n$$R(\\hat\\beta_\\lambda) - R(\\beta_*) \\leq \\sqrt{\\frac{\\sigma^2r^2B^2}{n}\\left(1+O(1/n)\\right)} + \nO\\left(\\frac{r^2B^2}{n}\\right)$$\n\n\n$$\\inf_{\\hat\\beta}\\sup_{\\beta_*} R(\\hat\\beta) - R(\\beta_*) \\geq C\\sqrt{\\frac{\\sigma^2r^2B^2}{n}}$$\n\n## Ridge theory\n\nWe call this behavior _rate minimax_: essential meaning, \n$$R(\\hat\\beta) - R(\\beta_*) = O\\left(\\inf_{\\hat\\beta}\\sup_{\\beta_*} R(\\hat\\beta) - R(\\beta_*)\\right)$$\n\nIn this setting, Ridge regression does as well as we could hope, up to constants.\n\n## Bayes interpretation\n\nIf \n\n1. $Y=X'\\beta + \\epsilon$, \n2. $\\epsilon\\sim N(0,\\sigma^2)$ \n3. $\\beta\\sim N(0,\\tau^2 I_p)$,\n\nThen, the posterior mean (median, mode) is the ridge estimator with $\\lambda=\\sigma^2/\\tau^2$.\n\n\n# Lasso\n\n## Geometry\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(mvtnorm)\nnormBall <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- data.frame(x = cos(tg)) %>%\n mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%\n gather(key = \"lab\", value = \"y\", -x)\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n df <- expand.grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)\n )\n df$z <- dmvnorm(df, mean, Sigma)\n df\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6) {\n ed <- filter(ed, x > 0, y > 0)\n for (i in 1:20) {\n ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol\n if (sum(ff) > 0) break\n tol <- 2 * tol\n }\n best <- ed[ff, ]\n best[which.max(best$z), ]\n}\n\nnb <- normBall(1)\ned <- ellipseData()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 1)\nggplot(nb, aes(x, y)) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +\n theme_bw(base_family = \"\", base_size = 24) +\n geom_label(\n data = bols, mapping = aes(label = bquote(\"hat(beta)[ols]\")), parse = TRUE,\n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n geom_label(\n data = bhat, mapping = aes(label = bquote(\"hat(beta)[s]^L\")), parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )\n```\n\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/ball-plotting-functions-1.svg){fig-align='center'}\n:::\n:::\n\n\n## $\\ell_1$-regularized regression\n\nKnown as \n\n* \"lasso\"\n* \"basis pursuit\"\n\nThe estimator satisfies\n\n$$\\hat\\beta_s = \\argmin_{ \\snorm{\\beta}_1 \\leq s} \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2$$\n\n\nIn its corresponding Lagrangian dual form:\n\n$$\\hat\\beta_\\lambda = \\argmin_{\\beta} \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1$$\n\n\n## Lasso\n\nWhile the ridge solution can be easily computed \n\n$$\\argmin_{\\beta} \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_2^2 = (\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1} \\X^{\\top}\\y$$\n\n\nthe lasso solution\n\n\n$$\\argmin_{\\beta} \\frac 1n\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1 = \\; ??$$\n\ndoesn't have a closed-form solution.\n\n\nHowever, because the optimization problem is convex, there exist efficient algorithms for computing it\n\n::: aside\nThe best are Iterative Soft Thresholding or Coordinate Descent. Gradient Descent doesn't work very well in practice.\n:::\n\n\n## Coefficient path: ridge vs lasso\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(glmnet)\ndata(prostate, package = \"ElemStatLearn\")\nX <- prostate |> dplyr::select(-train, -lpsa) |> as.matrix()\nY <- prostate$lpsa\nlasso <- glmnet(x = X, y = Y) # alpha = 1 by default\nridge <- glmnet(x = X, y = Y, alpha = 0)\nop <- par()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Additional intuition for why Lasso selects variables\n\nSuppose, for a particular $\\lambda$, I have solutions for $\\widehat{\\beta}_k$, $k = 1,\\ldots,j-1, j+1,\\ldots,p$.\n\nLet $\\widehat{\\y}_{-j} = \\X_{-j}\\widehat{\\beta}_{-j}$, and assume WLOG $\\overline{\\X}_k = 0$, $\\X_k^\\top\\X_k = 1\\ \\forall k$\n\nOne can show that:\n\n$$\n\\widehat{\\beta}_j = S\\left(\\mathbf{X}^\\top_j(\\y - \\widehat{\\y}_{-j}),\\ \\lambda\\right).\n$$\n\n$$\nS(z, \\gamma) = \\textrm{sign}(z)(|z| - \\gamma)_+ = \\begin{cases} z - \\gamma & z > \\gamma\\\\\nz + \\gamma & z < -\\gamma \\\\ 0 & |z| \\leq \\gamma \\end{cases}\n$$\n\n* Iterating over this is called [coordinate descent]{.secondary} and gives the solution\n\n::: aside\nSee for example, \n:::\n\n\n::: notes\n* If I were told all the other coefficient estimates.\n* Then to find this one, I'd shrink when the gradient is big, or set to 0 if it\ngets too small.\n:::\n\n## `{glmnet}` version (same procedure for lasso or ridge)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"1|2|3|4|5|\"}\nlasso <- cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself\n# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambda\nlambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)\ncoeffs <- coefficients(lasso, s = \"lambda.min\") # s can be string or a number\npreds <- predict(lasso, newx = X, s = \"lambda.1se\") # must supply `newx`\n```\n:::\n\n\n* $\\widehat{R}_{CV}$ is an estimator of $R_n$, it has bias and variance\n* Because we did CV, we actually have 10 $\\widehat{R}$ values, 1 per split.\n* Calculate the mean (that's what we've been using), but what about SE?\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other flavours\n\nThe elastic net\n: generally used for correlated variables that\ncombines a ridge/lasso penalty. Use `glmnet(..., alpha = a)` (0 < `a` < 1). \n\nGrouped lasso\n: where variables are included or excluded in groups. Required for factors (1-hot encoding)\n\nRelaxed lasso\n: Takes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use `glmnet(..., relax = TRUE)`.\n\nDantzig selector\n: a slightly modified version of the lasso\n\n## Lasso cinematic universe\n\n::: flex\n::: w-60\n\nSCAD\n: a non-convex version of lasso that adds a more severe variable selection penalty\n\n$\\sqrt{\\textrm{lasso}}$\n: claims to be tuning parameter free (but isn't). Uses $\\Vert\\cdot\\Vert_2$\ninstead of $\\Vert\\cdot\\Vert_1$ for the loss.\n\nGeneralized lasso\n: Adds various additional matrices to the penalty term (e.g. $\\Vert D\\beta\\Vert_1$). \n\nArbitrary combinations\n: combine the above penalties in your favourite combinations\n:::\n\n::: w-40\n\n![](https://sportshub.cbsistatic.com/i/2022/08/10/d348f903-585f-4aa6-aebc-d05173761065/brett-goldstein-hercules.jpg)\n\n:::\n:::\n\n## Warnings on regularized regression\n\n1. This isn't a method unless you say how to choose $\\lambda$.\n1. The intercept is never penalized. Adds an extra degree-of-freedom.\n1. Predictor scaling is [very]{.secondary} important.\n1. Discrete predictors need groupings.\n1. Centering the predictors may be necessary\n1. (These all work with other likelihoods.)\n\n. . .\n\nSoftware handles most of these automatically, but not always. (No Lasso with factor predictors.)\n\n## Lasso theory under strong conditions {.smaller}\n\n[Support recovery:]{.tertiary} [@Wainwright2009], see also [@MeinshausenBuhlmann2006; @ZhaoYu2006]\n\n1. The truth is linear.\n2. $\\norm{\\X'_{S^c}\\X_S (\\X'_S\\X_S)^{-1}}_\\infty < 1-\\epsilon.$\n3. $\\lambda_{\\min} (\\X'_S\\X_S) \\geq C_{\\min} > 0$.\n4. The columns of $\\X$ have 2-norm $n$.\n5. The noise is iid Normal.\n6. $\\lambda_n$ satisfies $\\frac{n\\lambda^2}{\\log(p-s)} \\rightarrow \\infty$.\n7. $\\min_j \\{ |\\beta_j| : j \\in S\\} \\geq \\rho_n > 0$ and \n$$\\rho_n^{-1} \\left( \\sqrt{\\frac{\\log s}{n}}+ \\lambda_n\\norm{(\\X'_S\\X_S)^{-1}}_\\infty \\right)\\rightarrow 0$$\n\n\nThen, $P(\\textrm{supp}(\\hat\\beta_\\lambda) = \\textrm{supp}(\\beta_*))\\rightarrow 1$.\n\n## Lasso theory under strong conditions {.smaller}\n\n[Estimation consistency:]{.tertiary} [@negahban2010unified] also [@MeinshausenYu2009]\n\n1. The truth is linear.\n2. $\\exists \\kappa$ such that for all vectors $\\theta\\in\\R^p$ that satisfy \n$\\norm{\\theta_{S^C}}_1 \\leq 3\\norm{\\theta_S}_1$, we have $\\norm{X\\theta}_2^2/n \\geq \\kappa\\norm{\\theta}_2^2$ (Compatibility)\n3. The columns of $\\X$ have 2-norm $n$.\n4. The noise is iid sub-Gaussian.\n5. $\\lambda_n >4\\sigma \\sqrt{\\log (p)/n}$.\n\nThen, with probability at least $1-c\\exp(-c'n\\lambda_n^2)$, \n$$\\norm{\\hat\\beta_\\lambda-\\beta_*}_2^2 \\leq \\frac{64\\sigma^2}{\\kappa^2}\\frac{s\\log p}{n}.$$\n\n::: {.callout-important}\nThese conditions are very strong, uncheckable in practice, unlikely to be true for real datasets. But theory of this type is the standard for these procedures.\n:::\n\n## Lasso under weak / no conditions\n\nIf $Y$ and $X$ are bounded by $B$, then with probability at least $1-\\delta^2$,\n$$R_n(\\hat\\beta_\\lambda) - R_n(\\beta_*) \\leq \\sqrt{\\frac{16(t+1)^4B^2}{n}\\log\\left(\\frac{\\sqrt{2}p}{\\delta}\\right)}.$$\n\n\nThis is a simple version of a result in [@GreenshteinRitov2004].\n\nNote that it applies to the constrained version.\n\n[@bartlett2012] derives the same rate for the Lagrangian version\n\nAgain, this rate is (nearly) optimal:\n$$c\\sqrt{\\frac{s}{n}} < R_n(\\hat\\beta_\\lambda) - R_n(\\beta_*) < C\\sqrt{\\frac{s\\log p}{n}}.$$\n\n\n$\\log p$ is the penalty you pay for selection.\n\n\n\n\n## References", + "markdown": "---\nlecture: \"Linear models, selection, regularization, and inference\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\nbibliography: refs.bib\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## Recap\n\nModel Selection means [select a family of distributions for your data]{.secondary}.\n\nIdeally, we'd do this by comparing the $R_n$ for one family with that for\nanother.\n\nWe'd use whichever has smaller $R_n$.\n\nBut $R_n$ depends on the truth, so we estimate it with $\\widehat{R}$.\n\nThen we use whichever has smaller $\\widehat{R}$.\n\n## Example\n\nThe truth:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndat <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100),\n y = 3 + x1 - 5 * x2 + sin(x1 * x2 / (2 * pi)) + rnorm(100, sd = 5)\n)\n```\n:::\n\n\nModel 1: $y_i = \\beta_0 + \\beta_1 x_{i1} + \\beta_2 x_{i2} + \\epsilon_i$, $\\quad\\epsilon_i \\overset{iid}{\\sim} N(0, \\sigma^2)$\n\nModel 2: `y ~ x1 + x2 + x1*x2` (what's the math version?)\n\nModel 3: `y ~ x2 + sin(x1 * x2)`\n\n\n## Fit each model and estimate $R_n$\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlist(\"y ~ x1 + x2\", \"y ~ x1 * x2\", \"y ~ x2 + sin(x1*x2)\") |>\n map(~ {\n fits <- lm(as.formula(.x), data = dat)\n tibble(\n R2 = summary(fits)$r.sq,\n training_error = mean(residuals(fits)^2),\n loocv = mean( (residuals(fits) / (1 - hatvalues(fits)))^2 ),\n AIC = AIC(fits),\n BIC = BIC(fits)\n )\n }) |> list_rbind()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 5\n R2 training_error loocv AIC BIC\n \n1 0.589 21.3 22.9 598. 608.\n2 0.595 21.0 23.4 598. 611.\n3 0.586 21.4 23.0 598. 609.\n```\n:::\n:::\n\n\n# Greedy selection\n\n::: {.callout-note}\nI'm doing everything for linear models, but applies to generalized linear models.\n:::\n\n## Model Selection vs. Variable Selection\n\nModel selection is very comprehensive\n\nYou choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.\n\nVariable selection is a subset of this. It means \n\n> choosing which predictors to include in a predictive model\n\nEliminating a predictor, means removing it from the model.\n\nSome [procedures]{.hand} automatically search predictors, and eliminate some.\n\nWe call this variable selection. But the procedure is implicitly selecting a model\nas well.\n\n\nMaking this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them.\n\n## Selecting variables / predictors with linear methods\n\n\nSuppose we have a pile of predictors.\n\nWe estimate models with different subsets of predictors and use CV / Cp / AIC \n/ BIC to decide which is preferred.\n\nSometimes you might have a few plausible subsets. Easy enough to choose with our criterion.\n\nSometimes you might just have a bunch of predictors, then what do you do?\n\n## Best subsets\n\nIf we imagine that only a few predictors are relevant, we could solve\n\n$$\\min_{\\beta\\in\\R^p} \\frac{1}{2n}\\norm{Y-\\X\\beta}_2^2 + \\lambda\\norm{\\beta}_0$$\n\n\nThe $\\ell_0$-norm counts the number of non-zero coefficients.\n\nThis may or may not be a good thing to do.\n\nIt is computationally infeasible if $p$ is more than about 20.\n\nTechnically NP-hard (you must find the error of each of the $2^p$ models)\n\nThough see [@BertsimasKing2016] for a method of solving reasonably large cases via mixed integer programming.\n\n## Greedy methods\n\nBecause this is an NP-hard problem, we fall back on greedy algorithms.\n\nAll are implemented by the `regsubsets` function in the `leaps` package. \n\nAll subsets\n: estimate model based on every possible subset of size $|\\mathcal{S}| \\leq \\min\\{n, p\\}$, use one with \nlowest risk estimate\n\nForward selection\n: start with $\\mathcal{S}=\\varnothing$, add predictors greedily\n\nBackward selection\n: start with $\\mathcal{S}=\\{1,\\ldots,p\\}$, remove greedily\n\nHybrid\n: combine forward and backward smartly\n\n##\n\n::: {.callout-note}\nWithin each procedure, we're comparing _nested_ models.\n:::\n\n\n## Costs and benefits\n\n\nAll subsets\n: 👍 estimates each subset \n💣 takes $2^p$ model fits when $pn$\n\nHybrid\n: 👍 visits more models than forward/backward \n💣 slower\n\n\n## Synthetic example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2024 - 550)\nn <- 550\ndf <- tibble( \n x1 = rnorm(n),\n x2 = rnorm(n, mean = 2, sd = 1),\n x3 = rexp(n, rate = 1),\n x4 = x2 + rnorm(n, sd = .1), # correlated with x2\n x5 = x1 + rnorm(n, sd = .1), # correlated with x1\n x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)\n x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)\n y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only\n)\n```\n:::\n\n\n$\\mathbf{x}_1$ and $\\mathbf{x}_2$ are the true predictors\n\nBut the rest are correlated with them\n\n\n## Full model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfull <- lm(y ~ ., data = df)\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.120 -1.386 -0.060 1.417 6.536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.17176 0.21823 -0.787 0.43158 \nx1 4.94560 1.62872 3.036 0.00251 **\nx2 1.88209 1.34057 1.404 0.16091 \nx3 0.10755 0.90835 0.118 0.90579 \nx4 -1.51043 0.97746 -1.545 0.12287 \nx5 -1.79872 0.94961 -1.894 0.05874 . \nx6 -0.08277 0.92535 -0.089 0.92876 \nx7 -0.05477 0.90159 -0.061 0.95159 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.176 on 542 degrees of freedom\nMultiple R-squared: 0.6538,\tAdjusted R-squared: 0.6494 \nF-statistic: 146.2 on 7 and 542 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## True model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntruth <- lm(y ~ x1 + x2, data = df)\nsummary(truth)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ x1 + x2, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.0630 -1.4199 -0.0654 1.3871 6.7382 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.12389 0.20060 -0.618 0.537 \nx1 2.99853 0.09434 31.783 < 2e-16 ***\nx2 0.44614 0.09257 4.820 1.87e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.179 on 547 degrees of freedom\nMultiple R-squared: 0.6498,\tAdjusted R-squared: 0.6485 \nF-statistic: 507.5 on 2 and 547 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## All subsets\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(leaps)\ntrythemall <- regsubsets(y ~ ., data = df)\nsummary(trythemall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df)\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: exhaustive\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \"*\" \" \" \" \" \" \" \" \" \" \"\n3 ( 1 ) \"*\" \"*\" \" \" \" \" \"*\" \" \" \" \"\n4 ( 1 ) \"*\" \"*\" \" \" \"*\" \"*\" \" \" \" \"\n5 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \" \" \" \"\n6 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \" \"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\"\n```\n:::\n:::\n\n\n\n## BIC and Cp\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/more-all-subsets1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Theory\n\nThis result is due to @FosterGeorge1994.\n\n1. If the truth is linear.\n2. $\\lambda = C\\sigma^2\\log p.$\n3. $\\norm{\\beta_*}_0 = s$\n\n$$\\frac{\\Expect{\\norm{\\X\\beta_*-\\X\\hat\\beta}_2^2}/n}{s\\sigma^2/n} \\leq 4\\log p + 2 + o(1).$$\n\n\n$$\\inf_{\\hat\\beta}\\sup_{\\X,\\beta_*} \\frac{\\Expect{\\norm{\\X\\beta_*-\\X\\hat\\beta}_2^2}/n}{s\\sigma^2/n} \\geq 2\\log p - o(\\log p).$$\n\n\n##\n\n::: {.callout-important}\n\n- even if we could compute the subset selection estimator at scale, it’s not clear that we would want to\n- (Many people assume that we would.) \n- theory provides an understanding of the performance of various estimators under typically idealized conditions\n\n:::\n\n\n\n# Regularization\n\n## Regularization\n\n\n* Another way to control bias and variance is through [regularization]{.secondary} or\n[shrinkage]{.secondary}. \n\n\n* Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.\n\n\n* But, make your estimates of $\\beta$ \"smaller\"\n\n\n\n## Brief aside on optimization\n\n* An optimization problem has 2 components:\n\n 1. The \"Objective function\": e.g. $\\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$.\n 2. The \"constraint\": e.g. \"fewer than 5 non-zero entries in $\\beta$\".\n \n* A constrained minimization problem is written\n\n\n$$\\min_\\beta f(\\beta)\\;\\; \\mbox{ subject to }\\;\\; C(\\beta)$$\n\n* $f(\\beta)$ is the objective function\n* $C(\\beta)$ is the constraint\n\n\n## Ridge regression (constrained version)\n\nOne way to do this for regression is to solve (say):\n$$\n\\minimize_\\beta \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\sum_j \\beta^2_j < s\n$$\nfor some $s>0$.\n\n* This is called \"ridge regression\".\n* Write the minimizer as $\\hat{\\beta}_s$.\n\n. . .\n\nCompare this to ordinary least squares:\n\n$$\n\\minimize_\\beta \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 \n\\quad \\st \\beta \\in \\R^p\n$$\n\n\n\n## Geometry of ridge regression (contours)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/plotting-functions-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Reminder of norms we should remember\n\n$\\ell_q$-norm\n: $\\left(\\sum_{j=1}^p |z_j|^q\\right)^{1/q}$\n\n$\\ell_1$-norm (special case)\n: $\\sum_{j=1}^p |z_j|$\n\n$\\ell_0$-norm\n: $\\sum_{j=1}^p I(z_j \\neq 0 ) = \\lvert \\{j : z_j \\neq 0 \\}\\rvert$\n\n$\\ell_\\infty$-norm\n: $\\max_{1\\leq j \\leq p} |z_j|$\n\n::: aside\nRecall what a norm is: \n:::\n\n\n## Ridge regression\n\nAn equivalent way to write\n\n$$\\hat\\beta_s = \\argmin_{ \\Vert \\beta \\Vert_2^2 \\leq s} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$$\n\n\nis in the [Lagrangian]{.secondary} form\n\n\n$$\\hat\\beta_\\lambda = \\argmin_{ \\beta} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\frac{\\lambda}{2} \\Vert \\beta \\Vert_2^2.$$\n\n\n\n\nFor every $\\lambda$ there is a unique $s$ (and vice versa) that makes \n\n$$\\hat\\beta_s = \\hat\\beta_\\lambda$$\n\n## Ridge regression\n\n$\\hat\\beta_s = \\argmin_{ \\Vert \\beta \\Vert_2^2 \\leq s} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2$\n\n$\\hat\\beta_\\lambda = \\argmin_{ \\beta} \\frac{1}{2n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\frac{\\lambda}{2} \\Vert \\beta \\Vert_2^2.$\n\nObserve:\n\n* $\\lambda = 0$ (or $s = \\infty$) makes $\\hat\\beta_\\lambda = \\hat\\beta_{ols}$\n* Any $\\lambda > 0$ (or $s <\\infty$) penalizes larger values of $\\beta$, effectively shrinking them.\n\n\n$\\lambda$ and $s$ are known as [tuning parameters]{.secondary}\n\n\n\n\n## Example data\n\n`prostate` data from [ESL]\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 97 × 10\n lcavol lweight age lbph svi lcp gleason pgg45 lpsa train\n \n 1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE \n 2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE \n 4 -1.20 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE \n 6 -1.05 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE \n 7 0.737 3.47 64 0.615 0 -1.39 6 0 0.765 FALSE\n 8 0.693 3.54 58 1.54 0 -1.39 6 0 0.854 TRUE \n 9 -0.777 3.54 47 -1.39 0 -1.39 6 0 1.05 FALSE\n10 0.223 3.24 63 -1.39 0 -1.39 6 0 1.05 FALSE\n# ℹ 87 more rows\n```\n:::\n:::\n\n\n::: notes\n\nUse `lpsa` as response.\n\n:::\n\n\n## Ridge regression path\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nY <- prostate$lpsa\nX <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))\nlibrary(glmnet)\nridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)\n```\n:::\n\n\n::: flex\n::: w-60\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n::: w-35\n\nModel selection here: \n\n* means [choose]{.secondary} some $\\lambda$ \n\n* A value of $\\lambda$ is a vertical line.\n\n* This graphic is a \"path\" or \"coefficient trace\"\n\n* Coefficients for varying $\\lambda$\n:::\n:::\n\n\n## Solving the minimization\n\n* One nice thing about ridge regression is that it has a closed-form solution (like OLS)\n\n\n$$\\hat\\beta_\\lambda = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y$$\n\n* This is easy to calculate in `R` for any $\\lambda$.\n\n* However, computations and interpretation are simplified if we examine the \n[Singular Value Decomposition]{.secondary} of $\\X = \\mathbf{UDV}^\\top$.\n\n* Recall: any matrix has an SVD.\n\n* Here $\\mathbf{D}$ is diagonal and $\\mathbf{U}$ and $\\mathbf{V}$ are orthonormal: $\\mathbf{U}^\\top\\mathbf{U} = \\mathbf{I}$.\n\n## Solving the minization\n\n$$\\hat\\beta_\\lambda = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y$$\n\n* Note that $\\mathbf{X}^\\top\\mathbf{X} = \\mathbf{VDU}^\\top\\mathbf{UDV}^\\top = \\mathbf{V}\\mathbf{D}^2\\mathbf{V}^\\top$.\n\n\n* Then,\n\n\n$$\\hat\\beta_\\lambda = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top + \\lambda \\mathbf{I})^{-1}\\mathbf{VDU}^\\top \\y\n= \\mathbf{V}(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1} \\mathbf{DU}^\\top \\y.$$\n\n* For computations, now we only need to invert $\\mathbf{D}$.\n\n\n## Comparing with OLS\n\n\n* $\\mathbf{D}$ is a diagonal matrix\n\n$$\\hat\\beta_{ols} = (\\X^\\top\\X)^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top)^{-1}\\mathbf{VDU}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-2}\\mathbf{D}}\\mathbf{U}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-1}}\\mathbf{U}^\\top \\y$$\n\n$$\\hat\\beta_\\lambda = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = \\mathbf{V}\\color{red}{(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1}} \\mathbf{DU}^\\top \\y.$$\n\n\n* Notice that $\\hat\\beta_{ols}$ depends on $d_j/d_j^2$ while $\\hat\\beta_\\lambda$ depends on $d_j/(d_j^2 + \\lambda)$.\n\n* Ridge regression makes the coefficients smaller relative to OLS.\n\n* But if $\\X$ has small singular values, ridge regression compensates with $\\lambda$ in the denominator.\n\n# Multicollinearity\n\n## Ridge regression and multicollinearity\n\n[Multicollinearity:]{.secondary} a linear combination of predictor variables is nearly equal to another predictor variable. \n\n## Multicollinearity questions\n\n1. Can I test `cor(x1, x2) == 0` to determine if these are collinear?\n2. What plots or summaries can I look at?\n3. If multivariate regression or logistic regression is applied on a data set with many explanatory variables, what in the regression output might indicate potential multicollinearity?\n4. Is there a test or diagnostic procedure for multicollinearity? \n\n\n::: notes\n1. No. \n2. Correlation matrix of continuous $x$. \n3. Large standard errors, estimated coefficients with opposite sign. `NA` estimates. Removing vars brings down SEs without much change in fit.\n4. Big VIF `summary(lm(xj ~ . - xj - y))$r.sq`\n:::\n\n\n\n## Multicollinearity thoughts\n\nSome comments:\n\n* A better phrase: $\\X$ is ill-conditioned\n\n* AKA \"(numerically) rank-deficient\".\n\n* $\\X = \\mathbf{U D V}^\\top$ ill-conditioned $\\Longleftrightarrow$ some elements of $\\mathbf{D} \\approx 0$\n\n* $\\hat\\beta_{ols}= \\mathbf{V D}^{-1} \\mathbf{U}^\\top \\y$. Small entries of $\\mathbf{D}$ $\\Longleftrightarrow$ huge elements of $\\mathbf{D}^{-1}$\n\n* Means huge variance: $\\Var{\\hat\\beta_{ols}} = \\sigma^2(\\X^\\top \\X)^{-1} = \\sigma^2 \\mathbf{V D}^{-2} \\mathbf{V}^\\top$\n\n* If you're doing prediction, this is a purely computational concern.\n\n\n## Ridge regression and ill-posed $\\X$\n\n\nRidge Regression fixes this problem by preventing the division by a near-zero number\n\nConclusion\n: $(\\X^{\\top}\\X)^{-1}$ can be really unstable, while $(\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1}$ is not.\n\nAside\n: Engineering approach to solving linear systems is to always do this with small $\\lambda$. The thinking is about the numerics rather than the statistics.\n\n### Which $\\lambda$ to use?\n\nComputational\n: Use CV and pick the $\\lambda$ that makes this smallest.\n\nIntuition (bias)\n: As $\\lambda\\rightarrow\\infty$, bias ⬆\n\nIntuition (variance)\n: As $\\lambda\\rightarrow\\infty$, variance ⬇\n\nYou should think about why.\n\n\n\n## Can we get the best of both worlds?\n\nTo recap:\n\n* Deciding which predictors to include, adding quadratic terms, or interactions is [model selection]{.secondary} (more precisely variable selection within a linear model).\n\n* Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity. \n\n* If the LM is **true**, \n 1. OLS is unbiased, but Variance depends on $\\mathbf{D}^{-2}$. Can be big.\n 2. Ridge is biased (can you find the bias?). But Variance is smaller than OLS.\n\n* Ridge regression does not perform variable selection.\n\n* But [picking]{.hand} $\\lambda=3.7$ and thereby [deciding]{.hand} to predict with $\\widehat{\\beta}^R_{3.7}$ is [model selection]{.secondary}.\n\n\n\n## Can we get the best of both worlds?\n\nRidge regression \n: $\\minimize \\frac{1}{2n}\\Vert\\y-\\X\\beta\\Vert_2^2 \\ \\st\\ \\snorm{\\beta}_2^2 \\leq s$ \n\nBest (in-sample) linear regression model of size $s$\n: $\\minimize \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2 \\ \\st\\ \\snorm{\\beta}_0 \\leq s$\n\n\n$||\\beta||_0$ is the number of nonzero elements in $\\beta$\n\nFinding the best in-sample linear model (of size $s$, among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)\n\nRidge regression is convex (easy to solve), but doesn't do __variable__ selection\n\nCan we somehow \"interpolate\" to get both?\n\n\nNote: selecting $\\lambda$ is still __model__ selection, but we've included __all__ the variables.\n\n\n## Ridge theory\n\nRecalling that $\\beta^\\top_*x$ is the best linear approximation to $f_*(x)$\n\nIf $\\norm{x}_\\infty< r$, [@HsuKakade2014],\n$$R(\\hat\\beta_\\lambda) - R(\\beta_*) \\leq \\left(1+ O\\left(\\frac{1+r^2/\\lambda}{n}\\right)\\right)\n\\frac{\\lambda\\norm{\\beta_*}_2^2}{2} + \\frac{\\sigma^2\\tr{\\Sigma}}{2n\\lambda}$$\n\n\nOptimizing over $\\lambda$, and setting $B=\\norm{\\beta_*}$ gives\n\n$$R(\\hat\\beta_\\lambda) - R(\\beta_*) \\leq \\sqrt{\\frac{\\sigma^2r^2B^2}{n}\\left(1+O(1/n)\\right)} + \nO\\left(\\frac{r^2B^2}{n}\\right)$$\n\n\n$$\\inf_{\\hat\\beta}\\sup_{\\beta_*} R(\\hat\\beta) - R(\\beta_*) \\geq C\\sqrt{\\frac{\\sigma^2r^2B^2}{n}}$$\n\n## Ridge theory\n\nWe call this behavior _rate minimax_: essential meaning, \n$$R(\\hat\\beta) - R(\\beta_*) = O\\left(\\inf_{\\hat\\beta}\\sup_{\\beta_*} R(\\hat\\beta) - R(\\beta_*)\\right)$$\n\nIn this setting, Ridge regression does as well as we could hope, up to constants.\n\n## Bayes interpretation\n\nIf \n\n1. $Y=X'\\beta + \\epsilon$, \n2. $\\epsilon\\sim N(0,\\sigma^2)$ \n3. $\\beta\\sim N(0,\\tau^2 I_p)$,\n\nThen, the posterior mean (median, mode) is the ridge estimator with $\\lambda=\\sigma^2/\\tau^2$.\n\n\n# Lasso\n\n## Geometry\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(mvtnorm)\nnormBall <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- data.frame(x = cos(tg)) %>%\n mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%\n gather(key = \"lab\", value = \"y\", -x)\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n df <- expand.grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)\n )\n df$z <- dmvnorm(df, mean, Sigma)\n df\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6) {\n ed <- filter(ed, x > 0, y > 0)\n for (i in 1:20) {\n ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol\n if (sum(ff) > 0) break\n tol <- 2 * tol\n }\n best <- ed[ff, ]\n best[which.max(best$z), ]\n}\n\nnb <- normBall(1)\ned <- ellipseData()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 1)\nggplot(nb, aes(x, y)) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +\n theme_bw(base_family = \"\", base_size = 24) +\n geom_label(\n data = bols, mapping = aes(label = bquote(\"hat(beta)[ols]\")), parse = TRUE,\n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n geom_label(\n data = bhat, mapping = aes(label = bquote(\"hat(beta)[s]^L\")), parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )\n```\n\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/ball-plotting-functions-1.svg){fig-align='center'}\n:::\n:::\n\n\n## $\\ell_1$-regularized regression\n\nKnown as \n\n* \"lasso\"\n* \"basis pursuit\"\n\nThe estimator satisfies\n\n$$\\hat\\beta_s = \\argmin_{ \\snorm{\\beta}_1 \\leq s} \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2$$\n\n\nIn its corresponding Lagrangian dual form:\n\n$$\\hat\\beta_\\lambda = \\argmin_{\\beta} \\frac{1}{2n}\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1$$\n\n\n## Lasso\n\nWhile the ridge solution can be easily computed \n\n$$\\argmin_{\\beta} \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_2^2 = (\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1} \\X^{\\top}\\y$$\n\n\nthe lasso solution\n\n\n$$\\argmin_{\\beta} \\frac 1n\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1 = \\; ??$$\n\ndoesn't have a closed-form solution.\n\n\nHowever, because the optimization problem is convex, there exist efficient algorithms for computing it\n\n::: aside\nThe best are Iterative Soft Thresholding or Coordinate Descent. Gradient Descent doesn't work very well in practice.\n:::\n\n\n## Coefficient path: ridge vs lasso\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(glmnet)\ndata(prostate, package = \"ElemStatLearn\")\nX <- prostate |> dplyr::select(-train, -lpsa) |> as.matrix()\nY <- prostate$lpsa\nlasso <- glmnet(x = X, y = Y) # alpha = 1 by default\nridge <- glmnet(x = X, y = Y, alpha = 0)\nop <- par()\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Additional intuition for why Lasso selects variables\n\nSuppose, for a particular $\\lambda$, I have solutions for $\\widehat{\\beta}_k$, $k = 1,\\ldots,j-1, j+1,\\ldots,p$.\n\nLet $\\widehat{\\y}_{-j} = \\X_{-j}\\widehat{\\beta}_{-j}$, and assume WLOG $\\overline{\\X}_k = 0$, $\\X_k^\\top\\X_k = 1\\ \\forall k$\n\nOne can show that:\n\n$$\n\\widehat{\\beta}_j = S\\left(\\mathbf{X}^\\top_j(\\y - \\widehat{\\y}_{-j}),\\ \\lambda\\right).\n$$\n\n$$\nS(z, \\gamma) = \\textrm{sign}(z)(|z| - \\gamma)_+ = \\begin{cases} z - \\gamma & z > \\gamma\\\\\nz + \\gamma & z < -\\gamma \\\\ 0 & |z| \\leq \\gamma \\end{cases}\n$$\n\n* Iterating over this is called [coordinate descent]{.secondary} and gives the solution\n\n::: aside\nSee for example, \n:::\n\n\n::: notes\n* If I were told all the other coefficient estimates.\n* Then to find this one, I'd shrink when the gradient is big, or set to 0 if it\ngets too small.\n:::\n\n## `{glmnet}` version (same procedure for lasso or ridge)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"1|2|3|4|5|\"}\nlasso <- cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself\n# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambda\nlambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)\ncoeffs <- coefficients(lasso, s = \"lambda.min\") # s can be string or a number\npreds <- predict(lasso, newx = X, s = \"lambda.1se\") # must supply `newx`\n```\n:::\n\n\n* $\\widehat{R}_{CV}$ is an estimator of $R_n$, it has bias and variance\n* Because we did CV, we actually have 10 $\\widehat{R}$ values, 1 per split.\n* Calculate the mean (that's what we've been using), but what about SE?\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](regularization-lm_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other flavours\n\nThe elastic net\n: generally used for correlated variables that\ncombines a ridge/lasso penalty. Use `glmnet(..., alpha = a)` (0 < `a` < 1). \n\nGrouped lasso\n: where variables are included or excluded in groups. Required for factors (1-hot encoding)\n\nRelaxed lasso\n: Takes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use `glmnet(..., relax = TRUE)`.\n\nDantzig selector\n: a slightly modified version of the lasso\n\n## Lasso cinematic universe\n\n::: flex\n::: w-60\n\nSCAD\n: a non-convex version of lasso that adds a more severe variable selection penalty\n\n$\\sqrt{\\textrm{lasso}}$\n: claims to be tuning parameter free (but isn't). Uses $\\Vert\\cdot\\Vert_2$\ninstead of $\\Vert\\cdot\\Vert_1$ for the loss.\n\nGeneralized lasso\n: Adds various additional matrices to the penalty term (e.g. $\\Vert D\\beta\\Vert_1$). \n\nArbitrary combinations\n: combine the above penalties in your favourite combinations\n:::\n\n::: w-40\n\n![](https://sportshub.cbsistatic.com/i/2022/08/10/d348f903-585f-4aa6-aebc-d05173761065/brett-goldstein-hercules.jpg)\n\n:::\n:::\n\n## Warnings on regularized regression\n\n1. This isn't a method unless you say how to choose $\\lambda$.\n1. The intercept is never penalized. Adds an extra degree-of-freedom.\n1. Predictor scaling is [very]{.secondary} important.\n1. Discrete predictors need groupings.\n1. Centering the predictors may be necessary\n1. (These all work with other likelihoods.)\n\n. . .\n\nSoftware handles most of these automatically, but not always. (No Lasso with factor predictors.)\n\n## Lasso theory under strong conditions {.smaller}\n\n[Support recovery:]{.tertiary} [@Wainwright2009], see also [@MeinshausenBuhlmann2006; @ZhaoYu2006]\n\n1. The truth is linear.\n2. $\\norm{\\X'_{S^c}\\X_S (\\X'_S\\X_S)^{-1}}_\\infty < 1-\\epsilon.$\n3. $\\lambda_{\\min} (\\X'_S\\X_S) \\geq C_{\\min} > 0$.\n4. The columns of $\\X$ have 2-norm $n$.\n5. The noise is iid Normal.\n6. $\\lambda_n$ satisfies $\\frac{n\\lambda^2}{\\log(p-s)} \\rightarrow \\infty$.\n7. $\\min_j \\{ |\\beta_j| : j \\in S\\} \\geq \\rho_n > 0$ and \n$$\\rho_n^{-1} \\left( \\sqrt{\\frac{\\log s}{n}}+ \\lambda_n\\norm{(\\X'_S\\X_S)^{-1}}_\\infty \\right)\\rightarrow 0$$\n\n\nThen, $P(\\textrm{supp}(\\hat\\beta_\\lambda) = \\textrm{supp}(\\beta_*))\\rightarrow 1$.\n\n## Lasso theory under strong conditions {.smaller}\n\n[Estimation consistency:]{.tertiary} [@negahban2010unified] also [@MeinshausenYu2009]\n\n1. The truth is linear.\n2. $\\exists \\kappa$ such that for all vectors $\\theta\\in\\R^p$ that satisfy \n$\\norm{\\theta_{S^C}}_1 \\leq 3\\norm{\\theta_S}_1$, we have $\\norm{X\\theta}_2^2/n \\geq \\kappa\\norm{\\theta}_2^2$ (Compatibility)\n3. The columns of $\\X$ have 2-norm $n$.\n4. The noise is iid sub-Gaussian.\n5. $\\lambda_n >4\\sigma \\sqrt{\\log (p)/n}$.\n\nThen, with probability at least $1-c\\exp(-c'n\\lambda_n^2)$, \n$$\\norm{\\hat\\beta_\\lambda-\\beta_*}_2^2 \\leq \\frac{64\\sigma^2}{\\kappa^2}\\frac{s\\log p}{n}.$$\n\n::: {.callout-important}\nThese conditions are very strong, uncheckable in practice, unlikely to be true for real datasets. But theory of this type is the standard for these procedures.\n:::\n\n## Lasso under weak / no conditions\n\nIf $Y$ and $X$ are bounded by $B$, then with probability at least $1-\\delta^2$,\n$$R_n(\\hat\\beta_\\lambda) - R_n(\\beta_*) \\leq \\sqrt{\\frac{16(t+1)^4B^2}{n}\\log\\left(\\frac{\\sqrt{2}p}{\\delta}\\right)}.$$\n\n\nThis is a simple version of a result in [@GreenshteinRitov2004].\n\nNote that it applies to the constrained version.\n\n[@bartlett2012] derives the same rate for the Lagrangian version\n\nAgain, this rate is (nearly) optimal:\n$$c\\sqrt{\\frac{s}{n}} < R_n(\\hat\\beta_\\lambda) - R_n(\\beta_*) < C\\sqrt{\\frac{s\\log p}{n}}.$$\n\n\n$\\log p$ is the penalty you pay for selection.\n\n\n\n\n## References", "supporting": [ "regularization-lm_files" ], diff --git a/_freeze/schedule/slides/syllabus/execute-results/html.json b/_freeze/schedule/slides/syllabus/execute-results/html.json index 2b787ae..f75984a 100644 --- a/_freeze/schedule/slides/syllabus/execute-results/html.json +++ b/_freeze/schedule/slides/syllabus/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "2dbb1f6a639e04c9de75a1818bccb758", "result": { - "markdown": "---\nlecture: \"Introduction and Second half pivot\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## About me\n\n\n::::{.columns}\n:::{.column width=\"50%\"}\n- Daniel J. McDonald\n- daniel@stat.ubc.ca \n- \n- Associate Professor, Department of Statistics \n:::\n:::{.column width=\"50%\"}\n* Moved to UBC in mid-March 2020, 2 days before the border closed\n* Previously a Stats Prof at Indiana University for 8 years\n:::\n::::\n\n. . .\n\n![](https://weareiu.com/wp-content/uploads/2018/12/map-1.png){fig-align=\"center\" width=\"50%\"}\n\n\n## No More Canvas!!\n\nSee the website: \n\n\n\n
\n
\n
\n\nYou'll find\n\n* announcements\n* schedule\n* lecture slides / notes\n\n(Grades still on Canvas)\n\n\n\n## Course communication\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n### Website:\n\n\n\n* Hosted on GitHub.\n\n* Links to slides and all materials\n\n* Syllabus is there. Be sure to read it. (same idea as before)\n\n\n### Slack:\n\n* This is our discussion board.\n\n* Note that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\n\n* We'll use a Channel in the UBC-Stat Workspace\n:::\n\n::: {.column width=\"50%\"}\n\n### Github organization\n\n* Linked from the website.\n\n* This is where you complete/submit assignments/projects/in-class-work\n\n* This is also hosted on US servers \n:::\n::::\n\n## Why these?\n\n1. Yes, some data is hosted on servers in the US.\n1. But in the real world, no one uses Canvas/Piazza, so why not learn things they do use?\n1. Canvas is dumb and hard to organize.\n1. GitHub is free and actually useful.\n1. Much easier to communicate, \"grade\" or comment on your work\n1. Much more DS friendly\n1. Note that MDS uses both of these, the department uses both, etc.\n1. More on all this later.\n\n. . .\n\nSlack help from MDS --- [features](https://ubc-mds.github.io/resources_pages/slack/) and [rules](https://ubc-mds.github.io/resources_pages/slack_asking_for_help/)\n\n\n## What are the goals of Stat 550?\n\n
\n\n### 1. Prepare you to do the consulting practicum (Stat 551)\n\n\n

\n\n### 2. You're a captive audience, so I can teach you some skills you'll need for \n \n* MSc Thesis/Project or PhD research\n* Employment in Data Science / Statistics.\n* These are often things that will help with the first as well\n\n## 1. Prepare you for the consulting practicum (Stat 551)\n\n* understand how the data was collected\n\n* implications of the collection process for analysis\n\n* organize data for analysis\n\n* determine appropriate methods for analysis that [answer's the client's questions]{.secondary}\n\n* interpret the results\n\n* present and communicate the results\n\n::: {.notes}\n* In most courses you get nice clean data. Getting to \"nice clean data\" is non-trivial\n* In most courses things are \"IID\", negligible missingness\n* Usually, the question is formed in statistical langauge, here, you are responsible for \"translating\"\n* Interpretation has to be \"translated back\"\n* Presentation skills --- important everywhere\n:::\n\n## 2. Some skills you'll need\n\n* Version control\n* Reproducible reports\n* Writing experience: genre is important\n* Presentation skills\n* Better coding practice\n* Documentation\n\n## Computing\n\n* All work done in R/RMarkdown. \n* No you can't use Python. Or Stata or SPSS.\n* No you can't use Jupyter Notebooks.\n* All materials on Github. \n* You will learn to use Git/GitHub/RStudio/Rmarkdown.\n* Slack for discussion/communication\n\n## Getting setup\n\n1. Add to Slack Channel: \n2. Github account: \n3. Add to the Github Org --- tell me your account\n4. RStudio synchronization\n\n\n", + "markdown": "---\nlecture: \"Introduction and Second half pivot\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## About me\n\n\n::::{.columns}\n:::{.column width=\"50%\"}\n- Daniel J. McDonald\n- daniel@stat.ubc.ca \n- \n- Associate Professor, Department of Statistics \n:::\n:::{.column width=\"50%\"}\n* Moved to UBC in mid-March 2020, 2 days before the border closed\n* Previously a Stats Prof at Indiana University for 8 years\n:::\n::::\n\n. . .\n\n![](https://weareiu.com/wp-content/uploads/2018/12/map-1.png){fig-align=\"center\" width=\"50%\"}\n\n\n## No More Canvas!!\n\nSee the website: \n\n\n\n
\n
\n
\n\nYou'll find\n\n* announcements\n* schedule\n* lecture slides / notes\n\n(Grades still on Canvas)\n\n\n\n## Course communication\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n### Website:\n\n\n\n* Hosted on GitHub.\n\n* Links to slides and all materials\n\n* Syllabus is there. Be sure to read it. (same idea as before)\n\n\n### Slack:\n\n* This is our discussion board.\n\n* Note that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\n\n* We'll use a Channel in the UBC-Stat Workspace\n:::\n\n::: {.column width=\"50%\"}\n\n### Github organization\n\n* Linked from the website.\n\n* This is where you complete/submit assignments/projects/in-class-work\n\n* This is also hosted on US servers \n:::\n::::\n\n## Why these?\n\n1. Yes, some data is hosted on servers in the US.\n1. But in the real world, no one uses Canvas/Piazza, so why not learn things they do use?\n1. Canvas is dumb and hard to organize.\n1. GitHub is free and actually useful.\n1. Much easier to communicate, \"grade\" or comment on your work\n1. Much more DS friendly\n1. Note that MDS uses both of these, the department uses both, etc.\n1. More on all this later.\n\n. . .\n\nSlack help from MDS --- [features](https://ubc-mds.github.io/resources_pages/slack/) and [rules](https://ubc-mds.github.io/resources_pages/slack_asking_for_help/)\n\n\n## What are the goals of Stat 550?\n\n
\n\n### 1. Prepare you to do the consulting practicum (Stat 551)\n\n\n

\n\n### 2. You're a captive audience, so I can teach you some skills you'll need for \n \n* MSc Thesis/Project or PhD research\n* Employment in Data Science / Statistics.\n* These are often things that will help with the first as well\n\n## 1. Prepare you for the consulting practicum (Stat 551)\n\n* understand how the data was collected\n\n* implications of the collection process for analysis\n\n* organize data for analysis\n\n* determine appropriate methods for analysis that [answer's the client's questions]{.secondary}\n\n* interpret the results\n\n* present and communicate the results\n\n::: {.notes}\n* In most courses you get nice clean data. Getting to \"nice clean data\" is non-trivial\n* In most courses things are \"IID\", negligible missingness\n* Usually, the question is formed in statistical langauge, here, you are responsible for \"translating\"\n* Interpretation has to be \"translated back\"\n* Presentation skills --- important everywhere\n:::\n\n## 2. Some skills you'll need\n\n* Version control\n* Reproducible reports\n* Writing experience: genre is important\n* Presentation skills\n* Better coding practice\n* Documentation\n\n## Computing\n\n* All work done in R/RMarkdown. \n* No you can't use Python. Or Stata or SPSS.\n* No you can't use Jupyter Notebooks.\n* All materials on Github. \n* You will learn to use Git/GitHub/RStudio/Rmarkdown.\n* Slack for discussion/communication\n\n## Getting setup\n\n1. Add to Slack Channel: \n2. Github account: \n3. Add to the Github Org --- tell me your account\n4. RStudio synchronization\n\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/time-series/execute-results/html.json b/_freeze/schedule/slides/time-series/execute-results/html.json index 91ae62c..9fecb7d 100644 --- a/_freeze/schedule/slides/time-series/execute-results/html.json +++ b/_freeze/schedule/slides/time-series/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "3b6dd1150d54fab497660b3d66b6df06", "result": { - "markdown": "---\nlecture: \"Time series, a whirlwind\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## The general linear process\n\n* Imagine that there is a noise process\n\n$$\\epsilon_j \\sim \\textrm{N}(0, 1),\\ \\textrm{i.i.d.}$$\n\n* At time $i$, we observe the sum of all past noise\n\n$$y_i = \\sum_{j=-\\infty}^0 a_{i+j} \\epsilon_j$$\n\n* Without some conditions on $\\{a_k\\}_{k=-\\infty}^0$ this process will \"run away\"\n* The result is \"non-stationary\" and difficult to analyze.\n* Stationary means (roughly) that the marginal distribution of $y_i$ does not change with $i$.\n\n## Chasing stationarity\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 1000\nnseq <- 5\ngenerate_ar <- function(n, b) {\n y <- double(n)\n y[1] <- rnorm(1)\n for (i in 2:n) y[i] <- b * y[i - 1] + rnorm(1)\n tibble(time = 1:n, y = y)\n}\nstationary <- map(1:nseq, ~ generate_ar(n, .99)) |> list_rbind(names_to = \"id\")\nnon_stationary <- map(1:nseq, ~ generate_ar(n, 1.01)) |>\n list_rbind(names_to = \"id\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Uses of stationarity\n\n* Lots of types (weak, strong, in-mean, wide-sense,...)\n* not required for modelling / forecasting\n* But assuming stationarity gives some important guarantees\n* Usually work with stationary processes\n\n## Standard models {.smaller}\n\n### AR(p)\n\nSuppose $\\epsilon_i$ are i.i.d. N(0, 1) (distn is convenient, but not required)\n\n$$y_i = \\mu + a_1 y_{i-1} + \\cdots + a_p y_{i-p} + \\epsilon_i$$\n\n* This is a special case of the general linear process\n* You can recursively substitute this defn into itself to get that equation\n\nEasy to estimate the `a`'s given a realization. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ar = c(.7, -.1)), n = 1000)\nY <- y[3:1000]\nX <- cbind(lag1 = y[2:999], lag2 = y[1:998])\nsummary(lm(Y ~ X + 0))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Y ~ X + 0)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6164 -0.6638 0.0271 0.6456 3.8367 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nXlag1 0.66931 0.03167 21.134 <2e-16 ***\nXlag2 -0.04856 0.03167 -1.533 0.126 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.9899 on 996 degrees of freedom\nMultiple R-squared: 0.4085,\tAdjusted R-squared: 0.4073 \nF-statistic: 344 on 2 and 996 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n## AR(p)\n\n* The estimate isn't [that]{.secondary} accurate because the residuals (not the $\\epsilon$'s) are correlated. \n* (Usually, you get `1/n` convergence, here you don't.)\n* Also, this isn't the MLE. The likelihood includes $p(y_1)$, $p(y_2 | y_1)$ which `lm()` ignored.\n* The Std. Errors are unjustified.\n* But that was easy to do.\n* The correct way is\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(2, 0, 0), include.mean = FALSE)\n\nCoefficients:\n ar1 ar2\n 0.6686 -0.0485\ns.e. 0.0316 0.0316\n\nsigma^2 estimated as 0.9765: log likelihood = -1407.34, aic = 2820.67\n```\n:::\n:::\n\n\n* The resulting estimates and SEs are identical, AFAICS.\n\n## MA(q)\n\nStart with the general linear process, but truncate the infinite sum.\n\n$$y_i = \\sum_{j=-q}^0 a_{i+j} \\epsilon_j$$\n\n* This is termed a \"moving average\" process.\n* though $a_0 + \\cdots a_{-q}$ don't sum to 1.\n* Can't write this easily as a `lm()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ma = c(.9, .6, .1)), n = 1000)\narima(y, c(0, 0, 3), include.mean = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(0, 0, 3), include.mean = FALSE)\n\nCoefficients:\n ma1 ma2 ma3\n 0.9092 0.6069 0.1198\ns.e. 0.0313 0.0380 0.0311\n\nsigma^2 estimated as 0.8763: log likelihood = -1353.41, aic = 2714.82\n```\n:::\n:::\n\n\n## MA(q) as an AR(1) hidden process\n\nLet $X_j = [\\epsilon_{j-1},\\ \\ldots,\\ \\epsilon_{j-q}]$ and write\n\n$$\n\\begin{aligned}\nX_i &= \\begin{bmatrix} a_{i-1} & a_{i-2} & \\cdots & a_{i-q}\\\\ 1 & 0 & \\cdots & 0\\\\ & & \\ddots \\\\ 0 & 0 & \\cdots & 1\\end{bmatrix} X_{i-1} + \n\\begin{bmatrix} a_{i}\\\\ 0 \\\\ \\vdots \\\\ 0 \\end{bmatrix} \\epsilon_i\\\\\ny_i &= \\begin{bmatrix} 1 & 0 & \\cdots 0 \\end{bmatrix} X_i\n\\end{aligned}\n$$\n\n* Now $X$ is a $q$-dimensional AR(1) (but we don't see it)\n* $y$ is deterministic conditional on $X$\n* This is the usual way these are estimated using a State-Space Model\n* Many time series models have multiple equivalent representations\n\n## AR[I]{.secondary}MA\n\n* We've been using `arima()` and `arima.sim()`, so what is left?\n* The \"I\" means \"integrated\"\n* If, for example, we can write $z_i = y_i - y_{i-1}$ and $z$ follows an ARMA(p, q), we say $y$ follows an ARIMA(p, 1, q).\n* The middle term is the degree of differencing\n\n## Other standard models\n\nSuppose we can write \n\n$$\ny_i = T_i + S_i + W_i\n$$\n\nThis is the \"classical\" decomposition of $y$ into a Trend + Seasonal + Noise.\n\nYou can estimate this with a \"Basic Structural Time Series Model\" using `StrucTS()`.\n\nA related, though slightly different model is called the STL decomposition, estimated with `stl()`.\n\nThis is \"Seasonal Decomposition of Time Series by Loess\"\n\n(LOESS is \"locally estimated scatterplot smoothing\" named/proposed independently by Bill Cleveland though originally proposed about 15 years earlier and called the Savitsky-Golay Filter)\n\n## Quick example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsts <- StructTS(AirPassengers)\nbc <- stl(AirPassengers, \"periodic\") # use sin/cos to represent the seasonal\ntibble(\n time = seq(as.Date(\"1949-01-01\"), as.Date(\"1960-12-31\"), by = \"month\"),\n AP = AirPassengers, \n StrucTS = fitted(sts)[, 1], \n STL = rowSums(bc$time.series[, 1:2])\n) |>\n pivot_longer(-time) |>\n ggplot(aes(time, value, color = name)) +\n geom_line() +\n theme_bw(base_size = 24) +\n scale_color_viridis_d(name = \"\") +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Generic state space model\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n


\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\n:::\n::: {.column width=\"50%\"}\n\n* $x_k$ is unobserved, dimension $n$\n\n* $y_k$ is observed, dimension $m$\n\n* $x$ process is the [transition]{.tertiary} or [process]{.tertiary} equation\n\n* $y$ is the [observation]{.tertiary} or [measurement]{.tertiary} equation\n\n* Both are probability distributions that can depend on parameters $\\theta$\n\n* For now, assume $\\theta$ is [KNOWN]{.secondary}\n\n* We can allow the densities to vary with time.\n\n:::\n::::\n\n\n\n## GOAL(s)\n\n1. Filtering: given observations, find $$p(x_k | y_1,\\ldots y_k)$$\n\n1. Smoothing: given observations, find $$p(x_k | y_1,\\ldots y_T), \\;\\;\\ k < T$$\n\n1. Forecasting: given observations, find $$p(y_{k+1} | y_1,\\ldots,y_k)$$\n\n\n## Using Bayes Rule\n\nAssume $p(x_0)$ is known\n\n$$\n\\begin{aligned}\np(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T) &= \\prod_{k=1}^T p(y_k | x_k)\\\\\np(x_0,\\ldots,x_T) &= p(x_0) \\prod_{k=1}^T p(x_k | x_{k-1})\\\\\np(x_0,\\ldots,x_T\\ |\\ y_1,\\ldots,y_T) &= \\frac{p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)}{p(y_1,\\ldots,y_T)}\\\\ &\\propto p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)\\end{aligned}\n$$\n\nIn principle, if things are nice, you can compute this posterior (thinking of $x$ as unknown parameters)\n\nBut in practice, computing a big multivariate posterior like this is computationally ill-advised.\n\n\n\n## Generic filtering\n\n* Recursively build up $p(x_k | y_1,\\ldots y_k)$.\n\n* Why? Because if we're collecting data in real time, this is all we need to make forecasts for future data.\n\n$$\\begin{aligned} &p(y_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1}, y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | x_T) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n* Can continue to iterate if I want to predict $h$ steps ahead\n\n$$\\begin{aligned} &p(y_{T+h} | y_1,\\ldots,y_T)= p(y_{T+h} | x_{T+h} )\\prod_{j=0}^{h-1} p(x_{T+j+1} | x_{T+j}) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n\n\n## The filtering recursion\n\n1. Initialization. Fix $p(x_0)$.\n\nIterate the following for $k=1,\\ldots,T$:\n\n2. Predict. $$p(x_k | y_{k-1}) = \\int p(x_k | x_{k-1}) p(x_{k-1} | y_1,\\ldots, y_{k-1})dx_{k-1}.$$\n\n3. Update. $$p(x_k | y_1,\\ldots,y_k) = \\frac{p(y_k | x_k)p(x_k | y_1,\\ldots,y_{k-1})}{p(y_1,\\ldots,y_k)}$$\n\n\nIn general, this is somewhat annoying because these integrals may be challenging to solve.\n\nBut with some creativity, we can use Monte Carlo for everything.\n\n\n\n## What if we make lots of assumptions?\n\nAssume that $$\\begin{aligned}p(x_0) &= N(m_0, P_0) \\\\ p_k(x_k\\ |\\ x_{k-1}) &= N(A_{k-1}x_{k-1},\\ Q_{k-1})\\\\ p_k(y_k\\ |\\ x_k) &= N(H_k x_k,\\ R_k)\\end{aligned}.$$\n\nThen [all the ugly integrals have closed-form representations]{.secondary} by properties of conditional Gaussian distributions.\n\n## Closed-form representations\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\nDistributions:\n\n$$\n\\begin{aligned}\np(x_k | y_1,\\ldots,y_{k-1}) &= N(m^{-}_k, P^{-}_k)\\\\\np(x_k | y_1,\\ldots,y_{k}) &= N(m_k, P_k)\\\\\np(y_{k} | y_1,\\ldots,y_{k-1}) &= N(H_k m^-_k, S_k)\\\\\n\\end{aligned}\n$$\nPrediction:\n$$\n\\begin{aligned}\nm^-_k &= A_{k-1}m_{k-1}\\\\\nP^-_k &= A_{k-1}P_{k-1}A^\\mathsf{T}_{k-1} + Q_{k-1}\n\\end{aligned}\n$$\n\n:::\n::: {.column width=\"50%\"}\n\nUpdate:\n$$\n\\begin{aligned}\nv_k &= y_k - H_k m_k^-\\\\\nS_k &= H_k P_k^- H_k^\\mathsf{T} + R_k\\\\\nK_k &= P^-_k H_k^\\mathsf{T} S_k^{-1}\\\\\nm_k &= m^-_k + K_{k}v_{k}\\\\\nP_k &= P^-_k - K_k S_k K_k^\\mathsf{T}\n\\end{aligned}\n$$\n\n:::\n::::\n\n\n\n## Code or it isn't real (Kalman Filter)\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nkalman <- function(y, m0, P0, A, Q, H, R) {\n n <- length(y)\n m <- double(n + 1)\n P <- double(n + 1)\n m[1] <- m0\n P[1] <- P0\n for (k in seq(n)) {\n mm <- A * m[k]\n Pm <- A * P[k] * A + Q\n v <- y[k] - H * mm\n S <- H * Pm * H + R\n K <- Pm * H / S\n m[k + 1] <- mm + K * v\n P[k + 1] <- Pm - K * S * K\n }\n tibble(t = 1:n, m = m[-1], P = P[-1])\n}\n\nset.seed(2022 - 06 - 01)\nx <- double(100)\nfor (k in 2:100) x[k] <- x[k - 1] + rnorm(1)\ny <- x + rnorm(100, sd = 1)\nkf <- kalman(y, 0, 5, 1, 1, 1, 1)\n```\n:::\n\n\n:::\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n::::\n\n\n## Important notes\n\n* So far, we assumed all parameters were known.\n* In reality, we had 6: `m0, P0, A, Q, H, R`\n* I sort of also think of `x` as \"parameters\" in the Bayesian sense\n* By that I mean, \"latent variables for which we have prior distributions\"\n* What if we want to estimate them?\n\n[Bayesian way]{.tertiary}: `m0` and `P0` are already the parameters of for the prior on `x1`. Put priors on the other 4.\n\n[Frequentist way]{.tertiary}: Just maximize the likelihood. Can technically take \n`P0` $\\rightarrow\\infty$ to remove it and `m0`\n\n. . .\n\nThe Likelihood is produced as a by-product of the Kalman Filter. \n\n$$-\\ell(\\theta) = \\sum_{k=1}^T \\left(v_k^\\mathsf{T}S_k^{-1}v_k + \\log |S_k| + m \\log 2\\pi\\right)$$\n\n\n\n## Smoothing \n\n* We also want $p(x_k | y_1,\\ldots,y_{T})$\n* Filtering went \"forward\" in time. At the end we got, \n$p(x_T | y_1,\\ldots,y_{T})$. Smoothing starts there and goes \"backward\"\n* For \"everything linear Gaussian\", this is again \"easy\"\n* Set $m_T^s = m_T$, $P_T^s = P_T$. \n* For $k = T-1,\\ldots,1$, \n\n\n$$\\begin{aligned}\nG_k &= P_k A_k^\\mathsf{T} [P_{k+1}^-]^{-1}\\\\\nm_k^s &= m_k + G_k(m_{k+1}^s - m_{k+1}^-)\\\\\nP_k^s &= P_k + G_k(P_{k+1}^s - P_{k+1}^-)G_k^\\mathsf{T}\\\\\nx_k | y_1,\\ldots,y_T &= N(m^s_k, P_k^s)\n\\end{aligned}$$\n\n\n## Comparing the filter and the smoother\n\n* Same data, different code (using a package)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(FKF)\nfilt <- fkf(\n a0 = 0, P0 = matrix(5), dt = matrix(0), ct = matrix(0),\n Tt = matrix(1), Zt = matrix(1), HHt = matrix(1), GGt = matrix(1),\n yt = matrix(y, ncol = length(y))\n)\nsmo <- fks(filt)\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-smooth-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## What about non-linear and/or non-Gaussian\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\nThen we need to solve integrals. This is a pain. We approximate them. \n\nThese all give [approximations to the filtering distribution]{.secondary}\n\n* Extended Kalman filter - basically do a Taylor approximation, then do Kalman like\n* Uncented Kalman filter - Approximate integrals with Sigma points\n* Particle filter - Sequential Monte Carlo\n* Bootstrap filter (simple version of SMC)\n* Laplace Gaussian filter - Do a Laplace approximation to the distributions\n\n\n## The bootstrap filter\n\n* Need to **simulate** from the transition distribution (`rtrans`)\n* Need to **evaluate** the observation distribution (`dobs`)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nboot_filter <-\n function(y, B = 1000, rtrans, dobs, a0 = 0, P0 = 1, perturb = function(x) x) {\n n <- length(y)\n filter_est <- matrix(0, n, B)\n predict_est <- matrix(0, n, B)\n init <- rnorm(B, a0, P0)\n filter_est[1, ] <- init\n for (i in seq(n)) {\n raw_w <- dobs(y[i], filter_est[i, ])\n w <- raw_w / sum(raw_w)\n selection <- sample.int(B, replace = TRUE, prob = w)\n filter_est[i, ] <- perturb(filter_est[i, selection])\n predict_est[i, ] <- rtrans(filter_est[i, ])\n if (i < n) filter_est[i + 1, ] <- predict_est[i, ]\n }\n list(filt = filter_est, pred = predict_est)\n }\n```\n:::\n", + "markdown": "---\nlecture: \"Time series, a whirlwind\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n\n\n## The general linear process\n\n* Imagine that there is a noise process\n\n$$\\epsilon_j \\sim \\textrm{N}(0, 1),\\ \\textrm{i.i.d.}$$\n\n* At time $i$, we observe the sum of all past noise\n\n$$y_i = \\sum_{j=-\\infty}^0 a_{i+j} \\epsilon_j$$\n\n* Without some conditions on $\\{a_k\\}_{k=-\\infty}^0$ this process will \"run away\"\n* The result is \"non-stationary\" and difficult to analyze.\n* Stationary means (roughly) that the marginal distribution of $y_i$ does not change with $i$.\n\n## Chasing stationarity\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 1000\nnseq <- 5\ngenerate_ar <- function(n, b) {\n y <- double(n)\n y[1] <- rnorm(1)\n for (i in 2:n) y[i] <- b * y[i - 1] + rnorm(1)\n tibble(time = 1:n, y = y)\n}\nstationary <- map(1:nseq, ~ generate_ar(n, .99)) |> list_rbind(names_to = \"id\")\nnon_stationary <- map(1:nseq, ~ generate_ar(n, 1.01)) |>\n list_rbind(names_to = \"id\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Uses of stationarity\n\n* Lots of types (weak, strong, in-mean, wide-sense,...)\n* not required for modelling / forecasting\n* But assuming stationarity gives some important guarantees\n* Usually work with stationary processes\n\n## Standard models {.smaller}\n\n### AR(p)\n\nSuppose $\\epsilon_i$ are i.i.d. N(0, 1) (distn is convenient, but not required)\n\n$$y_i = \\mu + a_1 y_{i-1} + \\cdots + a_p y_{i-p} + \\epsilon_i$$\n\n* This is a special case of the general linear process\n* You can recursively substitute this defn into itself to get that equation\n\nEasy to estimate the `a`'s given a realization. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ar = c(.7, -.1)), n = 1000)\nY <- y[3:1000]\nX <- cbind(lag1 = y[2:999], lag2 = y[1:998])\nsummary(lm(Y ~ X + 0))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Y ~ X + 0)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6164 -0.6638 0.0271 0.6456 3.8367 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nXlag1 0.66931 0.03167 21.134 <2e-16 ***\nXlag2 -0.04856 0.03167 -1.533 0.126 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.9899 on 996 degrees of freedom\nMultiple R-squared: 0.4085,\tAdjusted R-squared: 0.4073 \nF-statistic: 344 on 2 and 996 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n## AR(p)\n\n* The estimate isn't [that]{.secondary} accurate because the residuals (not the $\\epsilon$'s) are correlated. \n* (Usually, you get `1/n` convergence, here you don't.)\n* Also, this isn't the MLE. The likelihood includes $p(y_1)$, $p(y_2 | y_1)$ which `lm()` ignored.\n* The Std. Errors are unjustified.\n* But that was easy to do.\n* The correct way is\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(2, 0, 0), include.mean = FALSE)\n\nCoefficients:\n ar1 ar2\n 0.6686 -0.0485\ns.e. 0.0316 0.0316\n\nsigma^2 estimated as 0.9765: log likelihood = -1407.34, aic = 2820.67\n```\n:::\n:::\n\n\n* The resulting estimates and SEs are identical, AFAICS.\n\n## MA(q)\n\nStart with the general linear process, but truncate the infinite sum.\n\n$$y_i = \\sum_{j=-q}^0 a_{i+j} \\epsilon_j$$\n\n* This is termed a \"moving average\" process.\n* though $a_0 + \\cdots a_{-q}$ don't sum to 1.\n* Can't write this easily as a `lm()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ma = c(.9, .6, .1)), n = 1000)\narima(y, c(0, 0, 3), include.mean = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(0, 0, 3), include.mean = FALSE)\n\nCoefficients:\n ma1 ma2 ma3\n 0.9092 0.6069 0.1198\ns.e. 0.0313 0.0380 0.0311\n\nsigma^2 estimated as 0.8763: log likelihood = -1353.41, aic = 2714.82\n```\n:::\n:::\n\n\n## MA(q) as an AR(1) hidden process\n\nLet $X_j = [\\epsilon_{j-1},\\ \\ldots,\\ \\epsilon_{j-q}]$ and write\n\n$$\n\\begin{aligned}\nX_i &= \\begin{bmatrix} a_{i-1} & a_{i-2} & \\cdots & a_{i-q}\\\\ 1 & 0 & \\cdots & 0\\\\ & & \\ddots \\\\ 0 & 0 & \\cdots & 1\\end{bmatrix} X_{i-1} + \n\\begin{bmatrix} a_{i}\\\\ 0 \\\\ \\vdots \\\\ 0 \\end{bmatrix} \\epsilon_i\\\\\ny_i &= \\begin{bmatrix} 1 & 0 & \\cdots 0 \\end{bmatrix} X_i\n\\end{aligned}\n$$\n\n* Now $X$ is a $q$-dimensional AR(1) (but we don't see it)\n* $y$ is deterministic conditional on $X$\n* This is the usual way these are estimated using a State-Space Model\n* Many time series models have multiple equivalent representations\n\n## AR[I]{.secondary}MA\n\n* We've been using `arima()` and `arima.sim()`, so what is left?\n* The \"I\" means \"integrated\"\n* If, for example, we can write $z_i = y_i - y_{i-1}$ and $z$ follows an ARMA(p, q), we say $y$ follows an ARIMA(p, 1, q).\n* The middle term is the degree of differencing\n\n## Other standard models\n\nSuppose we can write \n\n$$\ny_i = T_i + S_i + W_i\n$$\n\nThis is the \"classical\" decomposition of $y$ into a Trend + Seasonal + Noise.\n\nYou can estimate this with a \"Basic Structural Time Series Model\" using `StrucTS()`.\n\nA related, though slightly different model is called the STL decomposition, estimated with `stl()`.\n\nThis is \"Seasonal Decomposition of Time Series by Loess\"\n\n(LOESS is \"locally estimated scatterplot smoothing\" named/proposed independently by Bill Cleveland though originally proposed about 15 years earlier and called the Savitsky-Golay Filter)\n\n## Quick example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsts <- StructTS(AirPassengers)\nbc <- stl(AirPassengers, \"periodic\") # use sin/cos to represent the seasonal\ntibble(\n time = seq(as.Date(\"1949-01-01\"), as.Date(\"1960-12-31\"), by = \"month\"),\n AP = AirPassengers, \n StrucTS = fitted(sts)[, 1], \n STL = rowSums(bc$time.series[, 1:2])\n) |>\n pivot_longer(-time) |>\n ggplot(aes(time, value, color = name)) +\n geom_line() +\n theme_bw(base_size = 24) +\n scale_color_viridis_d(name = \"\") +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Generic state space model\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n


\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\n:::\n::: {.column width=\"50%\"}\n\n* $x_k$ is unobserved, dimension $n$\n\n* $y_k$ is observed, dimension $m$\n\n* $x$ process is the [transition]{.tertiary} or [process]{.tertiary} equation\n\n* $y$ is the [observation]{.tertiary} or [measurement]{.tertiary} equation\n\n* Both are probability distributions that can depend on parameters $\\theta$\n\n* For now, assume $\\theta$ is [KNOWN]{.secondary}\n\n* We can allow the densities to vary with time.\n\n:::\n::::\n\n\n\n## GOAL(s)\n\n1. Filtering: given observations, find $$p(x_k | y_1,\\ldots y_k)$$\n\n1. Smoothing: given observations, find $$p(x_k | y_1,\\ldots y_T), \\;\\;\\ k < T$$\n\n1. Forecasting: given observations, find $$p(y_{k+1} | y_1,\\ldots,y_k)$$\n\n\n## Using Bayes Rule\n\nAssume $p(x_0)$ is known\n\n$$\n\\begin{aligned}\np(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T) &= \\prod_{k=1}^T p(y_k | x_k)\\\\\np(x_0,\\ldots,x_T) &= p(x_0) \\prod_{k=1}^T p(x_k | x_{k-1})\\\\\np(x_0,\\ldots,x_T\\ |\\ y_1,\\ldots,y_T) &= \\frac{p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)}{p(y_1,\\ldots,y_T)}\\\\ &\\propto p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)\\end{aligned}\n$$\n\nIn principle, if things are nice, you can compute this posterior (thinking of $x$ as unknown parameters)\n\nBut in practice, computing a big multivariate posterior like this is computationally ill-advised.\n\n\n\n## Generic filtering\n\n* Recursively build up $p(x_k | y_1,\\ldots y_k)$.\n\n* Why? Because if we're collecting data in real time, this is all we need to make forecasts for future data.\n\n$$\\begin{aligned} &p(y_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1}, y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | x_T) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n* Can continue to iterate if I want to predict $h$ steps ahead\n\n$$\\begin{aligned} &p(y_{T+h} | y_1,\\ldots,y_T)= p(y_{T+h} | x_{T+h} )\\prod_{j=0}^{h-1} p(x_{T+j+1} | x_{T+j}) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n\n\n## The filtering recursion\n\n1. Initialization. Fix $p(x_0)$.\n\nIterate the following for $k=1,\\ldots,T$:\n\n2. Predict. $$p(x_k | y_{k-1}) = \\int p(x_k | x_{k-1}) p(x_{k-1} | y_1,\\ldots, y_{k-1})dx_{k-1}.$$\n\n3. Update. $$p(x_k | y_1,\\ldots,y_k) = \\frac{p(y_k | x_k)p(x_k | y_1,\\ldots,y_{k-1})}{p(y_1,\\ldots,y_k)}$$\n\n\nIn general, this is somewhat annoying because these integrals may be challenging to solve.\n\nBut with some creativity, we can use Monte Carlo for everything.\n\n\n\n## What if we make lots of assumptions?\n\nAssume that $$\\begin{aligned}p(x_0) &= N(m_0, P_0) \\\\ p_k(x_k\\ |\\ x_{k-1}) &= N(A_{k-1}x_{k-1},\\ Q_{k-1})\\\\ p_k(y_k\\ |\\ x_k) &= N(H_k x_k,\\ R_k)\\end{aligned}.$$\n\nThen [all the ugly integrals have closed-form representations]{.secondary} by properties of conditional Gaussian distributions.\n\n## Closed-form representations\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\nDistributions:\n\n$$\n\\begin{aligned}\np(x_k | y_1,\\ldots,y_{k-1}) &= N(m^{-}_k, P^{-}_k)\\\\\np(x_k | y_1,\\ldots,y_{k}) &= N(m_k, P_k)\\\\\np(y_{k} | y_1,\\ldots,y_{k-1}) &= N(H_k m^-_k, S_k)\\\\\n\\end{aligned}\n$$\nPrediction:\n$$\n\\begin{aligned}\nm^-_k &= A_{k-1}m_{k-1}\\\\\nP^-_k &= A_{k-1}P_{k-1}A^\\mathsf{T}_{k-1} + Q_{k-1}\n\\end{aligned}\n$$\n\n:::\n::: {.column width=\"50%\"}\n\nUpdate:\n$$\n\\begin{aligned}\nv_k &= y_k - H_k m_k^-\\\\\nS_k &= H_k P_k^- H_k^\\mathsf{T} + R_k\\\\\nK_k &= P^-_k H_k^\\mathsf{T} S_k^{-1}\\\\\nm_k &= m^-_k + K_{k}v_{k}\\\\\nP_k &= P^-_k - K_k S_k K_k^\\mathsf{T}\n\\end{aligned}\n$$\n\n:::\n::::\n\n\n\n## Code or it isn't real (Kalman Filter)\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nkalman <- function(y, m0, P0, A, Q, H, R) {\n n <- length(y)\n m <- double(n + 1)\n P <- double(n + 1)\n m[1] <- m0\n P[1] <- P0\n for (k in seq(n)) {\n mm <- A * m[k]\n Pm <- A * P[k] * A + Q\n v <- y[k] - H * mm\n S <- H * Pm * H + R\n K <- Pm * H / S\n m[k + 1] <- mm + K * v\n P[k + 1] <- Pm - K * S * K\n }\n tibble(t = 1:n, m = m[-1], P = P[-1])\n}\n\nset.seed(2022 - 06 - 01)\nx <- double(100)\nfor (k in 2:100) x[k] <- x[k - 1] + rnorm(1)\ny <- x + rnorm(100, sd = 1)\nkf <- kalman(y, 0, 5, 1, 1, 1, 1)\n```\n:::\n\n\n:::\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n::::\n\n\n## Important notes\n\n* So far, we assumed all parameters were known.\n* In reality, we had 6: `m0, P0, A, Q, H, R`\n* I sort of also think of `x` as \"parameters\" in the Bayesian sense\n* By that I mean, \"latent variables for which we have prior distributions\"\n* What if we want to estimate them?\n\n[Bayesian way]{.tertiary}: `m0` and `P0` are already the parameters of for the prior on `x1`. Put priors on the other 4.\n\n[Frequentist way]{.tertiary}: Just maximize the likelihood. Can technically take \n`P0` $\\rightarrow\\infty$ to remove it and `m0`\n\n. . .\n\nThe Likelihood is produced as a by-product of the Kalman Filter. \n\n$$-\\ell(\\theta) = \\sum_{k=1}^T \\left(v_k^\\mathsf{T}S_k^{-1}v_k + \\log |S_k| + m \\log 2\\pi\\right)$$\n\n\n\n## Smoothing \n\n* We also want $p(x_k | y_1,\\ldots,y_{T})$\n* Filtering went \"forward\" in time. At the end we got, \n$p(x_T | y_1,\\ldots,y_{T})$. Smoothing starts there and goes \"backward\"\n* For \"everything linear Gaussian\", this is again \"easy\"\n* Set $m_T^s = m_T$, $P_T^s = P_T$. \n* For $k = T-1,\\ldots,1$, \n\n\n$$\\begin{aligned}\nG_k &= P_k A_k^\\mathsf{T} [P_{k+1}^-]^{-1}\\\\\nm_k^s &= m_k + G_k(m_{k+1}^s - m_{k+1}^-)\\\\\nP_k^s &= P_k + G_k(P_{k+1}^s - P_{k+1}^-)G_k^\\mathsf{T}\\\\\nx_k | y_1,\\ldots,y_T &= N(m^s_k, P_k^s)\n\\end{aligned}$$\n\n\n## Comparing the filter and the smoother\n\n* Same data, different code (using a package)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(FKF)\nfilt <- fkf(\n a0 = 0, P0 = matrix(5), dt = matrix(0), ct = matrix(0),\n Tt = matrix(1), Zt = matrix(1), HHt = matrix(1), GGt = matrix(1),\n yt = matrix(y, ncol = length(y))\n)\nsmo <- fks(filt)\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-smooth-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## What about non-linear and/or non-Gaussian\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\nThen we need to solve integrals. This is a pain. We approximate them. \n\nThese all give [approximations to the filtering distribution]{.secondary}\n\n* Extended Kalman filter - basically do a Taylor approximation, then do Kalman like\n* Uncented Kalman filter - Approximate integrals with Sigma points\n* Particle filter - Sequential Monte Carlo\n* Bootstrap filter (simple version of SMC)\n* Laplace Gaussian filter - Do a Laplace approximation to the distributions\n\n\n## The bootstrap filter\n\n* Need to **simulate** from the transition distribution (`rtrans`)\n* Need to **evaluate** the observation distribution (`dobs`)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nboot_filter <-\n function(y, B = 1000, rtrans, dobs, a0 = 0, P0 = 1, perturb = function(x) x) {\n n <- length(y)\n filter_est <- matrix(0, n, B)\n predict_est <- matrix(0, n, B)\n init <- rnorm(B, a0, P0)\n filter_est[1, ] <- init\n for (i in seq(n)) {\n raw_w <- dobs(y[i], filter_est[i, ])\n w <- raw_w / sum(raw_w)\n selection <- sample.int(B, replace = TRUE, prob = w)\n filter_est[i, ] <- perturb(filter_est[i, selection])\n predict_est[i, ] <- rtrans(filter_est[i, ])\n if (i < n) filter_est[i + 1, ] <- predict_est[i, ]\n }\n list(filt = filter_est, pred = predict_est)\n }\n```\n:::\n", "supporting": [ "time-series_files" ], diff --git a/_freeze/schedule/slides/unit-tests/execute-results/html.json b/_freeze/schedule/slides/unit-tests/execute-results/html.json index 13ae13c..08dfa06 100644 --- a/_freeze/schedule/slides/unit-tests/execute-results/html.json +++ b/_freeze/schedule/slides/unit-tests/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "2657827e20283bbab6a8211c880d9b66", "result": { - "markdown": "---\nlecture: \"Unit tests and avoiding 🪲🪲\"\nformat: \n revealjs:\n echo: true\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## I urge you to consult:\n\n[Carnegie Mellon's 36-750 Notes](https://36-750.github.io)\n\nThank you Alex and Chris for the heavy lifting.\n\n\n\n## Bugs happen. All. The. Time.\n\n* the crash of the [Mars Climate Orbiter](https://en.wikipedia.org/wiki/Mars%5FClimate%5FOrbiter) (1998),\n\n* a [failure of the national telephone network](https://telephoneworld.org/landline-telephone-history/the-crash-of-the-att-network-in-1990/) (1990),\n\n* a deadly medical device ([1985](https://en.wikipedia.org/wiki/Therac-25), 2000),\n\n* a massive [Northeastern blackout](https://en.wikipedia.org/wiki/Northeast%5Fblackout%5Fof%5F2003) (2003),\n\n* the [Heartbleed](http://heartbleed.com/), [Goto Fail](https://www.dwheeler.com/essays/apple-goto-fail.html), [Shellshock](https://en.wikipedia.org/wiki/Shellshock%5F(software%5Fbug)) exploits (2012–2014),\n\n* a 15-year-old [fMRI analysis software](http://www.pnas.org/content/113/28/7900.full) bug that inflated significance levels (2015),\n\n. . .\n\nIt is easy to write lots of code.\n\nBut are we sure it's doing the right things?\n\n::: {.callout-important}\nEffective testing tries to help.\n:::\n\n\n## A Common (Interactive) Workflow\n\n1. Write a function.\n1. Try some reasonable values at the REPL to check that it works.\n1. If there are problems, maybe insert some print statements, and modify the function.\n1. Repeat until things seem fine.\n\n(REPL == Read-Eval-Print-Loop, the console, or Jupyter NB)\n\n* This tends to result in lots of bugs.\n\n* Later on, you forget which values you tried, whether they failed, how you fixed them.\n\n* So you make a change and maybe or maybe not try some again.\n\n## Step 1 --- write functions\n\n::: {.callout-important appearance=\"simple\"}\nWrite functions.\n\nLots of them.\n:::\n\n👍 Functions are testable \n\n👎 Scripts are not\n\nIt's easy to alter the arguments and see \"what happens\"\n\nThere's less ability to screw up environments.\n\n. . .\n\nI'm going to mainly describe `R`, but the logic is very similar (if not the syntax) for `python`, `C++`, and `Julia`\n\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](unit-tests_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 83 138 191 191 182 74 43 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.166 0.276 0.382 0.382 0.364 0.148 0.086 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## Step 2 --- program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n## Even better\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) cli::cli_abort(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) cli::cli_abort(\"`inc_by` must be numeric\")\n x + inc_by\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:6, \"b\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `inc_by` must be numeric\n```\n:::\n:::\n\n\n\n## Step 3 --- Keep track of behaviour with tests\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) stop(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) stop(\"`inc_by` must be numeric\")\n x + inc_by\n}\ntest_that(\"incrementer validates arguments\", {\n expect_error(incrementer(\"a\"))\n expect_equal(incrementer(1:3), 2:4)\n expect_equal(incrementer(2, -3), -1)\n expect_error(incrementer(1, \"b\"))\n expect_identical(incrementer(1:3), 2:4)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n── Failure: incrementer validates arguments ────────────────────────────────────\nincrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError:\n! Test failed\n```\n:::\n:::\n\n\n\n## Integers are trouble\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\nexpect_equal(incrementer(1:3, 1), 2:4)\n```\n:::\n\n\n# Testing lingo\n\n## Unit testing\n\n* A **unit** is a small bit of code (function, class, module, group of classes)\n\n* A **test** calls the unit with a set of inputs, and checks if we get the expected output.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngcd <- function(x, na.rm = FALSE) {\n if (na.rm) x <- x[!is.na(x)]\n if (anyNA(x)) return(NA)\n stopifnot(is.numeric(x))\n if (!rlang::is_integerish(x)) cli_abort(\"`x` must contain only integers.\")\n if (length(x) == 1L) return(as.integer(x))\n x <- x[x != 0]\n compute_gcd(x) # dispatch to a C++ function\n}\n\ntest_that(\"gcd works\", {\n # corner cases\n expect_identical(gcd(c(1, NA)), NA)\n expect_identical(gcd(c(1, NA), TRUE), 1L)\n expect_identical(gcd(c(1, 2, 4)), 1L)\n # error\n expect_error(gcd(1.3))\n # function\n expect_identical(gcd(c(2, 4, 6)), 2L)\n expect_identical(gcd(c(2, 3, 7)), 1L)\n})\n```\n:::\n\n\n## Unit testing\n\nUnit testing consists of writing tests that are\n\n* focused on a small, low-level piece of code (a unit)\n* typically written by the programmer with standard tools\n* fast to run (so can be run often, i.e. before every commit).\n\n\n## Unit testing benefits\n\nAmong others:\n\n* Exposing problems early\n* Making it easy to change (refactor) code without forgetting pieces or breaking things\n* Simplifying integration of components\n* Providing natural documentation of what the code should do\n* Driving the design of new code.\n\n![](http://www.phdcomics.com/comics/archive/phd033114s.gif)\n\n\n## Components of a Unit Testing Framework\n\n::: flex\n::: w-70\n\n* Collection of **Assertions** executed in sequence. \n* Executed in a self-contained environment.\n* Any assertion fails ``{=html} Test fails.\n\nEach test focuses on a single component.\n\nNamed so that you know what it's doing.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## See https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life\ntest_that(\"Conway's rules are correct\", {\n # conway_rules(num_neighbors, alive?)\n expect_true(conway_rules(3, FALSE))\n expect_false(conway_rules(4, FALSE))\n expect_true(conway_rules(2, TRUE))\n ...\n})\n```\n:::\n\n:::\n\n::: w-30\n\n![](https://upload.wikimedia.org/wikipedia/commons/e/e5/Gospers_glider_gun.gif)\n\n:::\n:::\n\n\n## A test suite\n\n::: flex\n::: w-50\n* Collection of related tests in a common context.\n\n* Prepares the environment, cleans up after\n\n* (loads some data, connects to database, necessary library,...)\n\n* Test suites are run and the results reported, particularly failures, in a easy to parse and economical style. \n\n* For example, Python’s `{unittest}` can report like this\n\n::: \n\n::: w-50\n\n```{.bash}\n$ python test/trees_test.py -v\n\ntest_crime_counts (__main__.DataTreeTest)\nEnsure Ks are consistent with num_points. ... ok\ntest_indices_sorted (__main__.DataTreeTest)\nEnsure all node indices are sorted in increasing order. ... ok\ntest_no_bbox_overlap (__main__.DataTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.DataTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.DataTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.DataTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.DataTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\ntest_no_bbox_overlap (__main__.QueryTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.QueryTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.QueryTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.QueryTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.QueryTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\n\n---------------------------------------------------------\nRan 12 tests in 23.932s\n```\n\n:::\n:::\n\n\n\n## `R` example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntestthat::test_local(here::here(\"../../../../../../Delphi/smoothqr/\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n✔ | F W S OK | Context\n\n⠏ | 0 | smooth-rq \n✔ | 12 | smooth-rq\n\n══ Results ═════════════════════════════════════════════════════════════════════\n[ FAIL 0 | WARN 0 | SKIP 0 | PASS 12 ]\n```\n:::\n:::\n\n\n\n## What do I test?\n\n::: {.callout-tip icon=false}\n## Core Principle:\n\nTests should be passed by a correct function, but not by an incorrect function.\n:::\n\nThe tests must apply pressure to know if things break.\n\n* several specific inputs for which you _know_ the correct answer\n* \"edge\" cases, like a list of size zero or a matrix instead of a vector\n* special cases that the function must handle, but which you might forget about months from now\n* error cases that should throw an error instead of returning an invalid answer\n* previous bugs you’ve fixed, so those bugs never return.\n\n\n## What do I test?\n\nMake sure that incorrect functions won't pass (or at least, won't pass them all).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadd <- function(a, b) return(4)\nadd <- function(a, b) return(a * b)\n\ntest_that(\"Addition is commutative\", {\n expect_equal(add(1, 3), add(3, 1)) # both pass this !!\n expect_equal(add(2, 5), add(5, 2)) # neither passes this\n})\n```\n:::\n\n\n::: {.callout-tip}\n* Cover all branches. \n\n* Make sure there aren't branches you don't expect.\n:::\n\n## Assertions\n\n[Assertions]{.secondary} are things that must be true. Failure means \"Quit\". \n\n- There's no way to recover. \n- Think: passed in bad arguments.\n \n\n::: {.cell layout-align=\"center\"}\n\n```{.python .cell-code}\ndef fit(data, ...):\n\n for it in range(max_iterations):\n # iterative fitting code here\n ...\n\n # Plausibility check\n assert np.all(alpha >= 0), \"negative alpha\"\n assert np.all(theta >= 0), \"negative theta\"\n assert omega > 0, \"Nonpositive omega\"\n assert eta2 > 0, \"Nonpositive eta2\"\n assert sigma2 > 0, \"Nonpositive sigma2\"\n\n ...\n```\n:::\n\n\nThe parameters have to be positive. Negative is impossible. No way to recover.\n\n\n## Errors\n\n[Errors]{.secondary} are for unexpected conditions that _could_ be handled by the calling code.\n\n* You could perform some action to work around the error, fix it, or report it to the user.\n\n#### Example:\n\n- I give you directions to my house. You get lost. You could recover.\n- Maybe retrace your steps, see if you missed a sign post.\n- Maybe search on Google Maps to locate your self in relation to a landmark.\n- If those fail, message me.\n- If I don't respond, get an Uber.\n- Finally, give up and go home.\n\n## Errors\n\nCode can also do this. It can `try` the function and `catch` errors to recover automatically.\n\nFor example:\n\n* Load some data from the internet. If the file doesn't exist, create some.\n\n* Run some iterative algorithm. If we haven't converged, restart from another place.\n\nCode can fix errors without user input. It can't fix assertions.\n\n* An input must be an integer. So round it, Warn, and proceed. Rather than fail.\n\n\n## Test-driven development\n\nTest Driven Development (TDD) uses a short development cycle for each new feature or component:\n\n1. Write tests that specify the component’s desired behavior. \n The tests will initially fail because the component does not yet exist.\n\n1. Create the minimal implementation that passes the test.\n\n1. Refactor the code to meet design standards, running the tests with each change to ensure correctness.\n\n\n## Why work this way?\n\n* Writing the tests may help you realize \n a. what arguments the function must take, \n b. what other data it needs, \n c. and what kinds of errors it needs to handle. \n\n* The tests define a specific plan for what the function must do.\n\n* You will catch bugs at the beginning instead of at the end (or never).\n\n* Testing is part of design, instead of a lame afterthought you dread doing.\n\n\n## Rules of thumb\n\nKeep tests in separate files\n: from the code they test. This makes it easy to run them separately.\n\nGive tests names\n: Testing frameworks usually let you give the test functions names or descriptions. `test_1` doesn’t help you at all, but `test_tree_insert` makes it easy for you to remember what the test is for.\n\nMake tests replicable\n: If a test involves random data, what do you do when the test fails? You need some way to know what random values it used so you can figure out why the test fails.\n\n## Rules of thumb\n\nUse tests instead of the REPL\n: If you’re building a complicated function, write the tests in advance and use them to help you while you write the function. You'll waste time calling over and over at the REPL.\n\nAvoid testing against another's code/package\n: You don't know the ins and outs of what they do. If they change the code, your tests will fail.\n\nTest Units, not main functions\n: You should write small functions that do one thing. Test those. Don't write one huge 1000-line function and try to test that.\n\nAvoid random numbers\n: Seeds are not always portable.\n\n---\n\n::: {.callout-note}\n* `R`, use `{testthat}`. See the [Testing](http://r-pkgs.had.co.nz/tests.html) chapter from Hadley Wickham’s R Packages book.\n\n* `python` use `{pytest}`. A bit more user-friendly than `{unittest}`: [pytest](https://docs.pytest.org/en/latest/)\n:::\n\n\n\n## Other suggestions\n\n::: flex\n::: w-50\n[Do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) stop(x, \" is not positive\")\n}\n\nfoo <- function(x) {\n if (x < 0) message(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (x < 0) warning(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (length(x) == 0)\n rlang::abort(\"no data\", class=\"no_input_data\")\n}\n```\n:::\n\n\nThese allow error handling.\n:::\n\n\n::: w-50\n\n[Don't do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) {\n print(paste0(x, \" is not positive\"))\n return(NULL)\n }\n ...\n}\n\nfoo <- function(x) {\n if (x < 0) cat(\"uh oh.\")\n ...\n}\n```\n:::\n\n\nCan't recover.\n\nDon't know what went wrong.\n\n:::\n:::\n\n---\n\nSee [here](https://36-750.github.io/practices/errors-exceptions/) for more details.\n\nSeems like overkill, \n\nbut when you run a big simulation that takes 2 weeks, \n\nyou don't want it to die after 10 days. \n\n\nYou want it to recover.\n\n\n\n# More coding details, if time.\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib)\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n## Annoying example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n\n```{.r .cell-code}\nprint.lm <- function(x, ...) print(\"This is an linear model.\")\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"This is an linear model.\"\n```\n:::\n:::\n\n\n* Overwrote the method in the global environment.\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside those package Environments.\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Global Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Careful organization and relative paths will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (this is my first step when you ask me)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get uncountably often that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just share a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that.\n\nAnd I can't copy-paste into Google.\n:::\n\nWhat you need is a Reproducible Example or `reprex`. This is a small chunk of code that \n\n1. runs in it's own environment \n1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex()` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Share that.\n\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script. That's the point\n:::\n\n\n\n\n\n\n## Practice\n\n#### Gradient ascent.\n\n* Suppose we want to find $\\max_x f(x)$.\n\n* We repeat the update $x \\leftarrow x + \\gamma f'(x)$ until convergence, for some $\\gamma > 0$.\n\n#### Poisson likelihood.\n\n* Recall the likelihood: $L(\\lambda; y_1,\\ldots,y_n) = \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)$\n\n[Goal:]{.secondary} find the MLE for $\\lambda$ using gradient ascent\n\n---\n\n## Deliverables, 2 R scripts\n\n1. A function that evaluates the log likelihood. (think about sufficiency, ignorable constants)\n1. A function that evaluates the gradient of the log likelihood. \n1. A function that *does* the optimization. \n a. Should take in data, the log likelihood and the gradient.\n b. Use the loglikelihood to determine convergence. \n c. Pass in any other necessary parameters with reasonable defaults.\n1. A collection of tests that make sure your functions work.\n\n\n$$\n\\begin{aligned} \nL(\\lambda; y_1,\\ldots,y_n) &= \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)\\\\\nx &\\leftarrow x + \\gamma f'(x)\n\\end{aligned}\n$$", + "markdown": "---\nlecture: \"Unit tests and avoiding 🪲🪲\"\nformat: \n revealjs:\n echo: true\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 03 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\renewcommand{\\hat}{\\widehat}\n$$\n\n\n\n\n## I urge you to consult:\n\n[Carnegie Mellon's 36-750 Notes](https://36-750.github.io)\n\nThank you Alex and Chris for the heavy lifting.\n\n\n\n## Bugs happen. All. The. Time.\n\n* the crash of the [Mars Climate Orbiter](https://en.wikipedia.org/wiki/Mars%5FClimate%5FOrbiter) (1998),\n\n* a [failure of the national telephone network](https://telephoneworld.org/landline-telephone-history/the-crash-of-the-att-network-in-1990/) (1990),\n\n* a deadly medical device ([1985](https://en.wikipedia.org/wiki/Therac-25), 2000),\n\n* a massive [Northeastern blackout](https://en.wikipedia.org/wiki/Northeast%5Fblackout%5Fof%5F2003) (2003),\n\n* the [Heartbleed](http://heartbleed.com/), [Goto Fail](https://www.dwheeler.com/essays/apple-goto-fail.html), [Shellshock](https://en.wikipedia.org/wiki/Shellshock%5F(software%5Fbug)) exploits (2012–2014),\n\n* a 15-year-old [fMRI analysis software](http://www.pnas.org/content/113/28/7900.full) bug that inflated significance levels (2015),\n\n. . .\n\nIt is easy to write lots of code.\n\nBut are we sure it's doing the right things?\n\n::: {.callout-important}\nEffective testing tries to help.\n:::\n\n\n## A Common (Interactive) Workflow\n\n1. Write a function.\n1. Try some reasonable values at the REPL to check that it works.\n1. If there are problems, maybe insert some print statements, and modify the function.\n1. Repeat until things seem fine.\n\n(REPL == Read-Eval-Print-Loop, the console, or Jupyter NB)\n\n* This tends to result in lots of bugs.\n\n* Later on, you forget which values you tried, whether they failed, how you fixed them.\n\n* So you make a change and maybe or maybe not try some again.\n\n## Step 1 --- write functions\n\n::: {.callout-important appearance=\"simple\"}\nWrite functions.\n\nLots of them.\n:::\n\n👍 Functions are testable \n\n👎 Scripts are not\n\nIt's easy to alter the arguments and see \"what happens\"\n\nThere's less ability to screw up environments.\n\n. . .\n\nI'm going to mainly describe `R`, but the logic is very similar (if not the syntax) for `python`, `C++`, and `Julia`\n\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](unit-tests_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 83 138 191 191 182 74 43 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.166 0.276 0.382 0.382 0.364 0.148 0.086 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## Step 2 --- program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n## Even better\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) cli::cli_abort(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) cli::cli_abort(\"`inc_by` must be numeric\")\n x + inc_by\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:6, \"b\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `inc_by` must be numeric\n```\n:::\n:::\n\n\n\n## Step 3 --- Keep track of behaviour with tests\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) stop(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) stop(\"`inc_by` must be numeric\")\n x + inc_by\n}\ntest_that(\"incrementer validates arguments\", {\n expect_error(incrementer(\"a\"))\n expect_equal(incrementer(1:3), 2:4)\n expect_equal(incrementer(2, -3), -1)\n expect_error(incrementer(1, \"b\"))\n expect_identical(incrementer(1:3), 2:4)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n── Failure: incrementer validates arguments ────────────────────────────────────\nincrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError:\n! Test failed\n```\n:::\n:::\n\n\n\n## Integers are trouble\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\nexpect_equal(incrementer(1:3, 1), 2:4)\n```\n:::\n\n\n# Testing lingo\n\n## Unit testing\n\n* A **unit** is a small bit of code (function, class, module, group of classes)\n\n* A **test** calls the unit with a set of inputs, and checks if we get the expected output.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngcd <- function(x, na.rm = FALSE) {\n if (na.rm) x <- x[!is.na(x)]\n if (anyNA(x)) return(NA)\n stopifnot(is.numeric(x))\n if (!rlang::is_integerish(x)) cli_abort(\"`x` must contain only integers.\")\n if (length(x) == 1L) return(as.integer(x))\n x <- x[x != 0]\n compute_gcd(x) # dispatch to a C++ function\n}\n\ntest_that(\"gcd works\", {\n # corner cases\n expect_identical(gcd(c(1, NA)), NA)\n expect_identical(gcd(c(1, NA), TRUE), 1L)\n expect_identical(gcd(c(1, 2, 4)), 1L)\n # error\n expect_error(gcd(1.3))\n # function\n expect_identical(gcd(c(2, 4, 6)), 2L)\n expect_identical(gcd(c(2, 3, 7)), 1L)\n})\n```\n:::\n\n\n## Unit testing\n\nUnit testing consists of writing tests that are\n\n* focused on a small, low-level piece of code (a unit)\n* typically written by the programmer with standard tools\n* fast to run (so can be run often, i.e. before every commit).\n\n\n## Unit testing benefits\n\nAmong others:\n\n* Exposing problems early\n* Making it easy to change (refactor) code without forgetting pieces or breaking things\n* Simplifying integration of components\n* Providing natural documentation of what the code should do\n* Driving the design of new code.\n\n![](http://www.phdcomics.com/comics/archive/phd033114s.gif)\n\n\n## Components of a Unit Testing Framework\n\n::: flex\n::: w-70\n\n* Collection of **Assertions** executed in sequence. \n* Executed in a self-contained environment.\n* Any assertion fails ``{=html} Test fails.\n\nEach test focuses on a single component.\n\nNamed so that you know what it's doing.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## See https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life\ntest_that(\"Conway's rules are correct\", {\n # conway_rules(num_neighbors, alive?)\n expect_true(conway_rules(3, FALSE))\n expect_false(conway_rules(4, FALSE))\n expect_true(conway_rules(2, TRUE))\n ...\n})\n```\n:::\n\n:::\n\n::: w-30\n\n![](https://upload.wikimedia.org/wikipedia/commons/e/e5/Gospers_glider_gun.gif)\n\n:::\n:::\n\n\n## A test suite\n\n::: flex\n::: w-50\n* Collection of related tests in a common context.\n\n* Prepares the environment, cleans up after\n\n* (loads some data, connects to database, necessary library,...)\n\n* Test suites are run and the results reported, particularly failures, in a easy to parse and economical style. \n\n* For example, Python’s `{unittest}` can report like this\n\n::: \n\n::: w-50\n\n```{.bash}\n$ python test/trees_test.py -v\n\ntest_crime_counts (__main__.DataTreeTest)\nEnsure Ks are consistent with num_points. ... ok\ntest_indices_sorted (__main__.DataTreeTest)\nEnsure all node indices are sorted in increasing order. ... ok\ntest_no_bbox_overlap (__main__.DataTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.DataTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.DataTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.DataTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.DataTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\ntest_no_bbox_overlap (__main__.QueryTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.QueryTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.QueryTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.QueryTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.QueryTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\n\n---------------------------------------------------------\nRan 12 tests in 23.932s\n```\n\n:::\n:::\n\n\n\n## `R` example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntestthat::test_local(here::here(\"../../../../../../Delphi/smoothqr/\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n✔ | F W S OK | Context\n\n⠏ | 0 | smooth-rq \n⠴ | 6 | smooth-rq \n✔ | 12 | smooth-rq\n\n══ Results ═════════════════════════════════════════════════════════════════════\n[ FAIL 0 | WARN 0 | SKIP 0 | PASS 12 ]\n```\n:::\n:::\n\n\n\n## What do I test?\n\n::: {.callout-tip icon=false}\n## Core Principle:\n\nTests should be passed by a correct function, but not by an incorrect function.\n:::\n\nThe tests must apply pressure to know if things break.\n\n* several specific inputs for which you _know_ the correct answer\n* \"edge\" cases, like a list of size zero or a matrix instead of a vector\n* special cases that the function must handle, but which you might forget about months from now\n* error cases that should throw an error instead of returning an invalid answer\n* previous bugs you’ve fixed, so those bugs never return.\n\n\n## What do I test?\n\nMake sure that incorrect functions won't pass (or at least, won't pass them all).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadd <- function(a, b) return(4)\nadd <- function(a, b) return(a * b)\n\ntest_that(\"Addition is commutative\", {\n expect_equal(add(1, 3), add(3, 1)) # both pass this !!\n expect_equal(add(2, 5), add(5, 2)) # neither passes this\n})\n```\n:::\n\n\n::: {.callout-tip}\n* Cover all branches. \n\n* Make sure there aren't branches you don't expect.\n:::\n\n## Assertions\n\n[Assertions]{.secondary} are things that must be true. Failure means \"Quit\". \n\n- There's no way to recover. \n- Think: passed in bad arguments.\n \n\n::: {.cell layout-align=\"center\"}\n\n```{.python .cell-code}\ndef fit(data, ...):\n\n for it in range(max_iterations):\n # iterative fitting code here\n ...\n\n # Plausibility check\n assert np.all(alpha >= 0), \"negative alpha\"\n assert np.all(theta >= 0), \"negative theta\"\n assert omega > 0, \"Nonpositive omega\"\n assert eta2 > 0, \"Nonpositive eta2\"\n assert sigma2 > 0, \"Nonpositive sigma2\"\n\n ...\n```\n:::\n\n\nThe parameters have to be positive. Negative is impossible. No way to recover.\n\n\n## Errors\n\n[Errors]{.secondary} are for unexpected conditions that _could_ be handled by the calling code.\n\n* You could perform some action to work around the error, fix it, or report it to the user.\n\n#### Example:\n\n- I give you directions to my house. You get lost. You could recover.\n- Maybe retrace your steps, see if you missed a sign post.\n- Maybe search on Google Maps to locate your self in relation to a landmark.\n- If those fail, message me.\n- If I don't respond, get an Uber.\n- Finally, give up and go home.\n\n## Errors\n\nCode can also do this. It can `try` the function and `catch` errors to recover automatically.\n\nFor example:\n\n* Load some data from the internet. If the file doesn't exist, create some.\n\n* Run some iterative algorithm. If we haven't converged, restart from another place.\n\nCode can fix errors without user input. It can't fix assertions.\n\n* An input must be an integer. So round it, Warn, and proceed. Rather than fail.\n\n\n## Test-driven development\n\nTest Driven Development (TDD) uses a short development cycle for each new feature or component:\n\n1. Write tests that specify the component’s desired behavior. \n The tests will initially fail because the component does not yet exist.\n\n1. Create the minimal implementation that passes the test.\n\n1. Refactor the code to meet design standards, running the tests with each change to ensure correctness.\n\n\n## Why work this way?\n\n* Writing the tests may help you realize \n a. what arguments the function must take, \n b. what other data it needs, \n c. and what kinds of errors it needs to handle. \n\n* The tests define a specific plan for what the function must do.\n\n* You will catch bugs at the beginning instead of at the end (or never).\n\n* Testing is part of design, instead of a lame afterthought you dread doing.\n\n\n## Rules of thumb\n\nKeep tests in separate files\n: from the code they test. This makes it easy to run them separately.\n\nGive tests names\n: Testing frameworks usually let you give the test functions names or descriptions. `test_1` doesn’t help you at all, but `test_tree_insert` makes it easy for you to remember what the test is for.\n\nMake tests replicable\n: If a test involves random data, what do you do when the test fails? You need some way to know what random values it used so you can figure out why the test fails.\n\n## Rules of thumb\n\nUse tests instead of the REPL\n: If you’re building a complicated function, write the tests in advance and use them to help you while you write the function. You'll waste time calling over and over at the REPL.\n\nAvoid testing against another's code/package\n: You don't know the ins and outs of what they do. If they change the code, your tests will fail.\n\nTest Units, not main functions\n: You should write small functions that do one thing. Test those. Don't write one huge 1000-line function and try to test that.\n\nAvoid random numbers\n: Seeds are not always portable.\n\n---\n\n::: {.callout-note}\n* `R`, use `{testthat}`. See the [Testing](http://r-pkgs.had.co.nz/tests.html) chapter from Hadley Wickham’s R Packages book.\n\n* `python` use `{pytest}`. A bit more user-friendly than `{unittest}`: [pytest](https://docs.pytest.org/en/latest/)\n:::\n\n\n\n## Other suggestions\n\n::: flex\n::: w-50\n[Do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) stop(x, \" is not positive\")\n}\n\nfoo <- function(x) {\n if (x < 0) message(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (x < 0) warning(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (length(x) == 0)\n rlang::abort(\"no data\", class=\"no_input_data\")\n}\n```\n:::\n\n\nThese allow error handling.\n:::\n\n\n::: w-50\n\n[Don't do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) {\n print(paste0(x, \" is not positive\"))\n return(NULL)\n }\n ...\n}\n\nfoo <- function(x) {\n if (x < 0) cat(\"uh oh.\")\n ...\n}\n```\n:::\n\n\nCan't recover.\n\nDon't know what went wrong.\n\n:::\n:::\n\n---\n\nSee [here](https://36-750.github.io/practices/errors-exceptions/) for more details.\n\nSeems like overkill, \n\nbut when you run a big simulation that takes 2 weeks, \n\nyou don't want it to die after 10 days. \n\n\nYou want it to recover.\n\n\n\n# More coding details, if time.\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib)\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n## Annoying example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n\n```{.r .cell-code}\nprint.lm <- function(x, ...) print(\"This is an linear model.\")\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"This is an linear model.\"\n```\n:::\n:::\n\n\n* Overwrote the method in the global environment.\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside those package Environments.\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Global Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Careful organization and relative paths will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (this is my first step when you ask me)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get uncountably often that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just share a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that.\n\nAnd I can't copy-paste into Google.\n:::\n\nWhat you need is a Reproducible Example or `reprex`. This is a small chunk of code that \n\n1. runs in it's own environment \n1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex()` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Share that.\n\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script. That's the point\n:::\n\n\n\n\n\n\n## Practice\n\n#### Gradient ascent.\n\n* Suppose we want to find $\\max_x f(x)$.\n\n* We repeat the update $x \\leftarrow x + \\gamma f'(x)$ until convergence, for some $\\gamma > 0$.\n\n#### Poisson likelihood.\n\n* Recall the likelihood: $L(\\lambda; y_1,\\ldots,y_n) = \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)$\n\n[Goal:]{.secondary} find the MLE for $\\lambda$ using gradient ascent\n\n---\n\n## Deliverables, 2 R scripts\n\n1. A function that evaluates the log likelihood. (think about sufficiency, ignorable constants)\n1. A function that evaluates the gradient of the log likelihood. \n1. A function that *does* the optimization. \n a. Should take in data, the log likelihood and the gradient.\n b. Use the loglikelihood to determine convergence. \n c. Pass in any other necessary parameters with reasonable defaults.\n1. A collection of tests that make sure your functions work.\n\n\n$$\n\\begin{aligned} \nL(\\lambda; y_1,\\ldots,y_n) &= \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)\\\\\nx &\\leftarrow x + \\gamma f'(x)\n\\end{aligned}\n$$", "supporting": [ "unit-tests_files" ],