diff --git a/_freeze/schedule/slides/bootstrap/execute-results/html.json b/_freeze/schedule/slides/bootstrap/execute-results/html.json index f7e3d9e..23aa65e 100644 --- a/_freeze/schedule/slides/bootstrap/execute-results/html.json +++ b/_freeze/schedule/slides/bootstrap/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "d2869649cdb959b50f0e8ab08e8f9e05", "result": { - "markdown": "---\nlecture: \"The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n\n## {background-image=\"https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg\" background-size=\"contain\"}\n\n\n## {background-image=\"http://rackjite.com/wp-content/uploads/rr11014aa.jpg\" background-size=\"contain\"}\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat y$, or a median, etc.\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n. . .\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n\n## Refresher on sampling distributions\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\bar{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\bar{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\bar{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\hat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\nThe bootstrap gives this to you.\n\n\n\n\n## Procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate a LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Very basic example\n\n* Let $X_i\\sim Exponential(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2 \\frac{s}{\\sqrt{n}}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Now what\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2022-11-01)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.669627\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nbootMed <- function() median(sample(x, replace = TRUE)) # resample, and get the median\nFhat <- replicate(B, bootMed()) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## {background-image=\"gfx/boot1.png\" background-size=\"contain\"}\n\n## {background-image=\"gfx/boot2.png\" background-size=\"contain\"}\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.9293 -1.0460 -0.1407 0.8298 16.2536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.81895 0.07678 49.74 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.549 on 143 degrees of freedom\nMultiple R-squared: 0.9454,\tAdjusted R-squared: 0.945 \nF-statistic: 2474 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## When we fit models, we examine diagnostics\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe tails are too fat, I don't believe that CI...\n:::\n\n::: {.column width=\"50%\"}\n\n\nWe bootstrap\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.654977 3.955927 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for most cases.\n\n. . .\n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n## Same data\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1-alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.673559 3.970251 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nnewcats <- fatcats\nfor (b in 1:B) {\n samp <- sample(residuals(cats.lm), replace = TRUE)\n newcats$Hwt <- predict(cats.lm) + samp # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.665531 3.961896 \n```\n:::\n:::\n\n\n:::\n::::\n\n## Bootstrap error sources\n\n\n[Simulation error]{.secondary}:\n\nusing only $B$ samples to estimate $F$ with $\\hat{F}$.\n\n[Statistical error]{.secondary}:\n\nour data depended on a sample from the population. We don't have the whole population so we make an error by using a sample \n\n(Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\n[Specification error]{.secondary}:\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n\n", + "markdown": "---\nlecture: \"The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n\n## {background-image=\"https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg\" background-size=\"contain\"}\n\n\n## {background-image=\"http://rackjite.com/wp-content/uploads/rr11014aa.jpg\" background-size=\"contain\"}\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat y$, or a median, etc.\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n. . .\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n\n## Refresher on sampling distributions\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\bar{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\bar{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\bar{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\hat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\nThe bootstrap gives this to you.\n\n\n\n\n## Procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate a LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Very basic example\n\n* Let $X_i\\sim Exponential(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2 \\frac{s}{\\sqrt{n}}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Now what\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(2022-11-01)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.669627\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nbootMed <- function() median(sample(x, replace = TRUE)) # resample, and get the median\nFhat <- replicate(B, bootMed()) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n---\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n## {background-image=\"gfx/boot1.png\" background-size=\"contain\"}\n\n## {background-image=\"gfx/boot2.png\" background-size=\"contain\"}\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.9293 -1.0460 -0.1407 0.8298 16.2536 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.81895 0.07678 49.74 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.549 on 143 degrees of freedom\nMultiple R-squared: 0.9454,\tAdjusted R-squared: 0.945 \nF-statistic: 2474 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## When we fit models, we examine diagnostics\n\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](bootstrap_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe tails are too fat, I don't believe that CI...\n:::\n\n::: {.column width=\"50%\"}\n\n\nWe bootstrap\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.654977 3.955927 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n::::\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for most cases.\n\n. . .\n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n## Same data\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\nfor (b in 1:B) {\n samp <- sample(1:nrow(fatcats), replace = TRUE)\n newcats <- fatcats[samp, ] # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1-alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.673559 3.970251 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.667178 3.97073\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\nalpha <- .05\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nnewcats <- fatcats\nfor (b in 1:B) {\n samp <- sample(residuals(cats.lm), replace = TRUE)\n newcats$Hwt <- predict(cats.lm) + samp # new data\n bhats[b] <- coef(lm(Hwt ~ 0 + Bwt, data = newcats)) \n}\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha/2, alpha/2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.665531 3.961896 \n```\n:::\n:::\n\n\n:::\n::::\n\n## Bootstrap error sources\n\n\n[Simulation error]{.secondary}:\n\nusing only $B$ samples to estimate $F$ with $\\hat{F}$.\n\n[Statistical error]{.secondary}:\n\nour data depended on a sample from the population. We don't have the whole population so we make an error by using a sample \n\n(Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\n[Specification error]{.secondary}:\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n\n", "supporting": [ "bootstrap_files" ], diff --git a/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-1-1.svg b/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-1-1.svg index 0ae11bf..77408a6 100644 --- a/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-1-1.svg +++ b/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-1-1.svg @@ -215,7 +215,7 @@ - + @@ -234,12 +234,12 @@ - + - + diff --git a/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-4-1.svg index d105a38..ad68727 100644 --- a/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-4-1.svg +++ b/_freeze/schedule/slides/bootstrap/figure-revealjs/unnamed-chunk-4-1.svg @@ -529,12 +529,12 @@ - + - + @@ -568,7 +568,7 @@ - + diff --git a/_freeze/schedule/slides/git/execute-results/html.json b/_freeze/schedule/slides/git/execute-results/html.json index e308378..649f079 100644 --- a/_freeze/schedule/slides/git/execute-results/html.json +++ b/_freeze/schedule/slides/git/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "a694de2b18ced9e1000f78652f8de195", "result": { - "markdown": "---\nlecture: \"Version control\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Why version control?\n\n\n![](http://www.phdcomics.com/comics/archive/phd101212s.gif){fig-align=\"center\"}\n\n\n[Much of this lecture is based on material from Colin Rundel and Karl Broman]{.smallest}\n\n\n## Why version control?\n\n* Simple formal system for tracking all changes to a project\n* Time machine for your projects\n + Track blame and/or praise\n + Remove the fear of breaking things\n* Learning curve is steep, but when you need it you [REALLY]{.secondary} need it\n\n::: {.callout-tip icon=false}\n## Words of wisdom\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n\n-- _Paul Wilson_\n:::\n\n\n## Why Git\n\n::: flex\n::: w-60\n* You could use something like Box or Dropbox\n* These are poor-man's version control\n* Git is much more appropriate\n* It works with large groups\n* It's very fast\n* It's [much]{.secondary} better at fixing mistakes\n* Tech companies use it (so it's in your interest to have some experience)\n:::\n\n::: w-40\n![](https://imgs.xkcd.com/comics/git.png){fig-align=\"center\"}\n:::\n:::\n\n. . .\n\n::: {.callout-important appearance=\"simple\"}\nThis will hurt, but what doesn't kill you, makes you stronger.\n:::\n\n## Overview\n\n* `git` is a command line program that lives on your machine\n* If you want to track changes in a directory, you type `git init`\n* This creates a (hidden) directory called `.git`\n* The `.git` directory contains a history of all changes made to \"versioned\" files\n* This top directory is referred to as a \"repository\" or \"repo\"\n* is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders `.ipynb` & `.md`\n* Some IDEs (pycharm, RStudio, VScode) have built in `git`\n* `git`/GitHub is broad and complicated. Here, just what you [need]{.secondary}\n\n## Aside on \"Built-in\" & \"Command line\" {background-color=\"#97D4E9\"}\n\n:::{.callout-tip}\nFirst things first, RStudio and the Terminal\n:::\n\n\n* Command line is the \"old\" type of computing. You type commands at a prompt and the computer \"does stuff\". \n\n* You may not have seen where this is. RStudio has one built in called \"Terminal\"\n\n* The Mac System version is also called \"Terminal\". If you have a Linux machine, this should all be familiar.\n\n* Windows is not great at this.\n\n* To get the most out of Git, you have to use the command line.\n\n\n## Typical workflow {.smaller}\n\n\n1. Download a repo from Github\n```{.bash}\ngit clone https://github.com/stat550-2021/lecture-slides.git\n```\n2. Create a **branch**\n```{.bash}\ngit branch \n```\n3. Make changes to your files.\n4. Add your changes to be tracked (\"stage\" them)\n```{.bash}\ngit add \n```\n5. Commit your changes\n```{.bash}\ngit commit -m \"Some explanatory message\"\n```\n\n**Repeat 3--5 as needed. Once you're satisfied**\n\n* Push to GitHub\n```{.bash}\ngit push\ngit push -u origin \n```\n\n---\n\n![](gfx/git-clone.png){fig-align=\"center\"}\n\n\n## What should be tracked?\n\n
\n\nDefinitely\n: code, markdown documentation, tex files, bash scripts/makefiles, ...\n\n
\n\nPossibly\n: logs, jupyter notebooks, images (that won’t change), ...\n\n
\n\nQuestionable\n: processed data, static pdfs, ...\n\n
\n\nDefinitely not\n: full data, continually updated pdfs, other things compiled from source code, ...\n\n\n\n## What things to track\n\n* You decide what is \"versioned\". \n\n* A file called `.gitignore` tells `git` files or types to never track\n\n```{.bash}\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n```\n\n* Shortcut to track everything (use carefully):\n\n```{.bash}\ngit add .\n```\n\n\n## What's a PR?\n\n* This exists on GitHub (not git)\n* Demonstration\n\n\n::: {.r-stack}\n![](gfx/pr1.png){.fragment height=\"550\"}\n\n![](gfx/pr2.png){.fragment height=\"550\"}\n:::\n\n## Some things to be aware of\n\n* `master` vs `main`\n* If you think you did something wrong, stop and ask for help\n* The hardest part is the initial setup. Then, this should all be rinse-and-repeat.\n* This book is great: [Happy Git with R](https://happygitwithr.com)\n 1. See Chapter 6 if you have install problems.\n 1. See Chapter 9 for credential caching (avoid typing a password all the time)\n 1. See Chapter 13 if RStudio can't find `git`\n \n## The `main/develop/branch` workflow\n\n* When working on your own\n 1. Don't NEED branches (but you should use them, really)\n 1. I make a branch if I want to try a modification without breaking what I have.\n \n \n* When working on a large team with production grade software\n 1. `main` is protected, released version of software (maybe renamed to `release`)\n 1. `develop` contains things not yet on `main`, but thoroughly tested\n 1. On a schedule (once a week, once a month) `develop` gets merged to `main`\n 1. You work on a `feature` branch off `develop` to build your new feature\n 1. You do a PR against `develop`. Supervisors review your contributions\n \n. . .\n\nI and many DS/CS/Stat faculty use this workflow with my lab.\n\n## Protection\n\n* Typical for your PR to trigger tests to make sure you don't break things\n\n* Typical for team members or supervisors to review your PR for compliance\n\n::: {.callout-tip}\nI suggest (require?) you adopt the \"production\" version for your HW 2\n:::\n\n\n## Operations in Rstudio \n\n::: flex\n::: w-50\n\n1. Stage\n1. Commit\n1. Push\n1. Pull\n1. Create a branch\n\n\n\n[Covers:]{.secondary}\n\n* Everything to do your HW / Project if you're careful\n* Plus most other things you \"want to do\"\n\n:::\n\n::: w-50\n\n\nCommand line versions (of the same)\n\n```{.bash}\ngit add \n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b \n```\n\n:::\n:::\n\n\n## Other useful stuff (but command line only) {.smaller}\n\n::: flex\n::: w-50\nInitializing\n```{.bash}\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\n```\n\n\nStaging\n```{.bash}\ngit add name/of/file # stage 1 file\ngit add . # stage all\n```\n\nCommitting\n```{.bash}\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \n```\n\nPushing\n```{.bash}\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n```\n:::\n\n::: w-50\nBranching\n```{.bash}\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\n```\n\nCheck the status\n```{.bash}\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs\n```\n:::\n:::\n\n## Commit messages {.smaller}\n\n::: {.callout-tip appearance=\"simple\"}\n1. Write meaningful messages. Not `fixed stuff` or `oops? maybe done?`\n1. These appear in the log and help you determine what you've done.\n1. Think _imperative mood_: \"add cross validation to simulation\"\n1. Best to have each commit \"do one thing\"\n:::\n\n[Conventions:]{.secondary} (see [here](https://www.conventionalcommits.org/en/v1.0.0/) for details)\n\n* feat: – a new feature is introduced with the changes\n* fix: – a bug fix has occurred\n* chore: – changes that do not relate to a fix or feature (e.g., updating dependencies)\n* refactor: – refactored code that neither fixes a bug nor adds a feature\n* docs: – updates to documentation such as a the README or other markdown files\n* style: – changes that do not affect the function of the code\n* test – including new or correcting previous tests\n* perf – performance improvements\n* ci – continuous integration related\n\n```{.bash}\ngit commit -m \"feat: add cross validation to simulation, closes #251\"\n```\n\n## Conflicts\n\n* Sometimes you merge things and \"conflicts\" happen.\n\n* Meaning that changes on one branch would overwrite changes on a different branch.\n\n::: flex\n::: w-50\n\n[They look like this:]{.secondary}\n\n```\nHere are lines that are either unchanged\nfrom the common ancestor, or cleanly\nresolved because only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n```\n\n:::\n\n::: w-50\n\n[You decide what to keep]{.secondary}\n\n1. Your changes (above `======`)\n2. Their changes (below `======`)\n3. Both.\n4. Neither.\n\nAlways delete the `<<<<<`, `======`, and `>>>>>` lines.\n\nOnce you're satisfied, commit to resolve the conflict.\n\n:::\n:::\n\n## Some other pointers\n\n* Commits have long names: `32b252c854c45d2f8dfda1076078eae8d5d7c81f`\n * If you want to use it, you need \"enough to be unique\": `32b25`\n\n* Online help uses directed graphs in ways different from statistics:\n * In stats, arrows point from cause to effect, forward in time\n * In `git` docs, it's reversed, they point to the thing on which they depend\n \n \n### Cheat sheet\n\n\n\n\n## How to undo in 3 scenarios\n\n* Suppose we're concerned about a file named `README.md`\n* Often, `git status` will give some of these as suggestions\n\n::: flex\n::: w-50\n\n[1. Saved but not staged]{.secondary}\n\nIn RStudio, select the file and click ``{=html} ``{=html} then select ``{=html} Revert...\n```{.bash}\n# grab the old committed version\ngit checkout -- README.md \n```\n\n[2. Staged but not committed]{.secondary}\n\nIn RStudio, uncheck the box by the file, then use the method above.\n```{.bash}\n# first unstage, then same as 1\ngit reset HEAD README.md\ngit checkout -- README.md\n```\n:::\n\n::: w-50\n\n[3. Committed]{.secondary}\n\nNot easy to do in RStudio...\n```{.bash}\n# check the log to find the chg \ngit log\n# go one step before that \n# (e.g., to commit 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n```\n\n
\n\n```{.bash}\n# alternatively, if it happens\n# to also be on another branch\ngit checkout otherbranch -- README.md\n```\n:::\n:::\n\n## Recovering from things\n\n1. Accidentally did work on main,\n```{.bash}\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n```\n\n2. Made a branch, did lots of work, realized it's trash, and you want to burn it\n```{.bash}\ngit checkout main\ngit branch -d badbranch\n```\n\n3. Anything more complicated, either post to Slack or LMGTFY\n\n\n## Rules for HW 2\n\n* Each team has their own repo\n* Make a PR against `main` to submit\n* Tag me and all the assigned reviewers\n* Peer evaluations are done via PR review (also send to Estella)\n* YOU must make at [least 5 commits]{.secondary} (fewer will lead to deductions)\n* I review your work and merge the PR\n\n::: {.callout-important}\n☠️☠️ Read all the instructions in the repo! ☠️☠️\n:::\n\n\n# Practice time...\n\n[dajmcdon/sugary-beverages](https://github.com/dajmcdon/sugary-beverages)\n", + "markdown": "---\nlecture: \"Version control\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Why version control?\n\n\n![](http://www.phdcomics.com/comics/archive/phd101212s.gif){fig-align=\"center\"}\n\n\n[Much of this lecture is based on material from Colin Rundel and Karl Broman]{.smallest}\n\n\n## Why version control?\n\n* Simple formal system for tracking all changes to a project\n* Time machine for your projects\n + Track blame and/or praise\n + Remove the fear of breaking things\n* Learning curve is steep, but when you need it you [REALLY]{.secondary} need it\n\n::: {.callout-tip icon=false}\n## Words of wisdom\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n\n-- _Paul Wilson_\n:::\n\n\n## Why Git\n\n::: flex\n::: w-60\n* You could use something like Box or Dropbox\n* These are poor-man's version control\n* Git is much more appropriate\n* It works with large groups\n* It's very fast\n* It's [much]{.secondary} better at fixing mistakes\n* Tech companies use it (so it's in your interest to have some experience)\n:::\n\n::: w-40\n![](https://imgs.xkcd.com/comics/git.png){fig-align=\"center\"}\n:::\n:::\n\n. . .\n\n::: {.callout-important appearance=\"simple\"}\nThis will hurt, but what doesn't kill you, makes you stronger.\n:::\n\n## Overview\n\n* `git` is a command line program that lives on your machine\n* If you want to track changes in a directory, you type `git init`\n* This creates a (hidden) directory called `.git`\n* The `.git` directory contains a history of all changes made to \"versioned\" files\n* This top directory is referred to as a \"repository\" or \"repo\"\n* is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders `.ipynb` & `.md`\n* Some IDEs (pycharm, RStudio, VScode) have built in `git`\n* `git`/GitHub is broad and complicated. Here, just what you [need]{.secondary}\n\n## Aside on \"Built-in\" & \"Command line\" {background-color=\"#97D4E9\"}\n\n:::{.callout-tip}\nFirst things first, RStudio and the Terminal\n:::\n\n\n* Command line is the \"old\" type of computing. You type commands at a prompt and the computer \"does stuff\". \n\n* You may not have seen where this is. RStudio has one built in called \"Terminal\"\n\n* The Mac System version is also called \"Terminal\". If you have a Linux machine, this should all be familiar.\n\n* Windows is not great at this.\n\n* To get the most out of Git, you have to use the command line.\n\n\n## Typical workflow {.smaller}\n\n\n1. Download a repo from Github\n```{.bash}\ngit clone https://github.com/stat550-2021/lecture-slides.git\n```\n2. Create a **branch**\n```{.bash}\ngit branch \n```\n3. Make changes to your files.\n4. Add your changes to be tracked (\"stage\" them)\n```{.bash}\ngit add \n```\n5. Commit your changes\n```{.bash}\ngit commit -m \"Some explanatory message\"\n```\n\n**Repeat 3--5 as needed. Once you're satisfied**\n\n* Push to GitHub\n```{.bash}\ngit push\ngit push -u origin \n```\n\n---\n\n![](gfx/git-clone.png){fig-align=\"center\"}\n\n\n## What should be tracked?\n\n
\n\nDefinitely\n: code, markdown documentation, tex files, bash scripts/makefiles, ...\n\n
\n\nPossibly\n: logs, jupyter notebooks, images (that won’t change), ...\n\n
\n\nQuestionable\n: processed data, static pdfs, ...\n\n
\n\nDefinitely not\n: full data, continually updated pdfs, other things compiled from source code, ...\n\n\n\n## What things to track\n\n* You decide what is \"versioned\". \n\n* A file called `.gitignore` tells `git` files or types to never track\n\n```{.bash}\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n```\n\n* Shortcut to track everything (use carefully):\n\n```{.bash}\ngit add .\n```\n\n\n## What's a PR?\n\n* This exists on GitHub (not git)\n* Demonstration\n\n\n::: {.r-stack}\n![](gfx/pr1.png){.fragment height=\"550\"}\n\n![](gfx/pr2.png){.fragment height=\"550\"}\n:::\n\n## Some things to be aware of\n\n* `master` vs `main`\n* If you think you did something wrong, stop and ask for help\n* The hardest part is the initial setup. Then, this should all be rinse-and-repeat.\n* This book is great: [Happy Git with R](https://happygitwithr.com)\n 1. See Chapter 6 if you have install problems.\n 1. See Chapter 9 for credential caching (avoid typing a password all the time)\n 1. See Chapter 13 if RStudio can't find `git`\n \n## The `main/develop/branch` workflow\n\n* When working on your own\n 1. Don't NEED branches (but you should use them, really)\n 1. I make a branch if I want to try a modification without breaking what I have.\n \n \n* When working on a large team with production grade software\n 1. `main` is protected, released version of software (maybe renamed to `release`)\n 1. `develop` contains things not yet on `main`, but thoroughly tested\n 1. On a schedule (once a week, once a month) `develop` gets merged to `main`\n 1. You work on a `feature` branch off `develop` to build your new feature\n 1. You do a PR against `develop`. Supervisors review your contributions\n \n. . .\n\nI and many DS/CS/Stat faculty use this workflow with my lab.\n\n## Protection\n\n* Typical for your PR to trigger tests to make sure you don't break things\n\n* Typical for team members or supervisors to review your PR for compliance\n\n::: {.callout-tip}\nI suggest (require?) you adopt the \"production\" version for your HW 2\n:::\n\n\n## Operations in Rstudio \n\n::: flex\n::: w-50\n\n1. Stage\n1. Commit\n1. Push\n1. Pull\n1. Create a branch\n\n\n\n[Covers:]{.secondary}\n\n* Everything to do your HW / Project if you're careful\n* Plus most other things you \"want to do\"\n\n:::\n\n::: w-50\n\n\nCommand line versions (of the same)\n\n```{.bash}\ngit add \n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b \n```\n\n:::\n:::\n\n\n## Other useful stuff (but command line only) {.smaller}\n\n::: flex\n::: w-50\nInitializing\n```{.bash}\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\n```\n\n\nStaging\n```{.bash}\ngit add name/of/file # stage 1 file\ngit add . # stage all\n```\n\nCommitting\n```{.bash}\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \n```\n\nPushing\n```{.bash}\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n```\n:::\n\n::: w-50\nBranching\n```{.bash}\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\n```\n\nCheck the status\n```{.bash}\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs\n```\n:::\n:::\n\n## Commit messages {.smaller}\n\n::: {.callout-tip appearance=\"simple\"}\n1. Write meaningful messages. Not `fixed stuff` or `oops? maybe done?`\n1. These appear in the log and help you determine what you've done.\n1. Think _imperative mood_: \"add cross validation to simulation\"\n1. Best to have each commit \"do one thing\"\n:::\n\n[Conventions:]{.secondary} (see [here](https://www.conventionalcommits.org/en/v1.0.0/) for details)\n\n* feat: – a new feature is introduced with the changes\n* fix: – a bug fix has occurred\n* chore: – changes that do not relate to a fix or feature (e.g., updating dependencies)\n* refactor: – refactored code that neither fixes a bug nor adds a feature\n* docs: – updates to documentation such as a the README or other markdown files\n* style: – changes that do not affect the function of the code\n* test – including new or correcting previous tests\n* perf – performance improvements\n* ci – continuous integration related\n\n```{.bash}\ngit commit -m \"feat: add cross validation to simulation, closes #251\"\n```\n\n## Conflicts\n\n* Sometimes you merge things and \"conflicts\" happen.\n\n* Meaning that changes on one branch would overwrite changes on a different branch.\n\n::: flex\n::: w-50\n\n[They look like this:]{.secondary}\n\n```\nHere are lines that are either unchanged\nfrom the common ancestor, or cleanly\nresolved because only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n```\n\n:::\n\n::: w-50\n\n[You decide what to keep]{.secondary}\n\n1. Your changes (above `======`)\n2. Their changes (below `======`)\n3. Both.\n4. Neither.\n\nAlways delete the `<<<<<`, `======`, and `>>>>>` lines.\n\nOnce you're satisfied, commit to resolve the conflict.\n\n:::\n:::\n\n## Some other pointers\n\n* Commits have long names: `32b252c854c45d2f8dfda1076078eae8d5d7c81f`\n * If you want to use it, you need \"enough to be unique\": `32b25`\n\n* Online help uses directed graphs in ways different from statistics:\n * In stats, arrows point from cause to effect, forward in time\n * In `git` docs, it's reversed, they point to the thing on which they depend\n \n \n### Cheat sheet\n\n\n\n\n## How to undo in 3 scenarios\n\n* Suppose we're concerned about a file named `README.md`\n* Often, `git status` will give some of these as suggestions\n\n::: flex\n::: w-50\n\n[1. Saved but not staged]{.secondary}\n\nIn RStudio, select the file and click ``{=html} ``{=html} then select ``{=html} Revert...\n```{.bash}\n# grab the old committed version\ngit checkout -- README.md \n```\n\n[2. Staged but not committed]{.secondary}\n\nIn RStudio, uncheck the box by the file, then use the method above.\n```{.bash}\n# first unstage, then same as 1\ngit reset HEAD README.md\ngit checkout -- README.md\n```\n:::\n\n::: w-50\n\n[3. Committed]{.secondary}\n\nNot easy to do in RStudio...\n```{.bash}\n# check the log to find the chg \ngit log\n# go one step before that \n# (e.g., to commit 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n```\n\n
\n\n```{.bash}\n# alternatively, if it happens\n# to also be on another branch\ngit checkout otherbranch -- README.md\n```\n:::\n:::\n\n## Recovering from things\n\n1. Accidentally did work on main,\n```{.bash}\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n```\n\n2. Made a branch, did lots of work, realized it's trash, and you want to burn it\n```{.bash}\ngit checkout main\ngit branch -d badbranch\n```\n\n3. Anything more complicated, either post to Slack or LMGTFY\n\n\n## Rules for HW 2\n\n* Each team has their own repo\n* Make a PR against `main` to submit\n* Tag me and all the assigned reviewers\n* Peer evaluations are done via PR review (also send to Estella)\n* YOU must make at [least 5 commits]{.secondary} (fewer will lead to deductions)\n* I review your work and merge the PR\n\n::: {.callout-important}\n☠️☠️ Read all the instructions in the repo! ☠️☠️\n:::\n\n\n# Practice time...\n\n[dajmcdon/sugary-beverages](https://github.com/dajmcdon/sugary-beverages)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/grad-school/execute-results/html.json b/_freeze/schedule/slides/grad-school/execute-results/html.json index ae8d8b5..5a72230 100644 --- a/_freeze/schedule/slides/grad-school/execute-results/html.json +++ b/_freeze/schedule/slides/grad-school/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "3e9694c1b0946782432a1fba557f0a95", "result": { - "markdown": "---\nlecture: \"Skills for graduate students\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Something happens in graduate school\n\n* As undergrads, you took lots of classes\n* You didn't care that much about all of them\n* Sure, you wanted good grades, but you may not have always wanted to [really learn]{.tertiary} the material\n* And you probably didn't try to go in depth beyond the requirements\n\n. . .\n\n* That has to change in grad school\n* Even if you don't want a to be a professor, to get a PhD, to do an MSc thesis.\n* This is the material that you have decided you will use for the rest of your life\n\n. . .\n\n* If you disagree, then we should talk\n\n\n## Side discussion on \"Reading for research\"\n\n\n* You should \"read\" regularly: set aside an 2-3 hours every week\n* Stay up-to-date on recent research, determine what you find interesting\n* What do people care about? What does it take to write journal articles?\n\n\n## What is \"read\"?\n\n* Start with titles, then abstracts, then intro+conclusion\n* Each is a filter to determine how far to go\n* Pass 3 filters, [read]{.secondary} the paper (should take about ~30 minutes)\n* Don't get bogged down in notation, proofs\n* Organize your documents somehow, make notes in the margins, etc\n* After you [read]{.secondary} it, you should be able to tell me what they show, why it's important, why it's novel\n* If you can, figure out [how]{.tertiary} they show something. This is hard.\n\n\n## How to find and organize papers\n\n* arXiv, AOS, JASA, JCGS have RSS feeds, email lists etc\n* Find a statistician you like who filters\n* Follow reading groups\n* Conference proceedings\n* Become an IMS member, SSC member (ASA costs money:( )\n* BibDesk, Zotero\n\n## Ideal outcome\n\n* If you need to learn something, you can teach yourself\n* Know how to find the basics on the internet\n* Know how to go in depth with real sources\n* Collect a set of resources that you can turn to regularly\n* If you need to read a book, you can\n* If you need to pick up a new coding language, you can\n\n. . .\n\n::: {.callout-note}\n## What this doesn't mean\n\nYou are not expected to have all the answers at the tips of your fingers.\n:::\n\nBut you should get progressively good at finding them.", + "markdown": "---\nlecture: \"Skills for graduate students\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Something happens in graduate school\n\n* As undergrads, you took lots of classes\n* You didn't care that much about all of them\n* Sure, you wanted good grades, but you may not have always wanted to [really learn]{.tertiary} the material\n* And you probably didn't try to go in depth beyond the requirements\n\n. . .\n\n* That has to change in grad school\n* Even if you don't want a to be a professor, to get a PhD, to do an MSc thesis.\n* This is the material that you have decided you will use for the rest of your life\n\n. . .\n\n* If you disagree, then we should talk\n\n\n## Side discussion on \"Reading for research\"\n\n\n* You should \"read\" regularly: set aside an 2-3 hours every week\n* Stay up-to-date on recent research, determine what you find interesting\n* What do people care about? What does it take to write journal articles?\n\n\n## What is \"read\"?\n\n* Start with titles, then abstracts, then intro+conclusion\n* Each is a filter to determine how far to go\n* Pass 3 filters, [read]{.secondary} the paper (should take about ~30 minutes)\n* Don't get bogged down in notation, proofs\n* Organize your documents somehow, make notes in the margins, etc\n* After you [read]{.secondary} it, you should be able to tell me what they show, why it's important, why it's novel\n* If you can, figure out [how]{.tertiary} they show something. This is hard.\n\n\n## How to find and organize papers\n\n* arXiv, AOS, JASA, JCGS have RSS feeds, email lists etc\n* Find a statistician you like who filters\n* Follow reading groups\n* Conference proceedings\n* Become an IMS member, SSC member (ASA costs money:( )\n* BibDesk, Zotero\n\n## Ideal outcome\n\n* If you need to learn something, you can teach yourself\n* Know how to find the basics on the internet\n* Know how to go in depth with real sources\n* Collect a set of resources that you can turn to regularly\n* If you need to read a book, you can\n* If you need to pick up a new coding language, you can\n\n. . .\n\n::: {.callout-note}\n## What this doesn't mean\n\nYou are not expected to have all the answers at the tips of your fingers.\n:::\n\nBut you should get progressively good at finding them.", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/organization/execute-results/html.json b/_freeze/schedule/slides/organization/execute-results/html.json index 3cfdbe9..f4ba3b3 100644 --- a/_freeze/schedule/slides/organization/execute-results/html.json +++ b/_freeze/schedule/slides/organization/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "46509e1b87037017090fcbb152d46295", "result": { - "markdown": "---\nlecture: \"Organization and reports\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Topics for today\n\n1. Organizing your file system\n2. Writing reports that mix output and text\n3. (Avoiding buggy code)\n\n## The guiding theme\n\n![](https://imgs.xkcd.com/comics/is_it_worth_the_time.png){.center}\n\n\n# Organization {background-color=\"#e98a15\"}\n\n* Students come to my office\n* All their stuff is on their Desktop\n* This is 🤮\n\n## I urge you to consult:\n\n[Karl Broman's Notes](https://kbroman.org/Tools4RR/assets/lectures/06_org_eda_withnotes.pdf)\n\n\n## Some guiding principles\n\n1. Avoid naming by date. \n - Your file system already knows the date.\n - Sometimes projects take a while.\n - You can add this inside a particular report: `Last updated: 2022-01-07`\n1. If you're going to use a date anywhere, do `YYYY-MM-DD` or `YYYYMMDD` not `DD-MMM-YY`\n1. This is a process\n1. Don't get tied down\n1. But don't reorganize every time you find a better system\n1. Customize to your needs, preferences\n \n\n## Organizing your stuff\n\n\n```{.bash}\n├── Advising\n│ ├── arash\n│ ├── gian-carlo\n├── CV\n├── Computing\n│ ├── batchtools.slurm.tmpl\n│ ├── computecanada_notes.md\n│ ├── FKF\n│ └── ghclass\n├── Grants\n│ ├── B&E JSM 2010\n│ ├── CANSSI RRP 2020\n│ ├── NSERC 2020\n├── LettersofRec\n├── Manuscripts\n| ├── learning-white-matter\n| ├── rt-est\n│ ├── zzzz Old\n├── Referee reports\n├── Talks\n│ ├── JobTalk2020\n│ ├── ubc-stat-covid-talk\n│ └── utoronto-grad-advice\n├── Teaching\n│ ├── stat-406\n│ ├── stat-550\n│ ├── zzzz CMU TA\n│ └── zzzz booth\n└── Website\n```\n\n\n\n## Inside a project\n\n```{.bash}\n.\n├── README.md\n├── Summary of Goals.rtf\n├── cluster_output\n├── code\n├── data\n├── dsges-github.Rproj\n├── manuscript\n└── waldman-triage\n```\n\n* Include a README\n* Ideally have a MAKEFILE\n* Under version control, shared with collaborator\n\n\n## Basic principles\n\n* Be consistent\n – directory structure; names\n - all project files in 1 directory, not multiples\n* Always separate raw from processed data\n* Always separate code from data\n* It should be obvious what code created what files, and what the dependencies are. (MAKEFILE forces this)\n* [No hand-editing of data files]{.secondary}\n* Don’t use spaces in file names\n* In code, use relative paths, not absolute paths\n - `../blah` not `~/blah` or `/users/dajmcdon/Documents/Work/proj-1/blah`\n - The `{here}` package in `R` is great for this\n \n## Problem: Coordinating with collaborators\n\n* Where to put data that multiple people will work with?\n* Where to put intermediate/processed data?\n* Where to indicate the code that created those processed data files?\n* How to divvy up tasks and know who did what?\n* Need to agree on directory structure and file naming conventions\n\n[GitHub is (I think) the ideal solution, but not always feasible.]{.secondary}\n\n## Problem: Collaborators who don’t use GitHub\n\n* Use GitHub yourself\n* Copy files to/from some shared space\n - Ideally, in an automated way (Dropbox, S3 Bucket)\n - Avoid Word at all costs. Google Docs if needed.\n - Word and Git do not mix\n - [Last resort:]{.secondary} Word file in Dropbox. Everything else nicely organized on your end. Rmd file with similar structure to Manuscript that does the analysis.\n* Commit their changes.\n\n. . .\n\nOverleaf has Git built in (paid tier). I don't like Overleaf. Costs money, the viewer is crap and so is the editor. I suggest you avoid it.\n\n# Reports that mix output and text {background-color=\"#e98a15\"}\n\n## Using Rmarkdown/Quarto/Jupyter for most things\n\n### Your goal is to [Avoid at all costs]{.secondary}:\n\n* \"How did I create this plot?\"\n* \"Why did I decide to omit those six samples?\"\n* \"Where (on the web) did I find these data?\"\n* \"What was that interesting gene/feature/predictor?\"\n\n
\n \nReally useful resource:\n\n* Emily Reiderer [RmdDD](https://emilyriederer.netlify.app/post/rmarkdown-driven-development/)\n* Talk [Slides](https://www.slideshare.net/EmilyRiederer/rmarkdown-driven-development-rstudioconf-2020)\n\n## When I begin a new project\n\n1. Create a directory structure\n - `code/`\n - `papers/`\n - `notes/` (maybe?)\n - `README.md`\n - `data/` (maybe?)\n1. Write scripts in the `code/` directory\n1. TODO items in the README\n1. Use Rmarkdown/Quarto/Jupyter for reports, render to `.pdf`\n\n## As the project progresses...\n\nReorganize\n\n* Some script files go to a package (thorougly tested), all that remains is for the paper\n* These now load the package and run simulations or analyses (that take a while)\n* Maybe add a directory that contains dead-ends (code or text or ...)\n* Add `manuscript/`. I try to go for `main.tex` and `Supplement.Rmd`\n* `Supplement.Rmd` runs anything necessary in `code/` and creates all figures in the main doc and the supplement. Also generates any online supplementary material\n* Sometimes, just `manuscript/main.Rmd` \n* Sometimes `main.tex` just inputs `intro.tex`, `methods.tex`, etc.\n\n## The old manuscript (starting in School, persisting too long)\n\n1. Write lots of LaTeX, `R` code in separate files\n1. Need a figure. Run `R` code, get figure, save as `.pdf`.\n1. Recompile LaTeX. Axes are unreadable. Back to `R`, rerun `R` code, ...\n1. Recompile LaTeX. Can't distinguish lines. Back to `R`, rerun `R` code, ...\n1. Collaborator wants changes to the simulation. Edit the code. Rerun figure script, doesn't work. More edits....Finally Recompile.\n1. Reviewer \"what if `n` is bigger\". Hope I can find the right location. But the code isn't functions. Something breaks ...\n1. Etc, etc.\n\n## Now: \n\n\n1. `R` package with documented code, available on GitHub. \n1. One script to run the analysis, one to gather the results. \n1. One `.Rmd` file to take in the results, do preprocessing, generate all figures. \n1. LaTeX file on Journal style.\n\n### The optimal\n\nSame as above but with a MAKEFILE to automatically run parts of 1--4 as needed\n\n\n\n\n## Evolution of presentations\n\n1. LaTeX + Beamer (similar to the manuscript):\n a. Write lots of LaTeX, `R` code in separate files\n a. Need a figure. Run `R` code, get figure, save as `.pdf`.\n a. Rinse and repeat.\n1. Course slides in Rmarkdown + Slidy\n1. Seminars in Rmarkdown + Beamer (with lots of customization)\n1. Seminars in Rmarkdown + Xaringan\n1. Everything in Quarto\n\n::: {.callout-tip appearance=\"simple\"}\n* Easy to use.\n* Easy to customize (defaults are not great)\n* WELL DOCUMENTED\n:::\n\n\n## Takeaways", + "markdown": "---\nlecture: \"Organization and reports\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Topics for today\n\n1. Organizing your file system\n2. Writing reports that mix output and text\n3. (Avoiding buggy code)\n\n## The guiding theme\n\n![](https://imgs.xkcd.com/comics/is_it_worth_the_time.png){.center}\n\n\n# Organization {background-color=\"#e98a15\"}\n\n* Students come to my office\n* All their stuff is on their Desktop\n* This is 🤮\n\n## I urge you to consult:\n\n[Karl Broman's Notes](https://kbroman.org/Tools4RR/assets/lectures/06_org_eda_withnotes.pdf)\n\n\n## Some guiding principles\n\n1. Avoid naming by date. \n - Your file system already knows the date.\n - Sometimes projects take a while.\n - You can add this inside a particular report: `Last updated: 2022-01-07`\n1. If you're going to use a date anywhere, do `YYYY-MM-DD` or `YYYYMMDD` not `DD-MMM-YY`\n1. This is a process\n1. Don't get tied down\n1. But don't reorganize every time you find a better system\n1. Customize to your needs, preferences\n \n\n## Organizing your stuff\n\n\n```{.bash}\n├── Advising\n│ ├── arash\n│ ├── gian-carlo\n├── CV\n├── Computing\n│ ├── batchtools.slurm.tmpl\n│ ├── computecanada_notes.md\n│ ├── FKF\n│ └── ghclass\n├── Grants\n│ ├── B&E JSM 2010\n│ ├── CANSSI RRP 2020\n│ ├── NSERC 2020\n├── LettersofRec\n├── Manuscripts\n| ├── learning-white-matter\n| ├── rt-est\n│ ├── zzzz Old\n├── Referee reports\n├── Talks\n│ ├── JobTalk2020\n│ ├── ubc-stat-covid-talk\n│ └── utoronto-grad-advice\n├── Teaching\n│ ├── stat-406\n│ ├── stat-550\n│ ├── zzzz CMU TA\n│ └── zzzz booth\n└── Website\n```\n\n\n\n## Inside a project\n\n```{.bash}\n.\n├── README.md\n├── Summary of Goals.rtf\n├── cluster_output\n├── code\n├── data\n├── dsges-github.Rproj\n├── manuscript\n└── waldman-triage\n```\n\n* Include a README\n* Ideally have a MAKEFILE\n* Under version control, shared with collaborator\n\n\n## Basic principles\n\n* Be consistent\n – directory structure; names\n - all project files in 1 directory, not multiples\n* Always separate raw from processed data\n* Always separate code from data\n* It should be obvious what code created what files, and what the dependencies are. (MAKEFILE forces this)\n* [No hand-editing of data files]{.secondary}\n* Don’t use spaces in file names\n* In code, use relative paths, not absolute paths\n - `../blah` not `~/blah` or `/users/dajmcdon/Documents/Work/proj-1/blah`\n - The `{here}` package in `R` is great for this\n \n## Problem: Coordinating with collaborators\n\n* Where to put data that multiple people will work with?\n* Where to put intermediate/processed data?\n* Where to indicate the code that created those processed data files?\n* How to divvy up tasks and know who did what?\n* Need to agree on directory structure and file naming conventions\n\n[GitHub is (I think) the ideal solution, but not always feasible.]{.secondary}\n\n## Problem: Collaborators who don’t use GitHub\n\n* Use GitHub yourself\n* Copy files to/from some shared space\n - Ideally, in an automated way (Dropbox, S3 Bucket)\n - Avoid Word at all costs. Google Docs if needed.\n - Word and Git do not mix\n - [Last resort:]{.secondary} Word file in Dropbox. Everything else nicely organized on your end. Rmd file with similar structure to Manuscript that does the analysis.\n* Commit their changes.\n\n. . .\n\nOverleaf has Git built in (paid tier). I don't like Overleaf. Costs money, the viewer is crap and so is the editor. I suggest you avoid it.\n\n# Reports that mix output and text {background-color=\"#e98a15\"}\n\n## Using Rmarkdown/Quarto/Jupyter for most things\n\n### Your goal is to [Avoid at all costs]{.secondary}:\n\n* \"How did I create this plot?\"\n* \"Why did I decide to omit those six samples?\"\n* \"Where (on the web) did I find these data?\"\n* \"What was that interesting gene/feature/predictor?\"\n\n
\n \nReally useful resource:\n\n* Emily Reiderer [RmdDD](https://emilyriederer.netlify.app/post/rmarkdown-driven-development/)\n* Talk [Slides](https://www.slideshare.net/EmilyRiederer/rmarkdown-driven-development-rstudioconf-2020)\n\n## When I begin a new project\n\n1. Create a directory structure\n - `code/`\n - `papers/`\n - `notes/` (maybe?)\n - `README.md`\n - `data/` (maybe?)\n1. Write scripts in the `code/` directory\n1. TODO items in the README\n1. Use Rmarkdown/Quarto/Jupyter for reports, render to `.pdf`\n\n## As the project progresses...\n\nReorganize\n\n* Some script files go to a package (thorougly tested), all that remains is for the paper\n* These now load the package and run simulations or analyses (that take a while)\n* Maybe add a directory that contains dead-ends (code or text or ...)\n* Add `manuscript/`. I try to go for `main.tex` and `Supplement.Rmd`\n* `Supplement.Rmd` runs anything necessary in `code/` and creates all figures in the main doc and the supplement. Also generates any online supplementary material\n* Sometimes, just `manuscript/main.Rmd` \n* Sometimes `main.tex` just inputs `intro.tex`, `methods.tex`, etc.\n\n## The old manuscript (starting in School, persisting too long)\n\n1. Write lots of LaTeX, `R` code in separate files\n1. Need a figure. Run `R` code, get figure, save as `.pdf`.\n1. Recompile LaTeX. Axes are unreadable. Back to `R`, rerun `R` code, ...\n1. Recompile LaTeX. Can't distinguish lines. Back to `R`, rerun `R` code, ...\n1. Collaborator wants changes to the simulation. Edit the code. Rerun figure script, doesn't work. More edits....Finally Recompile.\n1. Reviewer \"what if `n` is bigger\". Hope I can find the right location. But the code isn't functions. Something breaks ...\n1. Etc, etc.\n\n## Now: \n\n\n1. `R` package with documented code, available on GitHub. \n1. One script to run the analysis, one to gather the results. \n1. One `.Rmd` file to take in the results, do preprocessing, generate all figures. \n1. LaTeX file on Journal style.\n\n### The optimal\n\nSame as above but with a MAKEFILE to automatically run parts of 1--4 as needed\n\n\n\n\n## Evolution of presentations\n\n1. LaTeX + Beamer (similar to the manuscript):\n a. Write lots of LaTeX, `R` code in separate files\n a. Need a figure. Run `R` code, get figure, save as `.pdf`.\n a. Rinse and repeat.\n1. Course slides in Rmarkdown + Slidy\n1. Seminars in Rmarkdown + Beamer (with lots of customization)\n1. Seminars in Rmarkdown + Xaringan\n1. Everything in Quarto\n\n::: {.callout-tip appearance=\"simple\"}\n* Easy to use.\n* Easy to customize (defaults are not great)\n* WELL DOCUMENTED\n:::\n\n\n## Takeaways", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/pca-intro/execute-results/html.json b/_freeze/schedule/slides/pca-intro/execute-results/html.json index df7effd..687c563 100644 --- a/_freeze/schedule/slides/pca-intro/execute-results/html.json +++ b/_freeze/schedule/slides/pca-intro/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "b43f7074da22613bdd9c0a2aefd2fc3d", "result": { - "markdown": "---\nlecture: \"Principal components analysis\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Representation learning\n\nRepresentation learning is the idea that performance of ML methods is\nhighly dependent on the choice of representation\n\n\nFor this reason, much of ML is geared towards transforming the data into\nthe relevant features and then using these as inputs\n\n\nThis idea is as old as statistics itself, really,\n\nHowever, the idea is constantly revisited in a variety of fields and\ncontexts\n\n\nCommonly, these learned representations capture low-level information\nlike overall shapes\n\n\n\nIt is possible to quantify this intuition for PCA at least\n\n. . .\n\nGoal\n: Transform $\\mathbf{X}\\in \\R^{n\\times p}$ into $\\mathbf{Z} \\in \\R^{n \\times ?}$\n\n?-dimension can be bigger (feature creation) or smaller (dimension reduction) than $p$\n\n\n\n\n\n## PCA\n\nPrincipal components analysis (PCA) is a dimension\nreduction technique\n\n\nIt solves various equivalent optimization problems\n\n(Maximize variance, minimize $\\ell_2$ distortions, find closest subspace of a given rank, $\\ldots$)\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-1-1.png){fig-align='center'}\n:::\n:::\n\n\n\n\n## Lower dimensional embeddings\n\nSuppose we have predictors $\\x_1$ and $\\x_2$ (columns / features / measurements)\n\n- We more faithfully preserve the structure of this data by keeping\n $\\x_1$ and setting $\\x_2$ to zero than the opposite\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Lower dimensional embeddings\n\nAn important feature of the previous example is that $\\x_1$ and $\\x_2$\naren't correlated\n\nWhat if they are?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\nWe lose a lot of structure by setting either $\\x_1$ or $\\x_2$ to zero\n\n\n\n## Lower dimensional embeddings\n\n\nThe only difference is the first is a rotation of the second\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## PCA\n\nIf we knew how to rotate our data, then we could more \neasily retain the structure.\n\n[PCA]{.secondary} gives us exactly this rotation\n\n1. Center (+scale?) the data matrix $\\X$\n2. Compute the SVD of $\\X = \\U\\D \\V^\\top$ (always exists)\n3. Return $\\U_M\\D_M$, where $\\D_M$ is the largest $M$\n singular values of $\\X$\n\n\n## PCA\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n## PCA on some pop music data\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,269 × 15\n artist danceability energy key loudness mode speechiness acousticness\n \n 1 Taylor Swi… 0.781 0.357 0 -16.4 1 0.912 0.717 \n 2 Taylor Swi… 0.627 0.266 9 -15.4 1 0.929 0.796 \n 3 Taylor Swi… 0.516 0.917 11 -3.19 0 0.0827 0.0139 \n 4 Taylor Swi… 0.629 0.757 1 -8.37 0 0.0512 0.00384\n 5 Taylor Swi… 0.686 0.705 9 -10.8 1 0.249 0.832 \n 6 Taylor Swi… 0.522 0.691 2 -4.82 1 0.0307 0.00609\n 7 Taylor Swi… 0.31 0.374 6 -8.46 1 0.0275 0.761 \n 8 Taylor Swi… 0.705 0.621 2 -8.09 1 0.0334 0.101 \n 9 Taylor Swi… 0.553 0.604 1 -5.30 0 0.0258 0.202 \n10 Taylor Swi… 0.419 0.908 9 -5.16 1 0.0651 0.00048\n# ℹ 1,259 more rows\n# ℹ 7 more variables: instrumentalness , liveness , valence ,\n# tempo , time_signature , duration_ms , explicit \n```\n:::\n:::\n\n\n## PCA on some pop music data\n\n* 15 dimensions to 2\n* coloured by artist\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/pca-music-plot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Plotting the weights, $\\alpha_j,\\ j=1,2$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n# Mathematical details\n\n## Matrix decompositions\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\nThis is expressed via the SVD: $\\X = \\U\\D\\V^{\\top}$.\n\n\n::: {.callout-important}\nWe assume throughout that we have centered the data\n:::\n\nThen our new features are\n\n$$\\mathbf{Z} = \\X \\V = \\U\\D$$\n\n\n\n## Short SVD aside \n\n* Any $n\\times p$ matrix can be decomposed into $\\mathbf{UDV}^\\top$.\n\n* This is a computational procedure, like inverting a matrix, `svd()`\n\n* These have properties:\n\n1. $\\mathbf{U}^\\top \\mathbf{U} = \\mathbf{I}_n$\n2. $\\mathbf{V}^\\top \\mathbf{V} = \\mathbf{I}_p$\n3. $\\mathbf{D}$ is diagonal (0 off the diagonal)\n\n\n[Many]{.secondary} methods for dimension reduction use the SVD of some matrix.\n\n\n\n## Why? {.smaller}\n\n1. Given $\\X$, find a projection $\\mathbf{P}$ onto $\\R^M$ with $M \\leq p$ \nthat minimizes the reconstruction error\n$$\n\\begin{aligned}\n\\min_{\\mathbf{P}} &\\,\\, \\lVert \\mathbf{X} - \\mathbf{X}\\mathbf{P} \\rVert^2_F \\,\\,\\, \\textrm{(sum all the elements)}\\\\\n\\textrm{subject to} &\\,\\, \\textrm{rank}(\\mathbf{P}) = M,\\, \\mathbf{P} = \\mathbf{P}^T,\\, \\mathbf{P} = \\mathbf{P}^2\n\\end{aligned}\n$$\nThe conditions ensure that $\\mathbf{P}$ is a projection matrix onto $M$ dimensions.\n\n2. Maximize the variance explained by an orthogonal transformation $\\mathbf{A} \\in \\R^{p\\times M}$\n$$\n\\begin{aligned}\n\\max_{\\mathbf{A}} &\\,\\, \\textrm{trace}\\left(\\frac{1}{n}\\mathbf{A}^\\top \\X^\\top \\X \\mathbf{A}\\right)\\\\\n\\textrm{subject to} &\\,\\, \\mathbf{A}^\\top\\mathbf{A} = \\mathbf{I}_M\n\\end{aligned}\n$$\n\n* In case one, the minimizer is $\\mathbf{P} = \\mathbf{V}_M\\mathbf{V}_M^\\top$\n* In case two, the maximizer is $\\mathbf{A} = \\mathbf{V}_M$.\n\n## Code output to look at\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n", + "markdown": "---\nlecture: \"Principal components analysis\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Representation learning\n\nRepresentation learning is the idea that performance of ML methods is\nhighly dependent on the choice of representation\n\n\nFor this reason, much of ML is geared towards transforming the data into\nthe relevant features and then using these as inputs\n\n\nThis idea is as old as statistics itself, really,\n\nHowever, the idea is constantly revisited in a variety of fields and\ncontexts\n\n\nCommonly, these learned representations capture low-level information\nlike overall shapes\n\n\n\nIt is possible to quantify this intuition for PCA at least\n\n. . .\n\nGoal\n: Transform $\\mathbf{X}\\in \\R^{n\\times p}$ into $\\mathbf{Z} \\in \\R^{n \\times ?}$\n\n?-dimension can be bigger (feature creation) or smaller (dimension reduction) than $p$\n\n\n\n\n\n## PCA\n\nPrincipal components analysis (PCA) is a dimension\nreduction technique\n\n\nIt solves various equivalent optimization problems\n\n(Maximize variance, minimize $\\ell_2$ distortions, find closest subspace of a given rank, $\\ldots$)\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-1-1.png){fig-align='center'}\n:::\n:::\n\n\n\n\n## Lower dimensional embeddings\n\nSuppose we have predictors $\\x_1$ and $\\x_2$ (columns / features / measurements)\n\n- We more faithfully preserve the structure of this data by keeping\n $\\x_1$ and setting $\\x_2$ to zero than the opposite\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Lower dimensional embeddings\n\nAn important feature of the previous example is that $\\x_1$ and $\\x_2$\naren't correlated\n\nWhat if they are?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\nWe lose a lot of structure by setting either $\\x_1$ or $\\x_2$ to zero\n\n\n\n## Lower dimensional embeddings\n\n\nThe only difference is the first is a rotation of the second\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## PCA\n\nIf we knew how to rotate our data, then we could more \neasily retain the structure.\n\n[PCA]{.secondary} gives us exactly this rotation\n\n1. Center (+scale?) the data matrix $\\X$\n2. Compute the SVD of $\\X = \\U\\D \\V^\\top$ (always exists)\n3. Return $\\U_M\\D_M$, where $\\D_M$ is the largest $M$\n singular values of $\\X$\n\n\n## PCA\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n## PCA on some pop music data\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,269 × 15\n artist danceability energy key loudness mode speechiness acousticness\n \n 1 Taylor Swi… 0.781 0.357 0 -16.4 1 0.912 0.717 \n 2 Taylor Swi… 0.627 0.266 9 -15.4 1 0.929 0.796 \n 3 Taylor Swi… 0.516 0.917 11 -3.19 0 0.0827 0.0139 \n 4 Taylor Swi… 0.629 0.757 1 -8.37 0 0.0512 0.00384\n 5 Taylor Swi… 0.686 0.705 9 -10.8 1 0.249 0.832 \n 6 Taylor Swi… 0.522 0.691 2 -4.82 1 0.0307 0.00609\n 7 Taylor Swi… 0.31 0.374 6 -8.46 1 0.0275 0.761 \n 8 Taylor Swi… 0.705 0.621 2 -8.09 1 0.0334 0.101 \n 9 Taylor Swi… 0.553 0.604 1 -5.30 0 0.0258 0.202 \n10 Taylor Swi… 0.419 0.908 9 -5.16 1 0.0651 0.00048\n# ℹ 1,259 more rows\n# ℹ 7 more variables: instrumentalness , liveness , valence ,\n# tempo , time_signature , duration_ms , explicit \n```\n:::\n:::\n\n\n## PCA on some pop music data\n\n* 15 dimensions to 2\n* coloured by artist\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/pca-music-plot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Plotting the weights, $\\alpha_j,\\ j=1,2$\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](pca-intro_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n# Mathematical details\n\n## Matrix decompositions\n\nAt its core, we are finding linear combinations of the original\n(centered) data $$z_{ij} = \\alpha_j^{\\top} x_i$$\n\n\nThis is expressed via the SVD: $\\X = \\U\\D\\V^{\\top}$.\n\n\n::: {.callout-important}\nWe assume throughout that we have centered the data\n:::\n\nThen our new features are\n\n$$\\mathbf{Z} = \\X \\V = \\U\\D$$\n\n\n\n## Short SVD aside \n\n* Any $n\\times p$ matrix can be decomposed into $\\mathbf{UDV}^\\top$.\n\n* This is a computational procedure, like inverting a matrix, `svd()`\n\n* These have properties:\n\n1. $\\mathbf{U}^\\top \\mathbf{U} = \\mathbf{I}_n$\n2. $\\mathbf{V}^\\top \\mathbf{V} = \\mathbf{I}_p$\n3. $\\mathbf{D}$ is diagonal (0 off the diagonal)\n\n\n[Many]{.secondary} methods for dimension reduction use the SVD of some matrix.\n\n\n\n## Why? {.smaller}\n\n1. Given $\\X$, find a projection $\\mathbf{P}$ onto $\\R^M$ with $M \\leq p$ \nthat minimizes the reconstruction error\n$$\n\\begin{aligned}\n\\min_{\\mathbf{P}} &\\,\\, \\lVert \\mathbf{X} - \\mathbf{X}\\mathbf{P} \\rVert^2_F \\,\\,\\, \\textrm{(sum all the elements)}\\\\\n\\textrm{subject to} &\\,\\, \\textrm{rank}(\\mathbf{P}) = M,\\, \\mathbf{P} = \\mathbf{P}^T,\\, \\mathbf{P} = \\mathbf{P}^2\n\\end{aligned}\n$$\nThe conditions ensure that $\\mathbf{P}$ is a projection matrix onto $M$ dimensions.\n\n2. Maximize the variance explained by an orthogonal transformation $\\mathbf{A} \\in \\R^{p\\times M}$\n$$\n\\begin{aligned}\n\\max_{\\mathbf{A}} &\\,\\, \\textrm{trace}\\left(\\frac{1}{n}\\mathbf{A}^\\top \\X^\\top \\X \\mathbf{A}\\right)\\\\\n\\textrm{subject to} &\\,\\, \\mathbf{A}^\\top\\mathbf{A} = \\mathbf{I}_M\n\\end{aligned}\n$$\n\n* In case one, the minimizer is $\\mathbf{P} = \\mathbf{V}_M\\mathbf{V}_M^\\top$\n* In case two, the maximizer is $\\mathbf{A} = \\mathbf{V}_M$.\n\n## Code output to look at\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n", "supporting": [ "pca-intro_files" ], diff --git a/_freeze/schedule/slides/pca-intro/figure-revealjs/pca-music-plot-1.svg b/_freeze/schedule/slides/pca-intro/figure-revealjs/pca-music-plot-1.svg index 8f99a71..000c6f7 100644 --- a/_freeze/schedule/slides/pca-intro/figure-revealjs/pca-music-plot-1.svg +++ b/_freeze/schedule/slides/pca-intro/figure-revealjs/pca-music-plot-1.svg @@ -103,133 +103,133 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -237,1548 +237,1548 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - - + + - - + + - + - + - - - - - - - - - - - + + + + + + + + + + + - - - - + + + + - - - + + + - - - + + + - - - + + + - - - + + + - - - + + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + - + - - - + + + - - - + + + - - - + + + - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - + + + + + + + + + + + + + diff --git a/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-4-1.svg index 2ca9847..948f24b 100644 --- a/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-4-1.svg +++ b/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-4-1.svg @@ -541,35 +541,35 @@ - - - - - - - + + + + + + + - - - - + + + + - - - - - - - + + + + + + + - - - - - - - + + + + + + + diff --git a/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-5-1.svg b/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-5-1.svg index 6a4a6a1..a7e9f73 100644 --- a/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-5-1.svg +++ b/_freeze/schedule/slides/pca-intro/figure-revealjs/unnamed-chunk-5-1.svg @@ -158,190 +158,190 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -350,551 +350,551 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + - + - - + + - - + + - + - + - - - - - - - - - - - - + + + + + + + + + + + + - - + + - - + + - + - + - + - + - + - + - + - - - - - - - - + + + + + + + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + - + - - + + - - + + - + - + - - - - - - - - - - - - + + + + + + + + + + + + - - + + - - + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - + + + + + + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + - + - - + + - - + + - + - + - - - - - - - - - - - - + + + + + + + + + + + + - - + + - - + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + diff --git a/_freeze/schedule/slides/presentations/execute-results/html.json b/_freeze/schedule/slides/presentations/execute-results/html.json index 17bc4e1..c900713 100644 --- a/_freeze/schedule/slides/presentations/execute-results/html.json +++ b/_freeze/schedule/slides/presentations/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "6cd1212fed22adff9e7e3817c3ad2b52", "result": { - "markdown": "---\nlecture: \"Giving presentations\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Structure\n\n\n1. Strategy (applies to papers too)\n\n1. Dos and don'ts\n\n1. Personal preferences\n\n\n\n# Strategy\n\n\n## Genre\n\nTalks take many forms (like papers)\n\n* Department seminar\n\n* Short conference presentation\n\n* Class lecture\n\n* `...`\n\nCalibrate your talk to the [Genre]{.secondary} and the [Audience]{.tertiary}\n\n. . .\n\n* A job talk takes much more work than a class presentation\n* For context, after much practice, it takes me about 1 hour per minute of presentation length, depending on the amount of polish.\n* My course lectures take about 4x the target duration.\n* General ideas are the same for all styles.\n\n## Audience\n\n* Think about who you are talking to\n - Statisticians?\n - Students?\n - Potential employer?\n - People with PhD's but in other disciplines?\n - Your grandma?\n \n* Regardless of the audience, I think of dividing the talk roughly in 3rds.\n\n\n## (Your audience for your in-class talk) {background-color=\"#86D4FE\"}\n\n* 2/3 of the time, [the client]{.secondary}. \n* You're teaching them this topic.\n* Think \"someone who took 1 or 2 classes in statistics\"\n* 1/3 of the time, [your classmates]{.secondary}. \n* What are the details they need to know that they don't know?\n* Think carefully how to structure for that breakdown.\n\n\n\n---\n\n### 1. \nTalk to your grandma. Why are you listening to me? Why is what I'm saying interesting?\n\n### 2. \nTalk to your audience. What have I done that you should be able to do at the end?\n\n### 3. \nTalk [slightly]{.secondary} over your audience. Why should you be impressed by me?\n\n. . .\n\nPart 3 is shorter depending on the Audience.\n\n\n\n## Content \n\nEach part is a little mini-talk\n\n1. Starts with the general idea\n2. Develops a few details. [Strategy:]{.secondary} problem/solution or question/answer\n3. Ends with a takeaway\n\nBut these parts are [recalibrated]{.fourth-colour} to the audience.\n\n* Your Grandma doesn't want to see math.\n* Your employer might, but doesn't want to hear about $\\sigma$-fields. \n* Statisticians don't want to see proofs (but might want a sketch).\n* `...`\n\n\n## Story structure\n\n::: {.callout-note}\n## What I often see...\nOnce upon a time, a young MSc student went into the woods of theory and found some trees. \n\nFirst they looked at one tree, it was oak. \n\nThen the looked at the next tree, it was maple. \n\nThen they wondered if trees could talk. \n\nAfter three months of wandering, they saw a house...\n:::\n\n. . .\n\n::: {.callout-important}\n## The attention grabber\nAxe-wielding woodsman saves student from wolf attack!\n:::\n\n## Better structure\n\n1. (Enough details to give the headline.)\n1. Headline result.\n1. How do we know the result is real. What are the details of \ncomputation, inference, methodology.\n1. Demonstration with empirics.\n\n## You should consider...\n\n* Attention span diminishes quickly. \n* What are the 3-5 takeaways?\n* Hit your main result at the beginning: this is what I can do that I couldn't before.\n\n## The ideal map\n\nMap out what you've done.\n\n* What did you find?\n* What are the implications? \n* Why does audience care?\n* How do we do it?\n\n. . .\n\nAvoid wandering in the wilderness:\n\n1. First we did this;\n1. But that didn't work, so we tried ...\n1. But then we added ...\n1. Finally we got to the beach ...\n1. And the water was nice ...\n\n\n# Good resource\n\nProf. Trevor Campbell's [\"How to Explain Things\"](https://docs.google.com/presentation/d/13vwchlzQAZjjfiI3AiBC_kM-syI6GJKzbuZoLxgy1a4/edit?usp=sharing)\n\n# Dos and don'ts\n\n## Words\n\n:::: {.columns}\n::: {.column width=\"45%\"}\nToo many words on a slide is bad\n\n* Bullet points\n* Too densely concentrated are bad\n* Are bad\n* Are hard to focus on\n\n

\n\nEmpty space is your friend\n:::\n\n\n::: {.column width=\"55%\"}\n\n\nLorem markdownum et moras et ponendi odores, neu magna per! Tyria meo iungitur\nvidet, frigore terras rogis Anienis poteram, dant. His vallem arma corpore\nvident nunc nivibus [avus](http://mirantes.org/), dea. Spatium luce certa\ncupiunt, lina. [Amabam](http://www.sub.com/satisego) opem, Iovis fecundaque et\nparum.\n\nAede virum annis audit modo: meus ramis videri: nec quod insidiisque Aonio\ntenuem, AI. Trames Iason: nocent hortatus lacteus praebita\n[paternos](http://ex.net/milledubitavit) petit, Paridis **aptus prius ut** origo\nfuriisque. Mercibus sis nullo aliudve Amathunta sufficit ululatibus,\npraevalidusque segnis *et* Dryopen. \n:::\n::::\n\n## Images\n\n:::: {.columns}\n::: {.column width=\"40%\"}\nPictures are good\n\n
\n\nFlow charts are good.\n\n
\n\n[Careful]{.fourth-colour} use of [colour]{.secondary} is [good]{.tertiary}.\n\n
\n\n### Size is good.\n\n
\n\n_too much variation is distracting_\n:::\n\n::: {.column width=\"60%\"}\n\n![](https://cdn.stocksnap.io/img-thumbs/960w/cat-kitten_LH77KFTY76.jpg)\n:::\n::::\n\n. . .\n\nHow long did you stare at the cat?\n\n\n# Personal preferences\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours1-1.png){fig-align='center'}\n:::\n:::\n\n\n. . .\n\n::: {.callout-important}\nDefaults are almost always terrible.\n:::\n\n\n## Issues with the preceding\n\n* Colours are awful\n* Grey background is distracting\n* Text size is too small\n* Legend position on the side is strange?\n* Numbers on the y-axis are nonesense\n* With barchart, y-axis should start at 0.\n* `.png` vs `.svg`\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours2-1.png){fig-align='center'}\n:::\n:::\n\n\n## Again\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours3-1.png){fig-align='center'}\n:::\n:::\n\n\n::: {.callout-tip}\nI like this, but ~10% of men are colour blind (including some faculty in this department). \n:::\n\n## Simulation\n\n![](gfx/colour-blind.png)\n\n## Jargon\n\n* Be wary of acronyms (MLE, BLUP, RKHS)\n* Again, think of your audience. MLE is fine for any statistician.\n* Others need definitions in words and written on the slide\n* Same for math notation $\\bar{X},\\ \\mu,\\ \\sigma,\\ \\mathbf{UDV}^\\top$\n* And for applied work e.g. SNP\n* [B]{.secondary}est [L]{.secondary}inear [U]{.secondary}nbiased [P]{.secondary}redictor\n\n---\n\n## {background-image=\"gfx/jargon.png\"}\n\n## Things I hate\n\n``{=html} Saying \"I'm not going to talk about ...\" ``{=html} \"I'm happy to discuss ... later if you'd like\".\n\n``{=html} Wiggling your laser pointer at every word. Highlight important things with pretty colours. Use pointer sparingly.\n\n``{=html} Playing with your collar, your pockets, your water bottle...\n\n``{=html} Staring at your slides ...\n\n``{=html} Displaying the total number of slides as in 6/85 in the lower right hand corner ...\n\n``{=html} Running over time. Skipping 6 slides to desperately make the time limit.\n\n``{=html} Using the default themes:\n\n## {background-image=\"gfx/beamer-crud.png\"}\n\n## Never use tables of numbers\n\n* Economists do this all the time for inexplicable reasons\n* I rarely put these in papers either\n* If I'm not going to talk about it, it doesn't go on the slide\n* There's no way I'm going to read off the number, certainly not to 4 decimal places\n* Use a graph\n\n## Use graphs, but\n\n* A graph with 3 dots should be a table of 3 numbers.\n* But why do you have only 3 numbers?\n* Any table can be a better graph.\n\n::: {.callout-tip}\n## Ask yourself:\nIs this the best way to display the data? Have I summarized too much? \n:::\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Things you should do\n\n``{=html} Number your slides\n\n``{=html} Have lots of prepared backup slides (details, answers to potential questions, further analysis)\n\n``{=html} Practice a lot. Practice in front of others. Practice the beginning more than the rest.\n\n``{=html} BE EXCITED. You worked hard on this. All results are cool. Play them up. You did something good and you want to tell everyone about how awesome you are. Own it. \n\n``{=html} Take credit. Say \"I showed this\" not \"It can be shown\".\n\n\n## Things that are debatable\n\n* Math talks tend to be \"chalkboard\"\n* CS talks tend to be \"sales pitch\"\n* Stats is in the middle.\n* I lean toward details with elements of salesmanship\n* If I hear your talk, I want to be able to \"do\" what you created. This is hard without some math. \n* This also colours my decisions about software.\n\n. . .\n\n::: {.callout-note}\nJeff Bezos banned Powerpoint from Amazon presentations\n:::\n\n## Closing suggestions\n\n### 1. Slow down\n\n* Get a bottle of water before the talk. \n* Drink it to pause on (pre-planned) key slides.\n* This will help you relax. \n* It will also give the audience a few seconds to get the hard stuff into their head.\n\n### 2. Cut back\n\n* Most of your slides probably have too many words.\n* And too much \"filler\" --> Kill the filler\n\n## Closing suggestions\n\n### 3. Try to move\n\n* It's good to move physically, engage the audience\n* Try to make eye contact with the whole room\n* Record yourself once to see if you do anything extraneous\n\n### 4. Have fun.\n\n\n## Example talks:\n\n1. [Teaching PCA](pca-intro.qmd)\n2. [Short research talk about Latent Infections](https://cmu-delphi.github.io/cdcflu-latent-infections/)\n", + "markdown": "---\nlecture: \"Giving presentations\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Structure\n\n\n1. Strategy (applies to papers too)\n\n1. Dos and don'ts\n\n1. Personal preferences\n\n\n\n# Strategy\n\n\n## Genre\n\nTalks take many forms (like papers)\n\n* Department seminar\n\n* Short conference presentation\n\n* Class lecture\n\n* `...`\n\nCalibrate your talk to the [Genre]{.secondary} and the [Audience]{.tertiary}\n\n. . .\n\n* A job talk takes much more work than a class presentation\n* For context, after much practice, it takes me about 1 hour per minute of presentation length, depending on the amount of polish.\n* My course lectures take about 4x the target duration.\n* General ideas are the same for all styles.\n\n## Audience\n\n* Think about who you are talking to\n - Statisticians?\n - Students?\n - Potential employer?\n - People with PhD's but in other disciplines?\n - Your grandma?\n \n* Regardless of the audience, I think of dividing the talk roughly in 3rds.\n\n\n## (Your audience for your in-class talk) {background-color=\"#86D4FE\"}\n\n* 2/3 of the time, [the client]{.secondary}. \n* You're teaching them this topic.\n* Think \"someone who took 1 or 2 classes in statistics\"\n* 1/3 of the time, [your classmates]{.secondary}. \n* What are the details they need to know that they don't know?\n* Think carefully how to structure for that breakdown.\n\n\n\n---\n\n### 1. \nTalk to your grandma. Why are you listening to me? Why is what I'm saying interesting?\n\n### 2. \nTalk to your audience. What have I done that you should be able to do at the end?\n\n### 3. \nTalk [slightly]{.secondary} over your audience. Why should you be impressed by me?\n\n. . .\n\nPart 3 is shorter depending on the Audience.\n\n\n\n## Content \n\nEach part is a little mini-talk\n\n1. Starts with the general idea\n2. Develops a few details. [Strategy:]{.secondary} problem/solution or question/answer\n3. Ends with a takeaway\n\nBut these parts are [recalibrated]{.fourth-colour} to the audience.\n\n* Your Grandma doesn't want to see math.\n* Your employer might, but doesn't want to hear about $\\sigma$-fields. \n* Statisticians don't want to see proofs (but might want a sketch).\n* `...`\n\n\n## Story structure\n\n::: {.callout-note}\n## What I often see...\nOnce upon a time, a young MSc student went into the woods of theory and found some trees. \n\nFirst they looked at one tree, it was oak. \n\nThen the looked at the next tree, it was maple. \n\nThen they wondered if trees could talk. \n\nAfter three months of wandering, they saw a house...\n:::\n\n. . .\n\n::: {.callout-important}\n## The attention grabber\nAxe-wielding woodsman saves student from wolf attack!\n:::\n\n## Better structure\n\n1. (Enough details to give the headline.)\n1. Headline result.\n1. How do we know the result is real. What are the details of \ncomputation, inference, methodology.\n1. Demonstration with empirics.\n\n## You should consider...\n\n* Attention span diminishes quickly. \n* What are the 3-5 takeaways?\n* Hit your main result at the beginning: this is what I can do that I couldn't before.\n\n## The ideal map\n\nMap out what you've done.\n\n* What did you find?\n* What are the implications? \n* Why does audience care?\n* How do we do it?\n\n. . .\n\nAvoid wandering in the wilderness:\n\n1. First we did this;\n1. But that didn't work, so we tried ...\n1. But then we added ...\n1. Finally we got to the beach ...\n1. And the water was nice ...\n\n\n# Good resource\n\nProf. Trevor Campbell's [\"How to Explain Things\"](https://docs.google.com/presentation/d/13vwchlzQAZjjfiI3AiBC_kM-syI6GJKzbuZoLxgy1a4/edit?usp=sharing)\n\n# Dos and don'ts\n\n## Words\n\n:::: {.columns}\n::: {.column width=\"45%\"}\nToo many words on a slide is bad\n\n* Bullet points\n* Too densely concentrated are bad\n* Are bad\n* Are hard to focus on\n\n

\n\nEmpty space is your friend\n:::\n\n\n::: {.column width=\"55%\"}\n\n\nLorem markdownum et moras et ponendi odores, neu magna per! Tyria meo iungitur\nvidet, frigore terras rogis Anienis poteram, dant. His vallem arma corpore\nvident nunc nivibus [avus](http://mirantes.org/), dea. Spatium luce certa\ncupiunt, lina. [Amabam](http://www.sub.com/satisego) opem, Iovis fecundaque et\nparum.\n\nAede virum annis audit modo: meus ramis videri: nec quod insidiisque Aonio\ntenuem, AI. Trames Iason: nocent hortatus lacteus praebita\n[paternos](http://ex.net/milledubitavit) petit, Paridis **aptus prius ut** origo\nfuriisque. Mercibus sis nullo aliudve Amathunta sufficit ululatibus,\npraevalidusque segnis *et* Dryopen. \n:::\n::::\n\n## Images\n\n:::: {.columns}\n::: {.column width=\"40%\"}\nPictures are good\n\n
\n\nFlow charts are good.\n\n
\n\n[Careful]{.fourth-colour} use of [colour]{.secondary} is [good]{.tertiary}.\n\n
\n\n### Size is good.\n\n
\n\n_too much variation is distracting_\n:::\n\n::: {.column width=\"60%\"}\n\n![](https://cdn.stocksnap.io/img-thumbs/960w/cat-kitten_LH77KFTY76.jpg)\n:::\n::::\n\n. . .\n\nHow long did you stare at the cat?\n\n\n# Personal preferences\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours1-1.png){fig-align='center'}\n:::\n:::\n\n\n. . .\n\n::: {.callout-important}\nDefaults are almost always terrible.\n:::\n\n\n## Issues with the preceding\n\n* Colours are awful\n* Grey background is distracting\n* Text size is too small\n* Legend position on the side is strange?\n* Numbers on the y-axis are nonesense\n* With barchart, y-axis should start at 0.\n* `.png` vs `.svg`\n\n## Graphics\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours2-1.png){fig-align='center'}\n:::\n:::\n\n\n## Again\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/colours3-1.png){fig-align='center'}\n:::\n:::\n\n\n::: {.callout-tip}\nI like this, but ~10% of men are colour blind (including some faculty in this department). \n:::\n\n## Simulation\n\n![](gfx/colour-blind.png)\n\n## Jargon\n\n* Be wary of acronyms (MLE, BLUP, RKHS)\n* Again, think of your audience. MLE is fine for any statistician.\n* Others need definitions in words and written on the slide\n* Same for math notation $\\bar{X},\\ \\mu,\\ \\sigma,\\ \\mathbf{UDV}^\\top$\n* And for applied work e.g. SNP\n* [B]{.secondary}est [L]{.secondary}inear [U]{.secondary}nbiased [P]{.secondary}redictor\n\n---\n\n## {background-image=\"gfx/jargon.png\"}\n\n## Things I hate\n\n``{=html} Saying \"I'm not going to talk about ...\" ``{=html} \"I'm happy to discuss ... later if you'd like\".\n\n``{=html} Wiggling your laser pointer at every word. Highlight important things with pretty colours. Use pointer sparingly.\n\n``{=html} Playing with your collar, your pockets, your water bottle...\n\n``{=html} Staring at your slides ...\n\n``{=html} Displaying the total number of slides as in 6/85 in the lower right hand corner ...\n\n``{=html} Running over time. Skipping 6 slides to desperately make the time limit.\n\n``{=html} Using the default themes:\n\n## {background-image=\"gfx/beamer-crud.png\"}\n\n## Never use tables of numbers\n\n* Economists do this all the time for inexplicable reasons\n* I rarely put these in papers either\n* If I'm not going to talk about it, it doesn't go on the slide\n* There's no way I'm going to read off the number, certainly not to 4 decimal places\n* Use a graph\n\n## Use graphs, but\n\n* A graph with 3 dots should be a table of 3 numbers.\n* But why do you have only 3 numbers?\n* Any table can be a better graph.\n\n::: {.callout-tip}\n## Ask yourself:\nIs this the best way to display the data? Have I summarized too much? \n:::\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Example: Made up simulation results\n\nRan 50 simulations.\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](presentations_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Things you should do\n\n``{=html} Number your slides\n\n``{=html} Have lots of prepared backup slides (details, answers to potential questions, further analysis)\n\n``{=html} Practice a lot. Practice in front of others. Practice the beginning more than the rest.\n\n``{=html} BE EXCITED. You worked hard on this. All results are cool. Play them up. You did something good and you want to tell everyone about how awesome you are. Own it. \n\n``{=html} Take credit. Say \"I showed this\" not \"It can be shown\".\n\n\n## Things that are debatable\n\n* Math talks tend to be \"chalkboard\"\n* CS talks tend to be \"sales pitch\"\n* Stats is in the middle.\n* I lean toward details with elements of salesmanship\n* If I hear your talk, I want to be able to \"do\" what you created. This is hard without some math. \n* This also colours my decisions about software.\n\n. . .\n\n::: {.callout-note}\nJeff Bezos banned Powerpoint from Amazon presentations\n:::\n\n## Closing suggestions\n\n### 1. Slow down\n\n* Get a bottle of water before the talk. \n* Drink it to pause on (pre-planned) key slides.\n* This will help you relax. \n* It will also give the audience a few seconds to get the hard stuff into their head.\n\n### 2. Cut back\n\n* Most of your slides probably have too many words.\n* And too much \"filler\" --> Kill the filler\n\n## Closing suggestions\n\n### 3. Try to move\n\n* It's good to move physically, engage the audience\n* Try to make eye contact with the whole room\n* Record yourself once to see if you do anything extraneous\n\n### 4. Have fun.\n\n\n## Example talks:\n\n1. [Teaching PCA](pca-intro.qmd)\n2. [Short research talk about Latent Infections](https://cmu-delphi.github.io/cdcflu-latent-infections/)\n", "supporting": [ "presentations_files" ], diff --git a/_freeze/schedule/slides/presentations/figure-revealjs/colours1-1.png b/_freeze/schedule/slides/presentations/figure-revealjs/colours1-1.png index 6a433e5..87e998d 100644 Binary files a/_freeze/schedule/slides/presentations/figure-revealjs/colours1-1.png and b/_freeze/schedule/slides/presentations/figure-revealjs/colours1-1.png differ diff --git a/_freeze/schedule/slides/presentations/figure-revealjs/colours2-1.png b/_freeze/schedule/slides/presentations/figure-revealjs/colours2-1.png index 14b691b..b9a71b2 100644 Binary files a/_freeze/schedule/slides/presentations/figure-revealjs/colours2-1.png and b/_freeze/schedule/slides/presentations/figure-revealjs/colours2-1.png differ diff --git a/_freeze/schedule/slides/syllabus/execute-results/html.json b/_freeze/schedule/slides/syllabus/execute-results/html.json index 6f5543b..fd66146 100644 --- a/_freeze/schedule/slides/syllabus/execute-results/html.json +++ b/_freeze/schedule/slides/syllabus/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "2dbb1f6a639e04c9de75a1818bccb758", "result": { - "markdown": "---\nlecture: \"Introduction and Second half pivot\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n## About me\n\n\n::::{.columns}\n:::{.column width=\"50%\"}\n- Daniel J. McDonald\n- daniel@stat.ubc.ca \n- \n- Associate Professor, Department of Statistics \n:::\n:::{.column width=\"50%\"}\n* Moved to UBC in mid-March 2020, 2 days before the border closed\n* Previously a Stats Prof at Indiana University for 8 years\n:::\n::::\n\n. . .\n\n![](https://weareiu.com/wp-content/uploads/2018/12/map-1.png){fig-align=\"center\" width=\"50%\"}\n\n\n## No More Canvas!!\n\nSee the website: \n\n\n\n
\n
\n
\n\nYou'll find\n\n* announcements\n* schedule\n* lecture slides / notes\n\n(Grades still on Canvas)\n\n\n\n## Course communication\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n### Website:\n\n\n\n* Hosted on GitHub.\n\n* Links to slides and all materials\n\n* Syllabus is there. Be sure to read it. (same idea as before)\n\n\n### Slack:\n\n* This is our discussion board.\n\n* Note that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\n\n* We'll use a Channel in the UBC-Stat Workspace\n:::\n\n::: {.column width=\"50%\"}\n\n### Github organization\n\n* Linked from the website.\n\n* This is where you complete/submit assignments/projects/in-class-work\n\n* This is also hosted on US servers \n:::\n::::\n\n## Why these?\n\n1. Yes, some data is hosted on servers in the US.\n1. But in the real world, no one uses Canvas/Piazza, so why not learn things they do use?\n1. Canvas is dumb and hard to organize.\n1. GitHub is free and actually useful.\n1. Much easier to communicate, \"grade\" or comment on your work\n1. Much more DS friendly\n1. Note that MDS uses both of these, the department uses both, etc.\n1. More on all this later.\n\n. . .\n\nSlack help from MDS --- [features](https://ubc-mds.github.io/resources_pages/slack/) and [rules](https://ubc-mds.github.io/resources_pages/slack_asking_for_help/)\n\n\n## What are the goals of Stat 550?\n\n
\n\n### 1. Prepare you to do the consulting practicum (Stat 551)\n\n\n

\n\n### 2. You're a captive audience, so I can teach you some skills you'll need for \n \n* MSc Thesis/Project or PhD research\n* Employment in Data Science / Statistics.\n* These are often things that will help with the first as well\n\n## 1. Prepare you for the consulting practicum (Stat 551)\n\n* understand how the data was collected\n\n* implications of the collection process for analysis\n\n* organize data for analysis\n\n* determine appropriate methods for analysis that [answer's the client's questions]{.secondary}\n\n* interpret the results\n\n* present and communicate the results\n\n::: {.notes}\n* In most courses you get nice clean data. Getting to \"nice clean data\" is non-trivial\n* In most courses things are \"IID\", negligible missingness\n* Usually, the question is formed in statistical langauge, here, you are responsible for \"translating\"\n* Interpretation has to be \"translated back\"\n* Presentation skills --- important everywhere\n:::\n\n## 2. Some skills you'll need\n\n* Version control\n* Reproducible reports\n* Writing experience: genre is important\n* Presentation skills\n* Better coding practice\n* Documentation\n\n## Computing\n\n* All work done in R/RMarkdown. \n* No you can't use Python. Or Stata or SPSS.\n* No you can't use Jupyter Notebooks.\n* All materials on Github. \n* You will learn to use Git/GitHub/RStudio/Rmarkdown.\n* Slack for discussion/communication\n\n## Getting setup\n\n1. Add to Slack Channel: \n2. Github account: \n3. Add to the Github Org --- tell me your account\n4. RStudio synchronization\n\n\n", + "markdown": "---\nlecture: \"Introduction and Second half pivot\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n## About me\n\n\n::::{.columns}\n:::{.column width=\"50%\"}\n- Daniel J. McDonald\n- daniel@stat.ubc.ca \n- \n- Associate Professor, Department of Statistics \n:::\n:::{.column width=\"50%\"}\n* Moved to UBC in mid-March 2020, 2 days before the border closed\n* Previously a Stats Prof at Indiana University for 8 years\n:::\n::::\n\n. . .\n\n![](https://weareiu.com/wp-content/uploads/2018/12/map-1.png){fig-align=\"center\" width=\"50%\"}\n\n\n## No More Canvas!!\n\nSee the website: \n\n\n\n
\n
\n
\n\nYou'll find\n\n* announcements\n* schedule\n* lecture slides / notes\n\n(Grades still on Canvas)\n\n\n\n## Course communication\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n### Website:\n\n\n\n* Hosted on GitHub.\n\n* Links to slides and all materials\n\n* Syllabus is there. Be sure to read it. (same idea as before)\n\n\n### Slack:\n\n* This is our discussion board.\n\n* Note that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\n\n* We'll use a Channel in the UBC-Stat Workspace\n:::\n\n::: {.column width=\"50%\"}\n\n### Github organization\n\n* Linked from the website.\n\n* This is where you complete/submit assignments/projects/in-class-work\n\n* This is also hosted on US servers \n:::\n::::\n\n## Why these?\n\n1. Yes, some data is hosted on servers in the US.\n1. But in the real world, no one uses Canvas/Piazza, so why not learn things they do use?\n1. Canvas is dumb and hard to organize.\n1. GitHub is free and actually useful.\n1. Much easier to communicate, \"grade\" or comment on your work\n1. Much more DS friendly\n1. Note that MDS uses both of these, the department uses both, etc.\n1. More on all this later.\n\n. . .\n\nSlack help from MDS --- [features](https://ubc-mds.github.io/resources_pages/slack/) and [rules](https://ubc-mds.github.io/resources_pages/slack_asking_for_help/)\n\n\n## What are the goals of Stat 550?\n\n
\n\n### 1. Prepare you to do the consulting practicum (Stat 551)\n\n\n

\n\n### 2. You're a captive audience, so I can teach you some skills you'll need for \n \n* MSc Thesis/Project or PhD research\n* Employment in Data Science / Statistics.\n* These are often things that will help with the first as well\n\n## 1. Prepare you for the consulting practicum (Stat 551)\n\n* understand how the data was collected\n\n* implications of the collection process for analysis\n\n* organize data for analysis\n\n* determine appropriate methods for analysis that [answer's the client's questions]{.secondary}\n\n* interpret the results\n\n* present and communicate the results\n\n::: {.notes}\n* In most courses you get nice clean data. Getting to \"nice clean data\" is non-trivial\n* In most courses things are \"IID\", negligible missingness\n* Usually, the question is formed in statistical langauge, here, you are responsible for \"translating\"\n* Interpretation has to be \"translated back\"\n* Presentation skills --- important everywhere\n:::\n\n## 2. Some skills you'll need\n\n* Version control\n* Reproducible reports\n* Writing experience: genre is important\n* Presentation skills\n* Better coding practice\n* Documentation\n\n## Computing\n\n* All work done in R/RMarkdown. \n* No you can't use Python. Or Stata or SPSS.\n* No you can't use Jupyter Notebooks.\n* All materials on Github. \n* You will learn to use Git/GitHub/RStudio/Rmarkdown.\n* Slack for discussion/communication\n\n## Getting setup\n\n1. Add to Slack Channel: \n2. Github account: \n3. Add to the Github Org --- tell me your account\n4. RStudio synchronization\n\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/schedule/slides/time-series/execute-results/html.json b/_freeze/schedule/slides/time-series/execute-results/html.json index d6c3839..22e0280 100644 --- a/_freeze/schedule/slides/time-series/execute-results/html.json +++ b/_freeze/schedule/slides/time-series/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "3b6dd1150d54fab497660b3d66b6df06", "result": { - "markdown": "---\nlecture: \"Time series, a whirlwind\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n## The general linear process\n\n* Imagine that there is a noise process\n\n$$\\epsilon_j \\sim \\textrm{N}(0, 1),\\ \\textrm{i.i.d.}$$\n\n* At time $i$, we observe the sum of all past noise\n\n$$y_i = \\sum_{j=-\\infty}^0 a_{i+j} \\epsilon_j$$\n\n* Without some conditions on $\\{a_k\\}_{k=-\\infty}^0$ this process will \"run away\"\n* The result is \"non-stationary\" and difficult to analyze.\n* Stationary means (roughly) that the marginal distribution of $y_i$ does not change with $i$.\n\n## Chasing stationarity\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 1000\nnseq <- 5\ngenerate_ar <- function(n, b) {\n y <- double(n)\n y[1] <- rnorm(1)\n for (i in 2:n) y[i] <- b * y[i - 1] + rnorm(1)\n tibble(time = 1:n, y = y)\n}\nstationary <- map(1:nseq, ~ generate_ar(n, .99)) |> list_rbind(names_to = \"id\")\nnon_stationary <- map(1:nseq, ~ generate_ar(n, 1.01)) |>\n list_rbind(names_to = \"id\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Uses of stationarity\n\n* Lots of types (weak, strong, in-mean, wide-sense,...)\n* not required for modelling / forecasting\n* But assuming stationarity gives some important guarantees\n* Usually work with stationary processes\n\n## Standard models {.smaller}\n\n### AR(p)\n\nSuppose $\\epsilon_i$ are i.i.d. N(0, 1) (distn is convenient, but not required)\n\n$$y_i = \\mu + a_1 y_{i-1} + \\cdots + a_p y_{i-p} + \\epsilon_i$$\n\n* This is a special case of the general linear process\n* You can recursively substitute this defn into itself to get that equation\n\nEasy to estimate the `a`'s given a realization. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ar = c(.7, -.1)), n = 1000)\nY <- y[3:1000]\nX <- cbind(lag1 = y[2:999], lag2 = y[1:998])\nsummary(lm(Y ~ X + 0))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Y ~ X + 0)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6164 -0.6638 0.0271 0.6456 3.8367 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nXlag1 0.66931 0.03167 21.134 <2e-16 ***\nXlag2 -0.04856 0.03167 -1.533 0.126 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.9899 on 996 degrees of freedom\nMultiple R-squared: 0.4085,\tAdjusted R-squared: 0.4073 \nF-statistic: 344 on 2 and 996 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n## AR(p)\n\n* The estimate isn't [that]{.secondary} accurate because the residuals (not the $\\epsilon$'s) are correlated. \n* (Usually, you get `1/n` convergence, here you don't.)\n* Also, this isn't the MLE. The likelihood includes $p(y_1)$, $p(y_2 | y_1)$ which `lm()` ignored.\n* The Std. Errors are unjustified.\n* But that was easy to do.\n* The correct way is\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(2, 0, 0), include.mean = FALSE)\n\nCoefficients:\n ar1 ar2\n 0.6686 -0.0485\ns.e. 0.0316 0.0316\n\nsigma^2 estimated as 0.9765: log likelihood = -1407.34, aic = 2820.67\n```\n:::\n:::\n\n\n* The resulting estimates and SEs are identical, AFAICS.\n\n## MA(q)\n\nStart with the general linear process, but truncate the infinite sum.\n\n$$y_i = \\sum_{j=-q}^0 a_{i+j} \\epsilon_j$$\n\n* This is termed a \"moving average\" process.\n* though $a_0 + \\cdots a_{-q}$ don't sum to 1.\n* Can't write this easily as a `lm()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ma = c(.9, .6, .1)), n = 1000)\narima(y, c(0, 0, 3), include.mean = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(0, 0, 3), include.mean = FALSE)\n\nCoefficients:\n ma1 ma2 ma3\n 0.9092 0.6069 0.1198\ns.e. 0.0313 0.0380 0.0311\n\nsigma^2 estimated as 0.8763: log likelihood = -1353.41, aic = 2714.82\n```\n:::\n:::\n\n\n## MA(q) as an AR(1) hidden process\n\nLet $X_j = [\\epsilon_{j-1},\\ \\ldots,\\ \\epsilon_{j-q}]$ and write\n\n$$\n\\begin{aligned}\nX_i &= \\begin{bmatrix} a_{i-1} & a_{i-2} & \\cdots & a_{i-q}\\\\ 1 & 0 & \\cdots & 0\\\\ & & \\ddots \\\\ 0 & 0 & \\cdots & 1\\end{bmatrix} X_{i-1} + \n\\begin{bmatrix} a_{i}\\\\ 0 \\\\ \\vdots \\\\ 0 \\end{bmatrix} \\epsilon_i\\\\\ny_i &= \\begin{bmatrix} 1 & 0 & \\cdots 0 \\end{bmatrix} X_i\n\\end{aligned}\n$$\n\n* Now $X$ is a $q$-dimensional AR(1) (but we don't see it)\n* $y$ is deterministic conditional on $X$\n* This is the usual way these are estimated using a State-Space Model\n* Many time series models have multiple equivalent representations\n\n## AR[I]{.secondary}MA\n\n* We've been using `arima()` and `arima.sim()`, so what is left?\n* The \"I\" means \"integrated\"\n* If, for example, we can write $z_i = y_i - y_{i-1}$ and $z$ follows an ARMA(p, q), we say $y$ follows an ARIMA(p, 1, q).\n* The middle term is the degree of differencing\n\n## Other standard models\n\nSuppose we can write \n\n$$\ny_i = T_i + S_i + W_i\n$$\n\nThis is the \"classical\" decomposition of $y$ into a Trend + Seasonal + Noise.\n\nYou can estimate this with a \"Basic Structural Time Series Model\" using `StrucTS()`.\n\nA related, though slightly different model is called the STL decomposition, estimated with `stl()`.\n\nThis is \"Seasonal Decomposition of Time Series by Loess\"\n\n(LOESS is \"locally estimated scatterplot smoothing\" named/proposed independently by Bill Cleveland though originally proposed about 15 years earlier and called the Savitsky-Golay Filter)\n\n## Quick example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsts <- StructTS(AirPassengers)\nbc <- stl(AirPassengers, \"periodic\") # use sin/cos to represent the seasonal\ntibble(\n time = seq(as.Date(\"1949-01-01\"), as.Date(\"1960-12-31\"), by = \"month\"),\n AP = AirPassengers, \n StrucTS = fitted(sts)[, 1], \n STL = rowSums(bc$time.series[, 1:2])\n) |>\n pivot_longer(-time) |>\n ggplot(aes(time, value, color = name)) +\n geom_line() +\n theme_bw(base_size = 24) +\n scale_color_viridis_d(name = \"\") +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Generic state space model\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n


\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\n:::\n::: {.column width=\"50%\"}\n\n* $x_k$ is unobserved, dimension $n$\n\n* $y_k$ is observed, dimension $m$\n\n* $x$ process is the [transition]{.tertiary} or [process]{.tertiary} equation\n\n* $y$ is the [observation]{.tertiary} or [measurement]{.tertiary} equation\n\n* Both are probability distributions that can depend on parameters $\\theta$\n\n* For now, assume $\\theta$ is [KNOWN]{.secondary}\n\n* We can allow the densities to vary with time.\n\n:::\n::::\n\n\n\n## GOAL(s)\n\n1. Filtering: given observations, find $$p(x_k | y_1,\\ldots y_k)$$\n\n1. Smoothing: given observations, find $$p(x_k | y_1,\\ldots y_T), \\;\\;\\ k < T$$\n\n1. Forecasting: given observations, find $$p(y_{k+1} | y_1,\\ldots,y_k)$$\n\n\n## Using Bayes Rule\n\nAssume $p(x_0)$ is known\n\n$$\n\\begin{aligned}\np(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T) &= \\prod_{k=1}^T p(y_k | x_k)\\\\\np(x_0,\\ldots,x_T) &= p(x_0) \\prod_{k=1}^T p(x_k | x_{k-1})\\\\\np(x_0,\\ldots,x_T\\ |\\ y_1,\\ldots,y_T) &= \\frac{p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)}{p(y_1,\\ldots,y_T)}\\\\ &\\propto p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)\\end{aligned}\n$$\n\nIn principle, if things are nice, you can compute this posterior (thinking of $x$ as unknown parameters)\n\nBut in practice, computing a big multivariate posterior like this is computationally ill-advised.\n\n\n\n## Generic filtering\n\n* Recursively build up $p(x_k | y_1,\\ldots y_k)$.\n\n* Why? Because if we're collecting data in real time, this is all we need to make forecasts for future data.\n\n$$\\begin{aligned} &p(y_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1}, y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | x_T) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n* Can continue to iterate if I want to predict $h$ steps ahead\n\n$$\\begin{aligned} &p(y_{T+h} | y_1,\\ldots,y_T)= p(y_{T+h} | x_{T+h} )\\prod_{j=0}^{h-1} p(x_{T+j+1} | x_{T+j}) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n\n\n## The filtering recursion\n\n1. Initialization. Fix $p(x_0)$.\n\nIterate the following for $k=1,\\ldots,T$:\n\n2. Predict. $$p(x_k | y_{k-1}) = \\int p(x_k | x_{k-1}) p(x_{k-1} | y_1,\\ldots, y_{k-1})dx_{k-1}.$$\n\n3. Update. $$p(x_k | y_1,\\ldots,y_k) = \\frac{p(y_k | x_k)p(x_k | y_1,\\ldots,y_{k-1})}{p(y_1,\\ldots,y_k)}$$\n\n\nIn general, this is somewhat annoying because these integrals may be challenging to solve.\n\nBut with some creativity, we can use Monte Carlo for everything.\n\n\n\n## What if we make lots of assumptions?\n\nAssume that $$\\begin{aligned}p(x_0) &= N(m_0, P_0) \\\\ p_k(x_k\\ |\\ x_{k-1}) &= N(A_{k-1}x_{k-1},\\ Q_{k-1})\\\\ p_k(y_k\\ |\\ x_k) &= N(H_k x_k,\\ R_k)\\end{aligned}.$$\n\nThen [all the ugly integrals have closed-form representations]{.secondary} by properties of conditional Gaussian distributions.\n\n## Closed-form representations\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\nDistributions:\n\n$$\n\\begin{aligned}\np(x_k | y_1,\\ldots,y_{k-1}) &= N(m^{-}_k, P^{-}_k)\\\\\np(x_k | y_1,\\ldots,y_{k}) &= N(m_k, P_k)\\\\\np(y_{k} | y_1,\\ldots,y_{k-1}) &= N(H_k m^-_k, S_k)\\\\\n\\end{aligned}\n$$\nPrediction:\n$$\n\\begin{aligned}\nm^-_k &= A_{k-1}m_{k-1}\\\\\nP^-_k &= A_{k-1}P_{k-1}A^\\mathsf{T}_{k-1} + Q_{k-1}\n\\end{aligned}\n$$\n\n:::\n::: {.column width=\"50%\"}\n\nUpdate:\n$$\n\\begin{aligned}\nv_k &= y_k - H_k m_k^-\\\\\nS_k &= H_k P_k^- H_k^\\mathsf{T} + R_k\\\\\nK_k &= P^-_k H_k^\\mathsf{T} S_k^{-1}\\\\\nm_k &= m^-_k + K_{k}v_{k}\\\\\nP_k &= P^-_k - K_k S_k K_k^\\mathsf{T}\n\\end{aligned}\n$$\n\n:::\n::::\n\n\n\n## Code or it isn't real (Kalman Filter)\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nkalman <- function(y, m0, P0, A, Q, H, R) {\n n <- length(y)\n m <- double(n + 1)\n P <- double(n + 1)\n m[1] <- m0\n P[1] <- P0\n for (k in seq(n)) {\n mm <- A * m[k]\n Pm <- A * P[k] * A + Q\n v <- y[k] - H * mm\n S <- H * Pm * H + R\n K <- Pm * H / S\n m[k + 1] <- mm + K * v\n P[k + 1] <- Pm - K * S * K\n }\n tibble(t = 1:n, m = m[-1], P = P[-1])\n}\n\nset.seed(2022 - 06 - 01)\nx <- double(100)\nfor (k in 2:100) x[k] <- x[k - 1] + rnorm(1)\ny <- x + rnorm(100, sd = 1)\nkf <- kalman(y, 0, 5, 1, 1, 1, 1)\n```\n:::\n\n\n:::\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n::::\n\n\n## Important notes\n\n* So far, we assumed all parameters were known.\n* In reality, we had 6: `m0, P0, A, Q, H, R`\n* I sort of also think of `x` as \"parameters\" in the Bayesian sense\n* By that I mean, \"latent variables for which we have prior distributions\"\n* What if we want to estimate them?\n\n[Bayesian way]{.tertiary}: `m0` and `P0` are already the parameters of for the prior on `x1`. Put priors on the other 4.\n\n[Frequentist way]{.tertiary}: Just maximize the likelihood. Can technically take \n`P0` $\\rightarrow\\infty$ to remove it and `m0`\n\n. . .\n\nThe Likelihood is produced as a by-product of the Kalman Filter. \n\n$$-\\ell(\\theta) = \\sum_{k=1}^T \\left(v_k^\\mathsf{T}S_k^{-1}v_k + \\log |S_k| + m \\log 2\\pi\\right)$$\n\n\n\n## Smoothing \n\n* We also want $p(x_k | y_1,\\ldots,y_{T})$\n* Filtering went \"forward\" in time. At the end we got, \n$p(x_T | y_1,\\ldots,y_{T})$. Smoothing starts there and goes \"backward\"\n* For \"everything linear Gaussian\", this is again \"easy\"\n* Set $m_T^s = m_T$, $P_T^s = P_T$. \n* For $k = T-1,\\ldots,1$, \n\n\n$$\\begin{aligned}\nG_k &= P_k A_k^\\mathsf{T} [P_{k+1}^-]^{-1}\\\\\nm_k^s &= m_k + G_k(m_{k+1}^s - m_{k+1}^-)\\\\\nP_k^s &= P_k + G_k(P_{k+1}^s - P_{k+1}^-)G_k^\\mathsf{T}\\\\\nx_k | y_1,\\ldots,y_T &= N(m^s_k, P_k^s)\n\\end{aligned}$$\n\n\n## Comparing the filter and the smoother\n\n* Same data, different code (using a package)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(FKF)\nfilt <- fkf(\n a0 = 0, P0 = matrix(5), dt = matrix(0), ct = matrix(0),\n Tt = matrix(1), Zt = matrix(1), HHt = matrix(1), GGt = matrix(1),\n yt = matrix(y, ncol = length(y))\n)\nsmo <- fks(filt)\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-smooth-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## What about non-linear and/or non-Gaussian\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\nThen we need to solve integrals. This is a pain. We approximate them. \n\nThese all give [approximations to the filtering distribution]{.secondary}\n\n* Extended Kalman filter - basically do a Taylor approximation, then do Kalman like\n* Uncented Kalman filter - Approximate integrals with Sigma points\n* Particle filter - Sequential Monte Carlo\n* Bootstrap filter (simple version of SMC)\n* Laplace Gaussian filter - Do a Laplace approximation to the distributions\n\n\n## The bootstrap filter\n\n* Need to **simulate** from the transition distribution (`rtrans`)\n* Need to **evaluate** the observation distribution (`dobs`)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nboot_filter <-\n function(y, B = 1000, rtrans, dobs, a0 = 0, P0 = 1, perturb = function(x) x) {\n n <- length(y)\n filter_est <- matrix(0, n, B)\n predict_est <- matrix(0, n, B)\n init <- rnorm(B, a0, P0)\n filter_est[1, ] <- init\n for (i in seq(n)) {\n raw_w <- dobs(y[i], filter_est[i, ])\n w <- raw_w / sum(raw_w)\n selection <- sample.int(B, replace = TRUE, prob = w)\n filter_est[i, ] <- perturb(filter_est[i, selection])\n predict_est[i, ] <- rtrans(filter_est[i, ])\n if (i < n) filter_est[i + 1, ] <- predict_est[i, ]\n }\n list(filt = filter_est, pred = predict_est)\n }\n```\n:::\n", + "markdown": "---\nlecture: \"Time series, a whirlwind\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n## The general linear process\n\n* Imagine that there is a noise process\n\n$$\\epsilon_j \\sim \\textrm{N}(0, 1),\\ \\textrm{i.i.d.}$$\n\n* At time $i$, we observe the sum of all past noise\n\n$$y_i = \\sum_{j=-\\infty}^0 a_{i+j} \\epsilon_j$$\n\n* Without some conditions on $\\{a_k\\}_{k=-\\infty}^0$ this process will \"run away\"\n* The result is \"non-stationary\" and difficult to analyze.\n* Stationary means (roughly) that the marginal distribution of $y_i$ does not change with $i$.\n\n## Chasing stationarity\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 1000\nnseq <- 5\ngenerate_ar <- function(n, b) {\n y <- double(n)\n y[1] <- rnorm(1)\n for (i in 2:n) y[i] <- b * y[i - 1] + rnorm(1)\n tibble(time = 1:n, y = y)\n}\nstationary <- map(1:nseq, ~ generate_ar(n, .99)) |> list_rbind(names_to = \"id\")\nnon_stationary <- map(1:nseq, ~ generate_ar(n, 1.01)) |>\n list_rbind(names_to = \"id\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Uses of stationarity\n\n* Lots of types (weak, strong, in-mean, wide-sense,...)\n* not required for modelling / forecasting\n* But assuming stationarity gives some important guarantees\n* Usually work with stationary processes\n\n## Standard models {.smaller}\n\n### AR(p)\n\nSuppose $\\epsilon_i$ are i.i.d. N(0, 1) (distn is convenient, but not required)\n\n$$y_i = \\mu + a_1 y_{i-1} + \\cdots + a_p y_{i-p} + \\epsilon_i$$\n\n* This is a special case of the general linear process\n* You can recursively substitute this defn into itself to get that equation\n\nEasy to estimate the `a`'s given a realization. \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ar = c(.7, -.1)), n = 1000)\nY <- y[3:1000]\nX <- cbind(lag1 = y[2:999], lag2 = y[1:998])\nsummary(lm(Y ~ X + 0))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Y ~ X + 0)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6164 -0.6638 0.0271 0.6456 3.8367 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nXlag1 0.66931 0.03167 21.134 <2e-16 ***\nXlag2 -0.04856 0.03167 -1.533 0.126 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.9899 on 996 degrees of freedom\nMultiple R-squared: 0.4085,\tAdjusted R-squared: 0.4073 \nF-statistic: 344 on 2 and 996 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n## AR(p)\n\n* The estimate isn't [that]{.secondary} accurate because the residuals (not the $\\epsilon$'s) are correlated. \n* (Usually, you get `1/n` convergence, here you don't.)\n* Also, this isn't the MLE. The likelihood includes $p(y_1)$, $p(y_2 | y_1)$ which `lm()` ignored.\n* The Std. Errors are unjustified.\n* But that was easy to do.\n* The correct way is\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(2, 0, 0), include.mean = FALSE)\n\nCoefficients:\n ar1 ar2\n 0.6686 -0.0485\ns.e. 0.0316 0.0316\n\nsigma^2 estimated as 0.9765: log likelihood = -1407.34, aic = 2820.67\n```\n:::\n:::\n\n\n* The resulting estimates and SEs are identical, AFAICS.\n\n## MA(q)\n\nStart with the general linear process, but truncate the infinite sum.\n\n$$y_i = \\sum_{j=-q}^0 a_{i+j} \\epsilon_j$$\n\n* This is termed a \"moving average\" process.\n* though $a_0 + \\cdots a_{-q}$ don't sum to 1.\n* Can't write this easily as a `lm()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny <- arima.sim(list(ma = c(.9, .6, .1)), n = 1000)\narima(y, c(0, 0, 3), include.mean = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\narima(x = y, order = c(0, 0, 3), include.mean = FALSE)\n\nCoefficients:\n ma1 ma2 ma3\n 0.9092 0.6069 0.1198\ns.e. 0.0313 0.0380 0.0311\n\nsigma^2 estimated as 0.8763: log likelihood = -1353.41, aic = 2714.82\n```\n:::\n:::\n\n\n## MA(q) as an AR(1) hidden process\n\nLet $X_j = [\\epsilon_{j-1},\\ \\ldots,\\ \\epsilon_{j-q}]$ and write\n\n$$\n\\begin{aligned}\nX_i &= \\begin{bmatrix} a_{i-1} & a_{i-2} & \\cdots & a_{i-q}\\\\ 1 & 0 & \\cdots & 0\\\\ & & \\ddots \\\\ 0 & 0 & \\cdots & 1\\end{bmatrix} X_{i-1} + \n\\begin{bmatrix} a_{i}\\\\ 0 \\\\ \\vdots \\\\ 0 \\end{bmatrix} \\epsilon_i\\\\\ny_i &= \\begin{bmatrix} 1 & 0 & \\cdots 0 \\end{bmatrix} X_i\n\\end{aligned}\n$$\n\n* Now $X$ is a $q$-dimensional AR(1) (but we don't see it)\n* $y$ is deterministic conditional on $X$\n* This is the usual way these are estimated using a State-Space Model\n* Many time series models have multiple equivalent representations\n\n## AR[I]{.secondary}MA\n\n* We've been using `arima()` and `arima.sim()`, so what is left?\n* The \"I\" means \"integrated\"\n* If, for example, we can write $z_i = y_i - y_{i-1}$ and $z$ follows an ARMA(p, q), we say $y$ follows an ARIMA(p, 1, q).\n* The middle term is the degree of differencing\n\n## Other standard models\n\nSuppose we can write \n\n$$\ny_i = T_i + S_i + W_i\n$$\n\nThis is the \"classical\" decomposition of $y$ into a Trend + Seasonal + Noise.\n\nYou can estimate this with a \"Basic Structural Time Series Model\" using `StrucTS()`.\n\nA related, though slightly different model is called the STL decomposition, estimated with `stl()`.\n\nThis is \"Seasonal Decomposition of Time Series by Loess\"\n\n(LOESS is \"locally estimated scatterplot smoothing\" named/proposed independently by Bill Cleveland though originally proposed about 15 years earlier and called the Savitsky-Golay Filter)\n\n## Quick example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsts <- StructTS(AirPassengers)\nbc <- stl(AirPassengers, \"periodic\") # use sin/cos to represent the seasonal\ntibble(\n time = seq(as.Date(\"1949-01-01\"), as.Date(\"1960-12-31\"), by = \"month\"),\n AP = AirPassengers, \n StrucTS = fitted(sts)[, 1], \n STL = rowSums(bc$time.series[, 1:2])\n) |>\n pivot_longer(-time) |>\n ggplot(aes(time, value, color = name)) +\n geom_line() +\n theme_bw(base_size = 24) +\n scale_color_viridis_d(name = \"\") +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Generic state space model\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n


\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\n:::\n::: {.column width=\"50%\"}\n\n* $x_k$ is unobserved, dimension $n$\n\n* $y_k$ is observed, dimension $m$\n\n* $x$ process is the [transition]{.tertiary} or [process]{.tertiary} equation\n\n* $y$ is the [observation]{.tertiary} or [measurement]{.tertiary} equation\n\n* Both are probability distributions that can depend on parameters $\\theta$\n\n* For now, assume $\\theta$ is [KNOWN]{.secondary}\n\n* We can allow the densities to vary with time.\n\n:::\n::::\n\n\n\n## GOAL(s)\n\n1. Filtering: given observations, find $$p(x_k | y_1,\\ldots y_k)$$\n\n1. Smoothing: given observations, find $$p(x_k | y_1,\\ldots y_T), \\;\\;\\ k < T$$\n\n1. Forecasting: given observations, find $$p(y_{k+1} | y_1,\\ldots,y_k)$$\n\n\n## Using Bayes Rule\n\nAssume $p(x_0)$ is known\n\n$$\n\\begin{aligned}\np(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T) &= \\prod_{k=1}^T p(y_k | x_k)\\\\\np(x_0,\\ldots,x_T) &= p(x_0) \\prod_{k=1}^T p(x_k | x_{k-1})\\\\\np(x_0,\\ldots,x_T\\ |\\ y_1,\\ldots,y_T) &= \\frac{p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)}{p(y_1,\\ldots,y_T)}\\\\ &\\propto p(y_1,\\ldots,y_T\\ |\\ x_1, \\ldots, x_T)p(x_0,\\ldots,x_T)\\end{aligned}\n$$\n\nIn principle, if things are nice, you can compute this posterior (thinking of $x$ as unknown parameters)\n\nBut in practice, computing a big multivariate posterior like this is computationally ill-advised.\n\n\n\n## Generic filtering\n\n* Recursively build up $p(x_k | y_1,\\ldots y_k)$.\n\n* Why? Because if we're collecting data in real time, this is all we need to make forecasts for future data.\n\n$$\\begin{aligned} &p(y_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1}, y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | y_1,\\ldots,y_T)\\\\ &= p(y_{T+1} | x_{T+1} )p(x_{T+1} | x_T) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n* Can continue to iterate if I want to predict $h$ steps ahead\n\n$$\\begin{aligned} &p(y_{T+h} | y_1,\\ldots,y_T)= p(y_{T+h} | x_{T+h} )\\prod_{j=0}^{h-1} p(x_{T+j+1} | x_{T+j}) p(x_T | y_1,\\ldots,y_T)\\end{aligned}$$\n\n\n\n## The filtering recursion\n\n1. Initialization. Fix $p(x_0)$.\n\nIterate the following for $k=1,\\ldots,T$:\n\n2. Predict. $$p(x_k | y_{k-1}) = \\int p(x_k | x_{k-1}) p(x_{k-1} | y_1,\\ldots, y_{k-1})dx_{k-1}.$$\n\n3. Update. $$p(x_k | y_1,\\ldots,y_k) = \\frac{p(y_k | x_k)p(x_k | y_1,\\ldots,y_{k-1})}{p(y_1,\\ldots,y_k)}$$\n\n\nIn general, this is somewhat annoying because these integrals may be challenging to solve.\n\nBut with some creativity, we can use Monte Carlo for everything.\n\n\n\n## What if we make lots of assumptions?\n\nAssume that $$\\begin{aligned}p(x_0) &= N(m_0, P_0) \\\\ p_k(x_k\\ |\\ x_{k-1}) &= N(A_{k-1}x_{k-1},\\ Q_{k-1})\\\\ p_k(y_k\\ |\\ x_k) &= N(H_k x_k,\\ R_k)\\end{aligned}.$$\n\nThen [all the ugly integrals have closed-form representations]{.secondary} by properties of conditional Gaussian distributions.\n\n## Closed-form representations\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\nDistributions:\n\n$$\n\\begin{aligned}\np(x_k | y_1,\\ldots,y_{k-1}) &= N(m^{-}_k, P^{-}_k)\\\\\np(x_k | y_1,\\ldots,y_{k}) &= N(m_k, P_k)\\\\\np(y_{k} | y_1,\\ldots,y_{k-1}) &= N(H_k m^-_k, S_k)\\\\\n\\end{aligned}\n$$\nPrediction:\n$$\n\\begin{aligned}\nm^-_k &= A_{k-1}m_{k-1}\\\\\nP^-_k &= A_{k-1}P_{k-1}A^\\mathsf{T}_{k-1} + Q_{k-1}\n\\end{aligned}\n$$\n\n:::\n::: {.column width=\"50%\"}\n\nUpdate:\n$$\n\\begin{aligned}\nv_k &= y_k - H_k m_k^-\\\\\nS_k &= H_k P_k^- H_k^\\mathsf{T} + R_k\\\\\nK_k &= P^-_k H_k^\\mathsf{T} S_k^{-1}\\\\\nm_k &= m^-_k + K_{k}v_{k}\\\\\nP_k &= P^-_k - K_k S_k K_k^\\mathsf{T}\n\\end{aligned}\n$$\n\n:::\n::::\n\n\n\n## Code or it isn't real (Kalman Filter)\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nkalman <- function(y, m0, P0, A, Q, H, R) {\n n <- length(y)\n m <- double(n + 1)\n P <- double(n + 1)\n m[1] <- m0\n P[1] <- P0\n for (k in seq(n)) {\n mm <- A * m[k]\n Pm <- A * P[k] * A + Q\n v <- y[k] - H * mm\n S <- H * Pm * H + R\n K <- Pm * H / S\n m[k + 1] <- mm + K * v\n P[k + 1] <- Pm - K * S * K\n }\n tibble(t = 1:n, m = m[-1], P = P[-1])\n}\n\nset.seed(2022 - 06 - 01)\nx <- double(100)\nfor (k in 2:100) x[k] <- x[k - 1] + rnorm(1)\ny <- x + rnorm(100, sd = 1)\nkf <- kalman(y, 0, 5, 1, 1, 1, 1)\n```\n:::\n\n\n:::\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n::::\n\n\n## Important notes\n\n* So far, we assumed all parameters were known.\n* In reality, we had 6: `m0, P0, A, Q, H, R`\n* I sort of also think of `x` as \"parameters\" in the Bayesian sense\n* By that I mean, \"latent variables for which we have prior distributions\"\n* What if we want to estimate them?\n\n[Bayesian way]{.tertiary}: `m0` and `P0` are already the parameters of for the prior on `x1`. Put priors on the other 4.\n\n[Frequentist way]{.tertiary}: Just maximize the likelihood. Can technically take \n`P0` $\\rightarrow\\infty$ to remove it and `m0`\n\n. . .\n\nThe Likelihood is produced as a by-product of the Kalman Filter. \n\n$$-\\ell(\\theta) = \\sum_{k=1}^T \\left(v_k^\\mathsf{T}S_k^{-1}v_k + \\log |S_k| + m \\log 2\\pi\\right)$$\n\n\n\n## Smoothing \n\n* We also want $p(x_k | y_1,\\ldots,y_{T})$\n* Filtering went \"forward\" in time. At the end we got, \n$p(x_T | y_1,\\ldots,y_{T})$. Smoothing starts there and goes \"backward\"\n* For \"everything linear Gaussian\", this is again \"easy\"\n* Set $m_T^s = m_T$, $P_T^s = P_T$. \n* For $k = T-1,\\ldots,1$, \n\n\n$$\\begin{aligned}\nG_k &= P_k A_k^\\mathsf{T} [P_{k+1}^-]^{-1}\\\\\nm_k^s &= m_k + G_k(m_{k+1}^s - m_{k+1}^-)\\\\\nP_k^s &= P_k + G_k(P_{k+1}^s - P_{k+1}^-)G_k^\\mathsf{T}\\\\\nx_k | y_1,\\ldots,y_T &= N(m^s_k, P_k^s)\n\\end{aligned}$$\n\n\n## Comparing the filter and the smoother\n\n* Same data, different code (using a package)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(FKF)\nfilt <- fkf(\n a0 = 0, P0 = matrix(5), dt = matrix(0), ct = matrix(0),\n Tt = matrix(1), Zt = matrix(1), HHt = matrix(1), GGt = matrix(1),\n yt = matrix(y, ncol = length(y))\n)\nsmo <- fks(filt)\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](time-series_files/figure-revealjs/plot-smooth-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## What about non-linear and/or non-Gaussian\n\n$$\\begin{aligned} x_k &\\sim p(x_k | x_{k-1}) \\\\ y_k &\\sim p(y_k | x_k)\\end{aligned}$$\n\nThen we need to solve integrals. This is a pain. We approximate them. \n\nThese all give [approximations to the filtering distribution]{.secondary}\n\n* Extended Kalman filter - basically do a Taylor approximation, then do Kalman like\n* Uncented Kalman filter - Approximate integrals with Sigma points\n* Particle filter - Sequential Monte Carlo\n* Bootstrap filter (simple version of SMC)\n* Laplace Gaussian filter - Do a Laplace approximation to the distributions\n\n\n## The bootstrap filter\n\n* Need to **simulate** from the transition distribution (`rtrans`)\n* Need to **evaluate** the observation distribution (`dobs`)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nboot_filter <-\n function(y, B = 1000, rtrans, dobs, a0 = 0, P0 = 1, perturb = function(x) x) {\n n <- length(y)\n filter_est <- matrix(0, n, B)\n predict_est <- matrix(0, n, B)\n init <- rnorm(B, a0, P0)\n filter_est[1, ] <- init\n for (i in seq(n)) {\n raw_w <- dobs(y[i], filter_est[i, ])\n w <- raw_w / sum(raw_w)\n selection <- sample.int(B, replace = TRUE, prob = w)\n filter_est[i, ] <- perturb(filter_est[i, selection])\n predict_est[i, ] <- rtrans(filter_est[i, ])\n if (i < n) filter_est[i + 1, ] <- predict_est[i, ]\n }\n list(filt = filter_est, pred = predict_est)\n }\n```\n:::\n", "supporting": [ "time-series_files" ], diff --git a/_freeze/schedule/slides/time-series/figure-revealjs/plot-it-1.svg b/_freeze/schedule/slides/time-series/figure-revealjs/plot-it-1.svg index 872dc07..f82e1e6 100644 --- a/_freeze/schedule/slides/time-series/figure-revealjs/plot-it-1.svg +++ b/_freeze/schedule/slides/time-series/figure-revealjs/plot-it-1.svg @@ -305,31 +305,25 @@ - - - - - - - - - - - - + + + + + + - - - - - - - + + + + + + + - + - + diff --git a/_freeze/schedule/slides/time-series/figure-revealjs/plot-smooth-1.svg b/_freeze/schedule/slides/time-series/figure-revealjs/plot-smooth-1.svg index 71ef7ac..c3d6348 100644 --- a/_freeze/schedule/slides/time-series/figure-revealjs/plot-smooth-1.svg +++ b/_freeze/schedule/slides/time-series/figure-revealjs/plot-smooth-1.svg @@ -210,29 +210,29 @@ - - - - - - - + + + + + + + - - - - - + + + + + - - - - - - + + + + + + - + diff --git a/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-2-1.svg b/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-2-1.svg index bde4118..af9c07d 100644 --- a/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-2-1.svg +++ b/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-2-1.svg @@ -513,29 +513,29 @@ - - - - - - - - - - + + + + + + + + + + - + - + - + - + - + diff --git a/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-6-1.svg b/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-6-1.svg index f44eb40..10872d8 100644 --- a/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-6-1.svg +++ b/_freeze/schedule/slides/time-series/figure-revealjs/unnamed-chunk-6-1.svg @@ -225,28 +225,28 @@ - - - - - - + + + + + + - - + + - - - + + + - - - - - - - + + + + + + + diff --git a/_freeze/schedule/slides/unit-tests/execute-results/html.json b/_freeze/schedule/slides/unit-tests/execute-results/html.json index a8e98c7..02faf9d 100644 --- a/_freeze/schedule/slides/unit-tests/execute-results/html.json +++ b/_freeze/schedule/slides/unit-tests/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "2657827e20283bbab6a8211c880d9b66", "result": { - "markdown": "---\nlecture: \"Unit tests and avoiding 🪲🪲\"\nformat: \n revealjs:\n echo: true\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 February 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## I urge you to consult:\n\n[Carnegie Mellon's 36-750 Notes](https://36-750.github.io)\n\nThank you Alex and Chris for the heavy lifting.\n\n\n\n## Bugs happen. All. The. Time.\n\n* the crash of the [Mars Climate Orbiter](https://en.wikipedia.org/wiki/Mars%5FClimate%5FOrbiter) (1998),\n\n* a [failure of the national telephone network](https://telephoneworld.org/landline-telephone-history/the-crash-of-the-att-network-in-1990/) (1990),\n\n* a deadly medical device ([1985](https://en.wikipedia.org/wiki/Therac-25), 2000),\n\n* a massive [Northeastern blackout](https://en.wikipedia.org/wiki/Northeast%5Fblackout%5Fof%5F2003) (2003),\n\n* the [Heartbleed](http://heartbleed.com/), [Goto Fail](https://www.dwheeler.com/essays/apple-goto-fail.html), [Shellshock](https://en.wikipedia.org/wiki/Shellshock%5F(software%5Fbug)) exploits (2012–2014),\n\n* a 15-year-old [fMRI analysis software](http://www.pnas.org/content/113/28/7900.full) bug that inflated significance levels (2015),\n\n. . .\n\nIt is easy to write lots of code.\n\nBut are we sure it's doing the right things?\n\n::: {.callout-important}\nEffective testing tries to help.\n:::\n\n\n## A Common (Interactive) Workflow\n\n1. Write a function.\n1. Try some reasonable values at the REPL to check that it works.\n1. If there are problems, maybe insert some print statements, and modify the function.\n1. Repeat until things seem fine.\n\n(REPL == Read-Eval-Print-Loop, the console, or Jupyter NB)\n\n* This tends to result in lots of bugs.\n\n* Later on, you forget which values you tried, whether they failed, how you fixed them.\n\n* So you make a change and maybe or maybe not try some again.\n\n## Step 1 --- write functions\n\n::: {.callout-important appearance=\"simple\"}\nWrite functions.\n\nLots of them.\n:::\n\n👍 Functions are testable \n\n👎 Scripts are not\n\nIt's easy to alter the arguments and see \"what happens\"\n\nThere's less ability to screw up environments.\n\n. . .\n\nI'm going to mainly describe `R`, but the logic is very similar (if not the syntax) for `python`, `C++`, and `Julia`\n\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](unit-tests_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 83 138 191 191 182 74 43 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.166 0.276 0.382 0.382 0.364 0.148 0.086 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## Step 2 --- program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n## Even better\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) cli::cli_abort(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) cli::cli_abort(\"`inc_by` must be numeric\")\n x + inc_by\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:6, \"b\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `inc_by` must be numeric\n```\n:::\n:::\n\n\n\n## Step 3 --- Keep track of behaviour with tests\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) stop(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) stop(\"`inc_by` must be numeric\")\n x + inc_by\n}\ntest_that(\"incrementer validates arguments\", {\n expect_error(incrementer(\"a\"))\n expect_equal(incrementer(1:3), 2:4)\n expect_equal(incrementer(2, -3), -1)\n expect_error(incrementer(1, \"b\"))\n expect_identical(incrementer(1:3), 2:4)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n── Failure: incrementer validates arguments ────────────────────────────────────\nincrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError:\n! Test failed\n```\n:::\n:::\n\n\n\n## Integers are trouble\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\nexpect_equal(incrementer(1:3, 1), 2:4)\n```\n:::\n\n\n# Testing lingo\n\n## Unit testing\n\n* A **unit** is a small bit of code (function, class, module, group of classes)\n\n* A **test** calls the unit with a set of inputs, and checks if we get the expected output.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngcd <- function(x, na.rm = FALSE) {\n if (na.rm) x <- x[!is.na(x)]\n if (anyNA(x)) return(NA)\n stopifnot(is.numeric(x))\n if (!rlang::is_integerish(x)) cli_abort(\"`x` must contain only integers.\")\n if (length(x) == 1L) return(as.integer(x))\n x <- x[x != 0]\n compute_gcd(x) # dispatch to a C++ function\n}\n\ntest_that(\"gcd works\", {\n # corner cases\n expect_identical(gcd(c(1, NA)), NA)\n expect_identical(gcd(c(1, NA), TRUE), 1L)\n expect_identical(gcd(c(1, 2, 4)), 1L)\n # error\n expect_error(gcd(1.3))\n # function\n expect_identical(gcd(c(2, 4, 6)), 2L)\n expect_identical(gcd(c(2, 3, 7)), 1L)\n})\n```\n:::\n\n\n## Unit testing\n\nUnit testing consists of writing tests that are\n\n* focused on a small, low-level piece of code (a unit)\n* typically written by the programmer with standard tools\n* fast to run (so can be run often, i.e. before every commit).\n\n\n## Unit testing benefits\n\nAmong others:\n\n* Exposing problems early\n* Making it easy to change (refactor) code without forgetting pieces or breaking things\n* Simplifying integration of components\n* Providing natural documentation of what the code should do\n* Driving the design of new code.\n\n![](http://www.phdcomics.com/comics/archive/phd033114s.gif)\n\n\n## Components of a Unit Testing Framework\n\n::: flex\n::: w-70\n\n* Collection of **Assertions** executed in sequence. \n* Executed in a self-contained environment.\n* Any assertion fails ``{=html} Test fails.\n\nEach test focuses on a single component.\n\nNamed so that you know what it's doing.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## See https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life\ntest_that(\"Conway's rules are correct\", {\n # conway_rules(num_neighbors, alive?)\n expect_true(conway_rules(3, FALSE))\n expect_false(conway_rules(4, FALSE))\n expect_true(conway_rules(2, TRUE))\n ...\n})\n```\n:::\n\n:::\n\n::: w-30\n\n![](https://upload.wikimedia.org/wikipedia/commons/e/e5/Gospers_glider_gun.gif)\n\n:::\n:::\n\n\n## A test suite\n\n::: flex\n::: w-50\n* Collection of related tests in a common context.\n\n* Prepares the environment, cleans up after\n\n* (loads some data, connects to database, necessary library,...)\n\n* Test suites are run and the results reported, particularly failures, in a easy to parse and economical style. \n\n* For example, Python’s `{unittest}` can report like this\n\n::: \n\n::: w-50\n\n```{.bash}\n$ python test/trees_test.py -v\n\ntest_crime_counts (__main__.DataTreeTest)\nEnsure Ks are consistent with num_points. ... ok\ntest_indices_sorted (__main__.DataTreeTest)\nEnsure all node indices are sorted in increasing order. ... ok\ntest_no_bbox_overlap (__main__.DataTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.DataTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.DataTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.DataTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.DataTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\ntest_no_bbox_overlap (__main__.QueryTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.QueryTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.QueryTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.QueryTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.QueryTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\n\n---------------------------------------------------------\nRan 12 tests in 23.932s\n```\n\n:::\n:::\n\n\n\n## `R` example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntestthat::test_local(here::here(\"../../../../../../Delphi/smoothqr/\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n✔ | F W S OK | Context\n\n⠏ | 0 | smooth-rq \n⠴ | 6 | smooth-rq \n✔ | 12 | smooth-rq\n\n══ Results ═════════════════════════════════════════════════════════════════════\n[ FAIL 0 | WARN 0 | SKIP 0 | PASS 12 ]\n```\n:::\n:::\n\n\n\n## What do I test?\n\n::: {.callout-tip icon=false}\n## Core Principle:\n\nTests should be passed by a correct function, but not by an incorrect function.\n:::\n\nThe tests must apply pressure to know if things break.\n\n* several specific inputs for which you _know_ the correct answer\n* \"edge\" cases, like a list of size zero or a matrix instead of a vector\n* special cases that the function must handle, but which you might forget about months from now\n* error cases that should throw an error instead of returning an invalid answer\n* previous bugs you’ve fixed, so those bugs never return.\n\n\n## What do I test?\n\nMake sure that incorrect functions won't pass (or at least, won't pass them all).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadd <- function(a, b) return(4)\nadd <- function(a, b) return(a * b)\n\ntest_that(\"Addition is commutative\", {\n expect_equal(add(1, 3), add(3, 1)) # both pass this !!\n expect_equal(add(2, 5), add(5, 2)) # neither passes this\n})\n```\n:::\n\n\n::: {.callout-tip}\n* Cover all branches. \n\n* Make sure there aren't branches you don't expect.\n:::\n\n## Assertions\n\n[Assertions]{.secondary} are things that must be true. Failure means \"Quit\". \n\n- There's no way to recover. \n- Think: passed in bad arguments.\n \n\n::: {.cell layout-align=\"center\"}\n\n```{.python .cell-code}\ndef fit(data, ...):\n\n for it in range(max_iterations):\n # iterative fitting code here\n ...\n\n # Plausibility check\n assert np.all(alpha >= 0), \"negative alpha\"\n assert np.all(theta >= 0), \"negative theta\"\n assert omega > 0, \"Nonpositive omega\"\n assert eta2 > 0, \"Nonpositive eta2\"\n assert sigma2 > 0, \"Nonpositive sigma2\"\n\n ...\n```\n:::\n\n\nThe parameters have to be positive. Negative is impossible. No way to recover.\n\n\n## Errors\n\n[Errors]{.secondary} are for unexpected conditions that _could_ be handled by the calling code.\n\n* You could perform some action to work around the error, fix it, or report it to the user.\n\n#### Example:\n\n- I give you directions to my house. You get lost. You could recover.\n- Maybe retrace your steps, see if you missed a sign post.\n- Maybe search on Google Maps to locate your self in relation to a landmark.\n- If those fail, message me.\n- If I don't respond, get an Uber.\n- Finally, give up and go home.\n\n## Errors\n\nCode can also do this. It can `try` the function and `catch` errors to recover automatically.\n\nFor example:\n\n* Load some data from the internet. If the file doesn't exist, create some.\n\n* Run some iterative algorithm. If we haven't converged, restart from another place.\n\nCode can fix errors without user input. It can't fix assertions.\n\n* An input must be an integer. So round it, Warn, and proceed. Rather than fail.\n\n\n## Test-driven development\n\nTest Driven Development (TDD) uses a short development cycle for each new feature or component:\n\n1. Write tests that specify the component’s desired behavior. \n The tests will initially fail because the component does not yet exist.\n\n1. Create the minimal implementation that passes the test.\n\n1. Refactor the code to meet design standards, running the tests with each change to ensure correctness.\n\n\n## Why work this way?\n\n* Writing the tests may help you realize \n a. what arguments the function must take, \n b. what other data it needs, \n c. and what kinds of errors it needs to handle. \n\n* The tests define a specific plan for what the function must do.\n\n* You will catch bugs at the beginning instead of at the end (or never).\n\n* Testing is part of design, instead of a lame afterthought you dread doing.\n\n\n## Rules of thumb\n\nKeep tests in separate files\n: from the code they test. This makes it easy to run them separately.\n\nGive tests names\n: Testing frameworks usually let you give the test functions names or descriptions. `test_1` doesn’t help you at all, but `test_tree_insert` makes it easy for you to remember what the test is for.\n\nMake tests replicable\n: If a test involves random data, what do you do when the test fails? You need some way to know what random values it used so you can figure out why the test fails.\n\n## Rules of thumb\n\nUse tests instead of the REPL\n: If you’re building a complicated function, write the tests in advance and use them to help you while you write the function. You'll waste time calling over and over at the REPL.\n\nAvoid testing against another's code/package\n: You don't know the ins and outs of what they do. If they change the code, your tests will fail.\n\nTest Units, not main functions\n: You should write small functions that do one thing. Test those. Don't write one huge 1000-line function and try to test that.\n\nAvoid random numbers\n: Seeds are not always portable.\n\n---\n\n::: {.callout-note}\n* `R`, use `{testthat}`. See the [Testing](http://r-pkgs.had.co.nz/tests.html) chapter from Hadley Wickham’s R Packages book.\n\n* `python` use `{pytest}`. A bit more user-friendly than `{unittest}`: [pytest](https://docs.pytest.org/en/latest/)\n:::\n\n\n\n## Other suggestions\n\n::: flex\n::: w-50\n[Do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) stop(x, \" is not positive\")\n}\n\nfoo <- function(x) {\n if (x < 0) message(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (x < 0) warning(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (length(x) == 0)\n rlang::abort(\"no data\", class=\"no_input_data\")\n}\n```\n:::\n\n\nThese allow error handling.\n:::\n\n\n::: w-50\n\n[Don't do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) {\n print(paste0(x, \" is not positive\"))\n return(NULL)\n }\n ...\n}\n\nfoo <- function(x) {\n if (x < 0) cat(\"uh oh.\")\n ...\n}\n```\n:::\n\n\nCan't recover.\n\nDon't know what went wrong.\n\n:::\n:::\n\n---\n\nSee [here](https://36-750.github.io/practices/errors-exceptions/) for more details.\n\nSeems like overkill, \n\nbut when you run a big simulation that takes 2 weeks, \n\nyou don't want it to die after 10 days. \n\n\nYou want it to recover.\n\n\n\n# More coding details, if time.\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib)\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n## Annoying example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n\n```{.r .cell-code}\nprint.lm <- function(x, ...) print(\"This is an linear model.\")\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"This is an linear model.\"\n```\n:::\n:::\n\n\n* Overwrote the method in the global environment.\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside those package Environments.\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Global Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Careful organization and relative paths will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (this is my first step when you ask me)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get uncountably often that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just share a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that.\n\nAnd I can't copy-paste into Google.\n:::\n\nWhat you need is a Reproducible Example or `reprex`. This is a small chunk of code that \n\n1. runs in it's own environment \n1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex()` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Share that.\n\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script. That's the point\n:::\n\n\n\n\n\n\n## Practice\n\n#### Gradient ascent.\n\n* Suppose we want to find $\\max_x f(x)$.\n\n* We repeat the update $x \\leftarrow x + \\gamma f'(x)$ until convergence, for some $\\gamma > 0$.\n\n#### Poisson likelihood.\n\n* Recall the likelihood: $L(\\lambda; y_1,\\ldots,y_n) = \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)$\n\n[Goal:]{.secondary} find the MLE for $\\lambda$ using gradient ascent\n\n---\n\n## Deliverables, 2 R scripts\n\n1. A function that evaluates the log likelihood. (think about sufficiency, ignorable constants)\n1. A function that evaluates the gradient of the log likelihood. \n1. A function that *does* the optimization. \n a. Should take in data, the log likelihood and the gradient.\n b. Use the loglikelihood to determine convergence. \n c. Pass in any other necessary parameters with reasonable defaults.\n1. A collection of tests that make sure your functions work.\n\n\n$$\n\\begin{aligned} \nL(\\lambda; y_1,\\ldots,y_n) &= \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)\\\\\nx &\\leftarrow x + \\gamma f'(x)\n\\end{aligned}\n$$", + "markdown": "---\nlecture: \"Unit tests and avoiding 🪲🪲\"\nformat: \n revealjs:\n echo: true\nmetadata-files: \n - _metadata.yml\n---\n## {{< meta lecture >}} {.large background-image=\"img/consult.jpeg\" background-opacity=\"0.3\"}\n\n[Stat 550]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 01 April 2024\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\mid}\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## I urge you to consult:\n\n[Carnegie Mellon's 36-750 Notes](https://36-750.github.io)\n\nThank you Alex and Chris for the heavy lifting.\n\n\n\n## Bugs happen. All. The. Time.\n\n* the crash of the [Mars Climate Orbiter](https://en.wikipedia.org/wiki/Mars%5FClimate%5FOrbiter) (1998),\n\n* a [failure of the national telephone network](https://telephoneworld.org/landline-telephone-history/the-crash-of-the-att-network-in-1990/) (1990),\n\n* a deadly medical device ([1985](https://en.wikipedia.org/wiki/Therac-25), 2000),\n\n* a massive [Northeastern blackout](https://en.wikipedia.org/wiki/Northeast%5Fblackout%5Fof%5F2003) (2003),\n\n* the [Heartbleed](http://heartbleed.com/), [Goto Fail](https://www.dwheeler.com/essays/apple-goto-fail.html), [Shellshock](https://en.wikipedia.org/wiki/Shellshock%5F(software%5Fbug)) exploits (2012–2014),\n\n* a 15-year-old [fMRI analysis software](http://www.pnas.org/content/113/28/7900.full) bug that inflated significance levels (2015),\n\n. . .\n\nIt is easy to write lots of code.\n\nBut are we sure it's doing the right things?\n\n::: {.callout-important}\nEffective testing tries to help.\n:::\n\n\n## A Common (Interactive) Workflow\n\n1. Write a function.\n1. Try some reasonable values at the REPL to check that it works.\n1. If there are problems, maybe insert some print statements, and modify the function.\n1. Repeat until things seem fine.\n\n(REPL == Read-Eval-Print-Loop, the console, or Jupyter NB)\n\n* This tends to result in lots of bugs.\n\n* Later on, you forget which values you tried, whether they failed, how you fixed them.\n\n* So you make a change and maybe or maybe not try some again.\n\n## Step 1 --- write functions\n\n::: {.callout-important appearance=\"simple\"}\nWrite functions.\n\nLots of them.\n:::\n\n👍 Functions are testable \n\n👎 Scripts are not\n\nIt's easy to alter the arguments and see \"what happens\"\n\nThere's less ability to screw up environments.\n\n. . .\n\nI'm going to mainly describe `R`, but the logic is very similar (if not the syntax) for `python`, `C++`, and `Julia`\n\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](unit-tests_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 83 138 191 191 182 74 43 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.166 0.276 0.382 0.382 0.364 0.148 0.086 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## Step 2 --- program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n## Even better\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) cli::cli_abort(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) cli::cli_abort(\"`inc_by` must be numeric\")\n x + inc_by\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:6, \"b\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in `incrementer()`:\n! `inc_by` must be numeric\n```\n:::\n:::\n\n\n\n## Step 3 --- Keep track of behaviour with tests\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) stop(\"`x` must be numeric\")\n if (!is.numeric(inc_by)) stop(\"`inc_by` must be numeric\")\n x + inc_by\n}\ntest_that(\"incrementer validates arguments\", {\n expect_error(incrementer(\"a\"))\n expect_equal(incrementer(1:3), 2:4)\n expect_equal(incrementer(2, -3), -1)\n expect_error(incrementer(1, \"b\"))\n expect_identical(incrementer(1:3), 2:4)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n── Failure: incrementer validates arguments ────────────────────────────────────\nincrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nError:\n! Test failed\n```\n:::\n:::\n\n\n\n## Integers are trouble\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\nexpect_equal(incrementer(1:3, 1), 2:4)\n```\n:::\n\n\n# Testing lingo\n\n## Unit testing\n\n* A **unit** is a small bit of code (function, class, module, group of classes)\n\n* A **test** calls the unit with a set of inputs, and checks if we get the expected output.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngcd <- function(x, na.rm = FALSE) {\n if (na.rm) x <- x[!is.na(x)]\n if (anyNA(x)) return(NA)\n stopifnot(is.numeric(x))\n if (!rlang::is_integerish(x)) cli_abort(\"`x` must contain only integers.\")\n if (length(x) == 1L) return(as.integer(x))\n x <- x[x != 0]\n compute_gcd(x) # dispatch to a C++ function\n}\n\ntest_that(\"gcd works\", {\n # corner cases\n expect_identical(gcd(c(1, NA)), NA)\n expect_identical(gcd(c(1, NA), TRUE), 1L)\n expect_identical(gcd(c(1, 2, 4)), 1L)\n # error\n expect_error(gcd(1.3))\n # function\n expect_identical(gcd(c(2, 4, 6)), 2L)\n expect_identical(gcd(c(2, 3, 7)), 1L)\n})\n```\n:::\n\n\n## Unit testing\n\nUnit testing consists of writing tests that are\n\n* focused on a small, low-level piece of code (a unit)\n* typically written by the programmer with standard tools\n* fast to run (so can be run often, i.e. before every commit).\n\n\n## Unit testing benefits\n\nAmong others:\n\n* Exposing problems early\n* Making it easy to change (refactor) code without forgetting pieces or breaking things\n* Simplifying integration of components\n* Providing natural documentation of what the code should do\n* Driving the design of new code.\n\n![](http://www.phdcomics.com/comics/archive/phd033114s.gif)\n\n\n## Components of a Unit Testing Framework\n\n::: flex\n::: w-70\n\n* Collection of **Assertions** executed in sequence. \n* Executed in a self-contained environment.\n* Any assertion fails ``{=html} Test fails.\n\nEach test focuses on a single component.\n\nNamed so that you know what it's doing.\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n## See https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life\ntest_that(\"Conway's rules are correct\", {\n # conway_rules(num_neighbors, alive?)\n expect_true(conway_rules(3, FALSE))\n expect_false(conway_rules(4, FALSE))\n expect_true(conway_rules(2, TRUE))\n ...\n})\n```\n:::\n\n:::\n\n::: w-30\n\n![](https://upload.wikimedia.org/wikipedia/commons/e/e5/Gospers_glider_gun.gif)\n\n:::\n:::\n\n\n## A test suite\n\n::: flex\n::: w-50\n* Collection of related tests in a common context.\n\n* Prepares the environment, cleans up after\n\n* (loads some data, connects to database, necessary library,...)\n\n* Test suites are run and the results reported, particularly failures, in a easy to parse and economical style. \n\n* For example, Python’s `{unittest}` can report like this\n\n::: \n\n::: w-50\n\n```{.bash}\n$ python test/trees_test.py -v\n\ntest_crime_counts (__main__.DataTreeTest)\nEnsure Ks are consistent with num_points. ... ok\ntest_indices_sorted (__main__.DataTreeTest)\nEnsure all node indices are sorted in increasing order. ... ok\ntest_no_bbox_overlap (__main__.DataTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.DataTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.DataTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.DataTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.DataTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\ntest_no_bbox_overlap (__main__.QueryTreeTest)\nCheck that child bounding boxes do not overlap. ... ok\ntest_node_counts (__main__.QueryTreeTest)\nEnsure that each node's point count is accurate. ... ok\ntest_oversized_leaf (__main__.QueryTreeTest)\nDon't recurse infinitely on duplicate points. ... ok\ntest_split_parity (__main__.QueryTreeTest)\nCheck that each tree level has the right split axis. ... ok\ntest_trange_contained (__main__.QueryTreeTest)\nCheck that child tranges are contained in parent tranges. ... ok\n\n---------------------------------------------------------\nRan 12 tests in 23.932s\n```\n\n:::\n:::\n\n\n\n## `R` example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntestthat::test_local(here::here(\"../../../../../../Delphi/smoothqr/\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n✔ | F W S OK | Context\n\n⠏ | 0 | smooth-rq \n✔ | 12 | smooth-rq\n\n══ Results ═════════════════════════════════════════════════════════════════════\n[ FAIL 0 | WARN 0 | SKIP 0 | PASS 12 ]\n```\n:::\n:::\n\n\n\n## What do I test?\n\n::: {.callout-tip icon=false}\n## Core Principle:\n\nTests should be passed by a correct function, but not by an incorrect function.\n:::\n\nThe tests must apply pressure to know if things break.\n\n* several specific inputs for which you _know_ the correct answer\n* \"edge\" cases, like a list of size zero or a matrix instead of a vector\n* special cases that the function must handle, but which you might forget about months from now\n* error cases that should throw an error instead of returning an invalid answer\n* previous bugs you’ve fixed, so those bugs never return.\n\n\n## What do I test?\n\nMake sure that incorrect functions won't pass (or at least, won't pass them all).\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nadd <- function(a, b) return(4)\nadd <- function(a, b) return(a * b)\n\ntest_that(\"Addition is commutative\", {\n expect_equal(add(1, 3), add(3, 1)) # both pass this !!\n expect_equal(add(2, 5), add(5, 2)) # neither passes this\n})\n```\n:::\n\n\n::: {.callout-tip}\n* Cover all branches. \n\n* Make sure there aren't branches you don't expect.\n:::\n\n## Assertions\n\n[Assertions]{.secondary} are things that must be true. Failure means \"Quit\". \n\n- There's no way to recover. \n- Think: passed in bad arguments.\n \n\n::: {.cell layout-align=\"center\"}\n\n```{.python .cell-code}\ndef fit(data, ...):\n\n for it in range(max_iterations):\n # iterative fitting code here\n ...\n\n # Plausibility check\n assert np.all(alpha >= 0), \"negative alpha\"\n assert np.all(theta >= 0), \"negative theta\"\n assert omega > 0, \"Nonpositive omega\"\n assert eta2 > 0, \"Nonpositive eta2\"\n assert sigma2 > 0, \"Nonpositive sigma2\"\n\n ...\n```\n:::\n\n\nThe parameters have to be positive. Negative is impossible. No way to recover.\n\n\n## Errors\n\n[Errors]{.secondary} are for unexpected conditions that _could_ be handled by the calling code.\n\n* You could perform some action to work around the error, fix it, or report it to the user.\n\n#### Example:\n\n- I give you directions to my house. You get lost. You could recover.\n- Maybe retrace your steps, see if you missed a sign post.\n- Maybe search on Google Maps to locate your self in relation to a landmark.\n- If those fail, message me.\n- If I don't respond, get an Uber.\n- Finally, give up and go home.\n\n## Errors\n\nCode can also do this. It can `try` the function and `catch` errors to recover automatically.\n\nFor example:\n\n* Load some data from the internet. If the file doesn't exist, create some.\n\n* Run some iterative algorithm. If we haven't converged, restart from another place.\n\nCode can fix errors without user input. It can't fix assertions.\n\n* An input must be an integer. So round it, Warn, and proceed. Rather than fail.\n\n\n## Test-driven development\n\nTest Driven Development (TDD) uses a short development cycle for each new feature or component:\n\n1. Write tests that specify the component’s desired behavior. \n The tests will initially fail because the component does not yet exist.\n\n1. Create the minimal implementation that passes the test.\n\n1. Refactor the code to meet design standards, running the tests with each change to ensure correctness.\n\n\n## Why work this way?\n\n* Writing the tests may help you realize \n a. what arguments the function must take, \n b. what other data it needs, \n c. and what kinds of errors it needs to handle. \n\n* The tests define a specific plan for what the function must do.\n\n* You will catch bugs at the beginning instead of at the end (or never).\n\n* Testing is part of design, instead of a lame afterthought you dread doing.\n\n\n## Rules of thumb\n\nKeep tests in separate files\n: from the code they test. This makes it easy to run them separately.\n\nGive tests names\n: Testing frameworks usually let you give the test functions names or descriptions. `test_1` doesn’t help you at all, but `test_tree_insert` makes it easy for you to remember what the test is for.\n\nMake tests replicable\n: If a test involves random data, what do you do when the test fails? You need some way to know what random values it used so you can figure out why the test fails.\n\n## Rules of thumb\n\nUse tests instead of the REPL\n: If you’re building a complicated function, write the tests in advance and use them to help you while you write the function. You'll waste time calling over and over at the REPL.\n\nAvoid testing against another's code/package\n: You don't know the ins and outs of what they do. If they change the code, your tests will fail.\n\nTest Units, not main functions\n: You should write small functions that do one thing. Test those. Don't write one huge 1000-line function and try to test that.\n\nAvoid random numbers\n: Seeds are not always portable.\n\n---\n\n::: {.callout-note}\n* `R`, use `{testthat}`. See the [Testing](http://r-pkgs.had.co.nz/tests.html) chapter from Hadley Wickham’s R Packages book.\n\n* `python` use `{pytest}`. A bit more user-friendly than `{unittest}`: [pytest](https://docs.pytest.org/en/latest/)\n:::\n\n\n\n## Other suggestions\n\n::: flex\n::: w-50\n[Do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) stop(x, \" is not positive\")\n}\n\nfoo <- function(x) {\n if (x < 0) message(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (x < 0) warning(x, \" is not positive\")\n # not useful unless we fix it too...\n}\n\nfoo <- function(x) {\n if (length(x) == 0)\n rlang::abort(\"no data\", class=\"no_input_data\")\n}\n```\n:::\n\n\nThese allow error handling.\n:::\n\n\n::: w-50\n\n[Don't do this]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nfoo <- function(x) {\n if (x < 0) {\n print(paste0(x, \" is not positive\"))\n return(NULL)\n }\n ...\n}\n\nfoo <- function(x) {\n if (x < 0) cat(\"uh oh.\")\n ...\n}\n```\n:::\n\n\nCan't recover.\n\nDon't know what went wrong.\n\n:::\n:::\n\n---\n\nSee [here](https://36-750.github.io/practices/errors-exceptions/) for more details.\n\nSeems like overkill, \n\nbut when you run a big simulation that takes 2 weeks, \n\nyou don't want it to die after 10 days. \n\n\nYou want it to recover.\n\n\n\n# More coding details, if time.\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib)\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n## Annoying example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n 0.1216 1.0803 2.1038 \n```\n:::\n\n```{.r .cell-code}\nprint.lm <- function(x, ...) print(\"This is an linear model.\")\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"This is an linear model.\"\n```\n:::\n:::\n\n\n* Overwrote the method in the global environment.\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside those package Environments.\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Global Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Careful organization and relative paths will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (this is my first step when you ask me)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get uncountably often that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just share a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that.\n\nAnd I can't copy-paste into Google.\n:::\n\nWhat you need is a Reproducible Example or `reprex`. This is a small chunk of code that \n\n1. runs in it's own environment \n1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex()` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Share that.\n\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script. That's the point\n:::\n\n\n\n\n\n\n## Practice\n\n#### Gradient ascent.\n\n* Suppose we want to find $\\max_x f(x)$.\n\n* We repeat the update $x \\leftarrow x + \\gamma f'(x)$ until convergence, for some $\\gamma > 0$.\n\n#### Poisson likelihood.\n\n* Recall the likelihood: $L(\\lambda; y_1,\\ldots,y_n) = \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)$\n\n[Goal:]{.secondary} find the MLE for $\\lambda$ using gradient ascent\n\n---\n\n## Deliverables, 2 R scripts\n\n1. A function that evaluates the log likelihood. (think about sufficiency, ignorable constants)\n1. A function that evaluates the gradient of the log likelihood. \n1. A function that *does* the optimization. \n a. Should take in data, the log likelihood and the gradient.\n b. Use the loglikelihood to determine convergence. \n c. Pass in any other necessary parameters with reasonable defaults.\n1. A collection of tests that make sure your functions work.\n\n\n$$\n\\begin{aligned} \nL(\\lambda; y_1,\\ldots,y_n) &= \\prod_{i=1}^n \\frac{\\lambda^{y_i} \\exp(-\\lambda)}{y_i!}I(y_i \\in 0,1,\\ldots)\\\\\nx &\\leftarrow x + \\gamma f'(x)\n\\end{aligned}\n$$", "supporting": [ "unit-tests_files" ],