diff --git a/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json b/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json index 67473cf..2198513 100644 --- a/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json +++ b/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "30df895c7711276a8f7829ed46e75efa", + "hash": "a8b0ecde2f795d3d2b8a59b3595c74ce", "result": { - "markdown": "---\nlecture: \"18 The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 30 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n\n## A small detour...\n\n\n![](https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg)\n\n##\n\n![](http://rackjite.com/wp-content/uploads/rr11014aa.jpg)\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\n[Sampling distributions]{.primary}\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\overline{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\overline{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\overline{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\widehat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\n[The bootstrap gives this to you.]{.secondary}\n\n## Bootstrap procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n. . .\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Empirical distribution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nr <- rexp(50, 1 / 5)\nggplot(tibble(r = r), aes(r)) + \n stat_ecdf(colour = orange) +\n geom_vline(xintercept = quantile(r, probs = c(.05, .95))) +\n geom_hline(yintercept = c(.05, .95), linetype = \"dashed\") +\n annotate(\n \"label\", x = c(5, 12), y = c(.25, .75), \n label = c(\"hat(F)[boot](.05)\", \"hat(F)[boot](.95)\"), \n parse = TRUE\n )\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Very basic example\n\n* Let $X_i\\sim \\textrm{Exponential}(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2s/\\sqrt{n}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(data.frame(x = c(0, 12)), aes(x)) +\n stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +\n geom_vline(xintercept = 5, color = blue) + # mean\n geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median\n annotate(\"label\",\n x = c(2.5, 5.5, 10), y = c(.15, .15, .05),\n label = c(\"median\", \"bar(x)\", \"pdf\"), parse = TRUE,\n color = c(red, blue, orange), size = 6\n )\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Now what...\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(406406406)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.611615\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nFhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(data.frame(Fhat), aes(Fhat)) +\n geom_density(color = orange) +\n geom_vline(xintercept = CI, color = orange, linetype = 2) +\n geom_vline(xintercept = med, col = blue) +\n geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +\n annotate(\"label\",\n x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),\n color = c(orange, red, blue),\n label = c(\"widehat(F)\", \"true~median\", \"widehat(median)\"),\n parse = TRUE\n ) +\n xlab(\"x\") +\n geom_rug(aes(2 * med - Fhat))\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## How does this work?\n\n![](gfx/boot1.png)\n\n\n\n## Approximations\n\n![](gfx/boot2.png)\n\n\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(fatcats, aes(Bwt, Hwt)) +\n geom_point(color = blue) +\n xlab(\"Cat body weight (Kg)\") +\n ylab(\"Cat heart weight (g)\")\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nsummary(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-11.2353 -0.7932 -0.1407 0.5968 11.1026 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.95424 0.06294 62.83 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.089 on 143 degrees of freedom\nMultiple R-squared: 0.965,\tAdjusted R-squared: 0.9648 \nF-statistic: 3947 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## When we fit models, we examine diagnostics\n\n\n::: flex\n::: w-50\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqqnorm(residuals(cats.lm), pch = 16, col = blue)\nqqline(residuals(cats.lm), col = orange, lwd = 2)\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n\nThe tails are too fat. So I don't believe that CI...\n\n:::\n\n::: w-50\n\n**We bootstrap**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.826735 4.084322 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for the methods in this module\n\n. . . \n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n\n\n## Same data\n\n::: flex\n::: w-50\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.832907 4.070232 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n:::\n\n::: w-50\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nr <- residuals(cats.lm)\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |> mutate(\n Hwt = predict(cats.lm) +\n sample(r, n(), replace = TRUE)\n )\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.815162 4.065045 \n```\n:::\n:::\n\n\n:::\n\n:::\n\n\n\n## Bootstrap error sources\n\n\n![](gfx/boot2.png){.center}\n\n\n\nSimulation error\n: using only $B$ samples to estimate $F$ with $\\hat{F}$.\n\nStatistical error\n: our data depended on a sample from the population. We don't have the whole population so we make an error by using a sample (Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\nSpecification error\n: If we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\nOur interval is\n\n$$\n[2\\hat{\\theta} - \\theta^*_{1-\\alpha/2},\\ 2\\hat{\\theta} - \\theta^*_{\\alpha/2}]\n$$\n\nwhere $\\theta^*_q$ is the $q$ quantile of $\\hat{\\Theta}$.\n\n
\n\n* Called the \"Pivotal Interval\"\n* Has the correct $1-\\alpha$% coverage under very mild conditions on $\\hat{\\theta}$\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\n$$\n[\\hat{\\theta} - t_{\\alpha/2}s,\\ \\hat{\\theta} - t_{\\alpha/2}s]\n$$\n\nwhere $\\hat{s} = \\sqrt{\\Var{\\hat{\\Theta}}}$\n\n
\n\n* Called the \"Normal Interval\"\n* Only works if the distribution of $\\hat{\\Theta}$ is approximately Normal.\n* Unlikely to work well\n* Don't do this\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\n$$\n[\\theta^*_{\\alpha/2},\\ \\theta^*_{1-\\alpha/2}]\n$$\n\nwhere $\\theta^*_q$ is the $q$ quantile of $\\hat{\\Theta}$.\n\n
\n\n* Called the \"Percentile Interval\"\n* Works if $\\exists$ monotone $m$ so that $m(\\hat\\Theta) \\sim N(m(\\theta), c^2)$\n* Better than the Normal Interval\n* More assumptions than the Pivotal Interval\n\n## More details\n\n> See \"All of Statistics\" by Larry Wasserman, Chapter 8.3\n\nThere's a handout with the proofs on Canvas (under Modules)\n\n\n# Next time...\n\nBootstrap for bagging and random forests\n", + "markdown": "---\nlecture: \"18 The bootstrap\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n\n## A small detour...\n\n\n![](https://www.azquotes.com/picture-quotes/quote-i-believe-in-pulling-yourself-up-by-your-own-bootstraps-i-believe-it-is-possible-i-saw-stephen-colbert-62-38-03.jpg)\n\n##\n\n![](http://rackjite.com/wp-content/uploads/rr11014aa.jpg)\n\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\nYou usually get these \"second-level\" properties from \"the sampling distribution of an estimator\"\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\nBut what if you don't know the sampling distribution? Or you're skeptical of the CLT argument?\n\n## In statistics...\n\nThe \"bootstrap\" works. And well.\n\nIt's good for \"second-level\" analysis.\n\n* \"First-level\" analyses are things like $\\hat\\beta$, $\\hat \\y$, an estimator of the center (a median), etc.\n\n* \"Second-level\" are things like $\\Var{\\hat\\beta}$, a confidence interval for $\\hat \\y$, or a median, etc.\n\n
\n\n[Sampling distributions]{.primary}\n\n1. If $X_i$ are iid Normal $(0,\\sigma^2)$, then $\\Var{\\overline{X}} = \\sigma^2 / n$.\n1. If $X_i$ are iid and $n$ is big, then $\\Var{\\overline{X}} \\approx \\Var{X_1} / n$.\n1. If $X_i$ are iid Binomial $(m, p)$, then $\\Var{\\overline{X}} = mp(1-p) / n$\n\n\n\n## Example of unknown sampling distribution\n\nI estimate a LDA on some data.\n\nI get a new $x_0$ and produce $\\widehat{Pr}(y_0 =1 \\given x_0)$.\n\nCan I get a 95% confidence interval for $Pr(y_0=1 \\given x_0)$?\n\n. . .\n\n[The bootstrap gives this to you.]{.secondary}\n\n## Bootstrap procedure\n\n1. Resample your training data w/ replacement.\n2. Calculate LDA on this sample.\n3. Produce a new prediction, call it $\\widehat{Pr}_b(y_0 =1 \\given x_0)$.\n4. Repeat 1-3 $b = 1,\\ldots,B$ times.\n5. CI: $\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]$\n\n. . .\n\n$\\hat{F}$ is the \"empirical\" distribution of the bootstraps. \n\n\n## Empirical distribution\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nr <- rexp(50, 1 / 5)\nggplot(tibble(r = r), aes(r)) + \n stat_ecdf(colour = orange) +\n geom_vline(xintercept = quantile(r, probs = c(.05, .95))) +\n geom_hline(yintercept = c(.05, .95), linetype = \"dashed\") +\n annotate(\n \"label\", x = c(5, 12), y = c(.25, .75), \n label = c(\"hat(F)[boot](.05)\", \"hat(F)[boot](.95)\"), \n parse = TRUE\n )\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Very basic example\n\n* Let $X_i\\sim \\textrm{Exponential}(1/5)$. The pdf is $f(x) = \\frac{1}{5}e^{-x/5}$\n\n\n* I know if I estimate the mean with $\\bar{X}$, then by the CLT (if $n$ is big), \n\n$$\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).$$\n\n\n* This gives me a 95% confidence interval like\n$$\\bar{X} \\pm 2s/\\sqrt{n}$$\n\n\n* But I don't want to estimate the mean, I want to estimate the median.\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(data.frame(x = c(0, 12)), aes(x)) +\n stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +\n geom_vline(xintercept = 5, color = blue) + # mean\n geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median\n annotate(\"label\",\n x = c(2.5, 5.5, 10), y = c(.15, .15, .05),\n label = c(\"median\", \"bar(x)\", \"pdf\"), parse = TRUE,\n color = c(red, blue, orange), size = 6\n )\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Now what...\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n* I give you a sample of size 500, you give me the sample median.\n\n* How do you get a CI?\n\n* You can use the bootstrap!\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(406406406)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.611615\n```\n:::\n\n```{.r .cell-code}\nB <- 100\nalpha <- 0.05\nFhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))\n```\n:::\n\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(data.frame(Fhat), aes(Fhat)) +\n geom_density(color = orange) +\n geom_vline(xintercept = CI, color = orange, linetype = 2) +\n geom_vline(xintercept = med, col = blue) +\n geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +\n annotate(\"label\",\n x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),\n color = c(orange, red, blue),\n label = c(\"widehat(F)\", \"true~median\", \"widehat(median)\"),\n parse = TRUE\n ) +\n xlab(\"x\") +\n geom_rug(aes(2 * med - Fhat))\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## How does this work?\n\n![](gfx/boot1.png)\n\n\n\n## Approximations\n\n![](gfx/boot2.png)\n\n\n\n## Slightly harder example\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggplot(fatcats, aes(Bwt, Hwt)) +\n geom_point(color = blue) +\n xlab(\"Cat body weight (Kg)\") +\n ylab(\"Cat heart weight (g)\")\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nsummary(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-11.2353 -0.7932 -0.1407 0.5968 11.1026 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.95424 0.06294 62.83 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.089 on 143 degrees of freedom\nMultiple R-squared: 0.965,\tAdjusted R-squared: 0.9648 \nF-statistic: 3947 on 1 and 143 DF, p-value: < 2.2e-16\n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## When we fit models, we examine diagnostics\n\n\n::: flex\n::: w-50\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nqqnorm(residuals(cats.lm), pch = 16, col = blue)\nqqline(residuals(cats.lm), col = orange, lwd = 2)\n```\n\n::: {.cell-output-display}\n![](18-the-bootstrap_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n\nThe tails are too fat. So I don't believe that CI...\n\n:::\n\n::: w-50\n\n**We bootstrap**\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.826735 4.084322 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## An alternative\n\n* So far, I didn't use any information about the data-generating process. \n\n* We've done the [non-parametric bootstrap]{.secondary}\n\n* This is easiest, and most common for the methods in this module\n\n. . . \n\n[But there's another version]{.secondary}\n\n* You could try a \"parametric bootstrap\"\n\n* This assumes knowledge about the DGP\n\n\n\n## Same data\n\n::: flex\n::: w-50\n\n[Non-parametric bootstrap]{.secondary}\n\nSame as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.832907 4.070232 \n```\n:::\n\n```{.r .cell-code}\nconfint(cats.lm) # Original CI\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n```\n:::\n:::\n\n:::\n\n::: w-50\n[Parametric bootstrap]{.secondary}\n\n1. Assume that the linear model is TRUE.\n2. Then, $\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i$, $\\widehat{e}_i \\approx \\epsilon_i$\n3. The $\\epsilon_i$ is random $\\longrightarrow$ just resample $\\widehat{e}_i$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nB <- 500\nbhats <- double(B)\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nr <- residuals(cats.lm)\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |> mutate(\n Hwt = predict(cats.lm) +\n sample(r, n(), replace = TRUE)\n )\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n 97.5% 2.5% \n3.815162 4.065045 \n```\n:::\n:::\n\n\n:::\n\n:::\n\n\n\n## Bootstrap error sources\n\n\n![](gfx/boot2.png){.center}\n\n\n\nSimulation error\n: using only $B$ samples to estimate $F$ with $\\hat{F}$.\n\nStatistical error\n: our data depended on a sample from the population. We don't have the whole population so we make an error by using a sample (Note: this part is what __always__ happens with data, and what the science of statistics analyzes.)\n\nSpecification error\n: If we use the parametric bootstrap, and our model is wrong, then we are overconfident.\n\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\nOur interval is\n\n$$\n[2\\hat{\\theta} - \\theta^*_{1-\\alpha/2},\\ 2\\hat{\\theta} - \\theta^*_{\\alpha/2}]\n$$\n\nwhere $\\theta^*_q$ is the $q$ quantile of $\\hat{\\Theta}$.\n\n
\n\n* Called the \"Pivotal Interval\"\n* Has the correct $1-\\alpha$% coverage under very mild conditions on $\\hat{\\theta}$\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\n$$\n[\\hat{\\theta} - z_{\\alpha/2}\\hat{s},\\ \\hat{\\theta} + z_{\\alpha/2}\\hat{s}]\n$$\n\nwhere $\\hat{s} = \\sqrt{\\Var{\\hat{\\Theta}}}$\n\n
\n\n* Called the \"Normal Interval\"\n* Only works if the distribution of $\\hat{\\Theta}$ is approximately Normal.\n* Unlikely to work well\n* Don't do this\n\n## Types of intervals\n\nLet $\\hat{\\theta}$ be our sample statistic, $\\hat{\\Theta}$ be the resamples\n\n$$\n[\\theta^*_{\\alpha/2},\\ \\theta^*_{1-\\alpha/2}]\n$$\n\nwhere $\\theta^*_q$ is the $q$ quantile of $\\hat{\\Theta}$.\n\n
\n\n* Called the \"Percentile Interval\"\n* Works if $\\exists$ monotone $m$ so that $m(\\hat\\Theta) \\sim N(m(\\theta), c^2)$\n* Better than the Normal Interval\n* More assumptions than the Pivotal Interval\n\n## More details\n\n> See \"All of Statistics\" by Larry Wasserman, Chapter 8.3\n\nThere's a handout with the proofs on Canvas (under Modules)\n\n\n# Next time...\n\nBootstrap for bagging and random forests\n", "supporting": [ "18-the-bootstrap_files" ], diff --git a/schedule/slides/20-boosting-speaker.html b/schedule/slides/20-boosting-speaker.html index c82b596..4f704e5 100644 --- a/schedule/slides/20-boosting-speaker.html +++ b/schedule/slides/20-boosting-speaker.html @@ -8,7 +8,7 @@ - + UBC Stat406 2023W – boosting