diff --git a/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json b/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json index 443c508..061bb89 100644 --- a/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json +++ b/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "8c9479c260e07b3f045f15e51ea63a7c", + "hash": "705133ca6bd22eef7712667d691269e0", "result": { "engine": "knitr", - "markdown": "---\nlecture: \"11 Local methods\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n# Two Great Questions from Last Lecture\n\n## 1) The Bias of OLS\n\nIn last class' clicker question, we were trying to predict income from\na $p=2$, $n=50000$ dataset. I claimed that the *OLS predictor is high bias* in this example.\\\n\n\\\nBut Trevor proved that the *OLS predictor is unbiased* (via the following proof):\n\nA. Assume that $y = x^\\top \\beta + \\epsilon, \\quad \\epsilon \\sim \\mathcal N(0, \\sigma^2).$\n\nB. Then $E[\\hat \\beta_\\mathrm{ols}] = E[ E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X] ]$\n\nC. $E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X] = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top E[ \\mathbf y \\mid \\mathbf X]$\n\nD. $E[ \\mathbf y \\mid \\mathbf X] = \\mathbf X^\\top \\beta$\n\nE. So $E[\\hat \\beta_\\mathrm{ols}] - \\beta = E[(\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf X \\beta] - \\beta = \\beta - \\beta = 0$.\n\n\\\n[Why did this proof not apply to the clicker question?]{.secondary} \\\n[(Which step of this proof breaks down?)]{.small}\n\n\n## 1) The Bias of OLS\n\nWhich step did the proof break down?\n\nA. Assume that $y = x^\\top \\beta + \\epsilon, \\quad \\epsilon \\sim \\mathcal N(0, \\sigma^2).$\n\n\\\nThis assumption does not hold.\n\nIt is (almost certainly) not the case that `Income ~ Age + Education`.\n\n. . .\n\n\\\nIn reality, $y \\sim f(x) + \\epsilon$ where $f(x)$ is some potentially nonlinear function. So\n\n$$\n\\begin{align}\nE[\\hat \\beta_\\mathrm{ols}] = E[E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X ]]\n= E[ (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X f(\\mathbf X) ] \\ne \\beta\n\\end{align}\n$$\n\nIn statistics speak, our model is *misspecified*.\\\n[Ridge/lasso will still increase bias and decrease variance even under misspecification.]{.small}\n\n\n## 2) Why does ridge regression shrink varinace?\n\n### Mathematically\n\n- Variance of OLS: $\\mathrm{Cov}[\\hat \\beta_\\mathrm{ols}] = \\sigma^2 E[ (\\mathbf X^\\top \\mathbf X)^{-1} ]$\n- Variance of ridge regression: $\\mathrm{Cov}[\\hat \\beta_\\mathrm{ols}] = \\sigma^2 E[ (\\mathbf X^\\top \\mathbf X + \\lambda \\mathbf I)^{-1} ]$\n\n::: fragment\n\n### Intuitively...\n\nThink about the constrained optimization problem pictorally. \\\nWhat happens if we had a different dataset?\n\n:::\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 07 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Last time...\n\nWe looked at [feature maps]{.secondary} as a way to do nonlinear regression.\n\nWe used new \"features\" $\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)$\n\n\nNow we examine a *nonparametric* alternative\n\nSuppose I just look at the \"neighbours\" of some point (based on the $x$-values)\n\nI just average the $y$'s at those locations together\n\n## Let's use 3 neighbours\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(cowplot)\ndata(arcuate, package = \"Stat406\")\nset.seed(406406)\narcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)\npt <- 15\nnn <- 3\nseq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)\nneibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]\narcuate_unif$neighbours = seq_len(40) %in% neibs\ng1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + \n geom_point() +\n scale_colour_manual(values = c(blue, red)) + \n geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + \n annotate(\"rect\", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,\n xmin = min(arcuate_unif$position[neibs]), \n xmax = max(arcuate_unif$position[neibs])\n ) +\n theme(legend.position = \"none\")\ng2 <- ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa\n )$pred\n ),\n colour = orange, linewidth = 2\n )\nplot_grid(g1, g2, ncol = 2)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/load-lidar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## KNN \n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/small-lidar-again-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(arcuate, package = \"Stat406\")\nlibrary(FNN)\narcuate_unif <- arcuate |> \n slice_sample(n = 40) |> \n arrange(position) \n\nnew_position <- seq(\n min(arcuate_unif$position), \n max(arcuate_unif$position),\n length.out = 101\n)\n\nknn3 <- knn.reg(\n train = arcuate_unif$position, \n test = matrix(arcuate_unif$position, ncol = 1), \n y = arcuate_unif$fa, \n k = 3\n)\n```\n:::\n\n\n:::\n:::\n\n\n\n## This method is $K$-nearest neighbours.\n\nIt's a [linear smoother]{.secondary} just like in previous lectures: \n$\\widehat{\\mathbf{y}} = \\mathbf{S} \\mathbf{y}$ for some matrix $S$.\n\nYou should imagine what $\\mathbf{S}$ looks like.\n\n1. What is the degrees of freedom of KNN, and how does it depend on $k$?\n2. How does $k$ affect the bias/variance?\n\n. . .\n\n- $\\mathrm{df} = \\tr{\\mathbf S} = n/k$.\n- $k = n$ produces a constant predictor (highest bias, lowest variance).\n- $k = 1$ produces a low bias but extremely high variance predictor.\n\n---\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nset.seed(406406)\n\nplot_knn <- function(k) {\n ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa,\n k = k\n )$pred\n ),\n colour = orange, linewidth = 2\n ) + ggtitle(paste(\"k =\", k))\n}\n\ng1 <- plot_knn(1)\ng2 <- plot_knn(5)\ng3 <- plot_knn(length(arcuate_unif$position))\nplot_grid(g1, g2, g3, ncol = 3)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Local averages (soft KNN)\n\nKNN averages the neighbours with equal weight.\n\nBut some neighbours are \"closer\" than other neighbours.\n\nInstead of choosing the number of neighbours to average, we can average \nany observations within a certain distance.\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe boxes have width 30. \n\n\n## What is a \"kernel\" smoother?\n\n* The mathematics:\n\n> A kernel is any function $K$ such that for any $u$,\n>\n> - $K(u) \\geq 0$,\n> - $\\int K(u) du = 1$,\n> - $\\int uK(u) du = 0$.\n\n* The idea: a kernel takes weighted averages. The kernel function gives the weights.\n\n* The previous example is called the [boxcar]{.secondary} kernel. \n\n## Smoothing with the boxcar\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntestpts <- seq(0, 200, length.out = 101)\ndmat <- abs(outer(testpts, arcuate_unif$position, \"-\"))\nS <- (dmat < 15)\nS <- S / rowSums(S)\nboxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(data = boxcar, colour = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/boxcar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThis one gives the same non-zero weight to all points within $\\pm 15$ range.\n\n\n\n## Other kernels\n\nMost of the time, we don't use the boxcar because the weights are weird.\\\n[Ideally we would like closer points to have more weight.]{.small}\n\n::: flex\n::: w-60\nA more common kernel: the Gaussian kernel\n\n$$\nK(u) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} \\exp\\left(-\\frac{u^2}{2\\sigma^2}\\right)\n$$\n\nFor the plot, I made $\\sigma=7.5$. \n\nNow the weights \"die away\" for points farther from where we're predicting.\\\n[(but all nonzero!!)]{.small}\n\n:::\n\n::: w-40\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n:::\n:::\n\n## Other kernels\n\nWhat if I made $\\sigma=15$?\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nBefore, points far from $x_{15}$ got very small weights, now they have more influence.\n\nFor the Gaussian kernel, $\\sigma$ determines something like the \"range\" of the smoother.\n\n\n\n## Many Gaussians\n\nThe following code creates $\\mathbf{S}$ for Gaussian kernel smoothers with different $\\sigma$\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndmat <- as.matrix(dist(x))\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg), \"/\") # make the rows sum to 1.\n}\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.\n}\nboxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)\nboxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)\nboxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)\nbc = boxcar %>% select(position, S15, S08, S30) %>% \n pivot_longer(-position, names_to = \"Sigma\")\nggplot(arcuate_unif, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +\n scale_colour_brewer(palette = \"Set1\")\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## The bandwidth\n\n* Choosing $\\sigma$ is [very]{.secondary} important.\n\n* This \"range\" parameter is called the [bandwidth]{.secondary}.\n\n* It is way more important than which kernel you use.\n\n\n\n\n\n\n\n\n\n## Choosing the bandwidth\n\nAs we have discussed, kernel smoothing (and KNN) are linear smoothers\n\n$$\\widehat{\\mathbf{y}} = \\mathbf{S}\\mathbf{y}$$\n\n\n\nThe [degrees of freedom]{.secondary} is $\\textrm{tr}(\\mathbf{S})$\n\nTherefore we can use our model selection criteria from before \n\n. . .\n\nUnfortunately, these don't satisfy the \"technical condition\", so\n`cv_nice()` doesn't give LOO-CV\n\n\n## Smoothing the full Lidar data\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nar <- arcuate |> slice_sample(n = 200)\n\ngcv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )\n}\n\nfake_loocv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - diag(S))^2 )\n}\n\ndmat <- as.matrix(dist(ar$position))\nsigmas <- 10^(seq(log10(300), log10(.3), length = 100))\n\ngcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))\nflcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))\nbest_s <- sigmas[which.min(gcvs)]\nother_s <- sigmas[which.min(flcvs)]\n\nar$smoothed <- Sgauss(best_s) %*% ar$fa\nar$other <- Sgauss(other_s) %*% ar$fa\n```\n:::\n\n\n\n## Smoothing the full Lidar data\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ng3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +\n geom_point(colour = blue) +\n geom_vline(xintercept = best_s, colour = red) +\n scale_x_log10() +\n xlab(sprintf(\"Sigma, best is sig = %.2f\", best_s))\ng4 <- ggplot(ar, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(aes(y = smoothed), colour = orange, linewidth = 2)\nplot_grid(g3, g4, nrow = 1)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/smoothed-lidar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nI considered $\\sigma \\in [0.3,\\ 300]$ and used $3.97$. \n\nIt's too wiggly, to my eye. Typical for GCV.\n\n\n## Smoothing manually\n\nI did Kernel Smoothing \"manually\"\n\n1. For a fixed bandwidth\n\n2. Compute the smoothing matrix\n\n3. Make the predictions\n\n4. Repeat and compute GCV\n\nThe point is to \"show how it works\". It's also really easy.\n\n## `R` functions / packages\n\nThere are a number of other ways to do this in R\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nloess()\nksmooth()\nKernSmooth::locpoly()\nmgcv::gam()\nnp::npreg()\n```\n:::\n\n\n\nThese have tricks and ways of doing CV and other things automatically.\n\nNote\n: All I needed was the distance matrix `dist(x)`. \n: Given ANY distance function \n: say, $d(\\mathbf{x}_i, \\mathbf{x}_j) = \\Vert\\mathbf{x}_i - \\mathbf{x}_j\\Vert_2 + I(x_{i,3} = x_{j,3})$\n: I can use these methods.\n\n\n# Next time...\n\nWhy don't we just smooth everything all the time?\n", + "markdown": "---\nlecture: \"11 Local methods\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n# Two Great Questions from Last Lecture\n\n## 1) The Bias of OLS\n\nIn last class' clicker question, we were trying to predict income from\na $p=2$, $n=50000$ dataset. I claimed that the *OLS predictor is high bias* in this example.\\\n\n\\\nBut Trevor proved that the *OLS predictor is unbiased* (via the following proof):\n\nA. Assume that $y = x^\\top \\beta + \\epsilon, \\quad \\epsilon \\sim \\mathcal N(0, \\sigma^2).$\n\nB. Then $E[\\hat \\beta_\\mathrm{ols}] = E[ E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X] ]$\n\nC. $E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X] = (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top E[ \\mathbf y \\mid \\mathbf X]$\n\nD. $E[ \\mathbf y \\mid \\mathbf X] = \\mathbf X \\beta$\n\nE. So $E[\\hat \\beta_\\mathrm{ols}] - \\beta = E[(\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X^\\top \\mathbf X \\beta] - \\beta = \\beta - \\beta = 0$.\n\n\\\n[Why did this proof not apply to the clicker question?]{.secondary} \\\n[(Which step of this proof breaks down?)]{.small}\n\n\n## 1) The Bias of OLS\n\nWhich step did the proof break down?\n\nA. Assume that $y = x^\\top \\beta + \\epsilon, \\quad \\epsilon \\sim \\mathcal N(0, \\sigma^2).$\n\n\\\nThis assumption does not hold.\n\nIt is (almost certainly) not the case that `Income ~ Age + Education`.\n\n. . .\n\n\\\nIn reality, $y \\sim f(x) + \\epsilon$ where $f(x)$ is some potentially nonlinear function. So\n\n$$\n\\begin{align}\nE[\\hat \\beta_\\mathrm{ols}] = E[E[\\hat \\beta_\\mathrm{ols} \\mid \\mathbf X ]]\n= E[ (\\mathbf X^\\top \\mathbf X)^{-1} \\mathbf X f(\\mathbf X) ] \\ne \\beta\n\\end{align}\n$$\n\nIn statistics speak, our model is *misspecified*.\\\n[Ridge/lasso will always increase bias and decrease variance, even under misspecification.]{.small}\n\n\n## 2) Why does ridge regression shrink varinace?\n\n### Mathematically\n\n- Variance of OLS: $\\mathrm{Cov}[\\hat \\beta_\\mathrm{ols}] = \\sigma^2 E[ (\\mathbf X^\\top \\mathbf X)^{-1} ]$\n- Variance of ridge regression: $\\mathrm{Cov}[\\hat \\beta_\\mathrm{ols}] = \\sigma^2 E[ (\\mathbf X^\\top \\mathbf X + \\lambda \\mathbf I)^{-1} ]$\n\n::: fragment\n\n### Intuitively...\n\nThink about the constrained optimization problem pictorally. \\\nWhat happens if we had a different dataset?\n\n:::\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Last time...\n\nWe looked at [feature maps]{.secondary} as a way to do nonlinear regression.\n\nWe used new \"features\" $\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)$\n\n\nNow we examine a *nonparametric* alternative\n\nSuppose I just look at the \"neighbours\" of some point (based on the $x$-values)\n\nI just average the $y$'s at those locations together\n\n## Let's use 3 neighbours\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(cowplot)\ndata(arcuate, package = \"Stat406\")\nset.seed(406406)\narcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)\npt <- 15\nnn <- 3\nseq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)\nneibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]\narcuate_unif$neighbours = seq_len(40) %in% neibs\ng1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + \n geom_point() +\n scale_colour_manual(values = c(blue, red)) + \n geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + \n annotate(\"rect\", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,\n xmin = min(arcuate_unif$position[neibs]), \n xmax = max(arcuate_unif$position[neibs])\n ) +\n theme(legend.position = \"none\")\ng2 <- ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa\n )$pred\n ),\n colour = orange, linewidth = 2\n )\nplot_grid(g1, g2, ncol = 2)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/load-lidar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## KNN \n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/small-lidar-again-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(arcuate, package = \"Stat406\")\nlibrary(FNN)\narcuate_unif <- arcuate |> \n slice_sample(n = 40) |> \n arrange(position) \n\nnew_position <- seq(\n min(arcuate_unif$position), \n max(arcuate_unif$position),\n length.out = 101\n)\n\nknn3 <- knn.reg(\n train = arcuate_unif$position, \n test = matrix(arcuate_unif$position, ncol = 1), \n y = arcuate_unif$fa, \n k = 3\n)\n```\n:::\n\n\n:::\n:::\n\n\n\n## This method is $K$-nearest neighbours.\n\nIt's a [linear smoother]{.secondary} just like in previous lectures: \n$\\widehat{\\mathbf{y}} = \\mathbf{S} \\mathbf{y}$ for some matrix $S$.\n\nYou should imagine what $\\mathbf{S}$ looks like.\n\n1. What is the degrees of freedom of KNN, and how does it depend on $k$?\n2. How does $k$ affect the bias/variance?\n\n. . .\n\n- $\\mathrm{df} = \\tr{\\mathbf S} = n/k$.\n- $k = n$ produces a constant predictor (highest bias, lowest variance).\n- $k = 1$ produces a low bias but extremely high variance predictor.\n\n---\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nset.seed(406406)\n\nplot_knn <- function(k) {\n ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa,\n k = k\n )$pred\n ),\n colour = orange, linewidth = 2\n ) + ggtitle(paste(\"k =\", k))\n}\n\ng1 <- plot_knn(1)\ng2 <- plot_knn(5)\ng3 <- plot_knn(length(arcuate_unif$position))\nplot_grid(g1, g2, g3, ncol = 3)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Local averages (soft KNN)\n\nKNN averages the neighbours with equal weight.\n\nBut some neighbours are \"closer\" than other neighbours.\n\nInstead of choosing the number of neighbours to average, we can average \nany observations within a certain distance.\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThe boxes have width 30. \n\n\n## What is a \"kernel\" smoother?\n\n* The mathematics:\n\n> A kernel is any function $K$ such that for any $u$,\n>\n> - $K(u) \\geq 0$,\n> - $\\int K(u) du = 1$,\n> - $\\int uK(u) du = 0$.\n\n* The idea: a kernel takes weighted averages. The kernel function gives the weights.\n\n* The previous example is called the [boxcar]{.secondary} kernel. \n\n## Smoothing with the boxcar\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntestpts <- seq(0, 200, length.out = 101)\ndmat <- abs(outer(testpts, arcuate_unif$position, \"-\"))\nS <- (dmat < 15)\nS <- S / rowSums(S)\nboxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(data = boxcar, colour = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/boxcar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nThis one gives the same non-zero weight to all points within $\\pm 15$ range.\n\n\n\n## Other kernels\n\nMost of the time, we don't use the boxcar because the weights are weird.\\\n[Ideally we would like closer points to have more weight.]{.small}\n\n::: flex\n::: w-60\nA more common kernel: the Gaussian kernel\n\n$$\nK(u) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} \\exp\\left(-\\frac{u^2}{2\\sigma^2}\\right)\n$$\n\nFor the plot, I made $\\sigma=7.5$. \n\nNow the weights \"die away\" for points farther from where we're predicting.\\\n[(but all nonzero!!)]{.small}\n\n:::\n\n::: w-40\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n:::\n:::\n\n## Other kernels\n\nWhat if I made $\\sigma=15$?\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nBefore, points far from $x_{15}$ got very small weights, now they have more influence.\n\nFor the Gaussian kernel, $\\sigma$ determines something like the \"range\" of the smoother.\n\n\n\n## Many Gaussians\n\nThe following code creates $\\mathbf{S}$ for Gaussian kernel smoothers with different $\\sigma$\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndmat <- as.matrix(dist(x))\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg), \"/\") # make the rows sum to 1.\n}\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.\n}\nboxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)\nboxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)\nboxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)\nbc = boxcar %>% select(position, S15, S08, S30) %>% \n pivot_longer(-position, names_to = \"Sigma\")\nggplot(arcuate_unif, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +\n scale_colour_brewer(palette = \"Set1\")\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## The bandwidth\n\n* Choosing $\\sigma$ is [very]{.secondary} important.\n\n* This \"range\" parameter is called the [bandwidth]{.secondary}.\n\n* It is way more important than which kernel you use.\n\n\n\n\n\n\n\n\n\n## Choosing the bandwidth\n\nAs we have discussed, kernel smoothing (and KNN) are linear smoothers\n\n$$\\widehat{\\mathbf{y}} = \\mathbf{S}\\mathbf{y}$$\n\n\n\nThe [degrees of freedom]{.secondary} is $\\textrm{tr}(\\mathbf{S})$\n\nTherefore we can use our model selection criteria from before \n\n. . .\n\nUnfortunately, these don't satisfy the \"technical condition\", so\n`cv_nice()` doesn't give LOO-CV\n\n\n## Smoothing the full Lidar data\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nar <- arcuate |> slice_sample(n = 200)\n\ngcv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )\n}\n\nfake_loocv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - diag(S))^2 )\n}\n\ndmat <- as.matrix(dist(ar$position))\nsigmas <- 10^(seq(log10(300), log10(.3), length = 100))\n\ngcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))\nflcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))\nbest_s <- sigmas[which.min(gcvs)]\nother_s <- sigmas[which.min(flcvs)]\n\nar$smoothed <- Sgauss(best_s) %*% ar$fa\nar$other <- Sgauss(other_s) %*% ar$fa\n```\n:::\n\n\n\n## Smoothing the full Lidar data\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ng3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +\n geom_point(colour = blue) +\n geom_vline(xintercept = best_s, colour = red) +\n scale_x_log10() +\n xlab(sprintf(\"Sigma, best is sig = %.2f\", best_s))\ng4 <- ggplot(ar, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(aes(y = smoothed), colour = orange, linewidth = 2)\nplot_grid(g3, g4, nrow = 1)\n```\n\n::: {.cell-output-display}\n![](11-kernel-smoothers_files/figure-revealjs/smoothed-lidar-1.svg){fig-align='center'}\n:::\n:::\n\n\n\nI considered $\\sigma \\in [0.3,\\ 300]$ and used $3.97$. \n\nIt's too wiggly, to my eye. Typical for GCV.\n\n\n## Smoothing manually\n\nI did Kernel Smoothing \"manually\"\n\n1. For a fixed bandwidth\n\n2. Compute the smoothing matrix\n\n3. Make the predictions\n\n4. Repeat and compute GCV\n\nThe point is to \"show how it works\". It's also really easy.\n\n## `R` functions / packages\n\nThere are a number of other ways to do this in R\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nloess()\nksmooth()\nKernSmooth::locpoly()\nmgcv::gam()\nnp::npreg()\n```\n:::\n\n\n\nThese have tricks and ways of doing CV and other things automatically.\n\nNote\n: All I needed was the distance matrix `dist(x)`. \n: Given ANY distance function \n: say, $d(\\mathbf{x}_i, \\mathbf{x}_j) = \\Vert\\mathbf{x}_i - \\mathbf{x}_j\\Vert_2 + I(x_{i,3} = x_{j,3})$\n: I can use these methods.\n\n\n# Next time...\n\nWhy don't we just smooth everything all the time?\n", "supporting": [ "11-kernel-smoothers_files" ], diff --git a/_freeze/schedule/slides/12-why-smooth/execute-results/html.json b/_freeze/schedule/slides/12-why-smooth/execute-results/html.json index a0b4ff7..d27af91 100644 --- a/_freeze/schedule/slides/12-why-smooth/execute-results/html.json +++ b/_freeze/schedule/slides/12-why-smooth/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "cf38a603cc72a3e044e41c61503362c2", + "hash": "768f01ee2bd72df68cf3ebd8a8d9a8a0", "result": { "engine": "knitr", - "markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n [Assuming $k$ is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- As $n \\to \\infty$, $R_n^{(\\mathrm{basis})} \\to C_1^{(\\mathrm{basis})} + \\sigma^2$\n- As $n \\to \\infty$, $R_n^{(\\mathrm{local})} \\to \\sigma^2$\n:::\n\n:::\n:::\n\n\n\n\n\n\n\n\n\n\n## Takeaway\n\n1. Local methods are *consistent* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n \\underbrace{C_1^{(\\mathrm{lin})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{lin})}}{n/p}}_{\\mathrm{var}} +\n \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{lin})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{lin})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n", + "markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n [Assuming num. basis features is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- $R_n^{(\\mathrm{basis})} \\overset{n \\to \\infty}{\\longrightarrow} C_1^{(\\mathrm{basis})} + \\sigma^2$\n- $R_n^{(\\mathrm{local})} \\overset{n \\to \\infty}{\\longrightarrow} \\sigma^2$\n:::\n\n:::\n:::\n\n\n\n\n\n\n\n\n\n\n## Takeaway\n\n1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n \\underbrace{C_1^{(\\mathrm{OLS})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{OLS})}}{n/p}}_{\\mathrm{var}} +\n \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{OLS})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{OLS})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n", "supporting": [ "12-why-smooth_files" ], diff --git a/schedule/slides/11-kernel-smoothers.qmd b/schedule/slides/11-kernel-smoothers.qmd index b8612f8..52b5f5a 100644 --- a/schedule/slides/11-kernel-smoothers.qmd +++ b/schedule/slides/11-kernel-smoothers.qmd @@ -21,7 +21,7 @@ B. Then $E[\hat \beta_\mathrm{ols}] = E[ E[\hat \beta_\mathrm{ols} \mid \mathbf C. $E[\hat \beta_\mathrm{ols} \mid \mathbf X] = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top E[ \mathbf y \mid \mathbf X]$ -D. $E[ \mathbf y \mid \mathbf X] = \mathbf X^\top \beta$ +D. $E[ \mathbf y \mid \mathbf X] = \mathbf X \beta$ E. So $E[\hat \beta_\mathrm{ols}] - \beta = E[(\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf X \beta] - \beta = \beta - \beta = 0$. @@ -54,7 +54,7 @@ E[\hat \beta_\mathrm{ols}] = E[E[\hat \beta_\mathrm{ols} \mid \mathbf X ]] $$ In statistics speak, our model is *misspecified*.\ -[Ridge/lasso will still increase bias and decrease variance even under misspecification.]{.small} +[Ridge/lasso will always increase bias and decrease variance, even under misspecification.]{.small} ## 2) Why does ridge regression shrink varinace? diff --git a/schedule/slides/12-why-smooth.qmd b/schedule/slides/12-why-smooth.qmd index 2ffef76..49b2908 100644 --- a/schedule/slides/12-why-smooth.qmd +++ b/schedule/slides/12-why-smooth.qmd @@ -47,7 +47,7 @@ How does $R_n^{(\mathrm{basis})}$ compare to $R_n^{(\mathrm{local})}$ as we chan ### Bias - Basis: bias is *fixed*\ - [Assuming $k$ is fixed]{.small} + [Assuming num. basis features is fixed]{.small} - Local: bias depends on choice of bandwidth $\sigma$. ::: @@ -92,8 +92,8 @@ _The constants_ don't matter for the intuition, but they matter for a particular ### What do you notice? ::: fragment - As $n$ increases, the optimal bandwidth $\sigma$ decreases -- As $n \to \infty$, $R_n^{(\mathrm{basis})} \to C_1^{(\mathrm{basis})} + \sigma^2$ -- As $n \to \infty$, $R_n^{(\mathrm{local})} \to \sigma^2$ +- $R_n^{(\mathrm{basis})} \overset{n \to \infty}{\longrightarrow} C_1^{(\mathrm{basis})} + \sigma^2$ +- $R_n^{(\mathrm{local})} \overset{n \to \infty}{\longrightarrow} \sigma^2$ ::: ::: @@ -110,7 +110,7 @@ _The constants_ don't matter for the intuition, but they matter for a particular ## Takeaway -1. Local methods are *consistent* (bias and variance go to 0 as $n \to \infty$) +1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \to \infty$) 2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\ [$\underbrace{O(1/n)}_{\text{basis var.}} < \underbrace{O(1/n^{4/5})}_{\text{local var.}}$]{.small} @@ -146,7 +146,7 @@ a $p$-dimensional ball of radius 1. For a test point $x$ at the center of the ball, how far away are its $k = n/10$ nearest neighbours? -[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small} +[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small} ::: ::: w-30 @@ -221,8 +221,8 @@ Distance becomes (exponentially) meaningless in high dimensions.*\ $$ R_n^{(\mathrm{OLS})} = - \underbrace{C_1^{(\mathrm{lin})}}_{\mathrm{bias}^2} + - \underbrace{\tfrac{C_2^{(\mathrm{lin})}}{n/p}}_{\mathrm{var}} + + \underbrace{C_1^{(\mathrm{OLS})}}_{\mathrm{bias}^2} + + \underbrace{\tfrac{C_2^{(\mathrm{OLS})}}{n/p}}_{\mathrm{var}} + \sigma^2, \qquad R_n^{(\mathrm{local})} = @@ -234,8 +234,8 @@ $$ ::: fragment ### Observations -- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{lin})} / (n/p)$ is relatively small. -- So unless $C_1^{(\mathrm{lin})}$ is big, we should use the linear model.*\ +- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{OLS})} / (n/p)$ is relatively small. +- So unless $C_1^{(\mathrm{OLS})}$ is big, we should use the linear model.*\ ::: ## In practice