Typo fixes

UBC-STAT · Oct 8, 2024 · c4205b4 · c4205b4
1 parent f0e700b
commit c4205b4
Show file tree

Hide file tree

Showing 4 changed files with 15 additions and 15 deletions.
diff --git a/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json b/_freeze/schedule/slides/11-kernel-smoothers/execute-results/html.json
diff --git a/_freeze/schedule/slides/12-why-smooth/execute-results/html.json b/_freeze/schedule/slides/12-why-smooth/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "cf38a603cc72a3e044e41c61503362c2",
+  "hash": "768f01ee2bd72df68cf3ebd8a8d9a8a0",
   "result": {
     "engine": "knitr",
-    "markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n  [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n  [Assuming $k$ is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n  \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n  \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n  \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n  \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- As $n \\to \\infty$, $R_n^{(\\mathrm{basis})} \\to C_1^{(\\mathrm{basis})} + \\sigma^2$\n- As $n \\to \\infty$, $R_n^{(\\mathrm{local})} \\to \\sigma^2$\n:::\n\n:::\n:::\n\n<!-- . . . -->\n\n<!-- What if $x \\in \\R^p$ and $p>1$? -->\n\n<!-- ::: aside -->\n<!-- Note that $p$ means the dimension of $x$, not the dimension of the space of the polynomial basis or something else. That's why I put $k$ above. -->\n<!-- ::: -->\n\n\n## Takeaway\n\n1. Local methods are *consistent* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n   [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n  \\underbrace{C_1^{(\\mathrm{lin})}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\tfrac{C_2^{(\\mathrm{lin})}}{n/p}}_{\\mathrm{var}} +\n  \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n  \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n  \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{lin})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{lin})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n",
+    "markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n  [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n  [Assuming num. basis features is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n  \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n  \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n  \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n  \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- $R_n^{(\\mathrm{basis})} \\overset{n \\to \\infty}{\\longrightarrow} C_1^{(\\mathrm{basis})} + \\sigma^2$\n- $R_n^{(\\mathrm{local})} \\overset{n \\to \\infty}{\\longrightarrow} \\sigma^2$\n:::\n\n:::\n:::\n\n<!-- . . . -->\n\n<!-- What if $x \\in \\R^p$ and $p>1$? -->\n\n<!-- ::: aside -->\n<!-- Note that $p$ means the dimension of $x$, not the dimension of the space of the polynomial basis or something else. That's why I put $k$ above. -->\n<!-- ::: -->\n\n\n## Takeaway\n\n1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n   [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n  \\underbrace{C_1^{(\\mathrm{OLS})}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\tfrac{C_2^{(\\mathrm{OLS})}}{n/p}}_{\\mathrm{var}} +\n  \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n  \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n  \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n  \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{OLS})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{OLS})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n",
     "supporting": [
       "12-why-smooth_files"
     ],

diff --git a/schedule/slides/11-kernel-smoothers.qmd b/schedule/slides/11-kernel-smoothers.qmd
@@ -21,7 +21,7 @@ B. Then $E[\hat \beta_\mathrm{ols}] = E[ E[\hat \beta_\mathrm{ols} \mid \mathbf
 
 C. $E[\hat \beta_\mathrm{ols} \mid \mathbf X] = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top E[ \mathbf y \mid \mathbf X]$
 
-D. $E[ \mathbf y \mid \mathbf X] = \mathbf X^\top \beta$
+D. $E[ \mathbf y \mid \mathbf X] = \mathbf X \beta$
 
 E. So $E[\hat \beta_\mathrm{ols}] - \beta = E[(\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf X \beta] - \beta = \beta - \beta = 0$.
 
@@ -54,7 +54,7 @@ E[\hat \beta_\mathrm{ols}] = E[E[\hat \beta_\mathrm{ols} \mid \mathbf X ]]
 $$
 
 In statistics speak, our model is *misspecified*.\
-[Ridge/lasso will still increase bias and decrease variance even under misspecification.]{.small}
+[Ridge/lasso will always increase bias and decrease variance, even under misspecification.]{.small}
 
 
 ## 2) Why does ridge regression shrink varinace?

diff --git a/schedule/slides/12-why-smooth.qmd b/schedule/slides/12-why-smooth.qmd
@@ -47,7 +47,7 @@ How does $R_n^{(\mathrm{basis})}$ compare to $R_n^{(\mathrm{local})}$ as we chan
 ### Bias
 
 - Basis: bias is *fixed*\
-  [Assuming $k$ is fixed]{.small}
+  [Assuming num. basis features is fixed]{.small}
 - Local: bias depends on choice of bandwidth $\sigma$. 
 
 :::
@@ -92,8 +92,8 @@ _The constants_ don't matter for the intuition, but they matter for a particular
 ### What do you notice?
 ::: fragment
 - As $n$ increases, the optimal bandwidth $\sigma$ decreases
-- As $n \to \infty$, $R_n^{(\mathrm{basis})} \to C_1^{(\mathrm{basis})} + \sigma^2$
-- As $n \to \infty$, $R_n^{(\mathrm{local})} \to \sigma^2$
+- $R_n^{(\mathrm{basis})} \overset{n \to \infty}{\longrightarrow} C_1^{(\mathrm{basis})} + \sigma^2$
+- $R_n^{(\mathrm{local})} \overset{n \to \infty}{\longrightarrow} \sigma^2$
 :::
 
 :::
@@ -110,7 +110,7 @@ _The constants_ don't matter for the intuition, but they matter for a particular
 
 ## Takeaway
 
-1. Local methods are *consistent* (bias and variance go to 0 as $n \to \infty$)
+1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \to \infty$)
 2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\
    [$\underbrace{O(1/n)}_{\text{basis var.}} < \underbrace{O(1/n^{4/5})}_{\text{local var.}}$]{.small}
 
@@ -146,7 +146,7 @@ a $p$-dimensional ball of radius 1.
 For a test point $x$ at the center of the ball,
 how far away are its $k = n/10$ nearest neighbours?
 
-[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small}
+[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small}
 :::
 
 ::: w-30
@@ -221,8 +221,8 @@ Distance becomes (exponentially) meaningless in high dimensions.*\
 
 $$
 R_n^{(\mathrm{OLS})} =
-  \underbrace{C_1^{(\mathrm{lin})}}_{\mathrm{bias}^2} +
-  \underbrace{\tfrac{C_2^{(\mathrm{lin})}}{n/p}}_{\mathrm{var}} +
+  \underbrace{C_1^{(\mathrm{OLS})}}_{\mathrm{bias}^2} +
+  \underbrace{\tfrac{C_2^{(\mathrm{OLS})}}{n/p}}_{\mathrm{var}} +
   \sigma^2,
 \qquad
 R_n^{(\mathrm{local})} =
@@ -234,8 +234,8 @@ $$
 ::: fragment
 ### Observations
 
-- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{lin})} / (n/p)$ is relatively small.
-- So unless $C_1^{(\mathrm{lin})}$ is big, we should use the linear model.*\
+- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{OLS})} / (n/p)$ is relatively small.
+- So unless $C_1^{(\mathrm{OLS})}$ is big, we should use the linear model.*\
 :::
 
 ## In practice