Skip to content

Commit

Permalink
Typo fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
gpleiss committed Oct 8, 2024
1 parent f0e700b commit c4205b4
Show file tree
Hide file tree
Showing 4 changed files with 15 additions and 15 deletions.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "cf38a603cc72a3e044e41c61503362c2",
"hash": "768f01ee2bd72df68cf3ebd8a8d9a8a0",
"result": {
"engine": "knitr",
"markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n [Assuming $k$ is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- As $n \\to \\infty$, $R_n^{(\\mathrm{basis})} \\to C_1^{(\\mathrm{basis})} + \\sigma^2$\n- As $n \\to \\infty$, $R_n^{(\\mathrm{local})} \\to \\sigma^2$\n:::\n\n:::\n:::\n\n<!-- . . . -->\n\n<!-- What if $x \\in \\R^p$ and $p>1$? -->\n\n<!-- ::: aside -->\n<!-- Note that $p$ means the dimension of $x$, not the dimension of the space of the polynomial basis or something else. That's why I put $k$ above. -->\n<!-- ::: -->\n\n\n## Takeaway\n\n1. Local methods are *consistent* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n \\underbrace{C_1^{(\\mathrm{lin})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{lin})}}{n/p}}_{\\mathrm{var}} +\n \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{lin})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{lin})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n",
"markdown": "---\nlecture: \"12 To(o) smooth or not to(o) smooth?\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 08 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Smooting vs Linear Models\n\nWe've been discussing nonlinear methods in 1-dimension:\n\n$$\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R$$\n\n1. Basis expansions, e.g.:\n\n$$\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k$$ \n\n2. Local methods, e.g.:\n\n$$\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y$$\n\nWhich should we choose? \\\n[Of course, we can do model selection. But can we analyze the risk mathematically?]{.small}\n\n\n## Risk Decomposition\n\n$$\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n$$\n\nHow does $R_n^{(\\mathrm{basis})}$ compare to $R_n^{(\\mathrm{local})}$ as we change $n$?\\\n\n::: fragment\n### Variance\n\n- Basis: variance decreases as $n$ increases\n- Local: variance decreases as $n$ increases\\\n [But at what rate?]{.small}\n\n:::\n\n::: fragment\n### Bias\n\n- Basis: bias is *fixed*\\\n [Assuming num. basis features is fixed]{.small}\n- Local: bias depends on choice of bandwidth $\\sigma$. \n\n:::\n\n\n## Risk Decomposition\n\n\n::: flex\n\n::: w-60\n### Basis\n\n$$\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(\\mathrm{basis})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{basis})}}{n}}_{\\mathrm{var}} +\n \\sigma^2\n$$\n\n### Local\n\n*With the optimal bandwidth* ($\\propto n^{-1/5}$), we have\n\n$$\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\frac{C_1^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(\\mathrm{local})}}{n^{4/5}}}_{\\mathrm{var}} +\n \\sigma^2\n$$ \n:::\n\n::: w-40\n::: callout-important\n\n_you don't need to memorize these formulas_ but you should know the intuition\n\n_The constants_ don't matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n:::\n\n### What do you notice?\n::: fragment\n- As $n$ increases, the optimal bandwidth $\\sigma$ decreases\n- $R_n^{(\\mathrm{basis})} \\overset{n \\to \\infty}{\\longrightarrow} C_1^{(\\mathrm{basis})} + \\sigma^2$\n- $R_n^{(\\mathrm{local})} \\overset{n \\to \\infty}{\\longrightarrow} \\sigma^2$\n:::\n\n:::\n:::\n\n<!-- . . . -->\n\n<!-- What if $x \\in \\R^p$ and $p>1$? -->\n\n<!-- ::: aside -->\n<!-- Note that $p$ means the dimension of $x$, not the dimension of the space of the polynomial basis or something else. That's why I put $k$ above. -->\n<!-- ::: -->\n\n\n## Takeaway\n\n1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \\to \\infty$)\n2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\\\n [$\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}$]{.small}\n\n\n# The Curse of Dimensionality\n\nHow do local methods perform when $p > 1$?\n\n\n## Intuitively\n\n*Parametric* multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\\\n[e.g. $x^{(1)} x^{(2)}$, $\\cos( x^{(1)} + x^{(2)})$, etc.]{.small}\n\n\\\n*Nonparametric* multivariate regressors (e.g. KNN, local methods)\nautomatically handle interactions.\\\n[The distance function (e.g. $d(x,x') = \\Vert x - x' \\Vert_2$) used by kernels implicitly defines *infinitely many* interactions!]{.small}\n\n\n\\\n[This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.]{.secondary}\n\n\n\n## Mathematically\n\n::: flex\n\n::: w-70\nConsider $x_1, x_2, \\ldots, x_n$ distributed *uniformly* within\na $p$-dimensional ball of radius 1.\nFor a test point $x$ at the center of the ball,\nhow far away are its $k = n/10$ nearest neighbours?\n\n[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small}\n:::\n\n::: w-30\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](12-why-smooth_files/figure-revealjs/unnamed-chunk-1-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n:::\n\n. . . \n\n::: flex\n::: w-60\nLet $r$ the the average distance between $x$ and its $k^\\mathrm{th}$ nearest neighbour.\n\n- When $p=2$, $r = (0.1)^{1/2} \\approx 0.316$\n- When $p=10$, $r = (0.1)^{1/10} \\approx 0.794$(!)\n- When $p=100$, $r = (0.1)^{1/100} \\approx 0.977$(!!)\n- When $p=1000$, $r = (0.1)^{1/1000} \\approx 0.999$(!!!)\n:::\n\n::: w-35\n::: fragment\n### Why is this problematic?\n\n- All points are maximally far apart\n- Can't distinguish between \"similar\" and \"different\" inputs\n:::\n:::\n:::\n\n## Curse of Dimensionality\n\nDistance becomes (exponentially) meaningless in high dimensions.*\\\n[*(Unless our data has \"low dimensional structure.\")]{.small}\n\n. . .\n\n### Risk decomposition ($p > 1$)\n[Assuming optimal bandwidth of $n^{-1/(4+p)}$...]{.small}\n\n$$\nR_n^{(\\mathrm{OLS})} =\n \\underbrace{C_1^{(\\mathrm{OLS})}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{OLS})}}{n/p}}_{\\mathrm{var}} +\n \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\tfrac{C_1^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(\\mathrm{local})}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n \\sigma^2.\n$$\n\n::: fragment\n### Observations\n\n- $(C_1^{(\\mathrm{local})} + C_2^{(\\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\\mathrm{OLS})} / (n/p)$ is relatively small.\n- So unless $C_1^{(\\mathrm{OLS})}$ is big, we should use the linear model.*\\\n:::\n\n## In practice\n\n[The previous math assumes that our data are \"densely\" distributed throughout $\\R^p$.]{.small}\n\nHowever, if our data lie on a low-dimensional manifold within $\\R^p$, then local methods can work well!\n\n[We generally won't know the \"intrinsic dimensinality\" of our data though...]{.small}\n\n:::fragment\n### How to decide between basis expansions versus local kernel smoothers:\n1. Model selection\n2. Using a [very, very]{.secondary} questionable rule of thumb: if $p>\\log(n)$, don't do smoothing.\n:::\n\n# ☠️☠️ Danger ☠️☠️\n\nYou can't just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.\n\nYou used GCV/CV/etc. to select the tuning parameter, so we're back to the usual problem of using the data twice. You have to do [another]{.hand} CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.\n\n\n\n# Next time...\n\nCompromises if _p_ is big\n\nAdditive models and trees\n",
"supporting": [
"12-why-smooth_files"
],
Expand Down
4 changes: 2 additions & 2 deletions schedule/slides/11-kernel-smoothers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ B. Then $E[\hat \beta_\mathrm{ols}] = E[ E[\hat \beta_\mathrm{ols} \mid \mathbf

C. $E[\hat \beta_\mathrm{ols} \mid \mathbf X] = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top E[ \mathbf y \mid \mathbf X]$

D. $E[ \mathbf y \mid \mathbf X] = \mathbf X^\top \beta$
D. $E[ \mathbf y \mid \mathbf X] = \mathbf X \beta$

E. So $E[\hat \beta_\mathrm{ols}] - \beta = E[(\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf X \beta] - \beta = \beta - \beta = 0$.

Expand Down Expand Up @@ -54,7 +54,7 @@ E[\hat \beta_\mathrm{ols}] = E[E[\hat \beta_\mathrm{ols} \mid \mathbf X ]]
$$

In statistics speak, our model is *misspecified*.\
[Ridge/lasso will still increase bias and decrease variance even under misspecification.]{.small}
[Ridge/lasso will always increase bias and decrease variance, even under misspecification.]{.small}


## 2) Why does ridge regression shrink varinace?
Expand Down
18 changes: 9 additions & 9 deletions schedule/slides/12-why-smooth.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ How does $R_n^{(\mathrm{basis})}$ compare to $R_n^{(\mathrm{local})}$ as we chan
### Bias

- Basis: bias is *fixed*\
[Assuming $k$ is fixed]{.small}
[Assuming num. basis features is fixed]{.small}
- Local: bias depends on choice of bandwidth $\sigma$.

:::
Expand Down Expand Up @@ -92,8 +92,8 @@ _The constants_ don't matter for the intuition, but they matter for a particular
### What do you notice?
::: fragment
- As $n$ increases, the optimal bandwidth $\sigma$ decreases
- As $n \to \infty$, $R_n^{(\mathrm{basis})} \to C_1^{(\mathrm{basis})} + \sigma^2$
- As $n \to \infty$, $R_n^{(\mathrm{local})} \to \sigma^2$
- $R_n^{(\mathrm{basis})} \overset{n \to \infty}{\longrightarrow} C_1^{(\mathrm{basis})} + \sigma^2$
- $R_n^{(\mathrm{local})} \overset{n \to \infty}{\longrightarrow} \sigma^2$
:::

:::
Expand All @@ -110,7 +110,7 @@ _The constants_ don't matter for the intuition, but they matter for a particular

## Takeaway

1. Local methods are *consistent* (bias and variance go to 0 as $n \to \infty$)
1. Local methods are *consistent universal approximators* (bias and variance go to 0 as $n \to \infty$)
2. Fixed basis expansions are *biased* but have lower variance when $n$ is relatively small.\
[$\underbrace{O(1/n)}_{\text{basis var.}} < \underbrace{O(1/n^{4/5})}_{\text{local var.}}$]{.small}

Expand Down Expand Up @@ -146,7 +146,7 @@ a $p$-dimensional ball of radius 1.
For a test point $x$ at the center of the ball,
how far away are its $k = n/10$ nearest neighbours?

[(The picture on the right makes sense in 2D. It gives the wrong intuitions for higher dimensions!)]{.small}
[(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)]{.small}
:::

::: w-30
Expand Down Expand Up @@ -221,8 +221,8 @@ Distance becomes (exponentially) meaningless in high dimensions.*\

$$
R_n^{(\mathrm{OLS})} =
\underbrace{C_1^{(\mathrm{lin})}}_{\mathrm{bias}^2} +
\underbrace{\tfrac{C_2^{(\mathrm{lin})}}{n/p}}_{\mathrm{var}} +
\underbrace{C_1^{(\mathrm{OLS})}}_{\mathrm{bias}^2} +
\underbrace{\tfrac{C_2^{(\mathrm{OLS})}}{n/p}}_{\mathrm{var}} +
\sigma^2,
\qquad
R_n^{(\mathrm{local})} =
Expand All @@ -234,8 +234,8 @@ $$
::: fragment
### Observations

- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{lin})} / (n/p)$ is relatively small.
- So unless $C_1^{(\mathrm{lin})}$ is big, we should use the linear model.*\
- $(C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}$ is relatively big, but $C_2^{(\mathrm{OLS})} / (n/p)$ is relatively small.
- So unless $C_1^{(\mathrm{OLS})}$ is big, we should use the linear model.*\
:::

## In practice
Expand Down

0 comments on commit c4205b4

Please sign in to comment.