From f8fcf28d91f32bfc9a5d294a3b4538fae3ba21ad Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Mon, 7 Oct 2024 20:07:36 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- schedule/slides/11-kernel-smoothers.html | 2 +- schedule/slides/12-why-smooth.html | 285 ++++++++++++++++------- search.json | 76 +++--- sitemap.xml | 96 ++++---- 5 files changed, 284 insertions(+), 177 deletions(-) diff --git a/.nojekyll b/.nojekyll index 87e5c5d..44832bc 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -55738651 \ No newline at end of file +88cc06e1 \ No newline at end of file diff --git a/schedule/slides/11-kernel-smoothers.html b/schedule/slides/11-kernel-smoothers.html index 6d2d438..0b219dc 100644 --- a/schedule/slides/11-kernel-smoothers.html +++ b/schedule/slides/11-kernel-smoothers.html @@ -487,7 +487,7 @@

11 Local methods

Last time…

We looked at feature maps as a way to do nonlinear regression.

We used new “features” \(\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)\)

-

Now we examine an alternative

+

Now we examine a nonparametric alternative

Suppose I just look at the “neighbours” of some point (based on the \(x\)-values)

I just average the \(y\)’s at those locations together

diff --git a/schedule/slides/12-why-smooth.html b/schedule/slides/12-why-smooth.html index 596a3f2..dce7bc3 100644 --- a/schedule/slides/12-why-smooth.html +++ b/schedule/slides/12-why-smooth.html @@ -333,7 +333,7 @@

12 To(o) smooth or not to(o) smooth?

Stat 406

Geoff Pleiss, Trevor Campbell

-

Last modified – 09 October 2023

+

Last modified – 07 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} @@ -359,38 +359,73 @@

12 To(o) smooth or not to(o) smooth?

\newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} +\newcommand{\U}{\mathbf{U}} +\newcommand{\D}{\mathbf{D}} +\newcommand{\V}{\mathbf{V}} \]

-
-

Last time…

-

We’ve been discussing smoothing methods in 1-dimension:

+
+

Smooting vs Linear Models

+

We’ve been discussing nonlinear methods in 1-dimension:

\[\Expect{Y\given X=x} = f(x),\quad x\in\R\]

-

We looked at basis expansions, e.g.:

-

\[f(x) \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]

-

We looked at local methods, e.g.:

-

\[f(x_i) \approx s_i^\top \y\]

+
    +
  1. Basis expansions, e.g.:
  2. +
+

\[\hat f_\mathrm{basis}(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]

+
    +
  1. Local methods, e.g.:
  2. +
+

\[\hat f_\mathrm{local}(x_i) = s_i^\top \y\]

+

Which should we choose?
+Of course, we can do model selection. But can we analyze the risk mathematically?

+
+
+

Risk Decomposition

+

\[ +R_n = \mathrm{Bias}^2 + \mathrm{Var} + \sigma^2 +\]

+

How does \(R_n^{(\mathrm{basis})}\) compare to \(R_n^{(\mathrm{local})}\) as we change \(n\)?
+

-

What if \(x \in \R^p\) and \(p>1\)?

- + +

Variance

+
    +
  • Basis: variance decreases as \(n\) increases
  • +
  • Local: variance decreases as \(n\) increases
    +But at what rate?
  • +
-
-
-

Kernels and interactions

-

In multivariate nonparametric regression, you estimate a surface over the input variables.

-

This is trying to find \(\widehat{f}(x_1,\ldots,x_p)\).

-

Therefore, this function by construction includes interactions, handles categorical data, etc. etc.

-

This is in contrast with explicit linear models which need you to specify these things.

-

This extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.

-

More complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance.

+ +

Bias

+
    +
  • Basis: bias is fixed
    +Assuming \(k\) is fixed
  • +
  • Local: bias depends on choice of bandwidth \(\sigma\).
  • +
-
-

Issue 1

-

For \(p=1\), one can show that for kernels (with the correct bandwidth)

-

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

+
+

Risk Decomposition

+
+
+

Basis

+

\[ +R_n^{(\mathrm{basis})} = + \underbrace{C_1^{(b)}}_{\mathrm{bias}^2} + + \underbrace{\frac{C_2^{(b)}}{n}}_{\mathrm{var}} + + \sigma^2 +\]

+

Local

+

With the optimal bandwidth (\(\propto n^{-1/5}\)), we have

+

\[ +R_n^{(\mathrm{local})} = + \underbrace{\frac{C_1^{(l)}}{n^{4/5}}}_{\mathrm{bias}^2} + + \underbrace{\frac{C_2^{(l)}}{n^{4/5}}}_{\mathrm{var}} + + \sigma^2 +\]

+
+
@@ -401,86 +436,158 @@

Issue 1

you don’t need to memorize these formulas but you should know the intuition

-

the constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this.

+

The constants don’t matter for the intuition, but they matter for a particular data set. You have to estimate them.

-
-
-

Issue 1

-

For \(p=1\), one can show that for kernels (with the correct bandwidth)

-

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

-

Recall, this decomposition is squared bias + variance + irreducible error

-
    -
  • It depends on the choice of \(h\)
  • -
-

\[\textrm{MSE}(\hat{f}) = C_1 h^4 + \frac{C_2}{nh} + \sigma^2\]

+

What do you notice?

+
    -
  • Using \(h = cn^{-1/5}\) balances squared bias and variance, leads to the above rate. (That balance minimizes the MSE)
  • +
  • As \(n\) increases, the optimal bandwidth \(\sigma\) decreases
  • +
  • As \(n \to \infty\), \(R_n^{(\mathrm{basis})} \to C_1^{(b)} + \sigma^2\)
  • +
  • As \(n \to \infty\), \(R_n^{(\mathrm{local})} \to \sigma^2\)
+
+ + + + + + +
-
-

Issue 1

-

For \(p=1\), one can show that for kernels (with the correct bandwidth)

-

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

-

Intuition:

-

as you collect data, use a smaller bandwidth and the MSE (on future data) decreases

-
-
-

Issue 1

-

For \(p=1\), one can show that for kernels (with the correct bandwidth)

-

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

-

How does this compare to just using a linear model?

-

Bias

-
    -
  1. The bias of using a linear model when the truth nonlinear is a number \(b > 0\) which doesn’t depend on \(n\).
  2. -
  3. The bias of using kernel regression is \(C_1/n^{4/5}\). This goes to 0 as \(n\rightarrow\infty\).
  4. -
-

Variance

+
+

Takeaway

    -
  1. The variance of using a linear model is \(C/n\) no matter what
  2. -
  3. The variance of using kernel regression is \(C_2/n^{4/5}\).
  4. +
  5. Local methods are consistent (bias and variance go to 0 as \(n \to \infty\))
  6. +
  7. Fixed basis expansions are biased but have lower variance when \(n\) is relatively small.
    +\(\underbrace{O(1/n)}_{\text{basis var.}} < \underbrace{O(1/n^{4/5})}_{\text{local var.}}\)
-
-

Issue 1

-

For \(p=1\), one can show that for kernels (with the correct bandwidth)

-

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

-

To conclude:

+
+
+

The Curse of Dimensionality

+

How do local methods perform when \(p > 1\)?

+
+
+

Intuitively

+

Parametric multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms
+e.g. \(x^{(1)} x^{(2)}\), \(\cos( x^{(1)} + x^{(2)})\), etc.

+


+Nonparametric multivariate regressors (e.g. KNN, local methods) automatically handle interactions.
+The distance function (e.g. \(d(x,x') = \Vert x - x' \Vert_2\)) used by kernels implicitly defines infinitely many interactions!

+


+This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.

+
+
+

Mathematically

+

Let’s say \(x_1, \ldots, x_n\) are distributed uniformly over the space \(\mathcal B_1(p)\)
+\(B_1(p)\) is the “unit ball,” or the set of all \(x\) such that \(\Vert x \Vert_2 \leq 1\).

+
+


+What is the maximum distance between any two points in \(\mathcal B_1(p)\)?

+
+
+

\(\Vert x - x' \Vert_2 \leq \Vert x \Vert_2 + \Vert x' \Vert_2 \leq 1 + 1 = 2.\)

+
+
+


+What about the average distance?

+
+
+
+

The average (sq.) distance between points in \(\mathcal B_1(p)\)

+

\[ +\begin{align} +E\left[ \Vert x - x' \Vert_2^2 \right] +&= +E\left[ \textstyle \sum_{k=1}^p (x_k - x_k')^2 \right] +\\ +&= \textstyle{ + E[ \sum_{k=1}^p x_k^2 ] + + 2 \sum_{k=1}^p \sum_{\ell=1}^p \underbrace{E[ x_l x'_k ]}_{=0} + + E[ \sum_{k=1}^p x_k^{\prime 2} ] +} +\\ +&= 2 E[ \textstyle{\sum_{k=1}^p} x_k^2 ] += 2 E[ \Vert x \Vert_2^2 ] +\end{align} +\]

+
+

\(2 E[ \Vert x \Vert_2^2 ] = 2^{1 - 1/p}.\)

+
+
+
    -
  • bias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).

  • -
  • but variance of lines goes to zero faster than for kernels.

  • +
  • When \(p=2\), \(\frac{\text{avg dist}}{\text{max dist}} = 0.707\)
  • +
  • When \(p=5\), \(\frac{\text{avg dist}}{\text{max dist}} = 0.871\)!
  • +
  • When \(p=10\), \(\frac{\text{avg dist}}{\text{max dist}} = 0.933\)!!
  • +
  • When \(p=100\), \(\frac{\text{avg dist}}{\text{max dist}} = 0.993\)!!!
-

If the linear model is right, you win.

-

But if it’s wrong, you (eventually) lose as \(n\) grows.

-

How do you know if you have enough data?

-

Compare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model.

-
-
-
-

☠️☠️ Danger ☠️☠️

-

You can’t just compare the CVM for the kernel version to the CVM for the LM. This is because you used CVM to select the tuning parameter, so we’re back to the usual problem of using the data twice. You have to do another CV to estimate the risk of the kernel version at CV selected tuning parameter. ️

+ + +
+
+ +

Why is this problematic?

+
    +
  • All points are maximally far apart from all other points
  • +
  • Can’t distinguish between “similar” and “different” inputs
  • +
+
+
+ +
-
-

Issue 2

-

For \(p>1\), there is more trouble.

-

First, lets look again at \[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

-

That is for \(p=1\). It’s not that much slower than \(C/n\), the variance for linear models.

-

If \(p>1\) similar calculations show,

-

\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]

+
+

Curse of Dimensionality

+

Distance becomes (exponentially) meaningless in high dimensions.*
+*(Unless our data has “low dimensional structure.”)

+
+

Risk decomposition (\(p > 1\))

+

Assuming optimal bandwidth of \(n^{-1/(4+p)}\)

+

\[ +R_n^{(\mathrm{basis})} = + \underbrace{C_1^{(b)}}_{\mathrm{bias}^2} + + \underbrace{\tfrac{C_2^{(b)}}{n/p}}_{\mathrm{var}} + + \sigma^2, +\qquad +R_n^{(\mathrm{local})} = + \underbrace{\tfrac{C_1^{(l)}}{n^{4/(4+p)}}}_{\mathrm{bias}^2} + + \underbrace{\tfrac{C_2^{(l)}}{n^{4/(4+p)}}}_{\mathrm{var}} + + \sigma^2. +\]

+
+ +

Observations

+
    +
  • \((C_1 + C_2) / n^{4/(4+p)}\) is relatively big, but \(C_2^{(b)} / (n/p)\) is relatively small.
  • +
  • So unless \(C_1^{(b)}\) is big, we should use the linear model.*
    +
  • +
+
+
-
-

Issue 2

-

\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]

-

What if \(p\) is big (and \(n\) is really big)?

+
+

In practice

+

The previous math assumes that our data are “densely” distributed throughout \(\R^p\).

+

However, if our data lie on a low-dimensional manifold within \(\R^p\), then local methods can work well!

+

We generally won’t know the “intrinsic dimensinality” of our data though…

+
+ +

How to decide between basis expansions versus local kernel smoothers:

    -
  1. Then \((C_1 + C_2) / n^{4/(4+p)}\) is still big.
  2. -
  3. But \(Cp / n\) is small.
  4. -
  5. So unless \(b\) is big, we should use the linear model.
  6. +
  7. Model selection
  8. +
  9. Using a very, very questionable rule of thumb: if \(p>\log(n)\), don’t do smoothing.
-

How do you tell? Do model selection to decide.

-

A very, very questionable rule of thumb: if \(p>\log(n)\), don’t do smoothing.

+
+
+

☠️☠️ Danger ☠️☠️

+

You can’t just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.

+

You used GCV/CV/etc. to select the tuning parameter, so we’re back to the usual problem of using the data twice. You have to do another CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.

+
+

Next time…

Compromises if p is big

diff --git a/search.json b/search.json index 8d9d4fc..23dc0cd 100644 --- a/search.json +++ b/search.json @@ -354,7 +354,7 @@ "href": "schedule/slides/11-kernel-smoothers.html#last-time", "title": "UBC Stat406 2024W", "section": "Last time…", - "text": "Last time…\nWe looked at feature maps as a way to do nonlinear regression.\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNow we examine an alternative\nSuppose I just look at the “neighbours” of some point (based on the \\(x\\)-values)\nI just average the \\(y\\)’s at those locations together" + "text": "Last time…\nWe looked at feature maps as a way to do nonlinear regression.\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNow we examine a nonparametric alternative\nSuppose I just look at the “neighbours” of some point (based on the \\(x\\)-values)\nI just average the \\(y\\)’s at those locations together" }, { "objectID": "schedule/slides/11-kernel-smoothers.html#lets-use-3-neighbours", @@ -2230,70 +2230,70 @@ "href": "schedule/slides/12-why-smooth.html#section", "title": "UBC Stat406 2024W", "section": "12 To(o) smooth or not to(o) smooth?", - "text": "12 To(o) smooth or not to(o) smooth?\nStat 406\nGeoff Pleiss, Trevor Campbell\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]" + "text": "12 To(o) smooth or not to(o) smooth?\nStat 406\nGeoff Pleiss, Trevor Campbell\nLast modified – 07 October 2024\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]" }, { - "objectID": "schedule/slides/12-why-smooth.html#last-time", - "href": "schedule/slides/12-why-smooth.html#last-time", + "objectID": "schedule/slides/12-why-smooth.html#smooting-vs-linear-models", + "href": "schedule/slides/12-why-smooth.html#smooting-vs-linear-models", "title": "UBC Stat406 2024W", - "section": "Last time…", - "text": "Last time…\nWe’ve been discussing smoothing methods in 1-dimension:\n\\[\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R\\]\nWe looked at basis expansions, e.g.:\n\\[f(x) \\approx \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k\\]\nWe looked at local methods, e.g.:\n\\[f(x_i) \\approx s_i^\\top \\y\\]\n\nWhat if \\(x \\in \\R^p\\) and \\(p>1\\)?\n\n\n\nNote that \\(p\\) means the dimension of \\(x\\), not the dimension of the space of the polynomial basis or something else. That’s why I put \\(k\\) above." + "section": "Smooting vs Linear Models", + "text": "Smooting vs Linear Models\nWe’ve been discussing nonlinear methods in 1-dimension:\n\\[\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R\\]\n\nBasis expansions, e.g.:\n\n\\[\\hat f_\\mathrm{basis}(x) = \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k\\]\n\nLocal methods, e.g.:\n\n\\[\\hat f_\\mathrm{local}(x_i) = s_i^\\top \\y\\]\nWhich should we choose?\nOf course, we can do model selection. But can we analyze the risk mathematically?" }, { - "objectID": "schedule/slides/12-why-smooth.html#kernels-and-interactions", - "href": "schedule/slides/12-why-smooth.html#kernels-and-interactions", + "objectID": "schedule/slides/12-why-smooth.html#risk-decomposition", + "href": "schedule/slides/12-why-smooth.html#risk-decomposition", "title": "UBC Stat406 2024W", - "section": "Kernels and interactions", - "text": "Kernels and interactions\nIn multivariate nonparametric regression, you estimate a surface over the input variables.\nThis is trying to find \\(\\widehat{f}(x_1,\\ldots,x_p)\\).\nTherefore, this function by construction includes interactions, handles categorical data, etc. etc.\nThis is in contrast with explicit linear models which need you to specify these things.\nThis extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.\n\nMore complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance." + "section": "Risk Decomposition", + "text": "Risk Decomposition\n\\[\nR_n = \\mathrm{Bias}^2 + \\mathrm{Var} + \\sigma^2\n\\]\nHow does \\(R_n^{(\\mathrm{basis})}\\) compare to \\(R_n^{(\\mathrm{local})}\\) as we change \\(n\\)?\n\n\n\nVariance\n\nBasis: variance decreases as \\(n\\) increases\nLocal: variance decreases as \\(n\\) increases\nBut at what rate?\n\n\n\n\nBias\n\nBasis: bias is fixed\nAssuming \\(k\\) is fixed\nLocal: bias depends on choice of bandwidth \\(\\sigma\\)." }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-1", - "href": "schedule/slides/12-why-smooth.html#issue-1", + "objectID": "schedule/slides/12-why-smooth.html#risk-decomposition-1", + "href": "schedule/slides/12-why-smooth.html#risk-decomposition-1", "title": "UBC Stat406 2024W", - "section": "Issue 1", - "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\n\n\n\n\n\n\nImportant\n\n\nyou don’t need to memorize these formulas but you should know the intuition\nthe constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this." + "section": "Risk Decomposition", + "text": "Risk Decomposition\n\n\nBasis\n\\[\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(b)}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(b)}}{n}}_{\\mathrm{var}} +\n \\sigma^2\n\\]\nLocal\nWith the optimal bandwidth (\\(\\propto n^{-1/5}\\)), we have\n\\[\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\frac{C_1^{(l)}}{n^{4/5}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\frac{C_2^{(l)}}{n^{4/5}}}_{\\mathrm{var}} +\n \\sigma^2\n\\]\n\n\n\n\n\n\n\n\nImportant\n\n\nyou don’t need to memorize these formulas but you should know the intuition\nThe constants don’t matter for the intuition, but they matter for a particular data set. You have to estimate them.\n\n\n\nWhat do you notice?\n\n\nAs \\(n\\) increases, the optimal bandwidth \\(\\sigma\\) decreases\nAs \\(n \\to \\infty\\), \\(R_n^{(\\mathrm{basis})} \\to C_1^{(b)} + \\sigma^2\\)\nAs \\(n \\to \\infty\\), \\(R_n^{(\\mathrm{local})} \\to \\sigma^2\\)" }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-1-1", - "href": "schedule/slides/12-why-smooth.html#issue-1-1", + "objectID": "schedule/slides/12-why-smooth.html#takeaway", + "href": "schedule/slides/12-why-smooth.html#takeaway", "title": "UBC Stat406 2024W", - "section": "Issue 1", - "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nRecall, this decomposition is squared bias + variance + irreducible error\n\nIt depends on the choice of \\(h\\)\n\n\\[\\textrm{MSE}(\\hat{f}) = C_1 h^4 + \\frac{C_2}{nh} + \\sigma^2\\]\n\nUsing \\(h = cn^{-1/5}\\) balances squared bias and variance, leads to the above rate. (That balance minimizes the MSE)" + "section": "Takeaway", + "text": "Takeaway\n\nLocal methods are consistent (bias and variance go to 0 as \\(n \\to \\infty\\))\nFixed basis expansions are biased but have lower variance when \\(n\\) is relatively small.\n\\(\\underbrace{O(1/n)}_{\\text{basis var.}} < \\underbrace{O(1/n^{4/5})}_{\\text{local var.}}\\)" }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-1-2", - "href": "schedule/slides/12-why-smooth.html#issue-1-2", + "objectID": "schedule/slides/12-why-smooth.html#intuitively", + "href": "schedule/slides/12-why-smooth.html#intuitively", "title": "UBC Stat406 2024W", - "section": "Issue 1", - "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nIntuition:\nas you collect data, use a smaller bandwidth and the MSE (on future data) decreases" + "section": "Intuitively", + "text": "Intuitively\nParametric multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms\ne.g. \\(x^{(1)} x^{(2)}\\), \\(\\cos( x^{(1)} + x^{(2)})\\), etc.\n\nNonparametric multivariate regressors (e.g. KNN, local methods) automatically handle interactions.\nThe distance function (e.g. \\(d(x,x') = \\Vert x - x' \\Vert_2\\)) used by kernels implicitly defines infinitely many interactions!\n\nThis extra complexity (automatically including interactions, as well as other things) comes with a tradeoff." }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-1-3", - "href": "schedule/slides/12-why-smooth.html#issue-1-3", + "objectID": "schedule/slides/12-why-smooth.html#mathematically", + "href": "schedule/slides/12-why-smooth.html#mathematically", "title": "UBC Stat406 2024W", - "section": "Issue 1", - "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nHow does this compare to just using a linear model?\nBias\n\nThe bias of using a linear model when the truth nonlinear is a number \\(b > 0\\) which doesn’t depend on \\(n\\).\nThe bias of using kernel regression is \\(C_1/n^{4/5}\\). This goes to 0 as \\(n\\rightarrow\\infty\\).\n\nVariance\n\nThe variance of using a linear model is \\(C/n\\) no matter what\nThe variance of using kernel regression is \\(C_2/n^{4/5}\\)." + "section": "Mathematically", + "text": "Mathematically\nLet’s say \\(x_1, \\ldots, x_n\\) are distributed uniformly over the space \\(\\mathcal B_1(p)\\)\n\\(B_1(p)\\) is the “unit ball,” or the set of all \\(x\\) such that \\(\\Vert x \\Vert_2 \\leq 1\\).\n\n\nWhat is the maximum distance between any two points in \\(\\mathcal B_1(p)\\)?\n\n\n\\(\\Vert x - x' \\Vert_2 \\leq \\Vert x \\Vert_2 + \\Vert x' \\Vert_2 \\leq 1 + 1 = 2.\\)\n\n\n\nWhat about the average distance?" }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-1-4", - "href": "schedule/slides/12-why-smooth.html#issue-1-4", + "objectID": "schedule/slides/12-why-smooth.html#the-average-sq.-distance-between-points-in-mathcal-b_1p", + "href": "schedule/slides/12-why-smooth.html#the-average-sq.-distance-between-points-in-mathcal-b_1p", "title": "UBC Stat406 2024W", - "section": "Issue 1", - "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nTo conclude:\n\nbias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).\nbut variance of lines goes to zero faster than for kernels.\n\nIf the linear model is right, you win.\nBut if it’s wrong, you (eventually) lose as \\(n\\) grows.\nHow do you know if you have enough data?\nCompare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model." + "section": "The average (sq.) distance between points in \\(\\mathcal B_1(p)\\)", + "text": "The average (sq.) distance between points in \\(\\mathcal B_1(p)\\)\n\\[\n\\begin{align}\nE\\left[ \\Vert x - x' \\Vert_2^2 \\right]\n&=\nE\\left[ \\textstyle \\sum_{k=1}^p (x_k - x_k')^2 \\right]\n\\\\\n&= \\textstyle{\n E[ \\sum_{k=1}^p x_k^2 ]\n + 2 \\sum_{k=1}^p \\sum_{\\ell=1}^p \\underbrace{E[ x_l x'_k ]}_{=0}\n + E[ \\sum_{k=1}^p x_k^{\\prime 2} ]\n}\n\\\\\n&= 2 E[ \\textstyle{\\sum_{k=1}^p} x_k^2 ]\n= 2 E[ \\Vert x \\Vert_2^2 ]\n\\end{align}\n\\]\n\n\\(2 E[ \\Vert x \\Vert_2^2 ] = 2^{1 - 1/p}.\\)\n\n\n\n\nWhen \\(p=2\\), \\(\\frac{\\text{avg dist}}{\\text{max dist}} = 0.707\\)\nWhen \\(p=5\\), \\(\\frac{\\text{avg dist}}{\\text{max dist}} = 0.871\\)!\nWhen \\(p=10\\), \\(\\frac{\\text{avg dist}}{\\text{max dist}} = 0.933\\)!!\nWhen \\(p=100\\), \\(\\frac{\\text{avg dist}}{\\text{max dist}} = 0.993\\)!!!\n\n\n\n\n\n\nWhy is this problematic?\n\nAll points are maximally far apart from all other points\nCan’t distinguish between “similar” and “different” inputs" }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-2", - "href": "schedule/slides/12-why-smooth.html#issue-2", + "objectID": "schedule/slides/12-why-smooth.html#curse-of-dimensionality", + "href": "schedule/slides/12-why-smooth.html#curse-of-dimensionality", "title": "UBC Stat406 2024W", - "section": "Issue 2", - "text": "Issue 2\nFor \\(p>1\\), there is more trouble.\nFirst, lets look again at \\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nThat is for \\(p=1\\). It’s not that much slower than \\(C/n\\), the variance for linear models.\nIf \\(p>1\\) similar calculations show,\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]" + "section": "Curse of Dimensionality", + "text": "Curse of Dimensionality\nDistance becomes (exponentially) meaningless in high dimensions.*\n*(Unless our data has “low dimensional structure.”)\n\nRisk decomposition (\\(p > 1\\))\nAssuming optimal bandwidth of \\(n^{-1/(4+p)}\\)…\n\\[\nR_n^{(\\mathrm{basis})} =\n \\underbrace{C_1^{(b)}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(b)}}{n/p}}_{\\mathrm{var}} +\n \\sigma^2,\n\\qquad\nR_n^{(\\mathrm{local})} =\n \\underbrace{\\tfrac{C_1^{(l)}}{n^{4/(4+p)}}}_{\\mathrm{bias}^2} +\n \\underbrace{\\tfrac{C_2^{(l)}}{n^{4/(4+p)}}}_{\\mathrm{var}} +\n \\sigma^2.\n\\]\n\n\nObservations\n\n\\((C_1 + C_2) / n^{4/(4+p)}\\) is relatively big, but \\(C_2^{(b)} / (n/p)\\) is relatively small.\nSo unless \\(C_1^{(b)}\\) is big, we should use the linear model.*" }, { - "objectID": "schedule/slides/12-why-smooth.html#issue-2-1", - "href": "schedule/slides/12-why-smooth.html#issue-2-1", + "objectID": "schedule/slides/12-why-smooth.html#in-practice", + "href": "schedule/slides/12-why-smooth.html#in-practice", "title": "UBC Stat406 2024W", - "section": "Issue 2", - "text": "Issue 2\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]\nWhat if \\(p\\) is big (and \\(n\\) is really big)?\n\nThen \\((C_1 + C_2) / n^{4/(4+p)}\\) is still big.\nBut \\(Cp / n\\) is small.\nSo unless \\(b\\) is big, we should use the linear model.\n\nHow do you tell? Do model selection to decide.\nA very, very questionable rule of thumb: if \\(p>\\log(n)\\), don’t do smoothing." + "section": "In practice", + "text": "In practice\nThe previous math assumes that our data are “densely” distributed throughout \\(\\R^p\\).\nHowever, if our data lie on a low-dimensional manifold within \\(\\R^p\\), then local methods can work well!\nWe generally won’t know the “intrinsic dimensinality” of our data though…\n\n\nHow to decide between basis expansions versus local kernel smoothers:\n\nModel selection\nUsing a very, very questionable rule of thumb: if \\(p>\\log(n)\\), don’t do smoothing." }, { "objectID": "schedule/slides/00-intro-to-class.html#section", diff --git a/sitemap.xml b/sitemap.xml index 72bd7a0..00e9b31 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,194 +2,194 @@ https://UBC-STAT.github.io/stat-406/schedule/slides/00-r-review.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/handouts/keras-nnet.html - 2024-10-07T17:57:46.538Z + 2024-10-07T20:06:45.718Z https://UBC-STAT.github.io/stat-406/schedule/slides/11-kernel-smoothers.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/09-l1-penalties.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/18-the-bootstrap.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/23-nnets-other.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/05-estimating-test-mse.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/13-gams-trees.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/26-pca-v-kpca.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-classification-losses.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/slides/20-boosting.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/27-kmeans.html - 2024-10-07T17:57:46.547Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/14-classification-intro.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/04-bias-variance.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/06-information-criteria.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/03-regression-function.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/21-nnets-intro.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/12-why-smooth.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-intro-to-class.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/handouts/lab00-git.html - 2024-10-07T17:57:46.538Z + 2024-10-07T20:06:45.718Z https://UBC-STAT.github.io/stat-406/course-setup.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/computing/windows.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/computing/mac_x86.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/computing/index.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/index.html - 2024-10-07T17:57:46.522Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/computing/mac_arm.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/computing/ubuntu.html - 2024-10-07T17:57:46.521Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/syllabus.html - 2024-10-07T17:57:46.568Z + 2024-10-07T20:06:45.748Z https://UBC-STAT.github.io/stat-406/schedule/index.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-course-review.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-version-control.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/faq.html - 2024-10-07T17:57:46.522Z + 2024-10-07T20:06:45.701Z https://UBC-STAT.github.io/stat-406/schedule/slides/01-lm-review.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-cv-for-many-models.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/slides/19-bagging-and-rf.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/22-nnets-estimation.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/16-logistic-regression.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/08-ridge-regression.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-quiz-0-wrap.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z https://UBC-STAT.github.io/stat-406/schedule/slides/15-LDA-and-QDA.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/24-pca-intro.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/25-pca-issues.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/10-basis-expansions.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/28-hclust.html - 2024-10-07T17:57:46.547Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/07-greedy-selection.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/02-lm-example.html - 2024-10-07T17:57:46.545Z + 2024-10-07T20:06:45.725Z https://UBC-STAT.github.io/stat-406/schedule/slides/17-nonlinear-classifiers.html - 2024-10-07T17:57:46.546Z + 2024-10-07T20:06:45.726Z https://UBC-STAT.github.io/stat-406/schedule/slides/00-gradient-descent.html - 2024-10-07T17:57:46.544Z + 2024-10-07T20:06:45.724Z