Skip to content

Commit

Permalink
Merge pull request #11 from UBC-STAT/djm/nnet
Browse files Browse the repository at this point in the history
NNet Intro updates
  • Loading branch information
dajmcdon authored Nov 3, 2023
2 parents a612af2 + 01d6c53 commit ec8a139
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 22 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "f3fc08f00287583f6b5da27e5f8904c8",
"hash": "1764e2e9b2555daeb42521f17b1bf76f",
"result": {
"markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 12 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n- taking atomic units known as [neurons]{.tertiary},\n which can be \"on\" or \"off\"\n- putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\nIn the mid 2000's, new approaches for\n[initializing]{.tertiary} neural networks became\navailable\n\n \nThese approaches are collectively known as [deep learning]{.secondary}\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall nonparametric regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1. We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k h_k(x)$ some basis $h_1,h_2,\\ldots$ (using $h$ instead of $\\phi$ to match ISLR)\n\n2. Truncate this expansion at $K$: \n $f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k h_k(x)$\n\n3. Estimate $\\beta_k$ with least squares\n\n\n## Recall nonparametric regression\n\nThe weaknesses of this approach are:\n\n- The basis is fixed and independent of the data\n- If $p$ is large, then nonparametrics doesn't work well at all (recall the Curse of Dimensionality)\n- If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n large to capture the structure\n- What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ g(w_{k0} + w_k^{\\top}x)\\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\nThe main components are\n\n- The derived features ${A_k = g(w_{k0} + w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n- The function $g$ is called the [activation function]{.secondary} (more on this later)\n- The parameters\n${\\beta_0},{\\beta_k},{w_{k0}},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n- The number of hidden units ${K}$ is a tuning\n parameter\n- $\\beta_0$ and $w_{k0}$ are usually called [biases]{.secondary} (I'm going to set them to 0 and ignore them in future formulas. Just for space. It's just an intercept) \n\n\n## Terminology\n\n$$f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n$\\beta \\in \\R^k$. \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$ \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability: $\\hat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$). \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$ \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
"markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n- taking atomic units known as [neurons]{.tertiary},\n which can be \"on\" or \"off\"\n- putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall nonparametric regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1. We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k \\phi_k(x)$ some basis \n$\\phi_1,\\phi_2,\\ldots$\n\n2. Truncate this expansion at $K$: \n$f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k \\phi_k(x)$\n\n3. Estimate $\\beta_k$ with least squares\n\n\n## Recall nonparametric regression\n\nThe weaknesses of this approach are:\n\n- The basis is fixed and independent of the data\n- If $p$ is large, then nonparametrics doesn't work well at all (recall the Curse of Dimensionality)\n- If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n large to capture the structure\n- What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\sum_{k=1}^K \\beta_k \\ g(w_k^{\\top}x)\\\\\n&= \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g( w_k^{\\top}x)}$$\nThe main components are\n\n- The derived features ${A_k = g(w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n- The function $g$ is called the [activation function]{.secondary} (more on this later)\n- The parameters\n${\\beta_k},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n- The number of hidden units ${K}$ is a tuning\n parameter\n \n$$f(x) = \\sum_{k=1}^{{K}} \\beta_0 + {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n- Could add $\\beta_0$ and $w_{k0}$. Called [biases]{.secondary} \n(I'm going to ignore them. It's just an intercept) \n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g(w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n<br/>\n\n$\\beta \\in \\R^k$ \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$ \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability \n$\\longrightarrow\\ \\widehat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$). \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$ \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
"supporting": [
"21-nnets-intro_files"
],
Expand Down
41 changes: 21 additions & 20 deletions schedule/slides/21-nnets-intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,6 @@ the main one)

. . .

In the mid 2000's, new approaches for
[initializing]{.tertiary} neural networks became
available


These approaches are collectively known as [deep learning]{.secondary}


State-of-the-art performance on various classification
tasks has been accomplished via neural networks
Expand Down Expand Up @@ -105,10 +98,11 @@ In Module 2, we discussed basis expansion,



1. We know $f_*(x) =\sum_{k=1}^\infty \beta_k h_k(x)$ some basis $h_1,h_2,\ldots$ (using $h$ instead of $\phi$ to match ISLR)
1. We know $f_*(x) =\sum_{k=1}^\infty \beta_k \phi_k(x)$ some basis
$\phi_1,\phi_2,\ldots$

2. Truncate this expansion at $K$:
$f_*^K(x) \approx \sum_{k=1}^K \beta_k h_k(x)$
$f_*^K(x) \approx \sum_{k=1}^K \beta_k \phi_k(x)$

3. Estimate $\beta_k$ with least squares

Expand All @@ -135,14 +129,14 @@ An alternative would be to have the data
A single layer neural network model is
$$
\begin{aligned}
&f(x) = \beta_0 + \sum_{k=1}^K \beta_k h_k(x) \\
&= \beta_0 + \sum_{k=1}^K \beta_k \ g(w_{k0} + w_k^{\top}x)\\
&= \beta_0 + \sum_{k=1}^K \beta_k \ A_k\\
&f(x) = \sum_{k=1}^K \beta_k h_k(x) \\
&= \sum_{k=1}^K \beta_k \ g(w_k^{\top}x)\\
&= \sum_{k=1}^K \beta_k \ A_k\\
\end{aligned}
$$

[Compare:]{.secondary} A nonparametric regression
$$f(x) = \beta_0 + \sum_{k=1}^K \beta_k {\phi_k(x)}$$
$$f(x) = \sum_{k=1}^K \beta_k {\phi_k(x)}$$

:::

Expand All @@ -158,26 +152,32 @@ $$f(x) = \beta_0 + \sum_{k=1}^K \beta_k {\phi_k(x)}$$

## Terminology

$$f(x) = {\beta_0} + \sum_{k=1}^{{K}} {\beta_k} {g(w_{k0} + w_k^{\top}x)}$$
$$f(x) = \sum_{k=1}^{{K}} {\beta_k} {g( w_k^{\top}x)}$$
The main components are

- The derived features ${A_k = g(w_{k0} + w_k^{\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}
- The derived features ${A_k = g(w_k^{\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}
- The function $g$ is called the [activation function]{.secondary} (more on this later)
- The parameters
${\beta_0},{\beta_k},{w_{k0}},{w_k}$ are estimated from the data for all $k = 1,\ldots, K$.
${\beta_k},{w_k}$ are estimated from the data for all $k = 1,\ldots, K$.
- The number of hidden units ${K}$ is a tuning
parameter
- $\beta_0$ and $w_{k0}$ are usually called [biases]{.secondary} (I'm going to set them to 0 and ignore them in future formulas. Just for space. It's just an intercept)

$$f(x) = \sum_{k=1}^{{K}} \beta_0 + {\beta_k} {g(w_{k0} + w_k^{\top}x)}$$

- Could add $\beta_0$ and $w_{k0}$. Called [biases]{.secondary}
(I'm going to ignore them. It's just an intercept)


## Terminology

$$f(x) = {\beta_0} + \sum_{k=1}^{{K}} {\beta_k} {g(w_{k0} + w_k^{\top}x)}$$
$$f(x) = \sum_{k=1}^{{K}} {\beta_k} {g(w_k^{\top}x)}$$


Notes (no biases):

$\beta \in \R^k$.
<br/>

$\beta \in \R^k$

$w_k \in \R^p,\ k = 1,\ldots,K$

Expand Down Expand Up @@ -207,7 +207,8 @@ $$
:::
:::

Predict class with largest probability: $\hat{Y} = \argmax_{m} f_m(x)$
Predict class with largest probability
$\longrightarrow\ \widehat{Y} = \argmax_{m} f_m(x)$

## What about classification (10 classes, 2 layers)

Expand Down

0 comments on commit ec8a139

Please sign in to comment.