module 3 done

UBC-STAT · Oct 9, 2023 · 05c3bf1 · 05c3bf1
1 parent bef3fac
commit 05c3bf1
Show file tree

Hide file tree

Showing 18 changed files with 7,042 additions and 1,407 deletions.
diff --git a/_freeze/schedule/slides/16-logistic-regression/execute-results/html.json b/_freeze/schedule/slides/16-logistic-regression/execute-results/html.json
@@ -0,0 +1,20 @@
+{
+  "hash": "321f23ca5b468e97b2588b152a875222",
+  "result": {
+    "markdown": "---\nlecture: \"16 Logistic regression\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 09 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Last time\n\n\n* We showed that with two classes, the [Bayes' classifier]{.secondary} is\n\n$$g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0  &  \\textrm{ otherwise}\n\\end{cases}$$\n\nwhere $p_1(X) = Pr(X \\given Y=1)$ and $p_0(X) = Pr(X \\given Y=0)$\n\n* We then looked at what happens if we **assume** $Pr(X \\given Y=y)$ is Normally distributed.\n\nWe then used this distribution and the class prior $\\pi$ to find the __posterior__ $Pr(Y=1 \\given X=x)$.\n\n## Direct model\n\nInstead, let's directly model the posterior\n\n$$\n\\begin{aligned}\nPr(Y = 1 \\given X=x)  & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\n\\P(Y = 0 | X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\n\\end{aligned}\n$$\n\nThis is logistic regression.\n\n\n## Why this?\n\n$$Pr(Y = 1 \\given X=x) = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}$$\n\n* There are lots of ways to map $\\R \\mapsto [0,1]$.\n\n* The \"logistic\" function $z\\mapsto (1 + \\exp(-z))^{-1} =  \\exp(z) / (1+\\exp(z)) =:h(z)$ is nice.\n\n* It's symmetric: $1 - h(z) = h(-z)$\n\n* Has a nice derivative: $h'(z) = \\frac{\\exp(z)}{(1 + \\exp(z))^2} = h(z)(1-h(z))$.\n\n* It's the inverse of the \"log-odds\" (logit): $\\log(p / (1-p))$.\n\n\n\n## Another linear classifier\n\nLike LDA, logistic regression is a linear classifier\n\nThe _logit_ (i.e.: log odds) transformation\ngives a linear decision boundary\n$$\\log\\left( \\frac{\\P(Y = 1 \\given X=x)}{\\P(Y = 0 \\given X=x) } \\right) = \\beta_0 + \\beta^{\\top} x$$\nThe decision boundary is the hyperplane\n$\\{x : \\beta_0 + \\beta^{\\top} x = 0\\}$\n\nIf the log-odds are below 0, classify as 0, above 0 classify as a 1.\n\n## Logistic regression is also easy in R\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogistic <- glm(y ~ ., dat, family = \"binomial\")\n```\n:::\n\n\nOr we can use lasso or ridge regression or a GAM as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlasso_logit <- cv.glmnet(x, y, family = \"binomial\")\nridge_logit <- cv.glmnet(x, y, alpha = 0, family = \"binomial\")\ngam_logit <- gam(y ~ s(x), data = dat, family = \"binomial\")\n```\n:::\n\n\n\n::: aside\nglm means generalized linear model\n:::\n\n\n## Baby example (continued from last time)\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlogit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = \"binomial\")\nsummary(logit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nglm(formula = y ~ ., family = \"binomial\", data = mutate(dat1, \n    y = y - 1))\n\nCoefficients:\n            Estimate Std. Error z value Pr(>|z|)    \n(Intercept)  -2.6649     0.6281  -4.243 2.21e-05 ***\nx1            2.5305     0.5995   4.221 2.43e-05 ***\nx2            1.6610     0.4365   3.805 0.000142 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n    Null deviance: 138.469  on 99  degrees of freedom\nResidual deviance:  68.681  on 97  degrees of freedom\nAIC: 74.681\n\nNumber of Fisher Scoring iterations: 6\n```\n:::\n:::\n\n\n\n## Visualizing the classification boundary\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), \n                  x2 = seq(-2.5, 3, length.out = 100))\npts <- predict(logit, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n  scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n  geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +\n  geom_point(aes(shape = as.factor(y)), size = 4) +\n  coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n  scale_fill_steps2(n.breaks = 6, name = \"log odds\") \ng0\n```\n\n::: {.cell-output-display}\n![](16-logistic-regression_files/figure-revealjs/plot-d1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Calculation\n\nWhile the `R` formula for logistic regression is straightforward, it's not as easy to compute as OLS or LDA or QDA.\n\n\nLogistic regression for two classes simplifies to a likelihood:\n\nWrite $p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)$\n\n* $P(Y_i = y_i \\given X = x_i, \\beta) = p_i^{y_i}(1-p_i)^{1-y_i}$ (...Bernoulli distribution)\n\n* $P(\\mathbf{Y} \\given \\mathbf{X}, \\beta) = \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$. \n\n\n## Calculation\n\n\nWrite $p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)$\n\n\n$$\n\\begin{aligned}\n\\ell(\\beta) \n& = \\log \\left( \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \\right)\\\\\n&=\\sum_{i=1}^n \\left( y_i\\log(p_i(\\beta)) + (1-y_i)\\log(1-p_i(\\beta))\\right) \\\\\n& = \n\\sum_{i=1}^n \\left( y_i\\log(e^{\\beta^{\\top}x_i}/(1+e^{\\beta^{\\top}x_i})) - (1-y_i)\\log(1+e^{\\beta^{\\top}x_i})\\right) \\\\\n& = \n\\sum_{i=1}^n \\left( y_i\\beta^{\\top}x_i -\\log(1 + e^{\\beta^{\\top} x_i})\\right)\n\\end{aligned}\n$$\n\nThis gets optimized via Newton-Raphson updates and iteratively reweighed\nleast squares.\n\n\n## IRWLS for logistic regression (skip for now)\n\n(This is preparation for Neural Networks.)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {\n  p <- ncol(x)\n  beta <- double(p) # initialize coefficients\n  beta0 <- 0\n  conv <- FALSE # hasn't converged\n  iter <- 1 # first iteration\n  while (!conv && (iter < maxit)) { # check loops\n    iter <- iter + 1 # update first thing (so as not to forget)\n    eta <- beta0 + x %*% beta\n    mu <- 1 / (1 + exp(-eta))\n    gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic\n    z <- eta + (y - mu) * gp # effective transformed response\n    beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares\n    conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are \"moving\"\n    beta0 <- betaNew[1] # update betas\n    beta <- betaNew[-1]\n  }\n  return(c(beta0, beta))\n}\n```\n:::\n\n\n\n## Comparing LDA and Logistic regression\n\nBoth decision boundaries are linear in $x$:  \n\n- LDA $\\longrightarrow \\alpha_0 + \\alpha_1^\\top x$ \n- Logit $\\longrightarrow \\beta_0 + \\beta_1^\\top x$.\n\nBut the parameters are estimated differently.\n\n## Comparing LDA and Logistic regression\n\nExamine the joint distribution of $(X,\\ Y)$ [(not the posterior)]{.f3}:  \n\n- LDA: $f(X_i,\\ Y_i) = \\underbrace{ f(X_i \\given Y_i)}_{\\textrm{Gaussian}}\\underbrace{ f(Y_i)}_{\\textrm{Bernoulli}}$\n- Logistic Regression: $f(X_i,Y_i) = \\underbrace{ f(Y_i\\given X_i)}_{\\textrm{Logistic}}\\underbrace{ f(X_i)}_{\\textrm{Ignored}}$\n  \n* LDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. [But this is really all we need.]{.hand}\n\n* So logistic requires fewer assumptions.\n\n* But if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)\n\n* LDA \"works\" even if the conditional isn't normal, but works very poorly if any X is qualitative\n\n\n## Comparing with QDA (2 classes)\n\n\n* Recall: this gives a \"quadratic\" decision boundary (it's a curve).\n\n* If we have $p$ columns in $X$\n    - Logistic estimates $p+1$ parameters\n    - LDA estimates $2p + p(p+1)/2 + 1$\n    - QDA estimates $2p + p(p+1) + 1$\n  \n* If $p=50$,\n    - Logistic: 51\n    - LDA: 1376\n    - QDA: 2651\n  \n* QDA doesn't get used much: there are better nonlinear versions with way fewer parameters\n\n## Bad parameter counting\n\nI've motivated LDA as needing $\\Sigma$, $\\pi$ and $\\mu_0$, $\\mu_1$\n\nIn fact, we don't _need_ all of this to get the decision boundary.\n\nSo the \"degrees of freedom\" is much lower if we only want the _classes_ and not\nthe _probabilities_.\n\nThe decision boundary only really depends on\n\n* $\\Sigma^{-1}(\\mu_1-\\mu_0)$ \n* $(\\mu_1+\\mu_0)$, \n* so appropriate algorithms estimate $<2p$ parameters.\n\n## Note again:\n\nwhile logistic regression and LDA produce linear decision boundaries, they are **not** linear smoothers\n\nAIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly\n\nMust people use either test set or CV\n\n\n# Next time:\n\nNonlinear classifiers\n",
+    "supporting": [
+      "16-logistic-regression_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {
+      "include-after-body": [
+        "\n<script>\n  // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n  // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n  // slide changes (different for each slide format).\n  (function () {\n    // dispatch for htmlwidgets\n    function fireSlideEnter() {\n      const event = window.document.createEvent(\"Event\");\n      event.initEvent(\"slideenter\", true, true);\n      window.document.dispatchEvent(event);\n    }\n\n    function fireSlideChanged(previousSlide, currentSlide) {\n      fireSlideEnter();\n\n      // dispatch for shiny\n      if (window.jQuery) {\n        if (previousSlide) {\n          window.jQuery(previousSlide).trigger(\"hidden\");\n        }\n        if (currentSlide) {\n          window.jQuery(currentSlide).trigger(\"shown\");\n        }\n      }\n    }\n\n    // hookup for slidy\n    if (window.w3c_slidy) {\n      window.w3c_slidy.add_observer(function (slide_num) {\n        // slide_num starts at position 1\n        fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n      });\n    }\n\n  })();\n</script>\n\n"
+      ]
+    },
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}