-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
18 changed files
with
7,042 additions
and
1,407 deletions.
There are no files selected for viewing
20 changes: 20 additions & 0 deletions
20
_freeze/schedule/slides/16-logistic-regression/execute-results/html.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
{ | ||
"hash": "321f23ca5b468e97b2588b152a875222", | ||
"result": { | ||
"markdown": "---\nlecture: \"16 Logistic regression\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 09 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Last time\n\n\n* We showed that with two classes, the [Bayes' classifier]{.secondary} is\n\n$$g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}$$\n\nwhere $p_1(X) = Pr(X \\given Y=1)$ and $p_0(X) = Pr(X \\given Y=0)$\n\n* We then looked at what happens if we **assume** $Pr(X \\given Y=y)$ is Normally distributed.\n\nWe then used this distribution and the class prior $\\pi$ to find the __posterior__ $Pr(Y=1 \\given X=x)$.\n\n## Direct model\n\nInstead, let's directly model the posterior\n\n$$\n\\begin{aligned}\nPr(Y = 1 \\given X=x) & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\n\\P(Y = 0 | X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\n\\end{aligned}\n$$\n\nThis is logistic regression.\n\n\n## Why this?\n\n$$Pr(Y = 1 \\given X=x) = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}$$\n\n* There are lots of ways to map $\\R \\mapsto [0,1]$.\n\n* The \"logistic\" function $z\\mapsto (1 + \\exp(-z))^{-1} = \\exp(z) / (1+\\exp(z)) =:h(z)$ is nice.\n\n* It's symmetric: $1 - h(z) = h(-z)$\n\n* Has a nice derivative: $h'(z) = \\frac{\\exp(z)}{(1 + \\exp(z))^2} = h(z)(1-h(z))$.\n\n* It's the inverse of the \"log-odds\" (logit): $\\log(p / (1-p))$.\n\n\n\n## Another linear classifier\n\nLike LDA, logistic regression is a linear classifier\n\nThe _logit_ (i.e.: log odds) transformation\ngives a linear decision boundary\n$$\\log\\left( \\frac{\\P(Y = 1 \\given X=x)}{\\P(Y = 0 \\given X=x) } \\right) = \\beta_0 + \\beta^{\\top} x$$\nThe decision boundary is the hyperplane\n$\\{x : \\beta_0 + \\beta^{\\top} x = 0\\}$\n\nIf the log-odds are below 0, classify as 0, above 0 classify as a 1.\n\n## Logistic regression is also easy in R\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogistic <- glm(y ~ ., dat, family = \"binomial\")\n```\n:::\n\n\nOr we can use lasso or ridge regression or a GAM as before\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlasso_logit <- cv.glmnet(x, y, family = \"binomial\")\nridge_logit <- cv.glmnet(x, y, alpha = 0, family = \"binomial\")\ngam_logit <- gam(y ~ s(x), data = dat, family = \"binomial\")\n```\n:::\n\n\n\n::: aside\nglm means generalized linear model\n:::\n\n\n## Baby example (continued from last time)\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlogit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = \"binomial\")\nsummary(logit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nglm(formula = y ~ ., family = \"binomial\", data = mutate(dat1, \n y = y - 1))\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -2.6649 0.6281 -4.243 2.21e-05 ***\nx1 2.5305 0.5995 4.221 2.43e-05 ***\nx2 1.6610 0.4365 3.805 0.000142 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.469 on 99 degrees of freedom\nResidual deviance: 68.681 on 97 degrees of freedom\nAIC: 74.681\n\nNumber of Fisher Scoring iterations: 6\n```\n:::\n:::\n\n\n\n## Visualizing the classification boundary\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), \n x2 = seq(-2.5, 3, length.out = 100))\npts <- predict(logit, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_steps2(n.breaks = 6, name = \"log odds\") \ng0\n```\n\n::: {.cell-output-display}\n![](16-logistic-regression_files/figure-revealjs/plot-d1-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Calculation\n\nWhile the `R` formula for logistic regression is straightforward, it's not as easy to compute as OLS or LDA or QDA.\n\n\nLogistic regression for two classes simplifies to a likelihood:\n\nWrite $p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)$\n\n* $P(Y_i = y_i \\given X = x_i, \\beta) = p_i^{y_i}(1-p_i)^{1-y_i}$ (...Bernoulli distribution)\n\n* $P(\\mathbf{Y} \\given \\mathbf{X}, \\beta) = \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$. \n\n\n## Calculation\n\n\nWrite $p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)$\n\n\n$$\n\\begin{aligned}\n\\ell(\\beta) \n& = \\log \\left( \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \\right)\\\\\n&=\\sum_{i=1}^n \\left( y_i\\log(p_i(\\beta)) + (1-y_i)\\log(1-p_i(\\beta))\\right) \\\\\n& = \n\\sum_{i=1}^n \\left( y_i\\log(e^{\\beta^{\\top}x_i}/(1+e^{\\beta^{\\top}x_i})) - (1-y_i)\\log(1+e^{\\beta^{\\top}x_i})\\right) \\\\\n& = \n\\sum_{i=1}^n \\left( y_i\\beta^{\\top}x_i -\\log(1 + e^{\\beta^{\\top} x_i})\\right)\n\\end{aligned}\n$$\n\nThis gets optimized via Newton-Raphson updates and iteratively reweighed\nleast squares.\n\n\n## IRWLS for logistic regression (skip for now)\n\n(This is preparation for Neural Networks.)\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlogit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {\n p <- ncol(x)\n beta <- double(p) # initialize coefficients\n beta0 <- 0\n conv <- FALSE # hasn't converged\n iter <- 1 # first iteration\n while (!conv && (iter < maxit)) { # check loops\n iter <- iter + 1 # update first thing (so as not to forget)\n eta <- beta0 + x %*% beta\n mu <- 1 / (1 + exp(-eta))\n gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic\n z <- eta + (y - mu) * gp # effective transformed response\n beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares\n conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are \"moving\"\n beta0 <- betaNew[1] # update betas\n beta <- betaNew[-1]\n }\n return(c(beta0, beta))\n}\n```\n:::\n\n\n\n## Comparing LDA and Logistic regression\n\nBoth decision boundaries are linear in $x$: \n\n- LDA $\\longrightarrow \\alpha_0 + \\alpha_1^\\top x$ \n- Logit $\\longrightarrow \\beta_0 + \\beta_1^\\top x$.\n\nBut the parameters are estimated differently.\n\n## Comparing LDA and Logistic regression\n\nExamine the joint distribution of $(X,\\ Y)$ [(not the posterior)]{.f3}: \n\n- LDA: $f(X_i,\\ Y_i) = \\underbrace{ f(X_i \\given Y_i)}_{\\textrm{Gaussian}}\\underbrace{ f(Y_i)}_{\\textrm{Bernoulli}}$\n- Logistic Regression: $f(X_i,Y_i) = \\underbrace{ f(Y_i\\given X_i)}_{\\textrm{Logistic}}\\underbrace{ f(X_i)}_{\\textrm{Ignored}}$\n \n* LDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. [But this is really all we need.]{.hand}\n\n* So logistic requires fewer assumptions.\n\n* But if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)\n\n* LDA \"works\" even if the conditional isn't normal, but works very poorly if any X is qualitative\n\n\n## Comparing with QDA (2 classes)\n\n\n* Recall: this gives a \"quadratic\" decision boundary (it's a curve).\n\n* If we have $p$ columns in $X$\n - Logistic estimates $p+1$ parameters\n - LDA estimates $2p + p(p+1)/2 + 1$\n - QDA estimates $2p + p(p+1) + 1$\n \n* If $p=50$,\n - Logistic: 51\n - LDA: 1376\n - QDA: 2651\n \n* QDA doesn't get used much: there are better nonlinear versions with way fewer parameters\n\n## Bad parameter counting\n\nI've motivated LDA as needing $\\Sigma$, $\\pi$ and $\\mu_0$, $\\mu_1$\n\nIn fact, we don't _need_ all of this to get the decision boundary.\n\nSo the \"degrees of freedom\" is much lower if we only want the _classes_ and not\nthe _probabilities_.\n\nThe decision boundary only really depends on\n\n* $\\Sigma^{-1}(\\mu_1-\\mu_0)$ \n* $(\\mu_1+\\mu_0)$, \n* so appropriate algorithms estimate $<2p$ parameters.\n\n## Note again:\n\nwhile logistic regression and LDA produce linear decision boundaries, they are **not** linear smoothers\n\nAIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly\n\nMust people use either test set or CV\n\n\n# Next time:\n\nNonlinear classifiers\n", | ||
"supporting": [ | ||
"16-logistic-regression_files" | ||
], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": { | ||
"include-after-body": [ | ||
"\n<script>\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n</script>\n\n" | ||
] | ||
}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
Oops, something went wrong.