From e543c285841c92e7c3b7d7881e12cf87ede493f2 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Mon, 16 Sep 2024 11:38:37 -0700 Subject: [PATCH] add comment about qq to lm example --- .../schedule/slides/02-lm-example/execute-results/html.json | 4 ++-- schedule/slides/02-lm-example.qmd | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/_freeze/schedule/slides/02-lm-example/execute-results/html.json b/_freeze/schedule/slides/02-lm-example/execute-results/html.json index a94f2ef..36ea7ed 100644 --- a/_freeze/schedule/slides/02-lm-example/execute-results/html.json +++ b/_freeze/schedule/slides/02-lm-example/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "ce6713c37b36f5c632c81bc34955b4cf", + "hash": "58bfae2046ecfa446faf34d45bb39fc6", "result": { "engine": "knitr", - "markdown": "---\nlecture: \"02 Linear model example\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 13 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Economic mobility\n\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"mobility\", package = \"Stat406\")\nmobility\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n \n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty , Seg_affluence , Commute ,\n# Income , Gini , Share01 , Gini_99 , Middle_class ,\n# Local_tax_rate , Local_gov_spending , Progressivity ,\n# EITC , School_spending , Student_teacher_ratio ,\n# Test_scores , HS_dropout , Colleges , Tuition ,\n# Graduation , Labor_force_participation , Manufacturing , …\n```\n\n\n:::\n:::\n\n\n\n\n::: {.notes}\n\nNote how many observations and predictors it has.\n\nWe'll use `Mobility` as the response\n\n:::\n\n\n\n## A linear model\n\n\n$$\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i$$ \n \nor equivalently\n\n$$E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban}, \n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} + \n \\beta_2 \\, \\mbox{Urban} + \\cdots$$\n\n\n\n## Analysis\n\n\n- Randomly split into a training (say 3/4) and a test set (1/4)\n\n- Use training set to fit a model \n\n- Fit the \"full\" model\n\n- \"Look\" at the fit\n\n. . .\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ] |>\n\tselect(-Name, -ID, -State)\nn <- nrow(mob)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n```\n:::\n\n\n\n\n::: {.notes}\n\nWhy don't we include `Name` or `ID`?\n\n:::\n\n\n## Results\n\n(dispatch happening here!)\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808,\tAdjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n\n\n\n## Diagnostic plots\n\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nstuff <- tibble(\n residuals = residuals(full),\n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point() +\n geom_smooth(\n se = FALSE,\n colour = \"steelblue\",\n linewidth = 2\n ) +\n ggtitle(\"Residuals vs Fitted\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(size = 2) +\n geom_qq_line(linewidth = 2, color = \"steelblue\") +\n labs(\n x = \"Theoretical quantiles\",\n y = \"Standardized residuals\",\n title = \"Normal Q-Q\"\n )\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Can also just use dispatched `plot`\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n\n## Fit a reduced model\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers,\n data = train\n)\n\nsummary(reduced)$coefficients \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 0.1663344179 0.017769995 9.360409 1.829270e-18\nCommute 0.0637329409 0.014926382 4.269819 2.618234e-05\nGini_99 -0.1086058241 0.038958986 -2.787696 5.642726e-03\nTest_scores 0.0004997645 0.000256038 1.951915 5.186618e-02\nHS_dropout -0.2162067301 0.082003195 -2.636565 8.805228e-03\nManufacturing -0.1594229237 0.020215791 -7.886059 5.647668e-14\nMigration_in -0.3891567027 0.171839168 -2.264657 2.423771e-02\nReligious 0.0435673365 0.010463920 4.163577 4.084854e-05\nSingle_mothers -0.2864269552 0.046578928 -6.149282 2.444903e-09\n```\n\n\n:::\n\n```{.r .cell-code}\nreduced |>\n broom::glance() |>\n print(width = 120)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n \n1 0.159 304 313\n```\n\n\n:::\n:::\n\n\n\n\n\n## Diagnostic plots for reduced model\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-10-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n\n\n## How do we decide which model is better?\n\n::: flex\n\n::: w-50\n\n* Goodness of fit versus prediction power\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmap( # smaller AIC is better\n list(full = full, reduced = reduced),\n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n```\n\n\n:::\n:::\n\n\n\n\n* Use both models to predict `Mobility` \n\n* Compare both sets of predictions\n\n:::\n\n::: w-50\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmses <- function(preds, obs) round(mean((obs - preds)^2), 5)\nc(\n full = mses(\n predict(full, newdata = test),\n test$Mobility\n ),\n reduced = mses(\n predict(reduced, newdata = test),\n test$Mobility\n )\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n full reduced \n0.00072 0.00084 \n```\n\n\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |>\n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) +\n geom_point(color = \"orange\") +\n facet_wrap(~name, 2) +\n xlab(\"observed mobility\") +\n ylab(\"predicted mobility\") +\n geom_abline(slope = 1, intercept = 0, colour = \"darkblue\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-13-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n:::\n:::\n\n\n# Next time... \n\n\n[Module 1]{.secondary .large}\n\nSelecting models\n", + "markdown": "---\nlecture: \"02 Linear model example\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 September 2024\n\n\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Economic mobility\n\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"mobility\", package = \"Stat406\")\nmobility\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n \n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty , Seg_affluence , Commute ,\n# Income , Gini , Share01 , Gini_99 , Middle_class ,\n# Local_tax_rate , Local_gov_spending , Progressivity ,\n# EITC , School_spending , Student_teacher_ratio ,\n# Test_scores , HS_dropout , Colleges , Tuition ,\n# Graduation , Labor_force_participation , Manufacturing , …\n```\n\n\n:::\n:::\n\n\n\n\n::: {.notes}\n\nNote how many observations and predictors it has.\n\nWe'll use `Mobility` as the response\n\n:::\n\n\n\n## A linear model\n\n\n$$\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i$$ \n \nor equivalently\n\n$$E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban}, \n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} + \n \\beta_2 \\, \\mbox{Urban} + \\cdots$$\n\n\n\n## Analysis\n\n\n- Randomly split into a training (say 3/4) and a test set (1/4)\n\n- Use training set to fit a model \n\n- Fit the \"full\" model\n\n- \"Look\" at the fit\n\n. . .\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ] |>\n\tselect(-Name, -ID, -State)\nn <- nrow(mob)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n```\n:::\n\n\n\n\n::: {.notes}\n\nWhy don't we include `Name` or `ID`?\n\n:::\n\n\n## Results\n\n(dispatch happening here!)\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808,\tAdjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16\n```\n\n\n:::\n:::\n\n\n\n\n## Diagnostic plots\n\nNB: the line in the QQ plot isn't right for either `geom_qq_line` or `plot.lm`...\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nstuff <- tibble(\n residuals = residuals(full),\n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point() +\n geom_smooth(\n se = FALSE,\n colour = \"steelblue\",\n linewidth = 2\n ) +\n ggtitle(\"Residuals vs Fitted\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(size = 2) +\n geom_qq_line(linewidth = 2, color = \"steelblue\") +\n labs(\n x = \"Theoretical quantiles\",\n y = \"Standardized residuals\",\n title = \"Normal Q-Q\"\n )\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Can also just use dispatched `plot`\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n\n## Fit a reduced model\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers,\n data = train\n)\n\nsummary(reduced)$coefficients \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 0.1663344179 0.017769995 9.360409 1.829270e-18\nCommute 0.0637329409 0.014926382 4.269819 2.618234e-05\nGini_99 -0.1086058241 0.038958986 -2.787696 5.642726e-03\nTest_scores 0.0004997645 0.000256038 1.951915 5.186618e-02\nHS_dropout -0.2162067301 0.082003195 -2.636565 8.805228e-03\nManufacturing -0.1594229237 0.020215791 -7.886059 5.647668e-14\nMigration_in -0.3891567027 0.171839168 -2.264657 2.423771e-02\nReligious 0.0435673365 0.010463920 4.163577 4.084854e-05\nSingle_mothers -0.2864269552 0.046578928 -6.149282 2.444903e-09\n```\n\n\n:::\n\n```{.r .cell-code}\nreduced |>\n broom::glance() |>\n print(width = 120)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n \n1 0.159 304 313\n```\n\n\n:::\n:::\n\n\n\n\n\n## Diagnostic plots for reduced model\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-10-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n\n\n## How do we decide which model is better?\n\n::: flex\n\n::: w-50\n\n* Goodness of fit versus prediction power\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmap( # smaller AIC is better\n list(full = full, reduced = reduced),\n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n```\n\n\n:::\n:::\n\n\n\n\n* Use both models to predict `Mobility` \n\n* Compare both sets of predictions\n\n:::\n\n::: w-50\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmses <- function(preds, obs) round(mean((obs - preds)^2), 5)\nc(\n full = mses(\n predict(full, newdata = test),\n test$Mobility\n ),\n reduced = mses(\n predict(reduced, newdata = test),\n test$Mobility\n )\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n full reduced \n0.00072 0.00084 \n```\n\n\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |>\n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) +\n geom_point(color = \"orange\") +\n facet_wrap(~name, 2) +\n xlab(\"observed mobility\") +\n ylab(\"predicted mobility\") +\n geom_abline(slope = 1, intercept = 0, colour = \"darkblue\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-13-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n:::\n:::\n\n\n# Next time... \n\n\n[Module 1]{.secondary .large}\n\nSelecting models\n", "supporting": [ "02-lm-example_files" ], diff --git a/schedule/slides/02-lm-example.qmd b/schedule/slides/02-lm-example.qmd index ca2fc19..987c2ba 100644 --- a/schedule/slides/02-lm-example.qmd +++ b/schedule/slides/02-lm-example.qmd @@ -80,6 +80,7 @@ summary(full) ## Diagnostic plots +NB: the line in the QQ plot isn't right for either `geom_qq_line` or `plot.lm`... ```{r} #| output-location: column