Skip to content

Commit

Permalink
Merge pull request #1 from BowenZhang2001/main
Browse files Browse the repository at this point in the history
Fix the coding problem.
  • Loading branch information
BowenZhang2001 authored Jun 18, 2024
2 parents 460f400 + d53af3a commit 7d536e7
Show file tree
Hide file tree
Showing 6 changed files with 1,062 additions and 2,144 deletions.
1,416 changes: 496 additions & 920 deletions data-science-tutorials/03-logit/logit.html

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions data-science-tutorials/03-logit/logit.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -215,23 +215,23 @@ fss_21_data <- censusapi::getCensus(name = "cps/foodsec/dec",
HHSUPWGT = as.numeric(HHSUPWGT),
# Combining some categories
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 3, 6, 9, 10),
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 2, 5, 8, 10),
labels = c("MarriedFamily", "UnmarriedFamily",
"Individual", "GroupQuarters")),
GEREG = factor(GEREG, levels = c(1, 2, 3, 4),
labels = c("Northeast", "Midwest", "South", "West")),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 39, 43, 46),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 38, 42, 46),
labels = c("LessThanHighSchool",
"HighSchoolOrAssociateDegree",
"CollegeOrHigher")),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 3, 5, 7),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 2, 4, 7),
labels = c("Employed", "NotEmployed",
"NotInLaborForce")),
PRTAGE = as.numeric(PRTAGE),
PEHSPNON = factor(PEHSPNON, levels = c(1, 2),
labels = c("Hispanic", "Non-Hispanic")),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 2, 3, 26),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 1, 2, 26),
labels = c("White", "Black", "Others")),
RACE = ifelse(PEHSPNON == "Hispanic",
"Hispanic", str_c(PTDTRACE, " non-Hispanic")),
Expand Down Expand Up @@ -336,7 +336,7 @@ data.frame(threshold50 = metrics50, threshold10 = metrics10) |>
```
If we set the threshold to 0.5, the accuracy is 0.90, which is quite higher than setting threshold to 0.1. However, the sensitivity is only 0.02, which means the model can only capture 2% of the households in low food security. In contrast, if we set the threshold to 0.1, the sensitivity is 0.69, which means the model can capture 68% of the households in low food security. However, the specificity decreases to 0.76. This trade-off is common in classification models.
If we set the threshold to 0.5, the accuracy is 0.90, which is quite higher than setting threshold to 0.1. However, the sensitivity is only 0.02, which means the model can only capture 2% of the households in low food security. In contrast, if we set the threshold to 0.1, the sensitivity is 0.71, which means the model can capture 71% of the households in low food security. However, the specificity decreases to 0.74. This trade-off is common in classification models.
Therefore, we need a metric that can evaluate the model's performance under different thresholds. The ROC curve is a good choice. The **ROC** curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. The name “ROC” is historic, and comes from communications theory. It is an acronym for receiver operating characteristics.
Expand All @@ -363,7 +363,7 @@ ggroc(roc_data, legacy.axes = TRUE) +
label = paste("AUC =", round(auc(roc_data), 3)))
```
The logistic regression model has an AUC of 0.788, which indicates that the model has a good discrimination ability. However, if we want to evaluate the model's predictive performance, simply fitting models and calculating AUC is not enough.
The logistic regression model has an AUC of 0.792, which indicates that the model has a good discrimination ability. However, if we want to evaluate the model's predictive performance, simply fitting models and calculating AUC is not enough.
## Assessing model accuracy
Expand Down
465 changes: 110 additions & 355 deletions data-science-tutorials/04-rf/rf.html

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions data-science-tutorials/04-rf/rf.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -85,23 +85,23 @@ fss_21_data <- censusapi::getCensus(name = "cps/foodsec/dec",
HHSUPWGT = as.numeric(HHSUPWGT),
# Combining some categories
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 3, 6, 9, 10),
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 2, 5, 8, 10),
labels = c("MarriedFamily", "UnmarriedFamily",
"Individual", "GroupQuarters")),
GEREG = factor(GEREG, levels = c(1, 2, 3, 4),
labels = c("Northeast", "Midwest", "South", "West")),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 39, 43, 46),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 38, 42, 46),
labels = c("LessThanHighSchool",
"HighSchoolOrAssociateDegree",
"CollegeOrHigher")),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 3, 5, 7),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 2, 4, 7),
labels = c("Employed", "NotEmployed",
"NotInLaborForce")),
PRTAGE = as.numeric(PRTAGE),
PEHSPNON = factor(PEHSPNON, levels = c(1, 2),
labels = c("Hispanic", "Non-Hispanic")),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 2, 3, 26),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 1, 2, 26),
labels = c("White", "Black", "Others")),
RACE = ifelse(PEHSPNON == "Hispanic",
"Hispanic", str_c(PTDTRACE, " non-Hispanic")),
Expand Down Expand Up @@ -385,4 +385,4 @@ final_fit |>
collect_metrics()
```

Random forest model has a better performance than logistic regression model in terms of AUC and accuracy.
Random forest model has a better performance than logistic regression model in terms of AUC.
1,295 changes: 441 additions & 854 deletions data-science-tutorials/05-dml/dml.html

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions data-science-tutorials/05-dml/dml.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -197,22 +197,22 @@ fss_20_data <- censusapi::getCensus(name = "cps/foodsec/dec",
HESP8 = ifelse(HESP8 == 1, 1, 0),
HRNUMHOU = as.numeric(HRNUMHOU),
HHSUPWGT = as.numeric(HHSUPWGT),
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 3, 6, 9, 10),
HRHTYPE = cut(as.numeric(HRHTYPE), breaks = c(0, 2, 5, 8, 10),
labels = c("MarriedFamily", "UnmarriedFamily",
"Individual", "GroupQuarters")),
GEREG = factor(GEREG, levels = c(1, 2, 3, 4),
labels = c("Northeast", "Midwest", "South", "West")),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 39, 43, 46),
PEEDUCA = cut(as.numeric(PEEDUCA), breaks = c(30, 38, 42, 46),
labels = c("LessThanHighSchool",
"HighSchoolOrAssociateDegree",
"CollegeOrHigher")),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 3, 5, 7),
PEMLR = cut(as.numeric(PEMLR), breaks = c(0, 2, 4, 7),
labels = c("Employed", "NotEmployed",
"NotInLaborForce")),
PRTAGE = as.numeric(PRTAGE),
PEHSPNON = factor(PEHSPNON, levels = c(1, 2),
labels = c("Hispanic", "Non-Hispanic")),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 2, 3, 26),
PTDTRACE = cut(as.numeric(PTDTRACE), breaks = c(0, 1, 2, 26),
labels = c("White", "Black", "Others")),
RACE = ifelse(PEHSPNON == "Hispanic",
"Hispanic", str_c(PTDTRACE, " non-Hispanic")),
Expand Down

0 comments on commit 7d536e7

Please sign in to comment.