-
Notifications
You must be signed in to change notification settings - Fork 0
/
Mathur_Navodita_HW_03.Rmd
1398 lines (930 loc) · 56.3 KB
/
Mathur_Navodita_HW_03.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "INFSCI 2595 Spring 2023 Homework: 03"
subtitle: "Assigned January 25, 2023; Due: February 01, 2023"
author: "Navodita Mathur"
date: "Submission time: February 01, 2023 at 11:00PM EST"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#### Collaborators
Include the names of your collaborators here.
## Overview
This assignment is focused on binary classification performance metrics. You will calculate the Accuracy, the confusion matrix, and a simple ROC curve manually on a simple example. Afterwards, you will assess the performance of multiple binary classifiers using `caret` to manage the resampling process and the `yardstick` package to calculate several performance metrics in a more realistic application. You will use Accuracy, the ROC curve, and the ROC AUC to compare multiple models and select the best one. Lastly, you will compare the models using the calibration curve.
**IMPORTANT**: The RMarkdown assumes you have downloaded the 3 data sets (CSV files) to the same directory you saved the template Rmarkdown file. If you do not have the 3 CSV files in the correct location, the data will not be loaded correctly.
### IMPORTANT!!!
Certain code chunks are created for you. Each code chunk has `eval=FALSE` set in the chunk options. You **MUST** change it to be `eval=TRUE` in order for the code chunks to be evaluated when rendering the document.
You are free to add more code chunks if you would like.
## Load packages
The tidyverse is loaded for you in the code chunk below.
```{r, load_tidyverse_pkg}
library(tidyverse)
```
The other packages, `caret` and `yardstick`, will be loaded as needed later in the assignment.
## Problem 01
The code chunk below reads in a data set that you will work with in Problems 1, 2 and 3. A glimpse is printed for you which shows three variables, an input `x`, a model predicted event probability, `pred_prob`, and the observed output class, `obs_class`. A binary classifier has already been trained for you. That model's predicted probability (`pred_prob`) is provided with the observed binary outcome (`obs_class`). You will use this data set to get experience with binary classification performance metrics.
```{r, read_binary_class_data}
example_data_path <- "hw03_example_binary_data.csv"
model_pred_df <- readr::read_csv(example_data_path, col_names = TRUE)
```
```{r, glimpse_read_in_data}
model_pred_df %>% glimpse()
```
### 1a)
**Pipe the `model_pred_df` data set into the `count()` function to display the number of unique values of the `obs_class` variable.**
#### SOLUTION
```{r, solution_01a}
### your code here
model_pred_df %>%
count(obs_class)
```
### 1b)
You should see that one of the values of `obs_class` is the event of interest and is named `"event"`.
**Use the `mean()` function to determine the fraction (or proportion) of the observations that correspond to the event of interest. Is the data set a balanced data set?**
#### SOLUTION
```{r, solution_01b}
### your code here
mean(model_pred_df$obs_class == "event")
```
Is the data set balanced?
As the mean is around 0.5, it can be considered a balanced dataset or atleast a roughly balanced dataset
### 1c)
In lecture we discussed that regardless of the labels or classes associated with the binary response, we can encode the outcome as `y = 1` if the `"event"` is observed and `y = 0` if the `"non_event"` is observed. You will encode the output with this 0/1 encoding.
The `ifelse()` function can help you perform this operation. The `ifelse()` function is a one-line if-statement which operates similar to the IF function in Excel. The basic syntax is:
`ifelse(<conditional statement to check>, <value if TRUE>, <value if FALSE>)`
Thus, the user must specify a condition to check as the first argument to the `ifelse()` function. The second argument is the value to return if the conditional statement is TRUE, and the second argument is the value to return if the conditional statement is FALSE.
You can use the `ifelse()` statement within a `mutate()` call to create a new column in the `model_pred_df` data set.
The code chunk below provides an example using the first 10 rows from the `iris` data set which is loaded with base R. The `Sepal.Width` variable is compared to a value of 3.5. If `Sepal.Width` is greater than 3.5 the new variable, `width_factor`, is set equal to `"greater than"`. However, if it is less than 3.5 the new variable is set to `"less than"`.
```{r, show_ifelse_iris}
iris %>%
slice(1:10) %>%
select(starts_with("Sepal"), Species) %>%
mutate(width_factor = ifelse(Sepal.Width > 3.5,
"greater than",
"less than"))
```
You will use the `ifelse()` function combined with `mutate()` to add a column to the `model_pred_df` tibble.
**Pipe `model_pred_df` into a `mutate()` call in order to create a new column (variable) named `y`. The new variable, `y`, will equal the result of the `ifelse()` function. The conditional statement will be if `obs_class` is equal to the `"event"`. If TRUE assign `y` to equal the value 1. If FALSE, assign `y` to equal the value 0. Assign the result to the variable `model_pred_df` which overwrites the existing value.**
#### SOLUTION
```{r, solution_01c, eval=TRUE}
model_pred_df <- model_pred_df%>%
mutate(y = ifelse(obs_class == "event",
1,
0))
```
### 1d)
You will now visualize the observed binary outcome as encoded by 0 and 1.
**Pipe the `model_pred_df` object into `ggplot()`. Create a scatter plot between the encoded output `y` and the input `x`. Set the marker `size` to be 3.5 and the transparency (`alpha`) to be 0.5.**
#### SOLUTION
```{r, solution_01d}
### your code here
model_pred_df %>%
ggplot(mapping = aes(x=x,y=y), size=3.5, alpha = 0.5) +
geom_point()
```
### 1e)
The `model_pred_df` includes a column (variable) for a model predicted event probability.
**Use the `summary()` function to confirm that the lower an upper bounds on `pred_prob` are in fact between 0 and 1.**
#### SOLUTION
```{r, solution_01e}
### your code here
summary(model_pred_df)
```
### 1f)
With the binary outcome encoded as 0/1 within the `y` variable we can overlay the model predicted probability on top of the observed binary response.
**Use a `geom_line()` to plot the predicted event probability, `pred_prob`, with respect to the input `x`. Set the line `color` to be `"red"` within the `geom_line()` call. Overlay the binary response with the encoded response `y` as a scatter plot with `geom_point()`. Use the same marker size and transparency that you used for Problem 1d).**
#### SOLUTION
```{r, solution_03f}
### your code here
model_pred_df %>%
ggplot(mapping = aes(x=x), size=3.5, alpha = 0.5) +
geom_point(mapping = aes(y=y))+
geom_line( mapping = aes(y=pred_prob), color = "red")
```
### 1g)
**Does the observed binary response "follow" the model predicted probability?**
#### SOLUTION
What do you think?
Yes, the observed binary response follows the model predicted probability
## Problem 02
As you can see from the `model_pred_df` tibble, we have a model predicted probability but we do not have a corresponding classification.
### 2a)
In order to classify our predictions we must compare the predicted probability against a threshold. You will use `ifelse()` combined with `mutate()` to create a new variable `pred_class`. If the predicted probability, `pred_prob`, is greater than the threshold set the predicted class equal to `"event"`. If the predicted probability, `pred_prob`, is less than the threshold set the predicted class equal to the `"non_event"`.
**Use a threshold value of 0.5 and create the new variable `pred_class` such that the classification is `"event"` if the predicted probability is greater than the threshold and `"non_event"` if the predicted probability is less than the threshold. Assign the result to the new object `model_class_0.5`.**
#### SOLUTION
```{r, solution_02a, eval=TRUE}
model_class_0.5 <- model_pred_df%>%
mutate(pred_class = ifelse(pred_prob > 0.5,
1,
0))
```
### 2b)
You should now have a tibble that has a model classification and the observed binary outcome.
**Calculate the Accuracy, the fraction of observations where the model classification is correct.**
#### SOLUTION
```{r, solution_02b}
### your code here
mean(model_class_0.5$y == model_class_0.5$pred_class)
```
### 2c)
We discussed in lecture how there are additional metrics we can consider with binary classification. Specifically, we can consider how a classification is correct, and how a classification is incorrect. A simple way to determine the counts per combination of `pred_class` and `obs_class` is with the `count()` function.
**Pipe `model_class_0.5` into `count()` with `pred_class` as the first argument and `obs_class` as the second argument. You should see 4 combinations and the number of rows in the data set associated with each combination (the number or count is given by the `n` variable).**
**How many observations are associated with False-Positives? How many observations are associated with True-Negatives?**
#### SOLUTION
```{r, solution_024c}
### your code here
model_class_0.5 %>%
count(pred_class,obs_class)
```
#TP - 35
#FP - 6
#TN - 53
#FN - 31
your response here.
6 observations are associated with False-Positives. 53 observations are associated with True-Negatives.
### 2d)
**You will now calculate the Sensitivity and False Positive Rate (FPR) associated with the model predicted classifications based on a threshold of 0.5. This question is left open ended. It is your choice as to how you calculate the Sensitivity and FPR. However, you CANNOT use an existing function from a library which performs the calculations automatically for you. You are permitted to use `dplyr` data manipulation functions. Include as many code chunks as you feel are necessary.**
#### SOLUTION
```{r, solution_02d-1}
### add more code chunks if you need to
tp_0.5 = model_class_0.5 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "event") %>%
select(n)
tp_0.5 = tp_0.5[[1]]
```
```{r, solution_02d-2}
tn_0.5 = model_class_0.5 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "non_event") %>%
select(n)
tn_0.5 = tn_0.5[[1]]
```
```{r, solution_02d-3}
fp_0.5 = model_class_0.5 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "non_event") %>%
select(n)
fp_0.5 = fp_0.5[[1]]
```
```{r, solution_02d-4}
fn_0.5 = model_class_0.5 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "event") %>%
select(n)
fn_0.5 = fn_0.5[[1]]
```
```{r, solution_02d}
Sensitivity_0.5 = tp_0.5/(tp_0.5+fn_0.5)
Sensitivity_0.5
```
```{r}
#Specificity_0.5 = tn_0.5/(tn_0.5+fp_0.5)
fpr_0.5 = 1 - (tn_0.5/(tn_0.5+fp_0.5))
fpr_0.5
```
### 2e)
We also discussed the ROC curve in addition to the confusion matrix. You will not have to calculate the ROC curve for a large number of threshold values. You will go through several calculations in order to demonstrate an understanding of the steps necessary to create an ROC curve.
The first action you must perform is to make classifications based on a different threshold compared to the default value of 0.5, which we used previously.
**Pipe the `model_pred_df` tibble into a `mutate()` function again, but this time determine the classifications based on a threshold value of 0.7 instead of 0.5. Assign the result to the object `model_class_0.7`.**
#### SOLUTION
```{r, solution_02e, eval=TRUE}
model_class_0.7 <- model_pred_df%>%
mutate(pred_class = ifelse(pred_prob > 0.7,
1,
0))
```
### 2f)
**Perform the same action as in Problem 2e), but this time for a threshold value of 0.3. Assign the result to the object `model_class_0.3`.**
#### SOLUTION
```{r, solution_02f, eval=TRUE}
model_class_0.3 <- model_pred_df%>%
mutate(pred_class = ifelse(pred_prob > 0.3,
1,
0))
```
## Problem 3
You will continue with the binary classification application in this problem.
### 3a)
**Calculate the Accuracy of the model classifications based on the 0.7 threshold. You CANNOT use an existing function that calculates Accuracy automatically for you. You are permitted to use `dplyr` data manipulation functions.**
#### SOLUTION
```{r, solution_03a}
### your code here
mean(model_class_0.7$y == model_class_0.7$pred_class)
```
### 3b)
**Calculate the Sensitivity and Specificity of the model classifications based on the 0.7 threshold. Again you can calculate these however you wish. Except you cannot use a model function library that performs the calculations automatically for you.**
#### SOLUTION
```{r, solution_03b-1}
### add more code chunks if you need to
tp_0.7 = model_class_0.7 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "event") %>%
select(n)
tp_0.7 = tp_0.7[[1]]
```
```{r, solution_03b-2}
tn_0.7 = model_class_0.7 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "non_event") %>%
select(n)
tn_0.7 = tn_0.7[[1]]
```
```{r, solution_03b-3}
fp_0.7 = model_class_0.7 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "non_event") %>%
select(n)
fp_0.7 = fp_0.7[[1]]
```
```{r, solution_03b-4}
fn_0.7 = model_class_0.7 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "event") %>%
select(n)
fn_0.7 = fn_0.7[[1]]
```
```{r, solution_03b}
Sensitivity_0.7 = tp_0.7/(tp_0.7+fn_0.7)
Sensitivity_0.7
```
```{r}
#Specificity_0.7 = tn_0.7/(tn_0.7+fp_0.7)
fpr_0.7 = 1 - (tn_0.7/(tn_0.7+fp_0.7))
fpr_0.7
```
### 3c)
**Calculate the Accuracy of the model classifications based on the 0.3 threshold.**
#### SOLUTION
```{r, solution_03c}
### your code here
mean(model_class_0.3$y == model_class_0.3$pred_class)
```
### 3d)
**Calculate the Sensitivity and Specificity of the model classifications based on the 0.3 threshold. Again you can calculate these however you wish. Except you cannot use a model function library that performs the calculations automatically for you.**
#### SOLUTION
```{r, solution_03d-1}
### add more code chunks if you need to
tp_0.3 = model_class_0.3 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "event") %>%
select(n)
tp_0.3 = tp_0.3[[1]]
```
```{r, solution_03d-2}
tn_0.3 = model_class_0.3 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "non_event") %>%
select(n)
tn_0.3 = tn_0.3[[1]]
```
```{r, solution_03d-3}
fp_0.3 = model_class_0.3 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 1 , obs_class == "non_event") %>%
select(n)
fp_0.3 = fp_0.3[[1]]
```
```{r, solution_03d-4}
fn_0.3 = model_class_0.3 %>%
count(pred_class, obs_class) %>%
filter (pred_class == 0 , obs_class == "event") %>%
select(n)
fn_0.3 = fn_0.3[[1]]
```
```{r, solution_03d}
Sensitivity_0.3 = tp_0.3/(tp_0.3+fn_0.3)
Sensitivity_0.3
```
```{r}
#Specificity_0.3 = tn_0.3/(tn_0.3+fp_0.3)
fpr_0.3 = 1 - (tn_0.3/(tn_0.3+fp_0.3))
fpr_0.3
```
### 3e)
You have calculated the Sensitivity and FPR at three different threshold values. You will plot your simple 3 point ROC curve and include a "45-degree" line as reference.
**Use `ggplot2` to plot your simple 3 point ROC curve. You must compile the necessary values into a data.frame or tibble. You must use `geom_point()` to show the markers, `geom_abline()` with `slope=1` and `intercept=0` to show the reference "45-degree" line. And you must use `coord_equal(xlim=c(0,1), ylim=c(0,1))` with your graphic. This way both axes are plotted between 0 and 1 and the axes are equal.**
#### SOLUTION
```{r, solution_03e}
### you may add more code chunks if you need to
t<- tibble(
y = c(Sensitivity_0.3, Sensitivity_0.5, Sensitivity_0.7),
x = c(fpr_0.3, fpr_0.5, fpr_0.7),
threshhold = as.factor(c(0.3,0.5,0.7))
)
t %>% ggplot(mapping = aes(x=x,y=y)) +
geom_point(mapping=aes(color=threshhold),size=3) +
geom_line() +
geom_abline(slope = 1, intercept = 0) +
coord_equal(xlim=c(0,1), ylim=c(0,1))
```
## Problem 04
You have practiced calculating several important binary classification performance metrics. Let's use those metrics in a more realistic application to compare multiple models. The code chunk below reads in data that you will use for model training. The data consists of 4 inputs, `x1` through `x4`, and a binary outcome, `y`. The binary outcome is converted to a factor for you with the appropriate level order for modeling with `caret`.
```{r, read_train_data_set}
train_data_path <- 'hw03_train_data.csv'
df <- readr::read_csv(train_data_path, col_names = TRUE)
df <- df %>%
mutate(y = factor(y, levels = c("event", "non_event")))
```
A glimpse of the data are provided to you below.
```{r, show_train_data_glimpse}
df %>% glimpse()
```
### 4a)
**Are the levels of the binary outcome, `y`, balanced?**
#### SOLUTION
Add as many code chunks as you feel are necessary.
```{r}
mean(df$y == "event")
```
Yes
### 4b)
Although it is best to explore the data in greater detail when we start a data analysis project, we will jump straight to modeling in this assignment.
Download and install `yardstick` if you have not done so already.
**Load in the `caret` package and the `yardstick` packages below. Use a separate code chunk for each package.**
#### SOLUTION
Add the code chunks here.
```{r}
library(caret)
library(yardstick)
```
### 4c)
Just as with regression problems, we must first specify the resampling scheme and primary performance metric when we use `caret` for classification problems. All students will use the same primary performance metric in this assignment. We will begin by focusing on the Accuracy. That said, you are free to decide the kind of resampling scheme you wish to use.
The resampling scheme is controlled by the `trainControl()` function, just as it was with regression problems. You must specify the arguments to the `trainControl()` function accordingly in this problem.
**Specify the resampling scheme you wish to use and assign the result to the `ctrl_acc` object. Specify the primary performance metric to be Accuracy by assigning `'Accuracy'` to the `metric_acc` argument.**
```{r, solution_04c, eval=TRUE}
ctrl_acc <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
metric_acc <- "Accuracy"
```
### 4d)
You are going to train 8 binary classifiers in this problem. The different models will use different features derived from the 4 inputs. The 8 models must have the following features:
* model 1: linear additive features all inputs
* model 2: linear features and quadratic features all inputs
* model 3: linear features and quadratic features just inputs 1 and 2
* model 4: linear features and quadratic features just inputs 1 and 3
* model 5: linear features and quadratic features just inputs 1 and 4
* model 6: linear features and quadratic features just inputs 2 and 3
* model 7: linear features and quadratic features just inputs 2 and 4
* model 8: linear features and quadratic features just inputs 3 and 4
Model 1 is the conventional "linear method" for binary classification. All other models have linear and quadratic terms to allow capturing non-linear relationships with the event probability (just how that works will be discussed later in the semester). Model 2 creates the features from all four inputs. The remaining 6 models use the combinations of just two of the inputs. This scheme is trying to identify the best possible set of inputs to model the binary outcome in a step-wise like fashion. This is **not** an efficient way to identify the best set of features to use. Later in the semester, we will learn a much more efficient *modeling approach* that performs this operation for us!
**You must complete the 8 code chunks below. Use the formula interface to create the features in the model, analogous to the approach used in the previous assignment. You must specify the `method` argument in the `train()` function to be `"glm"`. You must specify the remaining arguments to `train()` accordingly.**
**The variable names and comments within the code chunks specify which model you are working with.**
*NOTE*: The models are trained in separate code chunks that way you can run each model separately from the others.
#### SOLUTION
```{r, solution_04d_01, eval=TRUE}
### model 1
set.seed(2021)
mod_1_acc <- train(y~x1+x2+x3+x4, method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_02, eval=TRUE}
### model 2
set.seed(2021)
mod_2_acc <- train(y~x1+I(x1^2)+x2+I(x2^2)+x3+I(x3^2)+x4+I(x4^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_03, eval=TRUE}
### model 3
set.seed(2021)
mod_3_acc <- train(y~x1+I(x1^2)+x2+I(x2^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_04, eval=TRUE}
### model 4
set.seed(2021)
mod_4_acc <- train(y~x1+I(x1^2)+x3+I(x3^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_05, eval=TRUE}
### model 5
set.seed(2021)
mod_5_acc <- train(y~x1+I(x1^2)+x4+I(x4^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_06, eval=TRUE}
### model 6
set.seed(2021)
mod_6_acc <- train(y~x2+I(x2^2)+x3+I(x3^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_07, eval=TRUE}
### model 7
set.seed(2021)
mod_7_acc <- train(y~x2+I(x2^2)+x4+I(x4^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
```{r, solution_04d_08, eval=TRUE}
### model 8
set.seed(2021)
mod_8_acc <- train(y~x3+I(x3^2)+x4+I(x4^2), method="glm", metric = metric_acc, trControl = ctrl_acc, data=df, family="binomial")
```
### 4e)
You will now compile all resample results together and compare the models based on their Accuracy.
**Complete the first code chunk below which assigns the models to the appropriate field within the `resamples()` function.**
**Then use the `summary()` function to summarize the Accuracy across the resamples and visualize the resample averaged performance with the `dotplot()` function from `caret`. In the function calls to both `summary()` and `dotplot()`, set the `metric` argument equal to `'Accuracy'`.**
**Which model is the best based on Accuracy? Are you confident it's the best?**
*HINT*: The field names within the list contained in the `resamples()` call correspond to the model object you should use.
#### SOLUTION
```{r, solution_04e_01, eval=TRUE}
acc_results <- resamples(list(mod_1 = mod_1_acc,
mod_2 = mod_2_acc,
mod_3 = mod_3_acc,
mod_4 = mod_4_acc,
mod_5 = mod_5_acc,
mod_6 = mod_6_acc,
mod_7 = mod_7_acc,
mod_8 = mod_8_acc))
```
Summarize the results across the resamples.
```{r, solution_04e_02}
### add your code here
summary(acc_results, metric = metric_acc)
```
Visualize the resample averaged Accuracy per model.
```{r, solution_04e_03}
### add your code here
dotplot(acc_results, metric = metric_acc)
```
Which model is the best?
Model 3 is predicted as the best.
### 4f)
Next, you will consider how a model was correct and how a model was wrong via the confusion matrix. You are allowed to use the `confusionMatrix()` function from the `caret` package in this assignment to create the confusion matrix. A `caret` model object can be passed as the argument to the `confusionMatrix()` function. The function will then calculate the average confusion matrix across all resample test-sets. The resulting confusion matrix is displayed with percentages instead of counts, as shown in the lecture slides. The interpretations however are the same.
**Use the `confusionMatrix()` function to display the confusion matrix for the top two and worst two models according to Accuracy.**
**How do the False-Positive and False-Negative behavior compare between these four models?**
#### SOLUTION
Add as many code chunks as you feel are necessary.
```{r}
#Top 1
confusionMatrix(mod_3_acc)
```
```{r}
#Top 2
confusionMatrix(mod_2_acc)
```
```{r}
# Worst
confusionMatrix(mod_8_acc)
```
```{r}
#2nd Worst
confusionMatrix(mod_4_acc)
```
Number of False Positives and False Negatives Increases as accuracy decreases.
## Problem 05
Now that you have compared the models based on Accuracy, it is time to consider another performance metric. The Accuracy is calculated using a single threshold value. You will now examine the model performance across all possible thresholds by studying the ROC curve.
### 5a)
You will ultimately visually compare the ROC curves for the different models. Unfortunately with `caret`, we need to make several changes to the `trainControl()` function in order to support such comparisons. The code chunk below is started for you by including the necessary arguments you will need to visualize the ROC curves later.
**You must complete the code chunk by specifying the same resampling scheme you used in Problem 4c). You must also specify the primary performance metric as `'ROC'`. That is how `caret` knows it must calculate the Area Under the Curve (AUC) for the ROC curve.**
#### SOLUTION
```{r, solution_05a, eval=TRUE}
ctrl_roc <- trainControl( method = "repeatedcv", number = 5, repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE)
metric_roc <- "ROC"
```
### 5b)
You will retrain the same set of 8 models that you trained in Problem 4d), but this time using the ROC AUC as the primary performance metric.
**Complete the code chunks below so that you train the 8 models again, but this time focusing on the ROC AUC. The object name and comments within the code chunks specify the model you should use.**
#### SOLUTION
```{r, solution_05b_01, eval=TRUE}
### model 1
set.seed(2021)
mod_1_roc <- train(y~x1+x2+x3+x4, method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_02, eval=TRUE}
### model 2
set.seed(2021)
mod_2_roc <- train(y~x1+I(x1^2)+x2+I(x2^2)+x3+I(x3^2)+x4+I(x4^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_03, eval=TRUE}
### model 3
set.seed(2021)
mod_3_roc <- train(y~x1+I(x1^2)+x2+I(x2^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_04, eval=TRUE}
### model 4
set.seed(2021)
mod_4_roc <- train(y~x1+I(x1^2)+x3+I(x3^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_05, eval=TRUE}
### model 5
set.seed(2021)
mod_5_roc <- train(y~x1+I(x1^2)+x4+I(x4^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_06, eval=TRUE}
### model 6
set.seed(2021)
mod_6_roc <- train(y~x2+I(x2^2)+x3+I(x3^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_07, eval=TRUE}
### model 7
set.seed(2021)
mod_7_roc <- train(y~x2+I(x2^2)+x4+I(x4^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
```{r, solution_05b_08, eval=TRUE}
### model 8
set.seed(2021)
mod_8_roc <- train(y~x3+I(x3^2)+x4+I(x4^2), method="glm", metric = metric_roc, trControl = ctrl_roc, data=df, family="binomial")
```
### 5c)
You will now compile all resample results together and compare the models based on their area under the ROC curve.
**Complete the first code chunk below which assigns the models to the appropriate field within the `resamples()` function.**
**Then use the `summary()` function to summarize the ROC AUC across the resamples and visualize the resample averaged performance with the `dotplot()` function from `caret`. In the function calls to both `summary()` and `dotplot()`, set the `metric` argument equal to `'ROC'`.**
**Which model is the best based on ROC AUC? Are you confident it's the best?**
*HINT*: The field names within the list contained in the `resamples()` call correspond to the model object you should use.
#### SOLUTION
```{r, solution_05c_01, eval=TRUE}
roc_results <- resamples(list(mod_1 = mod_1_roc,
mod_2 = mod_2_roc,
mod_3 = mod_3_roc,
mod_4 = mod_4_roc,
mod_5 = mod_5_roc,
mod_6 = mod_6_roc,
mod_7 = mod_7_roc,
mod_8 = mod_8_roc))
```
Summarize the results across the resamples.
```{r, solution_05c_02}
### add your code here
summary(roc_results, metric = metric_roc)
```
Visualize the resample averaged ROC AUC per model.
```{r, solution_05c_03}
### add your code here
dotplot(roc_results, metric = metric_roc)
```
Which model is the best?
Model 3 is considered as the best among all models
### 5d)
By default, two other metrics are calculated by `caret` when we use the ROC AUC as the primary performance metric. Unlike ROC AUC, these two metrics are calculated with the default threshold. `caret` labels the the Sensitivity as the `Sens` metric and the Specificity as the `Spec` metric.
**Use the `summary()` and `dotplot()` functions again, but do not specify a metric. Just provide the `roc_results` as the input argument to the functions.**
**Which model has the highest True-Positive Rate at the default threshold? Which model has the lowest False-Positive Rate at the default threshold?**
#### SOLUTION
Add as many code chunks and as much text as you feel are necessary.
```{r}
summary(roc_results)
```
```{r}
dotplot(roc_results)
```
Model 3 has the highest True-Positive Rate at the default threshold. It also has the lowest False-Positive Rate at the default threshold.
### 5e)
In order to visualize the ROC curve we need to understand how the resample hold-out test predictions are stored within the `caret` model objects. By default, hold-out test set predictions are not retained, in order to conserve memory. However, the `ctrl_roc` object set `savePredictions = TRUE` which overrides the default behavior and stores each resample test-set predictions.
The predictions are contained with the `$pred` field of the `caret` model object. The code chunk below displays the first few rows of the predictions for the `mod_1_roc` result for you. Note that the code chunk below is not evaluated by default. When you execute the code chunk below, you will see 7 columns. The column `obs` is the observed outcome and the column `event` is the predicted probability of the `event`. The `pred` column is the model classified outcome based on the default threshold of 50%. The `rowIndex` is the row from the original data set and serves to identify the row correctly. The `Resample` column tells us which resample fold the test point was associated with.
```{r, show_results_2e, eval=TRUE}
mod_1_roc$pred %>% tibble::as_tibble()
```
The ROC curve is calculated by comparing the model predicted probability to all possible thresholds to create many different classifications. Those different classifications are used to calculate many different confusion matrices. Thus, the columns of primary interest in the prediction object displayed above are the `obs` and `event` columns.
You manually created 3 points on the ROC curve previously in this assignment. Do NOT worry, you do NOT need to repeat those actions here! Instead you will use the `roc_curve()` function from the `yardstick` package to create the ROC curve data for you. The `roc_curve()` function has three primary arguments. The first is a data object which contains the predictions in a "tidy" format. The second is the name of the column that corresponds to the observed outcome (the truth or reference). The third is the name of the column in the data set that corresponds to the model predicted event probability.
**Pipe the prediction data object for the `mod_1_roc` `caret` object to the `roc_curve()`. The `obs` column is the observed outcome and the `event` column is the model predicted event probability. Display the result to the screen to confirm the `roc_curve()` function worked. If it did the first few rows should correspond to very low threshold values.**
**Why does the `sensitivity` have values at or near 1 when the `.threshold` is so low?**
#### SOLUTION
```{r, solution_05e}
### add your code here
mod_1_roc$pred %>%
roc_curve(obs, event)
```
What do you think?
Increasing the threshold decreases the number of occurrences of "event", therefore sensitivity decreases i.e. for low thresholds, sensitivity is high. Similarly, when threshold is very high, sensitivity is very close to 0.
### 5f)
You will now visualize the ROC curve associated with `mod_1_roc`.
**Repeat the same steps you performed in 5e) above, except pipe the result to the `autoplot()` method.**
#### SOLUTION
```{r, solution_05f}
### add your code here
mod_1_roc$pred %>%
roc_curve(obs, event) %>%
autoplot()
```
### 5g)
The ROC curve displayed in 5f) is the resample averaged ROC curve. You can examine the individual resample hold-out test set ROC curves by specifying a grouping structure with the `group_by()` function. This can help you get a sense of the variability in the ROC curve.
**Pipe the prediction object associated with `mod_1_roc` to the `group_by()` function where you specify the grouping variable to be `Resample`. Pipe the result to `roc_curve()` where you specify the same arguments as in the previous questions. Finally, pipe the result to `autoplot()`.**
#### SOLUTION
```{r, solution_05g}
### add your code here
mod_1_roc$pred %>%
group_by(Resample) %>%
roc_curve(obs, event) %>%
autoplot()
```
### 5h)
A function is defined for you in the code chunk below. This function compiles all model results together to enable comparing their ROC curves.
```{r, make_roc_compile_function}
compile_all_model_preds <- function(m1, m2, m3, m4, m5, m6, m7, m8)
{
purrr::map2_dfr(list(m1, m2, m3, m4, m5, m6, m7, m8),
as.character(seq_along(list(m1, m2, m3, m4, m5, m6, m7, m8))),
function(ll, lm){
ll$pred %>% tibble::as_tibble() %>%
select(obs, event, Resample) %>%
mutate(model_name = lm)
})
}
```
The code chunk below is also completed for you. It passes the `caret` model objects with the saved predictions to the `compile_all_model_preds()` function. The result is printed for you below so you can see the column names. Notice there is a new column `model_name` which stores the name of the model associated with the resample hold-out test set predictions. By default the code chunk below is not executed.
```{r, run_compiile_all_model_preds, eval=TRUE}
all_model_preds <- compile_all_model_preds(mod_1_roc, mod_2_roc, mod_3_roc,
mod_4_roc, mod_5_roc,
mod_6_roc, mod_7_roc, mod_8_roc)
all_model_preds
```
You will now create a figure which displays the resample averaged ROC curve for each model.
**Pipe the `all_model_preds` object to `group_by()` and specify the grouping variable as `model_name`. Pipe the result to `roc_curve()` and specify the arguments accordingly. Pipe the result to `autoplot()` to generate the figure.**
**Which model is the best? Is the result consistent with the ROC AUC summary metrics you calculated previously? Which model is closest to a "completely ineffective model" and why is that so?**
#### SOLUTION
```{r, solution_05h}
### add your code here
all_model_preds %>%
group_by(model_name) %>%
roc_curve(obs, event) %>%
autoplot()
```
What do you think?
From the plot, it can be said that model-3 is the best? The result is consistent with the ROC AUC summary metrics calculated previously.From the plot, model-8 is closest to a "completely ineffective model" as it almost overlaps with 45 degree line.A completely ineffective model, is one that “swaps” errors as the threshold changes.
## Problem 06
The performance metrics you have focused on up to this point are point-wise comparison metrics which evaluate the classification performance of a model. As discussed in lecture, binary classifier performance can also be measured based on the **calibration** between the predicted probability and the observed event proportion. The performance is represented graphically via the **calibration curve** which visualizes the correlation between the model predictions and the observed proportion of the event.
Regardless of your rankings in the previous questions, you will compare the performance of model 1, model 3, and model 8 with the calibration curve. Although multiple functions exist to create the calibration curve, you **must** create the calibration curve manually in this problem. You are only allowed to use functions from the `dplyr` and `ggplot2` packages. You are **not** allowed to use any third party function to calculate the calibration curve in this problem.
In the two previous problems, you used resampling to assess model performance. We could create the calibration curve based on the resample fold test sets, but we will instead start simpler and use a dedicated hold-out set that is different from the training data. The hold-out set is read in for you in the code chunk below.
```{r, read_test_set}
test_data_path <- 'hw03_test_data.csv'
df_test <- readr::read_csv(test_data_path, col_names = TRUE)
df_test <- df_test %>%
mutate(y = factor(y, levels = c("event", "non_event")))
```
The code chunk below shows a glimpse of the test set, which demonstrates the test set has the same column names as the training set.
```{r, show_test_set_glimpse}
df_test %>% glimpse()
```
### 6a)
The first step to create the calibration curve requires making predictions. The `predict()` function for a `caret` trained object function has two main arguments. The first is the model object we will use to make predictions with and the second, `newdata`, is the data set to predict. The input columns must have the same names as the inputs to the training data that trained the model.
By default, a binary classifier will return the classifications assuming a 50% threshold. As discussed in lecture, the calibration curve does not work with classifications. Instead we need the predicted probability! We can override the default prediction behavior by setting additional arguments to the `predict()` function. Specifically, the `type` argument instructs the model to return a "type" of prediction. We want the predicted probability and so you must set `type = 'prob'` in the `predict()` function call.
**Predict the hold-out test set using model 1 and assign the result to the variable `pred_test_1`. Return the predicted probability by setting the `type` argument to `'prob'`.**
**Print the data type of the `pred_test_1` object to the screen and use the `head()` function to display the "top" of the object.**
*HINT*: It does not matter whether you use the model 1 assessed based on Accuracy or ROC in this problem.
#### SOLUTION
```{r, solution_06a}
### add more code chunks if you like
pred_test_1 <- predict(mod_1_roc, newdata = df_test, type = 'prob')
class(pred_test_1)
```
```{r}
head(pred_test_1)