-
Notifications
You must be signed in to change notification settings - Fork 97
/
06_ml_regression.Rmd
1188 lines (861 loc) · 61.9 KB
/
06_ml_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Regression {#mlregression}
In this chapter, we will use machine learning to predict *continuous values* that are associated with text data. Like in all predictive modeling tasks, this chapter demonstrates how to use learning algorithms to find and model relationships between an outcome or target variable and other input features. What is unique about the focus of this book is that our features are created from text data following the techniques laid out in Chapters \@ref(language) through \@ref(embeddings), and what is unique about the focus of this particular chapter is that our outcome is numeric and continuous. For example, let's consider a sample of opinions from the United States Supreme Court, available in the **scotus** [@R-scotus] package.
```{r scotussample}
library(tidyverse)
library(scotus)
scotus_filtered %>%
as_tibble()
```
This data set contains the entire text of each opinion in the `text` column, along with the `case_name` and `docket_number`. Notice that we also have the `year` that each case was decided by the Supreme Court; this is basically a continuous variable (rather than a group membership of discrete label).
```{block, type = "rmdnote"}
If we want to build a model to predict which court opinions were written in which years, we would build a regression model.
```
- A **classification model** predicts a class label or group membership.
- A **regression model** predicts a numeric or continuous value.
In text modeling, we use text data (such as the text of the court opinions), sometimes combined with other structured, non-text data, to predict the continuous value of interest (such as year of the court opinion). The goal of predictive modeling with text input features and a continuous outcome is to learn and model the relationship between the input features and the numeric target (outcome).
## A first regression model {#firstmlregression}
Let's build our first regression model using this sample of Supreme Court opinions. Before we start, let's check out how many opinions we have for each decade in Figure \@ref(fig:scotushist).
```{r scotushist, dependson="scotussample", fig.cap="Supreme Court opinions per decade in sample"}
scotus_filtered %>%
mutate(year = as.numeric(year),
year = 10 * (year %/% 10)) %>%
count(year) %>%
ggplot(aes(year, n)) +
geom_col() +
labs(x = "Year", y = "Number of opinions per decade")
```
This sample of opinions reflects the distribution over time of available opinions for analysis; there are many more opinions per year in this data set after about 1850 than before. This is an example of bias already in our data, as we discussed in the overview to these chapters, and we will need to account for that in choosing a model and understanding our results.
### Building our first regression model {#firstregression}
Our first step in building a model is to split our data into training and testing sets. We use functions from **tidymodels** for this; we use `initial_split()` to set up *how* to split the data, and then we use the functions `training()` and `testing()` to create the data sets we need. Let's also convert the year to a numeric value since it was originally stored as a character, and remove the `'` character because of its effect on one of the models^[The random forest implementation in the **ranger** package, demonstrated in Section \@ref(comparerf), does not handle special characters in columns names well.] we want to try out.
```{r scotussplit, dependson="scotussample"}
library(tidymodels)
set.seed(1234)
scotus_split <- scotus_filtered %>%
mutate(year = as.numeric(year),
text = str_remove_all(text, "'")) %>%
initial_split()
scotus_train <- training(scotus_split)
scotus_test <- testing(scotus_split)
```
Next, let's \index{preprocessing}preprocess our data to get it ready for modeling using a recipe. We'll use both general preprocessing functions from **tidymodels** and specialized functions just for text from **textrecipes** in this preprocessing.
```{block2, type = "rmdpackage"}
The **recipes** package [@R-recipes] is part of **tidymodels** and provides functions for data preprocessing and feature engineering. The **textrecipes** package [@textrecipes] extends **recipes** by providing steps that create features for modeling from text, as we explored in the first five chapters of this book.
```
\index{preprocessing}
\index{feature engineering}
What are the steps in creating this recipe?
- First, we must specify in our initial `recipe()` statement the form of our model (with the formula `year ~ text`, meaning we will predict the year of each opinion from the text of that opinion) and what our training data is.
- Then, we tokenize (Chapter \@ref(tokenization)) the text of the court opinions.
- Next, we filter to only keep the top 1000 tokens by term frequency. We filter out those less frequent words because we expect them to be too rare to be reliable, at least for our first attempt. (We are _not_ removing stop words yet; we'll explore removing them in Section \@ref(casestudystopwords).)
- The recipe step `step_tfidf()`, used with defaults here, weights each token frequency by the inverse document frequency.\index{tf-idf}
- As a last step, we normalize (center and scale) these tf-idf values. This centering and scaling is needed because we're going to use a support vector machine model.
```{r scotusrec, dependson="scotussplit"}
library(textrecipes)
scotus_rec <- recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 1e3) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
scotus_rec
```
Now that we have a full specification of the preprocessing recipe, we can `prep()` this recipe to estimate all the necessary parameters for each step using the training data and `bake()` it to apply the steps to data, like the training data (with `new_data = NULL`), testing data, or new data at prediction time.
```{r scotusprep, dependson="scotusrec"}
scotus_prep <- prep(scotus_rec)
scotus_bake <- bake(scotus_prep, new_data = NULL)
dim(scotus_bake)
```
For most modeling tasks, you will not need to `prep()` or `bake()` your recipe directly; instead you can build up a tidymodels `workflow()` to bundle together your modeling components.
```{block2, type = "rmdpackage"}
In **tidymodels**, the **workflows** package [@R-workflows] offers infrastructure for bundling model components. A _model workflow_ is a convenient way to combine different modeling components (a preprocessor plus a model specification); when these are bundled explicitly, it can be easier to keep track of your modeling plan, as well as fit your model and predict on new data.
```
Let's create a `workflow()` to bundle together our recipe with any model specifications we may want to create later. First, let's create an empty `workflow()` and then only add the data preprocessor `scotus_rec` to it.
```{r scotuswf, dependson="scotusrec"}
scotus_wf <- workflow() %>%
add_recipe(scotus_rec)
scotus_wf
```
Notice that there is no model yet: `Model: None`. It's time to specify the model we will use! Let's build a support vector machine (SVM) model. While they don't see widespread use in cutting-edge machine learning research today, they are frequently used in practice and have properties that make them well-suited for text classification [@Joachims1998] and can give good performance [@Vantu2016].
```{block2, type = "rmdnote"}
An SVM model can be used for either regression or classification, and linear SVMs often work well with text data. Even better, linear SVMs typically do not need to be tuned (see Section \@ref(tunelasso) for tuning model hyperparameters).
```
Before fitting, we set up a model specification. There are three components to specifying a model using tidymodels: the model algorithm (a linear SVM here), the mode (typically either classification or regression), and the computational engine we are choosing to use. For our linear SVM, let's use the **LiblineaR** engine [@R-LiblineaR].
```{r svmspec}
svm_spec <- svm_linear() %>%
set_mode("regression") %>%
set_engine("LiblineaR")
```
Everything is now ready for us to fit our model. Let's add our model to the workflow with `add_model()` and fit to our training data `scotus_train`.
```{r svmfit, dependson=c("svmspec", "scotuswf")}
svm_fit <- scotus_wf %>%
add_model(svm_spec) %>%
fit(data = scotus_train)
```
We have successfully fit an SVM model to this data set of Supreme Court opinions. What does the result look like? We can access the fit using `extract_fit_parsnip()`, and even `tidy()` the model coefficient results into a convenient dataframe format.
```{r dependson="svmfit"}
svm_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
arrange(-estimate)
```
```{r echo=FALSE, dependson="svmfit"}
test_words1 <-
svm_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
slice_max(estimate, n = 10) %>%
mutate(term = str_remove(term, "tfidf_text_")) %>%
pull(term)
if (!all(c("appeals", "petitioner") %in% test_words1)) {
rlang::abort('In Chapter 6 on regression, tidied SVM terms (more recent) do not match previous results.')
}
```
The term `Bias` here means the same thing as an intercept. We see here what terms contribute to a Supreme Court opinion being written more recently, like "appeals" and "petitioner".
What terms contribute to a Supreme Court opinion being written further in the past, for this first attempt at a model?
```{r dependson="svmfit"}
svm_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
arrange(estimate)
```
```{r echo=FALSE, dependson="svmfit"}
test_words2 <-
svm_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
slice_max(-estimate, n = 10) %>%
mutate(term = str_remove(term, "tfidf_text_")) %>%
pull(term)
if (!all(c("ought", "therefore") %in% test_words2)) {
rlang::abort('In Chapter 6 on regression, tidied SVM terms (older) do not match previous results.')
}
```
Here we see words like "ought" and "therefore".
### Evaluation {#firstregressionevaluation}
One option for evaluating our model is to predict one time on the testing set to measure performance.
```{block, type = "rmdwarning"}
The testing set is extremely valuable data, however, and in real-world situations, we advise that you only use this precious resource one time (or at most, twice).
```
The purpose of the testing data is to estimate how your final model will perform on new data; we set aside a proportion of the data available and pretend that it is not available to us for training the model so we can use it to estimate model performance on strictly out-of-sample data. Often during the process of modeling, we want to compare models or different model parameters. If we use the test set for these kinds of tasks, we risk fooling ourselves that we are doing better than we really are.
Another option for evaluating models is to predict one time on the training set to measure performance. This is the _same data_ that was used to train the model, however, and evaluating on the training data often results in performance estimates that are too optimistic. This is especially true for powerful machine learning algorithms that can learn subtle patterns from data; we risk overfitting to the training set.\index{models!comparing}
Yet another option for evaluating or comparing models is to use a separate validation set. In this situation, we split our data _not_ into two sets (training and testing) but into three sets (testing, training, and validation). The validation set is used for computing performance metrics to compare models or model parameters. This can be a great option if you have enough data for it, but often we as machine learning practitioners are not so lucky.
What are we to do, then, if we want to train multiple models and find the best one? Or compute a reliable estimate for how our model has performed without wasting the valuable testing set? We can use **resampling**. When we resample, we create new simulated data sets from the training set for the purpose of, for example, measuring model performance.
Let's estimate the performance of the linear SVM regression model we just fit. We can do this using resampled data sets built from the training set.
```{block2, type = "rmdpackage"}
In **tidymodels**, the package for data splitting and resampling is **rsample** [@R-rsample].
```
Let's create 10-fold cross-validation sets, and use these resampled sets for performance estimates.
```{r scotusfolds, dependson="scotussplit"}
set.seed(123)
scotus_folds <- vfold_cv(scotus_train)
scotus_folds
```
Each of these "splits" contains information about how to create cross-validation folds from the original training data. In this example, 90% of the training data is included in each fold for analysis and the other 10% is held out for assessment. Since we used cross-validation, each Supreme Court opinion appears in only one of these held-out assessment sets.
In Section \@ref(firstregression), we fit one time to the training data as a whole. Now, to estimate how well that model performs, let's fit many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.
```{r svmrs, dependson=c("scotuswf", "scotusfolds", "svmspec")}
set.seed(123)
svm_rs <- fit_resamples(
scotus_wf %>% add_model(svm_spec),
scotus_folds,
control = control_resamples(save_pred = TRUE)
)
svm_rs
```
These results look a lot like the resamples, but they have some additional columns, like the `.metrics` that we can use to measure how well this model performed and the `.predictions` we can use to explore that performance more deeply. What results do we see, in terms of performance metrics?
```{r, dependson="svmrs"}
collect_metrics(svm_rs)
```
```{r firstattemprmse, dependson="svmrs", echo=FALSE}
first_attemp_rmse <- collect_metrics(svm_rs) %>%
filter(.metric == "rmse") %>%
pull(mean) %>%
round(1)
```
The default performance metrics to be computed for regression models are RMSE (root mean squared error) and $R^2$ (coefficient of determination). RMSE is a metric that is in the same units as the original data, so in units of _years_, in our case; the RMSE of this first regression model is `r first_attemp_rmse` years.
\index{root mean squared error|see {RMSE}}
\index{RMSE}
\index{coefficient of determination}
```{block, type = "rmdnote"}
RSME and $R^2$ are performance metrics used for regression models.
RSME is a measure of the difference between the predicted and observed values; if the model fits the data well, RMSE is lower. To compute RMSE, you take the mean values of the squared difference between the predicted and observed values, then take the square root.
$R^2$ is the squared correlation between the predicted and observed values. When the model fits the data well, the predicted and observed values are closer together with a higher correlation between them. The correlation between two variables is bounded between −1 and 1, so the closer $R^2$ is to one, the better.
```
These values are quantitative estimates for how well our model performed, and can be compared across different kinds of models. Figure \@ref(fig:firstregpredict) shows the predicted years for these Supreme Court opinions plotted against the true years when they were published, for all the resampled data sets.
```{r firstregpredict, dependson="svmrs", opts.label = "fig.large", fig.cap="Most Supreme Court opinions are near the dashed line, indicating good agreement between our SVM regression predictions and the real years"}
svm_rs %>%
collect_predictions() %>%
ggplot(aes(year, .pred, color = id)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = 0.3) +
labs(
x = "Truth",
y = "Predicted year",
color = NULL,
title = "Predicted and true years for Supreme Court opinions",
subtitle = "Each cross-validation fold is shown in a different color"
)
```
The average spread of points in this plot above and below the dashed line corresponds to RMSE, which is `r first_attemp_rmse` years for this model. When RMSE is better (lower), the points will be closer to the dashed line. This first model we have tried did not do a great job for Supreme Court opinions from before 1850, but for opinions after 1850, this looks pretty good!
```{block, type = "rmdwarning"}
Hopefully you are convinced that using resampled data sets for measuring performance is the right choice, but it can be computationally expensive. Instead of fitting once, we must fit the model one time for _each_ resample. The resamples are independent of each other, so this is a great fit for parallel processing. The tidymodels framework is designed to work fluently with parallel processing in R, using multiple cores or multiple machines. The implementation details of parallel processing are operating system specific, so [look at tidymodels' documentation for how to get started](https://tune.tidymodels.org/articles/extras/optimizations.html).
```
## Compare to the null model {#regnull}
One way to assess a model like this one is to compare its performance to a "null model".
```{block2, type = "rmdnote"}
A null model is a simple, non-informative model that always predicts the largest class (for classification) or the mean (such as the mean year of Supreme Court opinions, in our specific regression case)^[This is sometimes called a "baseline model".].
```
We can use the same function `fit_resamples()` and the same preprocessing recipe as before, switching out our SVM model specification for the `null_model()` specification.
```{r nullrs, dependson=c("scotuswf", "scotusfolds")}
null_regression <- null_model() %>%
set_engine("parsnip") %>%
set_mode("regression")
null_rs <- fit_resamples(
scotus_wf %>% add_model(null_regression),
scotus_folds,
metrics = metric_set(rmse)
)
null_rs
```
What results do we obtain from the null model, in terms of performance metrics?
```{r, dependson="nullrs"}
collect_metrics(null_rs)
```
The RMSE indicates that this null model is dramatically worse than our first model. Even our first very attempt at a regression model (using only unigrams and very little specialized preprocessing)\index{preprocessing} did much better than the null model; the text of the Supreme Court opinions has enough information in it related to the year the opinions were published that we can build successful models.
## Compare to a random forest model {#comparerf}
Random forest models are broadly used in predictive modeling contexts because they are low-maintenance and perform well. For example, see @Caruana2008 and @Olson2017 for comparisons of the performance of common models such as random forest, decision tree, support vector machines, etc. trained on benchmark data sets; random forest models were one of the best overall. Let's see how a random forest model performs with our data set of Supreme Court opinions.
First, let's build a random forest model specification, using the ranger implementation. Random forest models are known for performing well without hyperparameter tuning, so we will just make sure we have enough `trees`.
```{r scotusrfspec}
rf_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
rf_spec
```
Now we can fit this random forest model. Let's use `fit_resamples()` again, so we can evaluate the model performance. We will use three arguments to this function:
- Our modeling `workflow()`, with the same preprocessing recipe we have been using so far in this chapter plus our new random forest model specification
- Our cross-validation resamples of the Supreme Court opinions
- A `control` argument to specify that we want to keep the predictions, to explore after fitting
```{r scotusrfrs, dependson=c("scotuswf", "scotusfolds", "scotusrfspec")}
rf_rs <- fit_resamples(
scotus_wf %>% add_model(rf_spec),
scotus_folds,
control = control_resamples(save_pred = TRUE)
)
```
We can use `collect_metrics()` to obtain and format the performance metrics for this random forest model.
```{r rfmetrics, dependson="scotusrfrs"}
collect_metrics(rf_rs)
```
This looks pretty promising, so let's explore the predictions for this random forest model.
```{r rfpredict, dependson="scotusrfrs", opts.label = "fig.large", fig.cap="The random forest model did not perform very sensibly across years, compared to our first attempt using a linear SVM model"}
collect_predictions(rf_rs) %>%
ggplot(aes(year, .pred, color = id)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = 0.3) +
labs(
x = "Truth",
y = "Predicted year",
color = NULL,
title = paste("Predicted and true years for Supreme Court opinions using",
"a random forest model", sep = "\n"),
subtitle = "Each cross-validation fold is shown in a different color"
)
```
Figure \@ref(fig:rfpredict) shows some of the strange behavior from our fitted model. The overall performance metrics look pretty good, but predictions are too high and too low around certain threshold years. \index{models!challenges}
It is very common to run into problems when using tree-based models like random forests with text data. One of the defining characteristics of text data is that it is _sparse_, with many features but most features not occurring in most observations. Tree-based models such as random forests are often not well-suited to sparse data because of how decision trees model outcomes [@Tang2018].
```{block, type = "rmdnote"}
Models that work best with text tend to be models designed for or otherwise appropriate for sparse data.
```
Algorithms that work well with sparse data are less important when text has been transformed to a non-sparse representation, such as with word embeddings (Chapter \@ref(embeddings)).
## Case study: removing stop words {#casestudystopwords}
We did not remove stop words (Chapter \@ref(stopwords)) in any of our models so far in this chapter. What impact will removing stop words have, and how do we know which stop word list is the best to use? The best way to answer these questions is with experimentation.
Removing stop words is part of \index{preprocessing}data preprocessing, so we define this step as part of our preprocessing recipe. Let's use the best model we've found so far (the linear SVM model from Section \@ref(firstregressionevaluation)) and switch in a different recipe in our modeling workflow.
Let's build a small recipe wrapper helper function so we can pass a value `stopword_name` to `step_stopwords()`.
```{r stopwordrec, dependson="scotussplit"}
stopword_rec <- function(stopword_name) {
recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text) %>%
step_stopwords(text, stopword_source = stopword_name) %>%
step_tokenfilter(text, max_tokens = 1e3) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
}
```
For example, now we can create a recipe that removes the Snowball stop words list\index{stop word lists!Snowball} by calling this function.
```{r dependson="stopwordrec"}
stopword_rec("snowball")
```
Next, let's set up a new workflow that has a model only, using `add_model()`. We start with the empty `workflow()` and then add our linear SVM regression model.
```{r svmwf, dependson="svmspec"}
svm_wf <- workflow() %>%
add_model(svm_spec)
svm_wf
```
Notice that for this workflow, there is no preprocessor yet: `Preprocessor: None`. This workflow uses the same linear SVM specification that we used in Section \@ref(firstmlregression), but we are going to combine several different preprocessing recipes with it, one for each stop word lexicon we want to try.
Now we can put this all together and fit these models that include stop word removal. We could create a little helper function for fitting like we did for the recipe, but we have printed out all three calls to `fit_resamples()` for extra clarity. Notice for each one that there are two arguments:
- A workflow, which consists of the linear SVM model specification and a data preprocessing recipe with stop word removal
- The same cross-validation folds we created earlier
```{r stopwordsres, dependson=c("svmwf", "stopwordrec", "scotusfolds")}
set.seed(123)
snowball_rs <- fit_resamples(
svm_wf %>% add_recipe(stopword_rec("snowball")),
scotus_folds
)
set.seed(234)
smart_rs <- fit_resamples(
svm_wf %>% add_recipe(stopword_rec("smart")),
scotus_folds
)
set.seed(345)
stopwords_iso_rs <- fit_resamples(
svm_wf %>% add_recipe(stopword_rec("stopwords-iso")),
scotus_folds
)
```
After fitting models to each of the cross-validation folds, these sets of results contain metrics computed for removing that set of stop words.
```{r dependson="stopwordsres"}
collect_metrics(smart_rs)
```
We can explore whether one of these sets of stop words performed better than the others by comparing the performance, for example in terms of RMSE as shown Figure \@ref(fig:snowballrmse). This plot shows the five best models for each set of stop words, using `show_best()` applied to each via `purrr::map_dfr()`.
```{r snowballrmse, dependson="stopwordsres", fig.cap="Comparing model performance for predicting the year of Supreme Court opinions with three different stop word lexicons"}
word_counts <- tibble(name = c("snowball", "smart", "stopwords-iso")) %>%
mutate(words = map_int(name, ~length(stopwords::stopwords(source = .))))
list(snowball = snowball_rs,
smart = smart_rs,
`stopwords-iso` = stopwords_iso_rs) %>%
map_dfr(show_best, "rmse", .id = "name") %>%
left_join(word_counts, by = "name") %>%
mutate(name = paste0(name, " (", words, " words)"),
name = fct_reorder(name, words)) %>%
ggplot(aes(name, mean, color = name)) +
geom_crossbar(aes(ymin = mean - std_err, ymax = mean + std_err), alpha = 0.6) +
geom_point(size = 3, alpha = 0.8) +
theme(legend.position = "none") +
labs(x = NULL, y = "RMSE",
title = "Model performance for three stop word lexicons",
subtitle = "For this data set, the Snowball lexicon performed best")
```
The \index{stop word lists!Snowball}Snowball lexicon contains the smallest number of words (see Figure \@ref(fig:stopwordoverlap)) and, in this case, results in the best performance. Removing fewer stop words results in the best performance.
```{block, type = "rmdwarning"}
This result is not generalizable to all data sets and contexts, but the approach outlined in this section **is** generalizable.
```
This approach can be used to compare different lexicons and find the best one for a specific data set and model. Notice how the results for all stop word lexicons are worse than removing no stop words at all (remember that the RMSE was `r first_attemp_rmse` years in Section \@ref(firstregressionevaluation)). This indicates that, for this particular data set, removing even a small stop word list is not a great choice.
When removing stop words does appear to help a model, it's good to know that removing stop words isn't computationally slow or difficult so the cost for this improvement is low.\index{computational speed}
## Case study: varying n-grams {#casestudyngrams}
Each model trained so far in this chapter has involved single words or _unigrams_, but using \index{tokenization!n-gram}n-grams (Section \@ref(tokenizingngrams)) can integrate different kinds of information into a model. Bigrams and trigrams (or even higher-order n-grams) capture concepts that span single words, as well as effects from word order, that can be predictive.
This is another part of data preprocessing\index{preprocessing}, so we again define this step as part of our preprocessing recipe. Let's build another small recipe wrapper helper function so we can pass a list of options `ngram_options` to `step_tokenize()`. We'll use it with the same model as the previous section.
```{r ngramrec, dependson="scotussplit"}
ngram_rec <- function(ngram_options) {
recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text, token = "ngrams", options = ngram_options) %>%
step_tokenfilter(text, max_tokens = 1e3) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
}
```
There are two options we can specify, `n` and `n_min`, when we are using `engine = "tokenizers"`. We can set up a recipe with only `n = 1` to tokenize and only extract the unigrams.
```{r eval=FALSE}
ngram_rec(list(n = 1))
```
We can use `n = 3, n_min = 1` to identify the set of all trigrams, bigrams, _and_ unigrams.\index{tokenization!n-gram}
```{r eval=FALSE}
ngram_rec(list(n = 3, n_min = 1))
```
```{block, type = "rmdnote"}
Including n-grams of different orders in a model (such as trigrams, bigrams, plus unigrams) allows the model to learn at different levels of linguistic organization and context.
```
We can reuse the same workflow `svm_wf` from our earlier case study; these types of modular components are a benefit to adopting this approach to supervised machine learning. This workflow provides the linear SVM specification. Let's put it all together and create a helper function to use `fit_resamples()` with this model plus our helper recipe function.
```{r fitngram, dependson=c("tunablewf", "ngramrec", "scotusfolds")}
fit_ngram <- function(ngram_options) {
fit_resamples(
svm_wf %>% add_recipe(ngram_rec(ngram_options)),
scotus_folds
)
}
```
```{block2, type = "rmdwarning"}
We could have created this type of small function for trying out different stop word lexicons in Section \@ref(casestudystopwords), but there we showed each call to `fit_resamples()` for extra clarity.
```
With this helper function, let's try out predicting the year of Supreme Court opinions using:
- only unigrams
- bigrams and unigrams
- trigrams, bigrams, and unigrams
```{r ngramrs, dependson=c("fitngram"), eval=FALSE}
set.seed(123)
unigram_rs <- fit_ngram(list(n = 1))
set.seed(234)
bigram_rs <- fit_ngram(list(n = 2, n_min = 1))
set.seed(345)
trigram_rs <- fit_ngram(list(n = 3, n_min = 1))
```
These sets of results contain metrics computed for the model with that tokenization strategy.\index{tokenization!n-gram}
```{r dependson="ngramrs", eval=FALSE}
collect_metrics(bigram_rs)
```
```{r, echo=FALSE, message=FALSE}
readr::read_csv("inst/collect_metrics_bigram_rs.csv")
```
We can compare the performance of these models in terms of RMSE as shown Figure \@ref(fig:ngramrmse).
```{r, dependson="ngramrs", fig.cap="Comparing model performance for predicting the year of Supreme Court opinions with three different degrees of n-grams", eval=FALSE}
list(`1` = unigram_rs,
`1 and 2` = bigram_rs,
`1, 2, and 3` = trigram_rs) %>%
map_dfr(collect_metrics, .id = "name") %>%
filter(.metric == "rmse") %>%
ggplot(aes(name, mean, color = name)) +
geom_crossbar(aes(ymin = mean - std_err, ymax = mean + std_err), alpha = 0.6) +
geom_point(size = 3, alpha = 0.8) +
theme(legend.position = "none") +
labs(
x = "Degree of n-grams",
y = "RMSE",
title = "Model performance for different degrees of n-gram tokenization",
subtitle = "For the same number of tokens, unigrams performed best"
)
```
```{r ngramrmse, dependson="ngramrs", fig.cap="Comparing model performance for predicting the year of Supreme Court opinions with three different degrees of n-grams", echo=FALSE, message=FALSE}
readr::read_csv("inst/collect_metrics_all_ngram.csv") %>%
ggplot(aes(name, mean, color = name)) +
geom_crossbar(aes(ymin = mean - std_err, ymax = mean + std_err), alpha = 0.6) +
geom_point(size = 3, alpha = 0.8) +
theme(legend.position = "none") +
labs(
x = "Degree of n-grams",
y = "RMSE",
title = "Model performance for different degrees of n-gram tokenization",
subtitle = "For the same number of tokens, unigrams performed best"
)
```
Each of these models was trained with `max_tokens = 1e3`, i.e., including only the top 1000 tokens for each tokenization strategy. Holding the number of tokens constant, using unigrams alone performs best for this corpus of Supreme Court opinions. To be able to incorporate the more complex information in bigrams or trigrams, we would need to increase the number of tokens in the model considerably.
Keep in mind that adding n-grams\index{tokenization!n-gram} is computationally expensive\index{computational speed} to start with, especially compared to the typical improvement in model performance gained. We can benchmark the whole model workflow, including preprocessing\index{preprocessing} and modeling. Using bigrams plus unigrams takes more than twice as long to train than only unigrams (number of tokens held constant), and adding in trigrams as well takes almost five times as long as training on unigrams alone.
## Case study: lemmatization {#mlregressionlemmatization}
As we discussed in Section \@ref(lemmatization), we can normalize words to their roots or \index{lemma}**lemmas** based on each word's context and the structure of a language. Table \@ref(tab:lemmatb) shows both the original words and the lemmas for one sentence from a Supreme Court opinion, using lemmatization implemented via the [spaCy](https://spacy.io/) library as made available through the **spacyr** R package [@Benoit19].
```{r lemmatb, echo=FALSE}
spacyr::spacy_initialize(entity = FALSE)
paste(
"However, the Court of Appeals disagreed with the District Court's",
"construction of the state statute, concluding that it did authorize",
"issuance of the orders to withhold to the Postal Service.") %>%
spacyr::spacy_parse() %>%
select(`original word` = token, lemma) %>%
knitr::kable(
caption = "Lemmatization of one sentence from a Supreme Court opinion",
booktabs = TRUE
)
```
Notice several things about lemmatization\index{lemma} that are different from the kind of default tokenization (Chapter \@ref(tokenization)) you may be more familiar with.
- Words are converted to lowercase except for proper nouns.
- The lemma for pronouns is `-PRON-`.
- Words are converted from their existing form in the text to their canonical roots, like "disagree" and "conclude".
- Irregular verbs are converted to their canonical form ("did" to "do").
Using lemmatization\index{lemma} instead of a more straightforward tokenization strategy is slower because of the increased complexity involved, but it can be worth it. Let's explore how to train a model using _lemmas_ instead of _words_.
Lemmatization is, like choices around n-grams and stop words, part of data preprocessing\index{preprocessing} so we define how to set up lemmatization as part of our preprocessing recipe. We use `engine = "spacyr"` for tokenization (instead of the default) and add `step_lemma()` to our preprocessing. This step extracts the lemmas from the parsing done by the tokenization engine.
```{r lemmarecspacy, dependson=c("lemmatb", "scotussplit"), results='hide'}
spacyr::spacy_initialize(entity = FALSE)
```
```{r lemmarec, dependson=c("lemmatb", "scotussplit"), message=FALSE}
lemma_rec <- recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text, engine = "spacyr") %>%
step_lemma(text) %>%
step_tokenfilter(text, max_tokens = 1e3) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
lemma_rec
```
Let's combine this lemmatized\index{lemma} text with our linear SVM workflow. We can then fit our workflow to our resampled data sets and estimate performance using lemmatization.
```{r lemmars, dependson=c("svmwf", "lemmarec", "scotusfolds"), eval=FALSE}
lemma_rs <- fit_resamples(
svm_wf %>% add_recipe(lemma_rec),
scotus_folds
)
```
How did this model perform?
```{r dependson="lemmars", eval=FALSE}
collect_metrics(lemma_rs)
```
```{r, echo=FALSE, message=FALSE}
collect_metrics_lemma_rs <- readr::read_csv("inst/collect_metrics_lemma_rs.csv")
collect_metrics_lemma_rs
```
The best value for RMSE at `r collect_metrics_lemma_rs %>% filter(.metric == "rmse") %>% pull(mean) %>% round(1)` shows us that using lemmatization\index{lemma} can have a significant benefit for model performance, compared to `r first_attemp_rmse` from fitting a non-lemmatized linear SVM model in Section \@ref(firstregressionevaluation). The best model using lemmatization is better than the best model without. However, this comes at a cost of much slower training because of the procedure involved in identifying lemmas; adding `step_lemma()` to our preprocessing increases the overall time to train the workflow by over 10-fold.\index{computational speed}
```{block, type="rmdnote"}
We can use `engine = "spacyr"` to assign part-of-speech tags to the tokens during tokenization, and this information can be used in various useful ways in text modeling. One approach is to filter tokens to only retain a certain part of speech, like nouns. An example of how to do this is illustrated in this [**textrecipes** blogpost](https://www.emilhvitfeldt.com/post/tidytuesday-pos-textrecipes-the-office/) and can be performed with `step_pos_filter()`.
```
\index{part of speech}
## Case study: feature hashing
The models we have created so far have used tokenization (Chapter \@ref(tokenization)) to split apart text data into tokens that are meaningful to us as human beings (words, bigrams) and then weighted these tokens by simple counts with word frequencies or weighted counts with tf-idf.\index{tf-idf}
A problem with these methods is that the output space can be vast and dynamic.
We have limited ourselves to 1000 tokens so far in this chapter, but we could easily have more than 10,000 features in our training set.
We may run into computational problems with memory or long processing times; deciding how many tokens to include can become a trade-off between computational time and information.
This style of approach also doesn't let us take advantage of new tokens we didn't see in our training data.
One method that has gained popularity in the machine learning field is the **hashing trick**.
This method addresses many of the challenges outlined above and is very fast with a low memory footprint.
Let's start with the basics of feature hashing.\index{hashing function}
First proposed by @Weinberger2009, feature hashing was introduced as a dimensionality reduction method with a simple premise.
We begin with a hashing function that we then apply to our tokens.
```{block, type = "rmdwarning"}
A hashing function takes input of variable size and maps it to output of a fixed size. Hashing functions are commonly used in cryptography.
```
We will use the `hash()` function from the **rlang** package to illustrate the behavior of hashing functions.
The `rlang::hash()` function uses the XXH128 hash algorithm of the xxHash library, which generates a 128-bit hash. This is a more complex hashing function than what is normally used for the hashing trick. The 32-bit version of MurmurHash3 [@appleby2008] is often used for its speed and good properties.
```{block, type = "rmdnote"}
Hashing functions are typically very fast and have certain properties. For example, the output of a hash function is expected to be uniform, with the whole output space filled evenly. The "avalanche effect" describes how similar strings are hashed in such a way that their hashes are not similar in the output space.
```
Suppose we have many country names in a character vector.
We can apply the hashing function\index{hashing function} to each of the country names to project them into an integer space defined by the hashing function.
Since `hash()` creates hashes that are very long, let's create `small_hash()` for demonstration purposes here that generates slightly smaller hashes. (The specific details of what hashes are generated are not important here.)
```{r}
library(rlang)
countries <- c("Palau", "Luxembourg", "Vietnam", "Guam", "Argentina",
"Mayotte", "Bouvet Island", "South Korea", "San Marino",
"American Samoa")
small_hash <- function(x) {
strtoi(substr(hash(x), 26, 32), 16)
}
map_int(countries, small_hash)
```
Our `small_hash()` function uses `7 * 4 = 28` bits, so the number of possible values is `2^28 = 268435456`. This is admittedly not much of an improvement over 10 country names.
Let's take the modulo of these big integer values to project them down to a more manageable space.
```{r}
map_int(countries, small_hash) %% 24
```
Now we can use these values as indices when creating a matrix.\index{matrix!sparse}
```{r, echo=FALSE}
Matrix::sparseMatrix(1:10, map_int(countries, small_hash) %% 24,
dims = c(10, 24),
dimnames = list(countries, NULL))
```
This method is very fast\index{computational speed}; both the hashing and modulo can be performed independently for each input since neither need information about the full corpus.
Since we are reducing the space, there is a chance that multiple words are hashed to the same value.
This is called a collision and, at first glance, it seems like it would be a big problem for a model.
However, research finds that using feature hashing has roughly the same accuracy as a simple bag-of-words model, and the effect of collisions is quite minor [@Forman2008].
```{block, type = "rmdnote"}
Another step that is taken to avoid the negative effects of hash collisions is to use a _second_ hashing function that returns 1 and −1. This determines if we are adding or subtracting the index we get from the first hashin function. Suppose both the words "outdoor" and "pleasant" hash to the integer value 583. Without the second hashing they would collide to 2. Using signed hashing, we have a 50% chance that they will cancel each other out, which tries to stop one feature from growing too much.
```
There are downsides to using feature hashing.\index{hashing function}\index{hashing function!challenges} Feature hashing:
- still has one tuning parameter, and
- cannot be reversed.
The number of buckets you have correlates with computation speed and collision rate, which in turn affects performance.
It is your job to find the output that best suits your needs.
Increasing the number of buckets will decrease the collision rate but will, in turn, return a larger output data set, which increases model fitting time.
The number of buckets is tunable in tidymodels using the **tune** package.
Perhaps the more important downside to using feature hashing is that the operation can't be reversed.
We are not able to detect if a collision occurs and it is difficult to understand the effect of any word in the model.
Remember that we are left with `n` columns of _hashes_ (not tokens), so if we find that the 274th column is a highly predictive feature, we cannot know in general which tokens contribute to that column.
We cannot directly connect model values to words or tokens at all.
We could go back to our training set and create a paired list of the tokens and what hashes they map to. Sometimes we might find only one token in that list, but it may have two (or three or four or more!) different tokens contributing.
This feature hashing method is used because of its speed and scalability, not because it is interpretable.
Feature hashing on tokens\index{hashing function} is available in tidymodels using the `step_texthash()` step from **textrecipes**. Let's `prep()` and `bake()` this recipe for demonstration purposes.
```{r scotushash, dependson="scotussplit"}
scotus_hash <- recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text) %>%
step_texthash(text, signed = TRUE, num_terms = 512) %>%
prep() %>%
bake(new_data = NULL)
dim(scotus_hash)
```
There are many columns in the results. Let's take a `glimpse()` at the first 10 columns.
```{r dependson="scotushash"}
scotus_hash %>%
select(num_range("text_hash00", 1:9)) %>%
glimpse()
```
By using `step_texthash()` we can quickly generate machine-ready data with a consistent number of variables.
This typically results in a slight loss of performance compared to using a traditional bag-of-words representation. An example of this loss is illustrated in this [**textrecipes** blogpost](https://www.emilhvitfeldt.com/post/textrecipes-series-featurehashing/).
### Text normalization
\index{preprocessing!challenges}When working with text, you will inevitably run into problems with encodings and related irregularities.
These kinds of problems have a significant influence on feature hashing\index{hashing function}, as well as other preprocessing steps.
Consider the German word "schön".\index{language!Non-English}
The o with an umlaut (two dots over it) is a fairly simple character, but it can be represented in a couple of different ways.
We can either use a single character [\\U00f6](https://www.fileformat.info/info/unicode/char/00f6/index.htm) to represent the letter with an umlaut.
Alternatively, we can use two characters, one for the o and one character to denote the presence of two dots over the previous character [\\U0308](https://www.fileformat.info/info/unicode/char/0308/index.htm).
```{r}
s1 <- "sch\U00f6n"
s2 <- "scho\U0308n"
```
These two strings will print the same for us as human readers.
```{r}
s1
s2
```
However, they are not equal.
```{r}
s1 == s2
```
This poses a problem for the avalanche effect, which is needed for feature hashing to perform correctly. The avalanche effect will result in these two words (which should be identical) hashing to completely different values.
```{r}
small_hash(s1)
small_hash(s2)
```
We can deal with this problem by performing **text normalization** on our text before feeding it into our preprocessing\index{preprocessing} engine.
One library to perform text normalization is the **stringi** package, which includes many different text normalization methods.
How these methods work is beyond the scope of this book, but know that the text normalization functions make text like our two versions of "schön"\index{language!Non-English} equivalent. We will use `stri_trans_nfc()` for this example, which performs canonical decomposition, followed by canonical composition, but we could also use `textrecipes::step_text_normalize()` within a tidymodels recipe for the same task.
```{r}
library(stringi)
stri_trans_nfc(s1) == stri_trans_nfc(s2)
small_hash(stri_trans_nfc(s1))
small_hash(stri_trans_nfc(s2))
```
Now we see that the strings are equal after normalization.
```{block, type = "rmdwarning"}
This issue of text normalization can be important even if you don't use feature hashing in your machine learning.
```
\index{hashing function}
Since these words are encoded in different ways, they will be counted separately when we are counting token frequencies.
Representing what should be a single token in multiple ways will split the counts. This will introduce noise in the best case, and in worse cases, some tokens will fall below the cutoff when we select tokens, leading to a loss of potentially informative words.
Luckily this is easily addressed by using `stri_trans_nfc()` on our text columns _before_ starting preprocessing, or perhaps more conveniently, by using `textrecipes::step_text_normalize()` _within_ a preprocessing recipe.
## What evaluation metrics are appropriate?
We have focused on using RMSE and $R^2$ as metrics for our models in this chapter, the defaults in the tidymodels framework. Other metrics can also be appropriate for regression models. Another common set of regression metric options are the various flavors of mean absolute error.
If you know before you fit your model that you want to compute one or more of these metrics, you can specify them in a call to `metric_set()`. Let's set up a tuning grid for mean absolute error (`mae`) and mean absolute percent error (`mape`).
```{r eval=FALSE}
lemma_rs <- fit_resamples(
svm_wf %>% add_recipe(lemma_rec),
scotus_folds,
metrics = metric_set(mae, mape)
)
```
If you have already fit your model, you can still compute and explore non-default metrics as long as you saved the predictions for your resampled data sets using `control_resamples(save_pred = TRUE)`.
Let's go back to the first linear SVM model we tuned in Section \@ref(firstregressionevaluation), with results in `svm_rs`. We can compute the overall mean absolute percent error.
```{r dependson="scotustuners"}
svm_rs %>%
collect_predictions() %>%
mape(year, .pred)
```
We can also compute the mean absolute percent error for each resample.
```{r dependson="scotustuners"}
svm_rs %>%
collect_predictions() %>%
group_by(id) %>%
mape(year, .pred)
```
Similarly, we can do the same for the mean absolute error, which gives a result in units of the original data (years, in this case) instead of relative units.
```{r dependson="scotustuners"}
svm_rs %>%
collect_predictions() %>%
group_by(id) %>%
mae(year, .pred)
```
```{block, type = "rmdnote"}
For the full set of regression metric options, see the [yardstick documentation](https://yardstick.tidymodels.org/reference/).
```
## The full game: regression {#mlregressionfull}
In this chapter, we started from the beginning and then explored both different types of models and different data preprocessing steps. Let's take a step back and build one final model, using everything we've learned. For our final model, let's again use a linear SVM regression model, since it performed better than the other options we looked at. We will:
- train on the same set of cross-validation resamples used throughout this chapter,
- _tune_ the number of tokens used in the model to find a value that fits our needs,
- include both unigrams and bigrams\index{tokenization!n-gram},
- choose not to use lemmatization\index{lemmas}, to demonstrate what is possible for situations when training time makes lemmatization an impractical choice, and
- finally evaluate on the testing set, which we have not touched at all yet.
We will include a much larger number of tokens than before, which should give us the latitude to include both unigrams and bigrams, despite the result we saw in Section \@ref(casestudyngrams).
```{block, type = "rmdnote"}
Be aware that the tuning calculations we demonstrate here are computationally expensive, and take a long time to complete.
```
### Preprocess the data
First, let's create the data preprocessing recipe. By setting the tokenization options to `list(n = 2, n_min = 1)`, we will include both unigrams and bigrams in our model.
When we set `max_tokens = tune()`, we can train multiple models with different numbers of maximum tokens and then compare these models' performance to choose the best value. Previously, we set `max_tokens = 1e3` to choose a specific value for the number of tokens included in our model, but here we are going to try multiple different values.
```{r finalscotusrec, dependson="scotussplit"}
final_rec <- recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
step_tokenfilter(text, max_tokens = tune()) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
final_rec
```
### Specify the model
Let's use the same linear SVM regression model specification we have used multiple times in this chapter, and set it up here again to remind ourselves.
```{r scotussvmspec2}
svm_spec <- svm_linear() %>%
set_mode("regression") %>%
set_engine("LiblineaR")
svm_spec
```
We can combine the preprocessing recipe and the model specification in a tunable workflow. We can't fit this workflow right away to training data, because the value for `max_tokens` hasn't been chosen yet.
```{r tunescotuswf, dependson=c("finalscotusrec", "scotustunespec2")}
tune_wf <- workflow() %>%
add_recipe(final_rec) %>%
add_model(svm_spec)
tune_wf
```
### Tune the model
\index{models!tuning}Before we tune the model, we need to set up a set of possible parameter values to try.
```{block, type = "rmdwarning"}
There is _one_ tunable parameter in this model, the maximum number of tokens included in the model.
```
Let's include different possible values for this parameter starting from the value we've already tried, for a combination of six models.
```{r finalscotusgrid}
final_grid <- grid_regular(
max_tokens(range = c(1e3, 6e3)),
levels = 6
)
final_grid
```
Now it's time for tuning. Instead of using `fit_resamples()` as we have throughout this chapter, we are going to use `tune_grid()`, a function that has a very similar set of arguments. We pass this function our workflow (which holds our preprocessing recipe and SVM model), our resampling folds, and also the grid of possible parameter values to try. Let's save the predictions so we can explore them in more detail, and let's also set custom metrics instead of using the defaults. Let's compute RMSE, mean absolute error, and mean absolute percent error during tuning.
```{r finalscotusrsoffline, dependson=c("finalscotusgrid", "tunescotuswf", "finalscotusrec", "scotusfolds")}
final_rs <- tune_grid(
tune_wf,
scotus_folds,
grid = final_grid,
metrics = metric_set(rmse, mae, mape),
control = control_resamples(save_pred = TRUE)
)
final_rs
```
We trained all these models!
### Evaluate the modeling {#regression-final-evaluation}
Now that all of the models with possible parameter values have been trained, we can compare their performance. Figure \@ref(fig:scotusfinaltunevis) shows us the relationship between performance (as measured by the metrics we chose) and the number of tokens.
```{r scotusfinaltunevis, dependson="finalscotusrs", opts.label = "fig.square", fig.cap="Performance improves significantly at about 4000 tokens"}
final_rs %>%
collect_metrics() %>%
ggplot(aes(max_tokens, mean, color = .metric)) +
geom_line(size = 1.5, alpha = 0.5) +
geom_point(size = 2, alpha = 0.9) +
facet_wrap(~.metric, scales = "free_y", ncol = 1) +
theme(legend.position = "none") +
labs(
x = "Number of tokens",
title = "Linear SVM performance across number of tokens",
subtitle = "Performance improves as we include more tokens"
)
```
Since this is our final version of this model, we want to choose final parameters and update our model object so we can use it with new data. We have several options for choosing our final parameters, such as selecting the numerically best model (which would be one of the ones with the most tokens in our situation here) or the simplest model within some limit around the numerically best result. In this situation, we likely want to choose a simpler model with fewer tokens that gives close-to-best performance.
Let's choose by percent loss compared to the best model, with a limit of 3% loss.
```{r chosenmae, dependson="finalscotusrs"}
chosen_mae <- final_rs %>%