-
Notifications
You must be signed in to change notification settings - Fork 0
/
Mathur_Navodita_HW_07.Rmd
1209 lines (812 loc) · 56.1 KB
/
Mathur_Navodita_HW_07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "INFSCI 2595 Spring 2023 Homework: 07"
subtitle: "Assigned March 4, 2023; Due: March 19, 2023"
author: "Navodita Mathur"
date: "Submission time: March 19, 2023 at 11:00PM EST"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#### Collaborators
Include the names of your collaborators here.
## Overview
This homework assignment is focused on working with linear models. You will use formulas discussed in lecture to study the important concepts associated with fitting linear models. You will practice fitting, predicting, and assessing linear model performance on a simple example. You will then apply those concepts to a more realistic application. You will train numerous linear basis function models ranging from simple to very complex. You will identify the best model considering training and hold-out test set performance. You will contrast the model selection based on a train/test split with an information metric via the log-Evidence (log marginal likelihood).
**IMPORTANT**: The RMarkdown assumes you have downloaded the data sets (CSV files) to the same directory you saved the template Rmarkdown file. If you do not have the CSV files in the correct location, the data will not be loaded correctly.
### IMPORTANT!!!
Certain code chunks are created for you. Each code chunk has `eval=FALSE` set in the chunk options. You **MUST** change it to be `eval=TRUE` in order for the code chunks to be evaluated when rendering the document.
You are free to add more code chunks if you would like.
## Load packages
You will use the `tidyverse` in this assignment, as you have done in the previous assignments.
```{r, load_packages}
library(tidyverse)
```
This assignment also uses the `splines` and `MASS` packages. Both are installed with base `R` and so you do not need to download any additional packages to complete the assignment.
## Problem 01
We introduced many important regression topics at the beginning of the semester. Our primary example back then involved a continuous output $y$ and a single continuous input $x$. The mean trend, $\mu_n$, was related to the input via a quadratic relationship:
$$
\mu_n = \beta_0 + \beta_1 x_n + \beta_2 x_{n}^{2}
$$
We did not go into the mathematical details of the regression problem earlier in the semester. Instead, we conceptually introduced uncertainty and confidence intervals. We also discussed their impact on selecting the best model. However, we have recently covered the necessary mathematical and statistical fundamentals to calculate the uncertainty! You will begin by assuming the likelihood noise, $\sigma$, is known in order to focus on the regression coefficients (the $\boldsymbol{\beta}$ parameters). You will therefore study the *scaled* or *relative* coefficient uncertainty. This is a useful starting point because it allows you to focus on the impact the **features** have on learning the regression coefficients.
You will study the *scaled* regression coefficient uncertainty several ways. You will compare the posterior uncertainty based on informative, vague, and no prior specifications with "big" and "small" data applications.
The data from the "quadratic demo" are read in the code chunk below and assigned to the `quad_data_big` object.
```{r, read_quad_data}
quad_data_big <- readr::read_csv('quad_demo_data.csv', col_names = TRUE)
```
The glimpse below shows that there are 30 rows for the two continuous variables, the input `x` and the output `y`.
```{r, show_quad_data_glimpse}
quad_data_big %>% glimpse()
```
The data are labeled as "big" because the 30 observations will serve the role as the "big" or "large" data set in this example. The "small" data version is created for you in the code chunk below by slicing the first 5 rows.
```{r, make_small_quad_data}
quad_data_small <- quad_data_big %>% slice(1:5)
```
The entire small data set is displayed for you below.
```{r, show_quad_data_small}
quad_data_small
```
### 1a)
Let's start by visualizing the output to input relationship. You will use the `color` aesthetic to visually identify the first five rows which were separated out to create the small version of the data. You can add a row identifier column via the `tibble::rowid_to_column()` function. By default the new column will be named, `rowid`.
**Create a scatter plot using `ggplot2` and the "full" or "big" data set to show the relationship between the output `y` and the input `x`. You should use `tibble::rowid_to_column()` to add the row identifier column. Map the `color` aesthetic within `geom_point()` to `rowid > 5` to identify the first five rows. Set the marker size to 3.**
*HINT*: Don't forget the importance of the `aes()` function!!!!!!!
#### SOLUTION
Insert your code here.
```{r}
quad_data_big %>%
tibble::rowid_to_column()%>%
ggplot(mapping = aes(x=x,y=y)) +
geom_point(mapping = aes(color=rowid > 5), size=3)
```
### 1b)
As stated in lecture, we need to create the **design matrix** before we can calculate anything associated with linear models. The design matrix contains the features. Those features can be input variables as well as quantities derived from the inputs. Features can therefore be *functions of* the inputs!
You must create **TWO** design matrices in this problem. Both design matrices will use a **quadratic** relationship between the input `x` and the mean trend. The first design matrix, `Xmat_quad_small`, corresponds to the sliced "small" data. The second design matrix, `Xmat_quad_big`, corresponds to the original "big" data.
You are allowed to use the `model.matrix()` function shown in lecture to create the design matrices.
**Create two design matrices. Both must use a quadratic relationship between the input and the trend. The comments and variable names in the code chunks below state which data set to use in creating the design matrix.**
*HINT*: You created polynomial features in previous assignments. Please review the syntax from those earlier assignments if you forget how to create the polynomial feature via the formula interface.
#### SOLUTION
```{r, solution_for_1b_a, eval=TRUE}
### make the design matrix for the SMALL data
Xmat_quad_small <- model.matrix(y ~ x + I(x^2), data = quad_data_small)
Xmat_quad_small
```
```{r, solution_for_1b_b, eval=TRUE}
### make the design matrix for the BIG data
Xmat_quad_big <- model.matrix(y ~ x + I(x^2), data = quad_data_big)
Xmat_quad_big
```
### 1c)
**How many rows and columns does EACH design matrix have?**
**Display the entire small data design matrix to the screen. What values are contained in the FIRST column and why?**
#### SOLUTION
Insert code chunks here.
```{r}
#Number of columns in design matrix of small dataset
nc_Xmat_quad_small = ncol(Xmat_quad_small)
nc_Xmat_quad_small
```
```{r}
#Number of rows in design matrix of small dataset
nr_Xmat_quad_small = nrow(Xmat_quad_small)
nr_Xmat_quad_small
```
```{r}
#Number of columns in design matrix of big dataset
nc_Xmat_quad_big = ncol(Xmat_quad_big)
nc_Xmat_quad_big
```
```{r}
#Number of rows in design matrix of big dataset
nr_Xmat_quad_big = nrow(Xmat_quad_big)
nr_Xmat_quad_big
```
```{r}
Xmat_quad_small
```
1st column contains the x-intercept values
### 1d)
Let's now begin to study the posterior coefficient uncertainty. As stated previously, you will assume $\sigma$ is known. Let's keep things simple with an assumed value of $\sigma = 1$. If this bothers you, you can also think of this as examining the *scaled* uncertainty. We discussed in lecture that $\sigma$ acts as a scaling term on the posterior uncertainty. Thus, the "real" posterior uncertainty will be a multiple of what you are about to calculate. You are thus getting a rough idea on the behavior of the uncertainty due entirely to the inputs!
You will begin by assuming that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty.
**Calculate the posterior covariance matrix for the regression coefficients assuming $\sigma=1$. You must calculate the posterior covariance matrix for both design matrices.**
**The comments and variable names below state which data set to use.**
#### SOLUTION
```{r}
sigma = 1
```
```{r, solution_for_1d_a, eval=TRUE}
### posterior covariance matrix for the SMALL data
covmat_quad_small_noprior <- (sigma^2) * solve(t(Xmat_quad_small)%*%Xmat_quad_small)
covmat_quad_small_noprior
```
```{r, solution_for_1d_b, eval=TRUE}
### posterior covariance matrix for the BIG data
covmat_quad_big_noprior <- (sigma^2) * solve(t(Xmat_quad_big)%*%Xmat_quad_big)
covmat_quad_big_noprior
```
### 1e)
**How many rows and columns does EACH posterior covariance matrix have?**
#### SOLUTION
Insert code chunks here.
```{r}
nc_covmat_quad_small_noprior = ncol(covmat_quad_small_noprior)
nc_covmat_quad_small_noprior
```
```{r}
nr_covmat_quad_small_noprior = nrow(covmat_quad_small_noprior)
nr_covmat_quad_small_noprior
```
```{r}
nc_covmat_quad_big_noprior = ncol(covmat_quad_big_noprior)
nc_covmat_quad_big_noprior
```
```{r}
nr_covmat_quad_big_noprior = nrow(covmat_quad_big_noprior)
nr_covmat_quad_big_noprior
```
### 1f)
**What is the posterior standard deviation on the quadratic feature for the small data case? What is the posterior standard deviation on the quadratic feature for the big data case?**
#### SOLUTION
Insert code chunks here.
```{r}
sd_quad_small_noprior = sqrt(diag(covmat_quad_small_noprior))
sd_quad_small_noprior
```
```{r}
sd_quad_big_noprior = sqrt(diag(covmat_quad_big_noprior))
sd_quad_big_noprior
```
### 1g)
The uncertainty displayed in the previous question does not tell us if a feature is statistically significant or not. We need to know where the coefficient is "located" to answer that. We thus need an *estimate* for the coefficient! We need output measurements to estimate coefficients. Thus, uncertainty can be studied *before* output data are collected but *significance* can only be examined **after** output data are collected.
You will now estimate the coefficients by calculating their posterior mean assuming the prior does not matter. To help you with this question, the output variables are separated from the dataframes and converted to the correct data type for you in the code chunk below. The output column vector for the small data case is displayed below.
```{r, separate_output_columns_quad}
y_quad_big_col <- quad_data_big %>% pull(y) %>% as.matrix()
y_quad_small_col <- quad_data_small %>% pull(y) %>% as.matrix()
y_quad_small_col
```
You **MUST** continue to assume that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty.
**Calculate the regression coefficient posterior mean assuming $\sigma=1$ and the prior does not matter.**
**The comments and variable names below state which data set to use.**
**Display the posterior means for both data cases to the screen.**
**Would the values change if $\sigma$ was assumed to equal 5 instead of 1?**
#### SOLUTION
```{r, solution_for_1g_a, eval=TRUE}
### posterior mean vector for the SMALL data
postmeans_quad_small_noprior <- solve(t(Xmat_quad_small)%*%Xmat_quad_small)%*%(t(Xmat_quad_small)%*%y_quad_small_col)
```
```{r, solution_for_1g_b, eval=TRUE}
### posterior mean vector for the BIG data
postmeans_quad_big_noprior <- solve(t(Xmat_quad_big)%*%Xmat_quad_big)%*%(t(Xmat_quad_big)%*%y_quad_big_col)
```
Display the posterior means and answer the conceptual question.
```{r}
postmeans_quad_small_noprior
```
```{r}
postmeans_quad_big_noprior
```
No, the values would not change if $\sigma$ was assumed to equal 5 instead of 1
### 1h)
**Calculate the posterior 95% uncertainty interval on the quadratic feature.**
**Assuming $\sigma=1$, is the quadratic feature considered to be statistically significant in the small data case? Assuming $\sigma=1$, is the quadratic feature considered to be statistically significant in the big data case?**
#### SOLUTION
Insert your code chunks.
```{r}
qnorm(c(0.025, 0.975),postmeans_quad_small_noprior[[3]], sd_quad_small_noprior[3])
```
```{r}
qnorm(c(0.025, 0.975),postmeans_quad_big_noprior[[3]], sd_quad_big_noprior[3])
```
What do you think?
Assuming $\sigma=1$, the quadratic feature is considered to be statistically significant in both cases
## Problem 02
Let's now consider the influence of a prior on the posterior uncertainty. You will work with two priors in this problem. The first is an informative prior and the second is a vague or diffuse prior. Both priors are Multivariate Normal (MVN) distributions. Both assume the prior means are zero and the regression coefficients are independent. However, the prior standard deviations are different between the two priors.
**The informative prior assumes a prior standard deviation of 1.4 for each coefficient, while the vague prior assumes a prior standard deviation of 30 for each coefficient.**
### 2a)
You will specify the prior by creating the prior covariance matrix.
**Create 2 prior covariance matrices. The first is the prior covariance matrix for the informative prior and the second is the prior covariance matrix for the vague prior. The variable names and comments state which prior you are working with.**
**Display each prior covariance matrix to the screen.**
#### SOLUTION
```{r, solution_for_2a_a, eval=TRUE}
### prior covariance matrix for the INFORMATIVE prior
B0_inform <- (1.4^2)*diag(3)
```
```{r, solution_for_2a_b, eval=TRUE}
### prior covariance matrix for the VAGUE prior
B0_vague <- (30^2)*diag(3)
```
Insert code chunks to display the covariance matrices.
```{r}
B0_inform
```
```{r}
B0_vague
```
### 2b)
You will now calculate the posterior covariance matrix assuming $\sigma=1$ for the small and "big" data cases.
First, let's focus on the results based on the **informative** prior.
**Calculate the posterior covariance matrix based on the informative prior for both the small and big data cases. The variable names and comments state which data set to use.**
#### SOLUTION
```{r}
sigma=1
```
```{r, solution_for_2b_a, eval=TRUE}
### posterior covariance matrix INFORMATIVE prior and SMALL data
covmat_quad_small_inform <- solve(solve(B0_inform) + (t(Xmat_quad_small) %*% Xmat_quad_small)/sigma^2)
covmat_quad_small_inform
```
```{r, solution_for_2b_b, eval=TRUE}
### posterior covariance matrix INFORMATIVE prior and BIG data
covmat_quad_big_inform <- solve(solve(B0_inform) + (t(Xmat_quad_big) %*% Xmat_quad_big)/sigma^2)
covmat_quad_big_inform
```
### 2c)
Next, let's consider the results based on the **vague** prior.
**Calculate the posterior covariance matrix based on the vague prior for both the small and big data cases. The variable names and comments state which data set to use.**
#### SOLUTION
```{r, solution_for_2c_a, eval=TRUE}
### posterior covariance matrix VAGUE prior and SMALL data
covmat_quad_small_vague <- solve(solve(B0_vague) + (t(Xmat_quad_small) %*% Xmat_quad_small)/sigma^2)
covmat_quad_small_vague
```
```{r, solution_for_2c_b, eval=TRUE}
### posterior covariance matrix VAGUE prior and BIG data
covmat_quad_big_vague <- solve(solve(B0_vague) + (t(Xmat_quad_big) %*% Xmat_quad_big)/sigma^2)
covmat_quad_big_vague
```
### 2d)
**What is the posterior standard deviation on the quadratic feature based on the INFORMATIVE prior for both data cases? Are the posterior standard deviations similar to the result from 1f)?**
#### SOLUTION
Insert your code chunks here.
```{r}
sd_quad_small_inform = sqrt(diag(covmat_quad_small_inform))
sd_quad_small_inform
```
```{r}
sd_quad_big_inform = sqrt(diag(covmat_quad_big_inform))
sd_quad_big_inform
```
### 2e)
**What is the posterior standard deviation on the quadratic feature based on the VAGUE prior for both data cases? Are the posterior standard deviations similar to the result from 1f)?**
#### SOLUTION
Insert your code chunks here.
```{r}
sd_quad_small_vague = sqrt(diag(covmat_quad_small_vague))
sd_quad_small_vague
```
```{r}
sd_quad_big_vague = sqrt(diag(covmat_quad_big_vague))
sd_quad_big_vague
```
### 2f)
**What is the influence of the prior on the linear model regression coefficients?**
#### SOLUTION
What do you think?
The co-efficient value slightly decrease if the dataset is small but has no significant impact on large dataset
## Problem 03
Now that you have worked though interpreting linear model results, its time to fit a linear model! You will learn the unknown regression coefficients **and** the unknown likelihood noise, $\sigma$. From a Bayesian perspective, we no longer have simple formulas for the posterior distribution. We will use the Laplace Approximation to execute the Bayesian analysis.
This problem is focused on setting up the log-posterior function for a linear model. You will program the function using matrix math, such that you can easily scale your code from a linear relationship with a single input up to complex linear basis function models. You will assume independent Gaussian priors on all $\boldsymbol{\beta}$-parameters with a shared prior mean $\mu_{\beta}$ and shared prior standard deviation, $\tau_{\beta}$. An Exponential prior with rate parameter $\lambda$ will be assumed for the likelihood noise, $\sigma$. The complete probability model for the response, $y_n$, is shown below using the linear basis notation. The $n$-th row of the basis design matrix, $\boldsymbol{\Phi}$ is denoted as $\boldsymbol{\phi}_{n,:}$. It is assumed that the basis $J$ degrees-of-freedom.
$$
y_n \mid \mu_n, \sigma \sim \mathrm{normal}\left(y_n \mid \mu_n, \sigma\right)
$$
$$
\mu_n = \boldsymbol{\phi}_{n,:}\boldsymbol{\beta}
$$
$$
\boldsymbol{\beta} \mid \mu_{\beta}, \tau_{\beta} \sim \prod_{j=0}^{J}\left( \mathrm{normal}\left( \beta_j \mid \mu_{\beta}, \tau_{\beta} \right) \right)
$$
$$
\sigma \mid \lambda \sim \mathrm{Exp}\left(\sigma \mid \lambda \right)
$$
### 3a)
You will fit the Bayesian linear model with the Laplace Approximation by programming the log-posterior function. However, before doing so you will create a list of required information, analogous to those you created in the previous homework assignments. The list will contain the data and the prior parameter specifications. The data consists of two major components. The output vector and the design matrix. You will need these details to execute the log-posterior function.
You will start out by fitting the quadratic trend function which you worked with in the previous two problems. As a reminder, the trend is:
$$
\mu_n = \beta_0 + \beta_1 x_{n} + \beta_2 x_{n}^{2}
$$
You will create the list of required information for the "big" data case which uses all observations.
**Complete the code chunk below by filling in the fields of the `info_quad` list. Assign the output vector to `yobs` and the "big" data design matrix to `design_matrix`. Specify the shared prior mean, `mu_beta`, to be 0, the shared prior standard deviation, `tau_beta`, as 3. The prior rate parameter on the noise, `sigma_rate`, should be assigned to 1.**
**Please make sure that you assign `yobs` as a "regular" vector and NOT as a matrix data type.**
#### SOLUTION
```{r, solution_03a, eval=TRUE}
info_quad <- list(
yobs = quad_data_big$y,
design_matrix = Xmat_quad_big,
mu_beta = 0,
tau_beta = 3,
sigma_rate = 1
)
```
### 3b)
You will now define the log-posterior function `lm_logpost()`. You will use the log-transformation on $\sigma$, and so you will actually define the log-posterior in terms of the regression coefficients, $\boldsymbol{\beta}$, and the unbounded noise parameter, $\varphi = \log\left[\sigma\right]$.
The comments in the code chunk below tell you what you need to fill in. The unknown parameters to learn are contained within the first input argument, `unknowns`. You **must** assume that the unknown $\boldsymbol{\beta}$-parameters are before the unknown $\varphi$ parameter in the `unknowns` vector. You must specify the number of $\boldsymbol{\beta}$ parameters programmatically to allow scaling up your function to an arbitrary number of unknowns. You will assume that all variables contained in the `my_info` list (the second argument to `lm_logpost()`) are the same fields in the `info_quad` list you defined in Problem 3a).
**Define the log-posterior function by completing the code chunk below. You must calculate the mean trend, `mu`, using MATRIX MATH between the design matrix and the unknown $\boldsymbol{\beta}$ column vector. After you complete the function, test that it works by evaluating the log-posterior at two different sets of parameter values. Try values of -1 for all parameters, and then try out values of 1 for all parameters.**
*HINT*: If you have successfully completed the log-posterior function, you should get a value of `-1161.363` for the -1 guess values, and a value of `-137.0484` for the +1 guess values.
*HINT*: Don't forget about useful data type conversion functions such as `as.matrix()` and `as.vector()` (or `as.numeric()`).
#### SOLUTION
```{r, solution_03b, eval=TRUE}
lm_logpost <- function(unknowns, my_info)
{
# specify the number of unknown beta parameters
length_beta <- ncol(my_info$design_matrix)
# extract the beta parameters from the `unknowns` vector
beta_v <- unknowns[1:length_beta]
# extract the unbounded noise parameter, varphi
lik_varphi <- unknowns[length_beta+1]
# back-transform from varphi to sigma
lik_sigma <- exp(lik_varphi)
# extract design matrix
X <- my_info$design_matrix
# calculate the linear predictor
mu <- as.vector(X %*% as.matrix(beta_v))
# evaluate the log-likelihood
log_lik <- sum(dnorm(x= my_info$yobs, mean = as.numeric(mu), sd = lik_sigma, log = TRUE))
# evaluate the log-prior
log_prior_beta <- sum(dnorm(x = beta_v, mean = my_info$mu_beta, sd = my_info$tau_beta, log = TRUE))
log_prior_sigma <- dexp(x= lik_sigma, rate = my_info$sigma_rate, log = TRUE)
# add the mean trend prior and noise prior together
log_prior <- log_prior_beta + log_prior_sigma
# account for the transformation
log_derive_adjust <- lik_varphi
# sum together
return(log_lik+log_prior+log_derive_adjust)
}
```
Test out `lm_logpost()` with guess values of -1 for all parameters.
```{r, solution_03b_c}
###
lm_logpost(c(-1,-1,-1,-1), info_quad)
```
Test out `lm_logpost()` with guess values of 1 for all parameters.
```{r, solution_03b_b}
###
lm_logpost(c(1,1,1,1), info_quad)
```
### 3c)
The `my_laplace()` function is started for you in the code chunk below. You must fill in the portion after the optimization is executed. Although you have filled in these elements before it is good to review to make sure you remember how the posterior covariance matrix is calculated! Also, you must complete the calculation of the `int` variable which stands for "integration" and equals the Laplace Approximation's estimate to the Evidence.
**Complete the `my_laplace()` function below and then fit the Bayesian linear model using a starting guess of zero for all parameters. Assign your result to the `laplace_quad` object. Print the posterior mode and posterior standard deviations to the screen.**
**Should you be concerned about the initial guess impacting the posterior results?**
#### SOLUTION
```{r, solution_03c, eval=TRUE}
my_laplace <- function(start_guess, logpost_func, ...)
{
# code adapted from the `LearnBayes`` function `laplace()`
fit <- optim(start_guess,
logpost_func,
gr = NULL,
...,
method = "BFGS",
hessian = TRUE,
control = list(fnscale = -1, maxit = 1001))
mode <- fit$par
post_var_matrix <- -solve(fit$hessian)
p <- length(mode) # number of unknown parameters
int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
# package all of the results into a list
list(mode = mode,
var_matrix = post_var_matrix,
log_evidence = int,
converge = ifelse(fit$convergence == 0,
"YES",
"NO"),
iter_counts = as.numeric(fit$counts[1]))
}
```
Fit the Bayesian linear model.
```{r, solution_03c_b, eval=TRUE}
laplace_quad <- my_laplace(c(0,0,0,0), lm_logpost, info_quad)
```
Display the posterior mode and posterior standard deviations.
```{r, solution_03c_c}
###
laplace_quad$mode
```
```{r}
sqrt(diag(laplace_quad$var_matrix))
```
Should you be worried about the initial guess impacted results?
No, Initial guess will not have any impact on results.
### 3d)
The `generate_lm_post_samples()` function is started for you in the code chunk below. The first argument, `mvn_result`, is the Laplace Approximation result object returned from the `my_laplace()` function. The second argument, `length_beta`, specifies the number of mean trend $\boldsymbol{\beta}$-parameters to the model. The naming of the variables is taken care of for you. This function should look quite similar to the previous assignment...
After completing the function, generate 2500 posterior samples of the parameters in your model and assign the result to the `post_samples_quad` object. You will then use the posterior samples to study the posterior distribution on the slope on the quadratic feature, $\beta_2$. The naming convention provided by the `generate_lm_post_samples()` function names $\beta_2$ as the `beta_02` column.
**Complete the generate_lm_post_samples() function below. After completing the function, generate 2500 posterior samples from your `post_samples_quad` model. Create a histogram with 55 bins using `ggplot2` for the slope `beta_02`. Calculate the probability that the slope is positive.**
#### SOLUTION
```{r, solution_03d, eval=TRUE}
generate_lm_post_samples <- function(mvn_result, length_beta, num_samples)
{
MASS::mvrnorm(n = num_samples,
mu = mvn_result$mode ,
Sigma = mvn_result$var_matrix ) %>%
as.data.frame() %>% tibble::as_tibble() %>%
purrr::set_names(c(sprintf("beta_%02d", 0:(length_beta-1)), "varphi")) %>%
# back transform varphi to sigma!!!
mutate(sigma = exp(varphi) )
}
```
Generate posterior samples.
```{r, solution_03d_b, eval=TRUE}
set.seed(87123)
N=2500
post_samples_quad <- generate_lm_post_samples(laplace_quad, ncol(Xmat_quad_big), N)
```
Create the posterior histogram on $\beta_2$, calculate the posterior 95% uncertainty interval, and calculate the probability that $\beta_2$ is greater than zero.
Add as many code chunks as you feel are necessary.
```{r}
post_samples_quad %>%
ggplot(mapping = aes(x = beta_02)) +
geom_histogram(bins = 55)
```
```{r}
qnorm(c(0.025, 0.975), mean(post_samples_quad$beta_02), sd(post_samples_quad$beta_02))
```
```{r}
mean(post_samples_quad$beta_02>0)
```
## Problem 04
Now that you can fit a Bayesian linear model, it's time to work with making posterior predictions from the model. You will use those predictions to calculate and summarize the errors of the model relative to observations. Since we have discussed RMSE and R-squared throughout the semester, you will work with the Mean Absolute Error (MAE) metric.
### 4a)
The `post_lm_pred_trend_samples()` function is started in the code chunk below. This function generates posterior mean trend predictions. The first argument, `Xnew`, is a potentially new or test design matrix that we wish to make predictions for. The second argument, `Bmat`, is a matrix of posterior samples of the $\boldsymbol{\beta}$-parameters. The `Xnew` matrix has rows equal to the number of predictions points, `M`, and the `Bmat` matrix has rows equal to the number of posterior samples `S`.
You must complete the function by performing the necessary matrix math to calculate the matrix of posterior mean trend predictions, `Umat`.
The `post_lm_pred_trend_samples()` returns the `Umat` and `Ymat` matrices contained within a list.
**Perform the necessary matrix math to calculate the matrix of posterior predicted mean trends `Umat`.**
#### SOLUTION
```{r, solution_02a, eval=TRUE}
post_lm_pred_trend_samples <- function(Xnew, Bmat, sigma_vector)
{
# matrix of posterior trend predictions
Umat <- Xnew %*% t(Bmat)
# create a matrix of random values from an arbitrary Gaussian by generating random values from a standard normal and rescaling
Rmat <- matrix(rep(sigma_vector, nrow(Xnew)), nrow(Xnew), byrow = TRUE)
Zmat <- matrix(rnorm(nrow(Xnew)*nrow(Bmat)), nrow(Xnew), byrow = TRUE)
#calculate $\mu$ + $\sigma$*z
Ymat <- Umat + Rmat * Zmat
# return Umat, completed for you
return(list(Umat = Umat, Ymat=Ymat))
}
```
### 4b)
The code chunk below is completed for you. The function `make_post_lm_pred()` is a wrapper which calls the `post_lm_pred_samples()` function. It contains two arguments. The first, `Xnew`, is a test design matrix. The second, `post`, is a data.frame of posterior samples. The function extracts the $\boldsymbol{\beta}$-parameter posterior samples and converts the object to a matrix. It also extracts the posterior samples on $\sigma$ and converts to a vector. We will use the posterior samples on $\sigma$ is the next assignment, but the line of code is included below for demonstration purposes.
```{r, create_lm_pred_wrapper}
make_post_lm_pred <- function(Xnew, post)
{
Bmat <- post %>% select(starts_with("beta_")) %>% as.matrix()
sigma_vector <- post %>% pull(sigma)
post_lm_pred_trend_samples(Xnew, Bmat, sigma_vector)
}
```
You now have enough pieces in place to generate posterior predictions from your model.
**Make posterior predictions of the trend on the "big" quadratic demo training set. Assign the posterior trend predictions to the `post_trend_samples_quad` object. What are the dimensions of `post_trend_samples_quad`? What do the rows and columns correspond to?**
#### SOLUTION
Make posterior predictions on the training set.
```{r, solution_04b, eval=TRUE}
post_trend_samples_quad <- make_post_lm_pred(Xmat_quad_big, post_samples_quad)
```
The dimensionality of the posterior predicted mean trend matrix is:
```{r, solution_04b_b, eval=TRUE}
###
dim(post_trend_samples_quad$Umat)
```
What do the rows and columns correspond to?
Number of rows correspond to number of data points (rows) while number of columns correspond to number of posterior samples of the $\beta$-parameters (features i.e. co-efficients) and $\sigma$
### 4c)
You will now use the model predictions to calculate the error between the model and the training set observations. Since you generated 2500 posterior samples, you have 2500 different sets of predictions! However, to get started you will focus on the first 3 posterior samples.
**Calculate the error between the predicted mean trend and the training set observations for each of the first 3 posterior predicted samples. Assign the errors to separate vectors, as indicated in the code chunk below.**
#### SOLUTION
```{r, solution_04c, eval=TRUE}
### error of the first posterior sample
error_quad_post_01 <- c(post_trend_samples_quad$Umat[,1] - quad_data_big$y)
### error of the second posterior sample
error_quad_post_02 <- c(post_trend_samples_quad$Umat[,2] - quad_data_big$y)
### error of the third posterior sample
error_quad_post_03 <- c(post_trend_samples_quad$Umat[,3] - quad_data_big$y)
```
### 4d)
You will now calculate the Mean Absolute Error (MAE) associated with each of the three error samples calculated in Problem 4c). However, before calculating the MAE, let's first consider the dimensions of the `error_quad_post_01`. What is the length of the `error_quad_post_01` vector? When you take the absolute value and then average across all elements in that vector, what are you averaging over?
**What is the length of the `error_quad_post_01` vector? Calculate the MAE associated with each of the 3 error vectors you calculated in Problem 4c).**
**What are you averaging over when you calculate the mean absolute error? Are the three MAE values the same? If not, why would they be different?**
*HINT*: The absolute value can be calculated with the `abs()` function.
#### SOLUTION
?
```{r, solution_04d}
###
length(error_quad_post_01)
```
Now calculate the MAE associated with each of the first three posterior samples.
```{r, solution_04d_b, eval=TRUE}
mae_quad_post_01 <- mean(abs(error_quad_post_01))
mae_quad_post_02 <- mean(abs(error_quad_post_02))
mae_quad_post_03 <- mean(abs(error_quad_post_03))
```
?
```{r}
mae_quad_post_01 == mae_quad_post_02
```
```{r}
mae_quad_post_02 == mae_quad_post_03
```
No, the three MAE values are not the same as the posterior samples have different values of associated parmeters or unknows
### 4e)
In Problem 4d), you calculated the MAE associated with the first 3 posterior samples. You will now work through calculating the MAE associated with every posterior sample. Although it might seem like you need to use a for-loop to do so, `R` will simplify the operation for you. If you perform an addition or subtraction between a matrix and a vector, `R` will find the dimension that matches between the two and repeat the action over the other dimension (this repetition is referred to *broadcasting*). Consider the code below, which has a vector, `a_numeric`, subtracted from a matrix `a_matrix`:
`a_matrix - a_numeric`
Assuming that `a_matrix` has 10 rows and 25 columns and `a_numeric` is length 10, `R` will subtract `a_numeric` from each column in `a_matrix`. The result will be another matrix with the same dimensionality as `a_matrix`. To confirm this is the case, consider the example below where a vector of length 2 is subtracted from a matrix of 2 rows and 4 columns. The resulting dimensionality is 2 rows by 4 columns.
```{r, example_subtraction}
### a 2 x 4 matrix
matrix(1:8, nrow = 2, byrow = TRUE)
### a vector length 2
c(1, 2)
### subtracting the two yields a matrix
matrix(1:8, nrow = 2, byrow = TRUE) - c(1, 2)
```
You will use this fact to calculate the error associated with each training point and each posterior sample.
**Calculate the absolute value of the error between the mean trend matrix and the training set response. Assign the result to the `absError_quad_mat` object. Print the dimensions of the `absError_quad_mat` matrix to screen.**
#### SOLUTION
```{r, solution_04e, eval=TRUE}
absError_quad_mat <- abs(post_trend_samples_quad$Umat - quad_data_big$y)
### dimensions?
dim(absError_quad_mat)
```
### 4f)
You must now summarize the absolute errors values by averaging them appropriately. Should you average across the rows or down the columns? In `R` the `colMeans()` will calculate the average value associated with each column in a matrix and returns a vector. Likewise, the `rowMeans()` function calculates the average value along each row and returns a vector.
**Which function, `colMeans()` or `rowMeans()`, should you use to calculate the MAE associated with each posterior sample?**
**Calculate the MAE associated with each posterior sample and assign the result to the `MAE_quad` object. Print the data type (the class) of the `MAE_quad` to the screen and display its length. Check your result is consistent with the MAEs you previously calculated in Problem 4d).**
#### SOLUTION
Which function?
```{r, solution_04f, eval=TRUE}
MAE_quad <- colMeans(absError_quad_mat)
### data type?
class(MAE_quad)
### length?
length(MAE_quad)
```
Check with the results you calculated previously.
```{r, solution_04f_b, eval=TRUE}
### your code here
MAE_quad[1] == mae_quad_post_01
```
```{r}
MAE_quad[2] == mae_quad_post_02
```
```{r}
MAE_quad[3] == mae_quad_post_03
```
### 4g)
You have calculated the MAE associated with each posterior sample, and thus represented the uncertainty in the MAE!
**Use the `quantile()` function to print out summary statistics associated with the MAE. You can use the default arguments, and thus pass in `MAE_quad` into `quantile()` without setting any other argument.**
**Why is the MAE uncertain? Or put another way, what causes the MAE to be uncertain?**
#### SOLUTION
Calculate the Quantiles of the MAE below.
```{r, solution_04g, eval=TRUE}
###
quantile(MAE_quad)
```
Why is the MAE uncertain?
Since the MAE depends on the betas i.e. regression coefficients, the uncertainty in the posterior samples of the model parameters also influences uncertainty in the MAE along with standard deviation, $\sigma$.
## Problem 05
You will now make use of the model fitting and prediction functions you created in the previous problems to study the behavior of a more complicated non-linear modeling task. The code chunk below reads in two data sets. Both consist of two continuous variables, an input `x` and a response `y`. The first, `train_df`, will serve as the training set, and the second, `test_df`, will serve as a hold-out test set. You will only fit models based on the training set.
```{r, read_in_train_test_02, eval=TRUE}
file_for_train <- "hw07_train_data.csv"
train_df <- readr::read_csv(file_for_train, col_names = TRUE)
file_for_test <- "hw07_test_data.csv"
test_df <- readr::read_csv(file_for_test, col_names = TRUE)
```
Glimpses are provided for each data set below.
```{r, show_train_data_glimpse}
train_df %>% glimpse()
```
```{r, sow_test_data_glimpse}
test_df %>% glimpse()
```
### 5a)
It's always a good idea to start out by visualizing the data before modeling.
**Create a scatter plot between the response and the input with `ggplot2`. Include both the training and test sets together in one graph. Use the marker color to distinguish between the two.**
#### SOLUTION
```{r solution_05a}
###
train_df%>%
ggplot(mapping = aes(x=x,y=y, color="train"))+
geom_point()+
geom_point(data=test_df, mapping = aes(color="test"))
```
### 5b)
Hopefully you can tell from the previous visualization that the response is non-linearly related to the input. You will use basis functions to try and model this non-linear relationship. You could manually try out various models of different levels of complexity. However, let's build off the functions you created in the previous questions to programmatically train and assess many models of varying complexity.
Specifically, you will fit 25 different models to the training set. You will consider a one degree of freedom natural spline up to 49 degrees of freedom natural spline. You will consider only odd values for the degrees of freedom. Your goal will be to find which spline is the "best" in terms of generalizing from the training set to the test set. To do so, you will calculate the MAE on the training set and on the test set for each model. It would be tedious to define the necessary information by hand, manually train each model, generate the posterior samples, and make predictions from each model. Therefore, you will work through completing functions that will enable you to programmatically loop over each candidate model.
You will begin by completing a function to create the training set and test set for a desired spline basis. The function `make_spline_basis_mats()` is started for you in the first code chunk below. The first argument is the desired spline basis degrees of freedom, `J`. The second argument is the training set, `train_data`, and the third argument is the hold-out test set, `test_data`.
The second through fifth code chunks below are provided to check that you completed the function correctly. The second code chunk uses `purrr::map_dfr()` to create the training and test matrices for all 25 models. A glimpse of the resulting object, `spline_matrices`, is displayed to the screen in the third code chunk and is printed to the screen in the fourth code chunk. You should see a `tibble` consisting of two variables, `design_matrix` and `test_matrix`. Both variables are lists containing matrices. The matrices contained in the `spline_matrices$design_matrix` variable are the training design matrices, while the matrices contained in `spline_matrices$test_matrix` are the corresponding hold-out test basis matrices.
The fifth code chunk below prints the dimensionality of the 1st and 2nd order spline basis matrices to the screen. It shows that to access a specific matrix, you need to use the `[[]]` notation.
**Complete the first code chunk below. You must specify the `splines::ns()` function call correctly such that the degrees-of-freedom, `df`, argument equals the user specified degrees of freedom, `J`, and that the basis is applied to the `x` variable within the user supplied `train_data` argument. The interior and boundary knots are extracted and saved to the `interior_knots` and `boundary_knots` objects, respectively. Create the training design matrix by calling the `model.matrix()` function with the `splines::ns()` function to create the basis for the `x` variable. The `interior_knots` object must be assigned to the `knots` argument and the `boundary_knots` object must be assigned to the `Boundary.knots` argument. Make sure you assign the data sets correctly to the `data` argument of `model.matrix()`.**
**How many rows are in the training matrices and how many rows are in the test matrices?**
*NOTE*: The `make_spline_basis_mats()` function **ASSUMES** the `train_data` and `test_data` arguments contain a column named `x`.
#### SOLUTION
Define the `make_spline_basis_mats()` function.
```{r, solution_05b, eval=TRUE}
make_spline_basis_mats <- function(J, train_data, test_data)
{
train_basis <- splines::ns(train_data$x, J)
interior_knots <- as.vector(attributes(train_basis)$knots)
boundary_knots <- as.vector(attributes(train_basis)$Boundary.knots)
train_matrix <- model.matrix(~splines::ns(x,knots = interior_knots, Boundary.knots = boundary_knots), data = train_df)
test_matrix <- model.matrix(~splines::ns(x,knots = interior_knots, Boundary.knots = boundary_knots), data = test_df)
tibble::tibble(
design_matrix = list(train_matrix),
test_matrix = list(test_matrix)
)
}
```
Create each of the training and test basis matrices.
```{r, solution_05b_b, eval=TRUE}
spline_matrices <- purrr::map_dfr(seq(1, 50, by = 2),
make_spline_basis_mats,
train_data = train_df,
test_data = test_df)
```
Get a glimpse of the structure of `spline_matrices`.
```{r, solution_05b_c, eval=TRUE}
glimpse(spline_matrices)
```
Display the elements of `spline_matrices` to the screen.
```{r, solution_05b_d, eval=TRUE}
spline_matrices
```
The code chunk below is created for you. It shows how to check the dimensions of several training and test matrices.
```{r, solution_05b_e, eval=TRUE}
dim(spline_matrices$design_matrix[[1]])
dim(spline_matrices$test_matrix[[1]])
dim(spline_matrices$design_matrix[[2]])
dim(spline_matrices$test_matrix[[2]])
```
**How many rows are the training design matrix and how many rows are in the test matrix?**
150 rows in training design matrix and 50 rows in test matrix
### 5c)
Each element in the `spline_matrices$design_matrix` object is a separate design matrix. You will use this structure to programmatically train the models. The first code chunk creates a list of information which stores the training set responses and defines the prior parameters. The second code chunk below defines the `manage_spline_fit()` function. The first argument is a design matrix `Xtrain`, the second argument is the log-posterior function, `logpost_func`, and the third argument is `my_settings`. `manage_spline_fit()` sets the initial starting values to the $\boldsymbol{\beta}$ parameters by generating random values from a standard normal. The initial value for the unbounded $\varphi$ parameter is set by log-transforming a random draw from the prior on $\sigma$. After creating the initial guess values, the `my_laplace()` function is called to fit the model.
You will complete both code chunks in order to programmatically train all 25 spline models. After completing the first two code chunks, the third code chunk performs the training for you. The fourth code chunk below shows how to access training results associated with the second trained natural spline which has 3 degrees-of-freedom by using the `[[]]` operator. The fifth code chunk checks that each model converged.
**Complete the first two code chunks below. In the first code chunk, assign the training responses to the `yobs` variable within the `info_splines_train` list. Specify the prior mean and prior standard deviation on the $\boldsymbol{\beta}$-parameters to be 0 and 20, respectively. Specify the rate parameter on the unknown $\sigma$ to be 1.**
**Complete the second code chunk by generating a random starting guess for all $\boldsymbol{\beta}$-parameters from a standard normal. Create the random initial guess for $\varphi$ by generating a random number from the Exponential prior distribution on $\sigma$ and log-transforming the variable. Complete the call the `my_laplace()` function by passing in the initial values as a vector of correct size.**
*HINT*: How can you determine the number of unknown $\boldsymbol{\beta}$-parameters if you know the training design matrix?
**Please make sure that you assign `yobs` as a "regular" vector and NOT as a matrix data type.**
#### SOLUTION
Assemble the list of required information.
```{r, solution_05c_a, eval=TRUE}
info_splines_train <- list(
yobs = train_df$y,
mu_beta = 0,
tau_beta = 20,