Mathur_Navodita_HW_07.Rmd

---
title: "INFSCI 2595 Spring 2023 Homework: 07"
subtitle: "Assigned March 4, 2023; Due: March 19, 2023"
author: "Navodita Mathur"
date: "Submission time: March 19, 2023 at 11:00PM EST"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#### Collaborators

Include the names of your collaborators here.  

## Overview

This homework assignment is focused on working with linear models. You will use formulas discussed in lecture to study the important concepts associated with fitting linear models. You will practice fitting, predicting, and assessing linear model performance on a simple example. You will then apply those concepts to a more realistic application. You will train numerous linear basis function models ranging from simple to very complex. You will identify the best model considering training and hold-out test set performance. You will contrast the model selection based on a train/test split with an information metric via the log-Evidence (log marginal likelihood).  

**IMPORTANT**: The RMarkdown assumes you have downloaded the data sets (CSV files) to the same directory you saved the template Rmarkdown file. If you do not have the CSV files in the correct location, the data will not be loaded correctly.  

### IMPORTANT!!!

Certain code chunks are created for you. Each code chunk has `eval=FALSE` set in the chunk options. You **MUST** change it to be `eval=TRUE` in order for the code chunks to be evaluated when rendering the document.  

You are free to add more code chunks if you would like.  

## Load packages

You will use the `tidyverse` in this assignment, as you have done in the previous assignments.  

```{r, load_packages}
library(tidyverse)
```

This assignment also uses the `splines` and `MASS` packages. Both are installed with base `R` and so you do not need to download any additional packages to complete the assignment.  

## Problem 01

We introduced many important regression topics at the beginning of the semester. Our primary example back then involved a continuous output $y$ and a single continuous input $x$. The mean trend, $\mu_n$, was related to the input via a quadratic relationship:  

$$ 
\mu_n = \beta_0 + \beta_1 x_n + \beta_2 x_{n}^{2}
$$

We did not go into the mathematical details of the regression problem earlier in the semester. Instead, we conceptually introduced uncertainty and confidence intervals. We also discussed their impact on selecting the best model. However, we have recently covered the necessary mathematical and statistical fundamentals to calculate the uncertainty! You will begin by assuming the likelihood noise, $\sigma$, is known in order to focus on the regression coefficients (the $\boldsymbol{\beta}$ parameters). You will therefore study the *scaled* or *relative* coefficient uncertainty. This is a useful starting point because it allows you to focus on the impact the **features** have on learning the regression coefficients.  

You will study the *scaled* regression coefficient uncertainty several ways. You will compare the posterior uncertainty based on informative, vague, and no prior specifications with "big" and "small" data applications.  

The data from the "quadratic demo" are read in the code chunk below and assigned to the `quad_data_big` object.  

```{r, read_quad_data}
quad_data_big <- readr::read_csv('quad_demo_data.csv', col_names = TRUE)
```

The glimpse below shows that there are 30 rows for the two continuous variables, the input `x` and the output `y`.  

```{r, show_quad_data_glimpse}
quad_data_big %>% glimpse()
```

The data are labeled as "big" because the 30 observations will serve the role as the "big" or "large" data set in this example. The "small" data version is created for you in the code chunk below by slicing the first 5 rows.  

```{r, make_small_quad_data}
quad_data_small <- quad_data_big %>% slice(1:5)
```


The entire small data set is displayed for you below.  

```{r, show_quad_data_small}
quad_data_small
```

### 1a)

Let's start by visualizing the output to input relationship. You will use the `color` aesthetic to visually identify the first five rows which were separated out to create the small version of the data. You can add a row identifier column via the `tibble::rowid_to_column()` function. By default the new column will be named, `rowid`.  

**Create a scatter plot using `ggplot2` and the "full" or "big" data set to show the relationship between the output `y` and the input `x`. You should use `tibble::rowid_to_column()` to add the row identifier column. Map the `color` aesthetic within `geom_point()` to `rowid > 5` to identify the first five rows. Set the marker size to 3.**  

*HINT*: Don't forget the importance of the `aes()` function!!!!!!!  

#### SOLUTION

Insert your code here.  
```{r}
quad_data_big %>%
  tibble::rowid_to_column()%>%
  ggplot(mapping = aes(x=x,y=y)) +
  geom_point(mapping = aes(color=rowid > 5), size=3)
```

### 1b)

As stated in lecture, we need to create the **design matrix** before we can calculate anything associated with linear models. The design matrix contains the features. Those features can be input variables as well as quantities derived from the inputs. Features can therefore be *functions of* the inputs!  

You must create **TWO** design matrices in this problem. Both design matrices will use a **quadratic** relationship between the input `x` and the mean trend. The first design matrix, `Xmat_quad_small`, corresponds to the sliced "small" data. The second design matrix, `Xmat_quad_big`, corresponds to the original "big" data.  

You are allowed to use the `model.matrix()` function shown in lecture to create the design matrices.  

**Create two design matrices. Both must use a quadratic relationship between the input and the trend. The comments and variable names in the code chunks below state which data set to use in creating the design matrix.**  

*HINT*: You created polynomial features in previous assignments. Please review the syntax from those earlier assignments if you forget how to create the polynomial feature via the formula interface.  

#### SOLUTION

```{r, solution_for_1b_a, eval=TRUE}
### make the design matrix for the SMALL data
Xmat_quad_small <- model.matrix(y ~ x + I(x^2), data = quad_data_small)
Xmat_quad_small
```


```{r, solution_for_1b_b, eval=TRUE}
### make the design matrix for the BIG data
Xmat_quad_big <-  model.matrix(y ~ x + I(x^2), data = quad_data_big)
Xmat_quad_big
```

### 1c)

**How many rows and columns does EACH design matrix have?**  

**Display the entire small data design matrix to the screen. What values are contained in the FIRST column and why?**  

#### SOLUTION

Insert code chunks here.  
```{r}
#Number of columns in design matrix of small dataset
nc_Xmat_quad_small = ncol(Xmat_quad_small)
nc_Xmat_quad_small
```
```{r}
#Number of rows in design matrix of small dataset
nr_Xmat_quad_small = nrow(Xmat_quad_small)
nr_Xmat_quad_small
```
```{r}
#Number of columns in design matrix of big dataset
nc_Xmat_quad_big = ncol(Xmat_quad_big)
nc_Xmat_quad_big
```
```{r}
#Number of rows in design matrix of big dataset
nr_Xmat_quad_big = nrow(Xmat_quad_big)
nr_Xmat_quad_big
```
```{r}
Xmat_quad_small
```
1st column contains the x-intercept values

### 1d)

Let's now begin to study the posterior coefficient uncertainty. As stated previously, you will assume $\sigma$ is known. Let's keep things simple with an assumed value of $\sigma = 1$. If this bothers you, you can also think of this as examining the *scaled* uncertainty. We discussed in lecture that $\sigma$ acts as a scaling term on the posterior uncertainty. Thus, the "real" posterior uncertainty will be a multiple of what you are about to calculate. You are thus getting a rough idea on the behavior of the uncertainty due entirely to the inputs!  

You will begin by assuming that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty.  

**Calculate the posterior covariance matrix for the regression coefficients assuming $\sigma=1$. You must calculate the posterior covariance matrix for both design matrices.**  

**The comments and variable names below state which data set to use.**  

#### SOLUTION
```{r}
sigma = 1
```

```{r, solution_for_1d_a, eval=TRUE}
### posterior covariance matrix for the SMALL data
covmat_quad_small_noprior <- (sigma^2) * solve(t(Xmat_quad_small)%*%Xmat_quad_small)
covmat_quad_small_noprior
```


```{r, solution_for_1d_b, eval=TRUE}
### posterior covariance matrix for the BIG data
covmat_quad_big_noprior <-  (sigma^2) * solve(t(Xmat_quad_big)%*%Xmat_quad_big)
covmat_quad_big_noprior
```

### 1e)

**How many rows and columns does EACH posterior covariance matrix have?**  

#### SOLUTION

Insert code chunks here.  
```{r}
nc_covmat_quad_small_noprior = ncol(covmat_quad_small_noprior)
nc_covmat_quad_small_noprior
```
```{r}
nr_covmat_quad_small_noprior = nrow(covmat_quad_small_noprior)
nr_covmat_quad_small_noprior
```
```{r}
nc_covmat_quad_big_noprior = ncol(covmat_quad_big_noprior)
nc_covmat_quad_big_noprior
```
```{r}
nr_covmat_quad_big_noprior = nrow(covmat_quad_big_noprior)
nr_covmat_quad_big_noprior
```

### 1f)

**What is the posterior standard deviation on the quadratic feature for the small data case? What is the posterior standard deviation on the quadratic feature for the big data case?**  

#### SOLUTION

Insert code chunks here.  
```{r}
sd_quad_small_noprior = sqrt(diag(covmat_quad_small_noprior))
sd_quad_small_noprior
```
```{r}
sd_quad_big_noprior = sqrt(diag(covmat_quad_big_noprior))
sd_quad_big_noprior
```

### 1g)

The uncertainty displayed in the previous question does not tell us if a feature is statistically significant or not. We need to know where the coefficient is "located" to answer that. We thus need an *estimate* for the coefficient! We need output measurements to estimate coefficients. Thus, uncertainty can be studied *before* output data are collected but *significance* can only be examined **after** output data are collected.  

You will now estimate the coefficients by calculating their posterior mean assuming the prior does not matter. To help you with this question, the output variables are separated from the dataframes and converted to the correct data type for you in the code chunk below. The output column vector for the small data case is displayed below.  

```{r, separate_output_columns_quad}
y_quad_big_col <- quad_data_big %>% pull(y) %>% as.matrix()

y_quad_small_col <- quad_data_small %>% pull(y) %>% as.matrix()

y_quad_small_col
```

You **MUST** continue to assume that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty.  

**Calculate the regression coefficient posterior mean assuming $\sigma=1$ and the prior does not matter.**  

**The comments and variable names below state which data set to use.**  

**Display the posterior means for both data cases to the screen.**  

**Would the values change if $\sigma$ was assumed to equal 5 instead of 1?**  

#### SOLUTION

```{r, solution_for_1g_a, eval=TRUE}
### posterior mean vector for the SMALL data
postmeans_quad_small_noprior <- solve(t(Xmat_quad_small)%*%Xmat_quad_small)%*%(t(Xmat_quad_small)%*%y_quad_small_col)
```


```{r, solution_for_1g_b, eval=TRUE}
### posterior mean vector for the BIG data
postmeans_quad_big_noprior <- solve(t(Xmat_quad_big)%*%Xmat_quad_big)%*%(t(Xmat_quad_big)%*%y_quad_big_col)
```


Display the posterior means and answer the conceptual question. 
```{r}
postmeans_quad_small_noprior
```
```{r}
postmeans_quad_big_noprior
```

No, the values would not change if $\sigma$ was assumed to equal 5 instead of 1

### 1h)

**Calculate the posterior 95% uncertainty interval on the quadratic feature.**  

**Assuming $\sigma=1$, is the quadratic feature considered to be statistically significant in the small data case? Assuming $\sigma=1$, is the quadratic feature considered to be statistically significant in the big data case?**  

#### SOLUTION

Insert your code chunks.  
```{r}
qnorm(c(0.025, 0.975),postmeans_quad_small_noprior[[3]], sd_quad_small_noprior[3])
```
```{r}
qnorm(c(0.025, 0.975),postmeans_quad_big_noprior[[3]], sd_quad_big_noprior[3])
```

What do you think?  
Assuming $\sigma=1$, the quadratic feature is considered to be statistically significant in both cases

## Problem 02

Let's now consider the influence of a prior on the posterior uncertainty. You will work with two priors in this problem. The first is an informative prior and the second is a vague or diffuse prior. Both priors are Multivariate Normal (MVN) distributions. Both assume the prior means are zero and the regression coefficients are independent. However, the prior standard deviations are different between the two priors.  

**The informative prior assumes a prior standard deviation of 1.4 for each coefficient, while the vague prior assumes a prior standard deviation of 30 for each coefficient.**  

### 2a)

You will specify the prior by creating the prior covariance matrix.  

**Create 2 prior covariance matrices. The first is the prior covariance matrix for the informative prior and the second is the prior covariance matrix for the vague prior. The variable names and comments state which prior you are working with.**  

**Display each prior covariance matrix to the screen.**  

#### SOLUTION

```{r, solution_for_2a_a, eval=TRUE}
### prior covariance matrix for the INFORMATIVE prior
B0_inform <- (1.4^2)*diag(3)
```


```{r, solution_for_2a_b, eval=TRUE}
### prior covariance matrix for the VAGUE prior
B0_vague <- (30^2)*diag(3)
```


Insert code chunks to display the covariance matrices.  
```{r}
B0_inform
```
```{r}
B0_vague
```

### 2b)

You will now calculate the posterior covariance matrix assuming $\sigma=1$ for the small and "big" data cases.  

First, let's focus on the results based on the **informative** prior.  

**Calculate the posterior covariance matrix based on the informative prior for both the small and big data cases. The variable names and comments state which data set to use.**  

#### SOLUTION
```{r}
sigma=1
```


```{r, solution_for_2b_a, eval=TRUE}
### posterior covariance matrix INFORMATIVE prior and SMALL data
covmat_quad_small_inform <- solve(solve(B0_inform) + (t(Xmat_quad_small) %*% Xmat_quad_small)/sigma^2)
covmat_quad_small_inform
```


```{r, solution_for_2b_b, eval=TRUE}
### posterior covariance matrix INFORMATIVE prior and BIG data
covmat_quad_big_inform <- solve(solve(B0_inform) + (t(Xmat_quad_big) %*% Xmat_quad_big)/sigma^2)
covmat_quad_big_inform
```

### 2c)

Next, let's consider the results based on the **vague** prior.  

**Calculate the posterior covariance matrix based on the vague prior for both the small and big data cases. The variable names and comments state which data set to use.**  

#### SOLUTION

```{r, solution_for_2c_a, eval=TRUE}
### posterior covariance matrix VAGUE prior and SMALL data
covmat_quad_small_vague <- solve(solve(B0_vague) + (t(Xmat_quad_small) %*% Xmat_quad_small)/sigma^2)
covmat_quad_small_vague
```

```{r, solution_for_2c_b, eval=TRUE}
### posterior covariance matrix VAGUE prior and BIG data
covmat_quad_big_vague <- solve(solve(B0_vague) + (t(Xmat_quad_big) %*% Xmat_quad_big)/sigma^2)
covmat_quad_big_vague
```

### 2d)

**What is the posterior standard deviation on the quadratic feature based on the INFORMATIVE prior for both data cases? Are the posterior standard deviations similar to the result from 1f)?**  

#### SOLUTION

Insert your code chunks here.  
```{r}
sd_quad_small_inform = sqrt(diag(covmat_quad_small_inform))
sd_quad_small_inform
```
```{r}
sd_quad_big_inform = sqrt(diag(covmat_quad_big_inform))
sd_quad_big_inform
```

### 2e)

**What is the posterior standard deviation on the quadratic feature based on the VAGUE prior for both data cases? Are the posterior standard deviations similar to the result from 1f)?**  

#### SOLUTION

Insert your code chunks here.  
```{r}
sd_quad_small_vague = sqrt(diag(covmat_quad_small_vague))
sd_quad_small_vague
```
```{r}
sd_quad_big_vague = sqrt(diag(covmat_quad_big_vague))
sd_quad_big_vague
```

### 2f)

**What is the influence of the prior on the linear model regression coefficients?**  

#### SOLUTION

What do you think?  
The co-efficient value slightly decrease if the dataset is small but has no significant impact on large dataset


## Problem 03

Now that you have worked though interpreting linear model results, its time to fit a linear model! You will learn the unknown regression coefficients **and** the unknown likelihood noise, $\sigma$. From a Bayesian perspective, we no longer have simple formulas for the posterior distribution. We will use the Laplace Approximation to execute the Bayesian analysis.  

This problem is focused on setting up the log-posterior function for a linear model. You will program the function using matrix math, such that you can easily scale your code from a linear relationship with a single input up to complex linear basis function models. You will assume independent Gaussian priors on all $\boldsymbol{\beta}$-parameters with a shared prior mean $\mu_{\beta}$ and shared prior standard deviation, $\tau_{\beta}$. An Exponential prior with rate parameter $\lambda$ will be assumed for the likelihood noise, $\sigma$. The complete probability model for the response, $y_n$, is shown below using the linear basis notation. The $n$-th row of the basis design matrix, $\boldsymbol{\Phi}$ is denoted as $\boldsymbol{\phi}_{n,:}$. It is assumed that the basis $J$ degrees-of-freedom.  

$$ 
y_n \mid \mu_n, \sigma \sim \mathrm{normal}\left(y_n \mid \mu_n, \sigma\right)
$$

$$ 
\mu_n = \boldsymbol{\phi}_{n,:}\boldsymbol{\beta}
$$

$$ 
\boldsymbol{\beta} \mid \mu_{\beta}, \tau_{\beta} \sim \prod_{j=0}^{J}\left( \mathrm{normal}\left( \beta_j \mid \mu_{\beta}, \tau_{\beta} \right) \right)
$$

$$ 
\sigma \mid \lambda \sim \mathrm{Exp}\left(\sigma \mid \lambda \right)
$$

### 3a)

You will fit the Bayesian linear model with the Laplace Approximation by programming the log-posterior function. However, before doing so you will create a list of required information, analogous to those you created in the previous homework assignments. The list will contain the data and the prior parameter specifications. The data consists of two major components. The output vector and the design matrix. You will need these details to execute the log-posterior function.  

You will start out by fitting the quadratic trend function which you worked with in the previous two problems. As a reminder, the trend is:  

$$ 
\mu_n = \beta_0 + \beta_1 x_{n} + \beta_2 x_{n}^{2}
$$

You will create the list of required information for the "big" data case which uses all observations.  

**Complete the code chunk below by filling in the fields of the `info_quad` list. Assign the output vector to `yobs` and the "big" data design matrix to `design_matrix`. Specify the shared prior mean, `mu_beta`, to be 0, the shared prior standard deviation, `tau_beta`, as 3. The prior rate parameter on the noise, `sigma_rate`, should be assigned to 1.**  

**Please make sure that you assign `yobs` as a "regular" vector and NOT as a matrix data type.**  

#### SOLUTION

```{r, solution_03a, eval=TRUE}
info_quad <- list(
  yobs = quad_data_big$y,
  design_matrix = Xmat_quad_big,
  mu_beta = 0,
  tau_beta = 3,
  sigma_rate = 1 
)
```


### 3b)


You will now define the log-posterior function `lm_logpost()`. You will use the log-transformation on $\sigma$, and so you will actually define the log-posterior in terms of the regression coefficients, $\boldsymbol{\beta}$, and the unbounded noise parameter, $\varphi = \log\left[\sigma\right]$.  

The comments in the code chunk below tell you what you need to fill in. The unknown parameters to learn are contained within the first input argument, `unknowns`. You **must** assume that the unknown $\boldsymbol{\beta}$-parameters are before the unknown $\varphi$ parameter in the `unknowns` vector. You must specify the number of $\boldsymbol{\beta}$ parameters programmatically to allow scaling up your function to an arbitrary number of unknowns. You will assume that all variables contained in the `my_info` list (the second argument to `lm_logpost()`) are the same fields in the `info_quad` list you defined in Problem 3a).  

**Define the log-posterior function by completing the code chunk below. You must calculate the mean trend, `mu`, using MATRIX MATH between the design matrix and the unknown $\boldsymbol{\beta}$ column vector. After you complete the function, test that it works by evaluating the log-posterior at two different sets of parameter values. Try values of -1 for all parameters, and then try out values of 1 for all parameters.**  

*HINT*: If you have successfully completed the log-posterior function, you should get a value of `-1161.363` for the -1 guess values, and a value of `-137.0484` for the +1 guess values.  

*HINT*: Don't forget about useful data type conversion functions such as `as.matrix()` and `as.vector()` (or `as.numeric()`).  

#### SOLUTION

```{r, solution_03b, eval=TRUE}
lm_logpost <- function(unknowns, my_info)
{
  # specify the number of unknown beta parameters
  length_beta <- ncol(my_info$design_matrix)
  
  # extract the beta parameters from the `unknowns` vector
  beta_v <- unknowns[1:length_beta]
  
  # extract the unbounded noise parameter, varphi
  lik_varphi <- unknowns[length_beta+1]
  
  # back-transform from varphi to sigma
  lik_sigma <- exp(lik_varphi)
  
  # extract design matrix
  X <- my_info$design_matrix
  
  # calculate the linear predictor
  mu <- as.vector(X %*% as.matrix(beta_v))
  
  # evaluate the log-likelihood
  log_lik <- sum(dnorm(x= my_info$yobs, mean = as.numeric(mu), sd = lik_sigma, log = TRUE))
  
  # evaluate the log-prior
  log_prior_beta <- sum(dnorm(x = beta_v, mean = my_info$mu_beta, sd = my_info$tau_beta, log = TRUE))
  
  log_prior_sigma <- dexp(x= lik_sigma, rate = my_info$sigma_rate, log = TRUE)
  
  # add the mean trend prior and noise prior together
  log_prior <- log_prior_beta + log_prior_sigma
  
  # account for the transformation
  log_derive_adjust <- lik_varphi
  
  # sum together
  return(log_lik+log_prior+log_derive_adjust)
}
```

Test out `lm_logpost()` with guess values of -1 for all parameters.  

```{r, solution_03b_c}
###
lm_logpost(c(-1,-1,-1,-1), info_quad)
```


Test out `lm_logpost()` with guess values of 1 for all parameters.  

```{r, solution_03b_b}
###
lm_logpost(c(1,1,1,1), info_quad)
```


### 3c)

The `my_laplace()` function is started for you in the code chunk below. You must fill in the portion after the optimization is executed. Although you have filled in these elements before it is good to review to make sure you remember how the posterior covariance matrix is calculated! Also, you must complete the calculation of the `int` variable which stands for "integration" and equals the Laplace Approximation's estimate to the Evidence.  

**Complete the `my_laplace()` function below and then fit the Bayesian linear model using a starting guess of zero for all parameters. Assign your result to the `laplace_quad` object. Print the posterior mode and posterior standard deviations to the screen.**  

**Should you be concerned about the initial guess impacting the posterior results?**  

#### SOLUTION

```{r, solution_03c, eval=TRUE}
my_laplace <- function(start_guess, logpost_func, ...)
{
  # code adapted from the `LearnBayes`` function `laplace()`
  fit <- optim(start_guess,
               logpost_func,
               gr = NULL,
               ...,
               method = "BFGS",
               hessian = TRUE,
               control = list(fnscale = -1, maxit = 1001))
  
  mode <- fit$par
  post_var_matrix <- -solve(fit$hessian)
  p <- length(mode) # number of unknown parameters
  int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
  # package all of the results into a list
  list(mode = mode,
       var_matrix = post_var_matrix,
       log_evidence = int,
       converge = ifelse(fit$convergence == 0,
                         "YES", 
                         "NO"),
       iter_counts = as.numeric(fit$counts[1]))
}
```

Fit the Bayesian linear model.  

```{r, solution_03c_b, eval=TRUE}
laplace_quad <- my_laplace(c(0,0,0,0), lm_logpost, info_quad)
```

Display the posterior mode and posterior standard deviations.  

```{r, solution_03c_c}
###
laplace_quad$mode
```

```{r}
sqrt(diag(laplace_quad$var_matrix))
```

Should you be worried about the initial guess impacted results?  
No, Initial guess will not have any impact on results.

### 3d)

The `generate_lm_post_samples()` function is started for you in the code chunk below. The first argument, `mvn_result`, is the Laplace Approximation result object returned from the `my_laplace()` function. The second argument, `length_beta`, specifies the number of mean trend $\boldsymbol{\beta}$-parameters to the model. The naming of the variables is taken care of for you. This function should look quite similar to the previous assignment...  

After completing the function, generate 2500 posterior samples of the parameters in your model and assign the result to the `post_samples_quad` object. You will then use the posterior samples to study the posterior distribution on the slope on the quadratic feature, $\beta_2$. The naming convention provided by the `generate_lm_post_samples()` function names $\beta_2$ as the `beta_02` column.  

**Complete the generate_lm_post_samples() function below. After completing the function, generate 2500 posterior samples from your `post_samples_quad` model. Create a histogram with 55 bins using `ggplot2` for the slope `beta_02`. Calculate the probability that the slope is positive.**  

#### SOLUTION

```{r, solution_03d, eval=TRUE}
generate_lm_post_samples <- function(mvn_result, length_beta, num_samples)
{
  MASS::mvrnorm(n = num_samples,
                mu = mvn_result$mode ,
                Sigma = mvn_result$var_matrix ) %>% 
    as.data.frame() %>% tibble::as_tibble() %>% 
    purrr::set_names(c(sprintf("beta_%02d", 0:(length_beta-1)), "varphi")) %>% 
    # back transform varphi to sigma!!!
    mutate(sigma = exp(varphi) )
}
```

Generate posterior samples.  

```{r, solution_03d_b, eval=TRUE}
set.seed(87123)
N=2500
post_samples_quad <- generate_lm_post_samples(laplace_quad, ncol(Xmat_quad_big), N)
```

Create the posterior histogram on $\beta_2$, calculate the posterior 95% uncertainty interval, and calculate the probability that $\beta_2$ is greater than zero.  

Add as many code chunks as you feel are necessary. 
```{r}
post_samples_quad %>% 
  ggplot(mapping = aes(x = beta_02)) +
  geom_histogram(bins = 55) 
```
```{r}
qnorm(c(0.025, 0.975), mean(post_samples_quad$beta_02), sd(post_samples_quad$beta_02))
```

```{r}
mean(post_samples_quad$beta_02>0)
```


## Problem 04

Now that you can fit a Bayesian linear model, it's time to work with making posterior predictions from the model. You will use those predictions to calculate and summarize the errors of the model relative to observations. Since we have discussed RMSE and R-squared throughout the semester, you will work with the Mean Absolute Error (MAE) metric.  

### 4a)

The `post_lm_pred_trend_samples()` function is started in the code chunk below. This function generates posterior mean trend predictions. The first argument, `Xnew`, is a potentially new or test design matrix that we wish to make predictions for. The second argument, `Bmat`, is a matrix of posterior samples of the $\boldsymbol{\beta}$-parameters. The `Xnew` matrix has rows equal to the number of predictions points, `M`, and the `Bmat` matrix has rows equal to the number of posterior samples `S`.  

You must complete the function by performing the necessary matrix math to calculate the matrix of posterior mean trend predictions, `Umat`.  

The `post_lm_pred_trend_samples()` returns the `Umat` and `Ymat` matrices contained within a list.  

**Perform the necessary matrix math to calculate the matrix of posterior predicted mean trends `Umat`.**  

#### SOLUTION

```{r, solution_02a, eval=TRUE}
post_lm_pred_trend_samples <- function(Xnew, Bmat, sigma_vector)
{
  # matrix of posterior trend predictions
  Umat <- Xnew %*% t(Bmat)
  # create a matrix of  random values from an arbitrary Gaussian by generating random values from a standard normal and rescaling
  Rmat <- matrix(rep(sigma_vector, nrow(Xnew)), nrow(Xnew), byrow = TRUE)
  Zmat <- matrix(rnorm(nrow(Xnew)*nrow(Bmat)), nrow(Xnew), byrow = TRUE)
  #calculate $\mu$ + $\sigma$*z
  Ymat <- Umat + Rmat * Zmat
  
  # return Umat, completed for you
  return(list(Umat = Umat, Ymat=Ymat))
}
```


### 4b)

The code chunk below is completed for you. The function `make_post_lm_pred()` is a wrapper which calls the `post_lm_pred_samples()` function. It contains two arguments. The first, `Xnew`, is a test design matrix. The second, `post`, is a data.frame of posterior samples. The function extracts the $\boldsymbol{\beta}$-parameter posterior samples and converts the object to a matrix. It also extracts the posterior samples on $\sigma$ and converts to a vector. We will use the posterior samples on $\sigma$ is the next assignment, but the line of code is included below for demonstration purposes.  

```{r, create_lm_pred_wrapper}
make_post_lm_pred <- function(Xnew, post)
{
  Bmat <- post %>% select(starts_with("beta_")) %>% as.matrix()

  sigma_vector <- post %>% pull(sigma)
  
  post_lm_pred_trend_samples(Xnew, Bmat, sigma_vector)
}
```

You now have enough pieces in place to generate posterior predictions from your model.  

**Make posterior predictions of the trend on the "big" quadratic demo training set. Assign the posterior trend predictions to the `post_trend_samples_quad` object. What are the dimensions of `post_trend_samples_quad`? What do the rows and columns correspond to?**  

#### SOLUTION

Make posterior predictions on the training set.  

```{r, solution_04b, eval=TRUE}
post_trend_samples_quad <- make_post_lm_pred(Xmat_quad_big, post_samples_quad)
```

The dimensionality of the posterior predicted mean trend matrix is:  

```{r, solution_04b_b, eval=TRUE}
### 
dim(post_trend_samples_quad$Umat)
```

What do the rows and columns correspond to?  
Number of rows correspond to number of data points (rows) while number of columns correspond to number of posterior samples of the $\beta$-parameters (features i.e. co-efficients) and $\sigma$ 

### 4c)

You will now use the model predictions to calculate the error between the model and the training set observations. Since you generated 2500 posterior samples, you have 2500 different sets of predictions! However, to get started you will focus on the first 3 posterior samples.  

**Calculate the error between the predicted mean trend and the training set observations for each of the first 3 posterior predicted samples. Assign the errors to separate vectors, as indicated in the code chunk below.**  

#### SOLUTION

```{r, solution_04c, eval=TRUE}
### error of the first posterior sample
error_quad_post_01 <- c(post_trend_samples_quad$Umat[,1] - quad_data_big$y)

### error of the second posterior sample
error_quad_post_02 <- c(post_trend_samples_quad$Umat[,2] - quad_data_big$y)

### error of the third posterior sample
error_quad_post_03 <- c(post_trend_samples_quad$Umat[,3] - quad_data_big$y)
```

### 4d)

You will now calculate the Mean Absolute Error (MAE) associated with each of the three error samples calculated in Problem 4c). However, before calculating the MAE, let's first consider the dimensions of the `error_quad_post_01`. What is the length of the `error_quad_post_01` vector? When you take the absolute value and then average across all elements in that vector, what are you averaging over?  

**What is the length of the `error_quad_post_01` vector? Calculate the MAE associated with each of the 3 error vectors you calculated in Problem 4c).**  

**What are you averaging over when you calculate the mean absolute error? Are the three MAE values the same? If not, why would they be different?**  

*HINT*: The absolute value can be calculated with the `abs()` function.  

#### SOLUTION

?  

```{r, solution_04d}
###
length(error_quad_post_01)
```

Now calculate the MAE associated with each of the first three posterior samples.  

```{r, solution_04d_b, eval=TRUE}
mae_quad_post_01 <- mean(abs(error_quad_post_01))

mae_quad_post_02 <- mean(abs(error_quad_post_02))

mae_quad_post_03 <- mean(abs(error_quad_post_03))
```

?  
```{r}
mae_quad_post_01 == mae_quad_post_02
```
```{r}
mae_quad_post_02 == mae_quad_post_03
```
No, the three MAE values are not the same as the posterior samples have different values of associated parmeters or unknows

### 4e)

In Problem 4d), you calculated the MAE associated with the first 3 posterior samples. You will now work through calculating the MAE associated with every posterior sample. Although it might seem like you need to use a for-loop to do so, `R` will simplify the operation for you. If you perform an addition or subtraction between a matrix and a vector, `R` will find the dimension that matches between the two and repeat the action over the other dimension (this repetition is referred to *broadcasting*). Consider the code below, which has a vector, `a_numeric`, subtracted from a matrix `a_matrix`:  

`a_matrix - a_numeric`

Assuming that `a_matrix` has 10 rows and 25 columns and `a_numeric` is length 10, `R` will subtract `a_numeric` from each column in `a_matrix`. The result will be another matrix with the same dimensionality as `a_matrix`. To confirm this is the case, consider the example below where a vector of length 2 is subtracted from a matrix of 2 rows and 4 columns. The resulting dimensionality is 2 rows by 4 columns.  

```{r, example_subtraction}
### a 2 x 4 matrix
matrix(1:8, nrow = 2, byrow = TRUE)

### a vector length 2
c(1, 2)

### subtracting the two yields a matrix
matrix(1:8, nrow = 2, byrow = TRUE) - c(1, 2)
```

You will use this fact to calculate the error associated with each training point and each posterior sample.  

**Calculate the absolute value of the error between the mean trend matrix and the training set response. Assign the result to the `absError_quad_mat` object. Print the dimensions of the `absError_quad_mat` matrix to screen.**  

#### SOLUTION

```{r, solution_04e, eval=TRUE}
absError_quad_mat <- abs(post_trend_samples_quad$Umat - quad_data_big$y)
### dimensions?
dim(absError_quad_mat)
```


### 4f)

You must now summarize the absolute errors values by averaging them appropriately. Should you average across the rows or down the columns? In `R` the `colMeans()` will calculate the average value associated with each column in a matrix and returns a vector. Likewise, the `rowMeans()` function calculates the average value along each row and returns a vector.  

**Which function, `colMeans()` or `rowMeans()`, should you use to calculate the MAE associated with each posterior sample?**  

**Calculate the MAE associated with each posterior sample and assign the result to the `MAE_quad` object. Print the data type (the class) of the `MAE_quad` to the screen and display its length. Check your result is consistent with the MAEs you previously calculated in Problem 4d).**  

#### SOLUTION

Which function?  

```{r, solution_04f, eval=TRUE}
MAE_quad <- colMeans(absError_quad_mat)

### data type?
class(MAE_quad)

### length?
length(MAE_quad)
```

Check with the results you calculated previously.  

```{r, solution_04f_b, eval=TRUE}
### your code here
MAE_quad[1] == mae_quad_post_01
```
```{r}
MAE_quad[2] == mae_quad_post_02
```
```{r}
MAE_quad[3] == mae_quad_post_03
```


### 4g)

You have calculated the MAE associated with each posterior sample, and thus represented the uncertainty in the MAE!  

**Use the `quantile()` function to print out summary statistics associated with the MAE. You can use the default arguments, and thus pass in `MAE_quad` into `quantile()` without setting any other argument.**  

**Why is the MAE uncertain? Or put another way, what causes the MAE to be uncertain?**  

#### SOLUTION

Calculate the Quantiles of the MAE below.  

```{r, solution_04g, eval=TRUE}
### 
quantile(MAE_quad)
```

Why is the MAE uncertain?  
Since the MAE depends on the betas i.e. regression coefficients, the uncertainty in the posterior samples of the model parameters also influences uncertainty in the MAE along with standard deviation, $\sigma$.

## Problem 05

You will now make use of the model fitting and prediction functions you created in the previous problems to study the behavior of a more complicated non-linear modeling task. The code chunk below reads in two data sets. Both consist of two continuous variables, an input `x` and a response `y`. The first, `train_df`, will serve as the training set, and the second, `test_df`, will serve as a hold-out test set. You will only fit models based on the training set.  

```{r, read_in_train_test_02, eval=TRUE}
file_for_train <- "hw07_train_data.csv"

train_df <- readr::read_csv(file_for_train, col_names = TRUE)

file_for_test <- "hw07_test_data.csv"

test_df <- readr::read_csv(file_for_test, col_names = TRUE)
```

Glimpses are provided for each data set below.  

```{r, show_train_data_glimpse}
train_df %>% glimpse()
```


```{r, sow_test_data_glimpse}
test_df %>% glimpse()
```


### 5a)

It's always a good idea to start out by visualizing the data before modeling.  

**Create a scatter plot between the response and the input with `ggplot2`. Include both the training and test sets together in one graph. Use the marker color to distinguish between the two.**  

#### SOLUTION

```{r solution_05a}
###
train_df%>%
  ggplot(mapping = aes(x=x,y=y, color="train"))+
  geom_point()+
  geom_point(data=test_df, mapping = aes(color="test"))
```

### 5b)

Hopefully you can tell from the previous visualization that the response is non-linearly related to the input. You will use basis functions to try and model this non-linear relationship. You could manually try out various models of different levels of complexity. However, let's build off the functions you created in the previous questions to programmatically train and assess many models of varying complexity.  

Specifically, you will fit 25 different models to the training set. You will consider a one degree of freedom natural spline up to 49 degrees of freedom natural spline. You will consider only odd values for the degrees of freedom. Your goal will be to find which spline is the "best" in terms of generalizing from the training set to the test set. To do so, you will calculate the MAE on the training set and on the test set for each model. It would be tedious to define the necessary information by hand, manually train each model, generate the posterior samples, and make predictions from each model. Therefore, you will work through completing functions that will enable you to programmatically loop over each candidate model.  

You will begin by completing a function to create the training set and test set for a desired spline basis. The function `make_spline_basis_mats()` is started for you in the first code chunk below. The first argument is the desired spline basis degrees of freedom, `J`. The second argument is the training set, `train_data`, and the third argument is the hold-out test set, `test_data`.  

The second through fifth code chunks below are provided to check that you completed the function correctly. The second code chunk uses `purrr::map_dfr()` to create the training and test matrices for all 25 models. A glimpse of the resulting object, `spline_matrices`, is displayed to the screen in the third code chunk and is printed to the screen in the fourth code chunk. You should see a `tibble` consisting of two variables, `design_matrix` and `test_matrix`. Both variables are lists containing matrices. The matrices contained in the `spline_matrices$design_matrix` variable are the training design matrices, while the matrices contained in `spline_matrices$test_matrix` are the corresponding hold-out test basis matrices.  

The fifth code chunk below prints the dimensionality of the 1st and 2nd order spline basis matrices to the screen. It shows that to access a specific matrix, you need to use the `[[]]` notation.  

**Complete the first code chunk below. You must specify the `splines::ns()` function call correctly such that the degrees-of-freedom, `df`, argument equals the user specified degrees of freedom, `J`, and that the basis is applied to the `x` variable within the user supplied `train_data` argument. The interior and boundary knots are extracted and saved to the `interior_knots` and `boundary_knots` objects, respectively. Create the training design matrix by calling the `model.matrix()` function with the `splines::ns()` function to create the basis for the `x` variable. The `interior_knots` object must be assigned to the `knots` argument and the `boundary_knots` object must be assigned to the `Boundary.knots` argument. Make sure you assign the data sets correctly to the `data` argument of `model.matrix()`.**  

**How many rows are in the training matrices and how many rows are in the test matrices?**  

*NOTE*: The `make_spline_basis_mats()` function **ASSUMES** the `train_data` and `test_data` arguments contain a column named `x`.  

#### SOLUTION

Define the `make_spline_basis_mats()` function.  

```{r, solution_05b, eval=TRUE}
make_spline_basis_mats <- function(J, train_data, test_data)
{
  train_basis <- splines::ns(train_data$x, J)
  
  interior_knots <- as.vector(attributes(train_basis)$knots)
  
  boundary_knots <- as.vector(attributes(train_basis)$Boundary.knots)
  
  train_matrix <- model.matrix(~splines::ns(x,knots = interior_knots, Boundary.knots = boundary_knots), data = train_df)
  
  test_matrix <- model.matrix(~splines::ns(x,knots = interior_knots, Boundary.knots = boundary_knots), data = test_df)
  
  tibble::tibble(
    design_matrix = list(train_matrix),
    test_matrix = list(test_matrix)
  )
}
```

Create each of the training and test basis matrices.  

```{r, solution_05b_b, eval=TRUE}
spline_matrices <- purrr::map_dfr(seq(1, 50, by = 2), 
                                  make_spline_basis_mats,
                                  train_data = train_df, 
                                  test_data = test_df)
```

Get a glimpse of the structure of `spline_matrices`.  

```{r, solution_05b_c, eval=TRUE}
glimpse(spline_matrices)
```

Display the elements of `spline_matrices` to the screen.  

```{r, solution_05b_d, eval=TRUE}
spline_matrices
```

The code chunk below is created for you. It shows how to check the dimensions of several training and test matrices.  

```{r, solution_05b_e, eval=TRUE}
dim(spline_matrices$design_matrix[[1]])

dim(spline_matrices$test_matrix[[1]])

dim(spline_matrices$design_matrix[[2]])

dim(spline_matrices$test_matrix[[2]])
```

**How many rows are the training design matrix and how many rows are in the test matrix?**  
150 rows in training design matrix and 50 rows in test matrix

### 5c)


Each element in the `spline_matrices$design_matrix` object is a separate design matrix. You will use this structure to programmatically train the models. The first code chunk creates a list of information which stores the training set responses and defines the prior parameters. The second code chunk below defines the `manage_spline_fit()` function. The first argument is a design matrix `Xtrain`, the second argument is the log-posterior function, `logpost_func`, and the third argument is `my_settings`. `manage_spline_fit()` sets the initial starting values to the $\boldsymbol{\beta}$ parameters by generating random values from a standard normal. The initial value for the unbounded $\varphi$ parameter is set by log-transforming a random draw from the prior on $\sigma$. After creating the initial guess values, the `my_laplace()` function is called to fit the model.  

You will complete both code chunks in order to programmatically train all 25 spline models. After completing the first two code chunks, the third code chunk performs the training for you. The fourth code chunk below shows how to access training results associated with the second trained natural spline which has 3 degrees-of-freedom by using the `[[]]` operator. The fifth code chunk checks that each model converged.  

**Complete the first two code chunks below. In the first code chunk, assign the training responses to the `yobs` variable within the `info_splines_train` list. Specify the prior mean and prior standard deviation on the $\boldsymbol{\beta}$-parameters to be 0 and 20, respectively. Specify the rate parameter on the unknown $\sigma$ to be 1.**  

**Complete the second code chunk by generating a random starting guess for all $\boldsymbol{\beta}$-parameters from a standard normal. Create the random initial guess for $\varphi$ by generating a random number from the Exponential prior distribution on $\sigma$ and log-transforming the variable. Complete the call the `my_laplace()` function by passing in the initial values as a vector of correct size.**  

*HINT*: How can you determine the number of unknown $\boldsymbol{\beta}$-parameters if you know the training design matrix?  

**Please make sure that you assign `yobs` as a "regular" vector and NOT as a matrix data type.**  

#### SOLUTION

Assemble the list of required information.  

```{r, solution_05c_a, eval=TRUE}
info_splines_train <- list(
  yobs = train_df$y,
  mu_beta = 0,
  tau_beta = 20,
  sigma_rate = 1
)
```

Complete the function which manages the execution of the Laplace Approximation to each spline model. Note that `Xtrain` is assumed to be a design matrix in the `manage_spline_fit()` function.  

```{r, solution_05c_b, eval=TRUE}
manage_spline_fit <- function(Xtrain, logpost_func, my_settings)
{
  my_settings$design_matrix <- Xtrain
  
  init_beta <- rnorm(ncol(Xtrain))
  
  init_varphi <- log(rexp(n=1, rate = my_settings$sigma_rate))
  
  my_laplace(c(init_beta, init_varphi), logpost_func, my_settings)
}
```

The next three code chunks are completed for you. All 25 spline models are trained in the first code chunk below. Notice that because the training design matrices have already been created, we just need to loop over each element of `spline_matrices$design_matrix`. Please note the code chunk below may take a few minutes to complete.  

```{r, solution_05c_c, eval=TRUE}
set.seed(724412)
all_spline_models <- purrr::map(spline_matrices$design_matrix, 
                                manage_spline_fit,
                                logpost_func = lm_logpost,
                                my_settings = info_splines_train)
```

The code chunk below shows how to access a single model fitting result. Specifically, it extracts the result of the second trained model, the 3 degree-of-freedom spline.  

```{r, solution_05c_d, eval=TRUE}
all_spline_models[[2]]
```

The code chunk below shows how to check that all optimizations converged.  

```{r, solution_05c_e, eval=TRUE}
purrr::map_chr(all_spline_models, "converge")
```


### 5d)


With all 25 spline models fit, it is time to assess their performance and select the best one. Several different approaches have been discussed in lecture for how to identify the "best" model. You will start out by calculating the MAE on the training set and the test set. You went through the steps to generate posterior samples, make posterior predictions and to calculate the posterior MAE distribution in Problem 4. You will now define a function which performs all necessary steps together.  

The function `calc_mae_from_laplace()` is started for you in the first code chunk below. The first argument, `mvn_result`, is the result of the `my_laplace()` function for a particular model. The second and third arguments, `Xtrain` and `Xtest`, are the training and test basis matrices associated with the model, respectively. The fourth and fifth arguments, `y_train` and `y_test`, are the output observations on the training set and test set, respectively. The last argument, `num_samples`, is the number of posterior samples to generate.  

You will complete the necessary steps to generate posterior samples from the model, predict the training set, and predict the test set. Then you will calculate the training set MAE and test set MAE associated with each posterior sample. The last portion of the `calc_mae_from_laplace()` function is mostly completed for you.  

After you complete the `calc_mae_from_laplace()`, the second code chunk applies the function to all 25 spline models. It is nearly complete. You must specify the arguments which define the training set output observations, `y_train`, the test set output observations, `y_train`, and the number of posterior samples, `num_samples`.  

**Complete all steps to calculate the MAE on the training set and test sets in the first code chunk below. Complete the lines of code in order to: generate posterior samples from the supplied `mvn_result` object, make posterior predictions on the training set, make posterior predictions on the test, and then calculate the MAE associated with each posterior sample on the training and test sets. In the book keeping portion of the function, you must specify the order of the spline model.**  

**You must specify the training set and test set observed responses correctly in the second code chunk. You must specify the number of posterior samples to be 2500.**  

#### SOLUTION

Complete all steps to define the `calc_mae_from_laplace()` function below.  

```{r, solution_05d_a, eval=TRUE}
calc_mae_from_laplace <- function(mvn_result, Xtrain, Xtest, y_train, y_test, num_samples)
{
  # generate posterior samples from the approximate MVN posterior
  post <- generate_lm_post_samples(mvn_result, ncol(Xtrain), num_samples)
  
  # make posterior predictions on the training set
  pred_train <- make_post_lm_pred(Xtrain, post)
  
  # make posterior predictions on the test set
  pred_test <- make_post_lm_pred(Xtest, post)
  
  # calculate the error between the training set predictions
  # and the training set observations
  error_train <- as.matrix(abs(pred_train$Umat - y_train))
  
  # calculate the error between the test set predictions
  # and the test set observations
  error_test <- as.matrix(abs(pred_test$Umat - y_test))
  
  # calculate the MAE on the training set
  mae_train <- colMeans(error_train)
  
  # calculate the MAE on the test set
  mae_test <- colMeans(error_test)
  
  # book keeping, package together the results
  mae_train_df <- tibble::tibble(
    mae = mae_train
  ) %>% 
    mutate(dataset = "training") %>% 
    tibble::rowid_to_column("post_id")
  
  mae_test_df <- tibble::tibble(
    mae = mae_test
  ) %>% 
    mutate(dataset = "test") %>% 
    tibble::rowid_to_column("post_id")
  
  # you MUST specify the order, J, associated with the spline model
  mae_train_df %>% 
    bind_rows(mae_test_df) %>% 
    mutate(J = ncol(Xtrain) -1)
}
```

Apply the `calc_mae_from_laplace()` function to all 25 spline models. **Complete several of the arguments to the `purrr` function.**  

**Please make sure that you assign `yobs` as a "regular" vector and NOT as a matrix data type.**  

```{r, solution_05d_b, eval=TRUE}
set.seed(52133)
all_spline_mae_results <- purrr::pmap_dfr(list(all_spline_models,
                                               spline_matrices$design_matrix,
                                               spline_matrices$test_matrix),
                                          calc_mae_from_laplace,
                                          y_train = train_df$y,
                                          y_test = test_df$y,
                                          num_samples = 2500)
```


A glimpse of the `all_spline_mae_results` object is provided below.  

```{r, solution_05d_bb, eval=TRUE}
all_spline_mae_results %>% glimpse()
```


### 5e)

If you completed the `calc_mae_from_laplace()` function correctly, you should have an object with 125000 rows and 4 columns. Each split consists of the same number of rows, thus there are 62500 rows associated with the training split and 62500 rows associated with the test split. The object was structured in a "tall" or **long-format** with both splits and all degrees-of-freedom combined together and is thus a TIDY data object. We can easily summarize TIDY data objects with with `ggplot2` (if you are a Python user this is the same philosophy of the Seaborn statistical visualization package). You will summarize the MAE posterior distributions with boxplots and by focusing on the median MAE. To focus on the median MAE, you will use the `stat_summary()` function. This is a flexible function capable of creating many different geometric objects. It consists of arguments such as `geom` to specify the type of geometric object to display and `fun` to specify what function to apply to the `y` aesthetic.  

**Complete the two code chunks below. In the first code chunk, pipe the `all_spline_mae_results` object into `ggplot()`. Set the `x` aesthetic to be `as.factor(J)` and the `y` aesthetic to be `mae`. In the `geom_boxplot()` call, map the `fill` aesthetic to the `dataset` variable. Use the scale `scale_fill_brewer()` with the `palette` argument set equal to `"Set1"`.**  

**In the second code chunk, pipe the `all_spline_mae_results` object into `ggplot()` where you set the `x` aesthetic to `J` and the `y` aesthetic to `mae`. Use the `stat_summary()` function to calculate the median MAE for each spline order. You must set the `geom` argument within `stat_summary()` to be `'line'` and the `fun` argument to be `'median'`. Map the `color` aesthetic to the `dataset` variable and use the `scale_color_brewer()` scale with the `palette` argument set to `"Set1"`.**  

**Please remember the importance of the `aes()` function when mapping the aesthetics!**  

**Based on these visualizations which degrees-of-freedom appear to overfit the training data?**  

#### SOLUTION

Summarize the posterior samples on the MAE on the training and test sets with boxplots.  

```{r, solution_05e, eval=TRUE}
### 
all_spline_mae_results%>%
  ggplot(mapping = aes(x=as.factor(J),y=mae))+
  geom_boxplot(mapping = aes(fill=dataset))+ 
  scale_fill_brewer(palette = "Set1")
```

Summarize the posterior samples on the MAE on the training and test sets by focusing on the posterior median MAE values.  

```{r, solution_05e_b, eval=TRUE}
### 
all_spline_mae_results%>%
  ggplot(mapping = aes(x=J, y=mae, color=dataset))+
  stat_summary(geom = 'line', fun = 'median')+
  scale_color_brewer(palette = "Set1")
```

Which degrees-of-freedom overfit?  
Degrees-of-freedom above 8 overfit the training data

### 5f)

By comparing the posterior MAE values on the training and test splits you are trying to assess how well the models generalize to new data. However, other metrics exist for trying to assess how well a model generalizes, based just on the training set performance. The Evidence or marginal likelihood is attempting to evaluate generalization by integrating the likelihood over all a-priori allowed parameter combinations. If the Evidence can be calculated, it can be used to weight all models relative to each other. Thereby allowing you to assess which model appears to be "most probable".  

You will now calculate the posterior model weights associated with each model. The first code chunk below is completed for you, by extracting the log-evidence associated with model into the `numeric` vector `spline_evidence`.  You will use the log-evidence to calculate the posterior model weights and visualize the results with a bar graph.  

**Calculate the posterior model weights associated with each spline model and assign the weights to the `spline_weights` variable. The `spline_dof` vector is created for you which stores the degrees of freedom considered in this problem. The `spline_weights` vector is assigned to the `w` variable in a `tibble` and the result is piped into `mutate()` where you must specify the `J` variable to be the degrees of freedom values. Pipe the result into a `ggplot()` call where you set the `x` aesthetic to `as.factor(J)` and the `y` aesthetic to `w`. Include a `geom_bar()` geometric object where the `stat` argument is set to `"identity"`.**  

**Based on your visualization, which model is considered the best?**  

#### SOLUTION

```{r, solution_05f, eval=TRUE}
spline_evidence <- purrr::map_dbl(all_spline_models, "log_evidence")
```

```{r, solution_05f_b, eval=TRUE}
spline_weights <- exp(spline_evidence)/sum(exp(spline_evidence))

spline_dof <- seq(1, 50, by = 2)

tibble::tibble(
  w = spline_weights
) %>% 
  mutate(J = spline_dof) %>% 
  ggplot(mapping = aes(x = as.factor(J), y = w))+
  geom_bar(stat = 'identity')
```

Which model is the best?  
From visualizations, model 7 can be considered as best.

### 5g)

You have compared the models several different ways, are your conclusions the same?  

**How well do the assessments from the single train/test split compare to the Evidence-based assessment? If the conclusions are different, why would they be different?**  

#### SOLUTION

What do you think?  
Model 7 is considered as best in both train/test split and the Evidence-based assessments.