This repository has been archived by the owner on Oct 21, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
LinearModelSelection.Rmd
263 lines (196 loc) · 10 KB
/
LinearModelSelection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: "Assignment 1: Linear model selection"
author: Thomas Vanpoucke, Boris Shilov
output:
bookdown::pdf_document2:
fig_height: 4
fig_width: 5
number_sections: true
---
```{r include=FALSE}
library(glmnet)
library(pls)
library(leaps)
set.seed(10)
prostate = get(load("prostate2.Rdata"))
prostate$svi = as.factor(prostate$svi)
attach(prostate)
options(digits=3)
```
# Variable comparison and evaluation
There are 97 data points, seven predictor variables and one response variable, Cscore. There is one binary variable, svi. Looking at the scatterplots below, there seem to be a few strong positive correlations present, such as those between Cscore and lpsa as well as lcavol and lpsa.
When looking at the first data points, we noticed that there were some recurring numbers, specifically within the lbph and lcp predictor variables.
```{r fig_width=10}
head(prostate)
summary(prostate)
plot(prostate)
```
# Forward stepwise selection
We implemented algorithm 6.2 for forward stepwise selection. To choose which parameter to add in one round of forward stepwise selection, we used the model with the highest $R^2$ value, as per the algorithm. To select the single best model, we used BIC. The entire function code can be found in the source file for this document.
```{r echo = FALSE}
forward_stepwise <- function(data){
data_vars = names(data)
#a list for all the predictors that have been used
predictor_list = c()
#the names of all the best models of each round of stepwise selection
models <- paste("Mk", 1:7, sep = "")
for (j in 1:(length(data_vars)-1)){ # -1 because Cscore is not a predictor
rsq = 0
best_predictor = NULL
best_model = NULL
name = paste("Mk",j, sep = "")
for (i in 2:length(data_vars)){ # the loop starts from 2, because Cscore, the response variable, is 1
if (!is.element(data_vars[i],predictor_list)){
#paste the function Cscore~(all the variables from the previous round) + this variable
Mki <- lm(paste(data_vars[1], "~", paste(predictor_list, collapse = "+", "+",paste(data_vars[i]))), data = data)
new_rsq = summary(Mki)$r.squared
#if this model is better(higher r square) than the previously selected model this round,
#make this the selected variable
if (new_rsq > rsq){
rsq = new_rsq
best_predictor = data_vars[i]
assign(name, Mki)
}
}
}
#append the selected variable of this round to the list of used predictor variables
predictor_list <- c(predictor_list, best_predictor)
}
# based on the list of all the model names, get all the forward stepwise models in a list
model_list <- lapply(models, function(x)get(x))
BIC_models = c()
#get a BIC value for every forward stepwise model
for (h in 1:length(model_list)){
BIC_models = c(BIC_models,BIC(model_list[[h]]))
}
best_model_nr = which.min(BIC_models)
#return the model with the lowest BIC
return(model_list[[best_model_nr]])
}
result = forward_stepwise(prostate)
result
summary(result)
```
Thus, the model returned by our algorithm as the best was one with only two predictor variables, lpsa and svi.
$$
Cscore = `r result$coefficients[1]` + `r result$coefficients[2]` \times lpsa + `r result$coefficients[3]` \times svi1
$$
# Lasso
To do a cross-validated lasso and ridge, we first split up the dataset into two pieces, a training and a test dataset. We made a model, based on the training data, that was crossvalidated within this training data. The test dataset was then used to check the mean squared test error for the best lambda (which is the lowest mean squared training error). To finish off, we used the entire dataset and the best lambda value to construct our lasso model.
```{r}
x = model.matrix(Cscore~.,prostate)[,-1]
y = prostate$Cscore
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
grid=10^seq(10,-2,length=100)
lasso.model=glmnet(x[train,],y[train],alpha=1,lambda=grid, thresh = 1e-12)
plot(lasso.model, label = TRUE)
plot(lasso.model,xvar="lambda",label=TRUE)
```
```{r}
lasso.model=glmnet(x[train,],y[train],alpha=1,lambda=grid, thresh = 1e-12)
OLS.model=glmnet(x[train,],y[train],alpha=1,lambda=0, thresh = 1e-12)
```
```{r}
cv.lasso.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.lasso.out)
```
```{r}
bestlambda.lasso=cv.lasso.out$lambda.min
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.pred = predict(lasso.model,s=bestlambda.lasso,newx=x[test,])
lasso.mean = mean((lasso.pred-y.test)^2)
lasso.coef = predict(out,type="coefficients",s=bestlambda.lasso)[1:8,]
OLS.pred = predict(OLS.model, s=0, newx=x[test,])
OLS.mean = mean((OLS.pred-y.test)^2)
OLS.coef = predict(out, type="coefficients", s=0)[1:8,]
lasso.mean
OLS.mean
lasso.coef
OLS.coef
```
We can clearly see that Lasso zeroes out a large number of our variables, which is a massive reduction of complexity. Two of the variables agree with our previous best forward selection. Thus Lasso is clearly superior to ordinary least squares in this instance. The resulting Lasso model becomes:
$$
Cscore = `r lasso.coef[1]` + `r lasso.coef[6]` \times svi1 + `r lasso.coef[7]` \times lcp + `r lasso.coef[8]` \times lpsa
$$
# Ridge
```{r}
grid=10^seq(10,-2,length=100)
ridge.model=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh = 1e-12)
OLS.model.ridge=glmnet(x[train,],y[train],alpha=0,lambda=0, thresh = 1e-12)
plot(ridge.model, label = TRUE)
plot(ridge.model,xvar="lambda",label=TRUE)
```
```{r}
cv.ridge.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.ridge.out)
bestlambda.ridge=cv.ridge.out$lambda.min
```
Lowest mean squared error here corresponds to lowest cross validation error. We can conclude that the best $\lambda$ value equals `r bestlambda.ridge`.
```{r}
bestlambda.ridge=cv.ridge.out$lambda.min
out.ridge=glmnet(x,y,alpha=1,lambda=grid)
ridge.pred = predict(ridge.model,s=bestlambda.ridge,newx=x[test,])
ridge.mean = mean((ridge.pred-y.test)^2)
ridge.coef = predict(out.ridge,type="coefficients",s=bestlambda.ridge)[1:8,]
OLS.pred.ridge = predict(OLS.model.ridge, s=0, newx=x[test,])
OLS.mean.ridge = mean((OLS.pred.ridge-y.test)^2)
OLS.coef.ridge = predict(out.ridge, type="coefficients", s=0)[1:8,]
ridge.mean
OLS.mean.ridge
ridge.coef
OLS.coef
```
There is still a subsantial improvement over OLS in Ridge as most coefficients are zeroed out. The resulting Ridge model is thus:
$$
Cscore = `r ridge.coef[1]` + `r ridge.coef[6]` \times svi1 + `r ridge.coef[7]` \times lcp + `r ridge.coef[8]` \times lpsa
$$
# Principal Component Regression
## How many principal components would you select for yout PCR model?
We would select three principal components, because the mean squared error compared to the number of components is the lowest there. Of course, taking seven principal components is still lower, since then we practically account for all variable. Also, three components already explain approximately 78% of the variation in the data.
```{r}
set.seed(10)
pcr_model=pcr(Cscore~.,data=prostate,scale=TRUE,
validation="CV")
summary(pcr_model)
explvar(pcr_model)
plot(pcr_model, plottype = "scores", comps = 1:3)
plot(pcr_model, "loadings", comps = 1:3, legendpos = "topleft",labels = 1:3)
abline(h = 0)
pcr_model=pcr(Cscore~.,data=prostate,subset=train,scale=TRUE,
validation="CV")
validationplot(pcr_model,val.type="MSEP")
pcr_pred=predict(pcr_model,x[test,],ncomp=3)
pcr.mean = mean((pcr_pred-y.test)^2)
pcr_model=pcr(y~x,scale=TRUE,ncomp=3)
summary(pcr_model)
pcr.mean
OLS.mean
```
## How appropriate is PCR for this dataset?
When we look at the analysis we did, we can certainly say that there is some benefit to use PCR for this dataset, since we can explain a lot of the variance with only three principal components.
While the model explains most of the variance in the predictor variables, it does not predict very much variance of the response variable. So it seems that the principal components do not explain the relationship with the response very well. This means that the assumption on which PCR is based is not fulfilled.
When we use cross-validation for our 3-PC model, we can see that the test error of the PCR model is higher than the mean squared error of a normal linear regression model (see test errors above).
Based on these remarks, we conclude that PCR is not optimal for this dataset.
# Model Comparison
## How well do they perform in general?
In general, the models perform well.
## Do they yield similar models?
Yes, for the most part the same variables are included. However, the coefficients differ from model to model. PCR is the only outlier, since there is no variable selection going on.
## Which variables are the most important in the prediction of the progression of prostate cancer, according to each model?
The forward stepwise, Lasso and Ridge model all seem to indicate svi and lpsa are important variables. Lasso and Ridge also indicate lcp has some significant influence.
For PCR, it is a bit harder to say which variable is the most important, since every principal component has loadings of all the variables. We looked at the highest loading values of the first principal component to determine which variables are important according to PCR. In our case, those variables are approximately the 1st, 5th, 6th and 7th, which correspond to lcavol, svi, lcp and lpsa respectively (see variable loading graph in the PCR component).
## Which model do you think is the most appropriate for this dataset?
We selected the best model based on the test error. For this to be valid we needed to remake the forward stepwise model with only the training data first.
```{r echo=FALSE}
forward_training = forward_stepwise(prostate[train,])
forward_pred = predict(forward_training, newdata = prostate[test,], type = "response")
forward.mean = mean((forward_pred-y.test)^2)
OLS.mean
forward.mean
lasso.mean
ridge.mean
pcr.mean
```
The values shown are the test errors for a normal OLS model, forward stepwise prediction, lasso, ridge and PCR respectively. As we can see, only the lasso and ridge test error are under the OLS test error, and lasso is still slightly lower than ridge. This leads us to conclude that our lasso model is the most appropriate model for this dataset.