This repository has been archived by the owner on Oct 21, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NonlinearModelling.Rmd
297 lines (226 loc) · 12.4 KB
/
NonlinearModelling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Nonlinear Modelling"
author: "Thomas Vanpoucke, Boris Shilov"
output:
pdf_document: default
html_notebook: default
---
```{r echo = FALSE, include = FALSE}
library(nlme)
library(ggplot2)
set.seed(10)
prostate = get(load("prostate2.Rdata"))
prostate$svi = as.factor(prostate$svi)
attach(prostate)
options(digits=3)
```
# Exploring the association
First we randomise the rows and split the data set into a training and validation subset.
```{r}
sample = sample.int(n = nrow(prostate), size = floor(.5*nrow(prostate)), replace = F)
train = prostate[sample,]
test = prostate[-sample,]
```
```{r}
defaultfit = lm(Cscore ~ lpsa, data = prostate, subset = sample)
plot(Cscore ~ lpsa)
abline(defaultfit)
```
```{r}
summary(defaultfit)
```
It can be seen that the trend between Cscore and lpsa does not appear perfectly linear, just by visual inspection. Some summary statistics confirm that a simple linear model only explains about half of the variance in this association (R²adj = 50,8%).
Having confirmed our suspicion that there may be some non-linearity involved, we can turn to more powerful diagnostics to determine more conclusively what is happening.
```{r}
plot(defaultfit)
```
The plot of residuals against fitted values clearly resembles a parabola more than a straight line expected of a dataset with good linearity. Thus we strongly suspect that there is a large deviation from our assumption of linearity.
The Q-Q plot appears much less concerning, but there is still a notable parabolic relationship. The tails of the plot are clearly edging upwards, especially in the top half, where many data points of concern are located. Data point number 96 is particularly problematic. The Q-Q plot thus clearly indicates a small departure from the assumption of normality.
The scale-location plot again demonstrates a strong parabolic pattern, although there is a good spread of points - but again the problematic data points at the higher end are highlighted. Thus we strongly suspect that the residuals are not spread equally along the predictor ranges, and thus the assumption of homoscedasticity is violated as well.
The last diagnostic plot, residuals against leverage, shows that data point 96 is indeed a very influential case, and thus we would probably alter our model significantly if we exluded it from analysis. Data point 1 is also relatively influential cases.
Thus we can conclude rather confidently that most of the assumptions underlying linearity do not hold up very well for this association.
# Nonlinear modelling
## Polynomial regression
The first model we constructed was a polynomial regression. We will do this up to the 9th degree polynomial, since this equals ten degrees of freedom.
```{r}
poly_list = list()
MSE_list = list()
for (i in 1:9){
model = lm(Cscore~poly(lpsa,i), data=prostate, subset=sample)
poly_list = c(poly_list, list(model))
mean = mean((Cscore-predict(model,prostate))[-sample]^2)
MSE_list = c(MSE_list, list(mean))
}
```
As can be seen, the model that has the best test MSE is a third degree polynomial model. We can further confirm this using an ANOVA method for picking the best model.
```{r}
eval(parse(text=paste("anova(",paste("poly_list[[",1:length(poly_list),"]]",sep="",collapse=","),")")))
```
Model 3 is a significant improvement over Model 2, whereas Model 4 gives no significant improvement under a 0.05 significance threshold. Thus a second or third degree polynomial model provides a good fit to the data, and using our previous test MSE result we can pick the third degree polynomial as the best model with some confidence.
```{r}
best_poly_model = lm(Cscore~poly(lpsa, 3), data = prostate)
lpsalims = range(lpsa)
lpsa.grid = seq(lpsalims[1], lpsalims[2], length.out = 50)
prediction = predict(best_poly_model, newdata = list(lpsa = lpsa.grid), se = T)
SE.bands = cbind(prediction$fit+2*prediction$se.fit, prediction$fit-2*prediction$se.fit)
plot(lpsa, Cscore, xlim=lpsalims)
title("Third degree polynomial")
lines(lpsa.grid, prediction$fit)
matlines(lpsa.grid, SE.bands, lwd=1, col="red", lty=2)
```
## Cubic splines
The next regression model we construct is a cubic spline. Here, we start from four degrees of freedom, since a cubic spline has three polynomial terms (first, second and third order) and at least one knot. We let the knots be put at uniform quantiles.
```{r}
library(splines)
cubic_list = c()
MSE_list = c()
for (i in 4:10){
model = lm(Cscore~bs(lpsa, df = i),data=prostate, subset=sample)
cubic_list = c(cubic_list, model)
mean = mean((Cscore - predict(model, prostate))[-sample]^2)
MSE_list = c(MSE_list, list(mean))
}
best_cubic_model = cubic_list[[which.min(MSE_list)]]
best_cubic_model
```
```{r}
best_cubic = lm(Cscore~bs(lpsa, df = 4),data=prostate, subset=sample)
lpsalims = range(lpsa)
lpsa.grid = seq(lpsalims[1], lpsalims[2], length.out = 50)
prediction = predict(best_cubic, newdata = list(lpsa = lpsa.grid), se = T)
SE.bands = cbind(prediction$fit+2*prediction$se.fit, prediction$fit-2*prediction$se.fit)
plot(lpsa, Cscore, xlim=lpsalims)
title("Single knot cubic spline")
lines(lpsa.grid, prediction$fit)
matlines(lpsa.grid, SE.bands, lwd=1, col="red", lty=2)
```
## Smoothing Splines
The last model we construct is a smoothing spline. We can use the built in leave-out-one cross validation to arrive at a good model with a degrees of freedom value within the desired range.
```{r warning = FALSE}
plot(lpsa, Cscore)
fit2 = smooth.spline(lpsa[sample], Cscore[sample], cv=T)
lpsalims = range(lpsa)
lpsa.grid = seq(lpsalims[1], lpsalims[2], length.out = 50)
prediction = predict(fit2, newdata = list(lpsa = lpsa.grid), se=T, data = prostate)
lines(prediction)
fit2$df
predictedVals = predict(fit2, test$lpsa)$y
errors = test$Cscore - predictedVals
smooth.mse = mean(errors^2)
smooth.mse
```
## Model comparison
We compare all the models to see which one gives us the best test MSE.
```{r}
linear.mse = mean((Cscore-predict(defaultfit, prostate))[-sample]^2)
poly.mse = mean((Cscore-predict(best_poly_model, prostate))[-sample]^2)
cubic.mse = mean((Cscore-predict(best_cubic, prostate))[-sample]^2)
linear.mse
poly.mse
cubic.mse
smooth.mse
```
These are the test mean square errors for the linear, polynomial, cubic spline and smooth spline models, respectively. The nonlinear models all have better performance than the linear regression, and the polynomial regression is still better than the cubic spline with a test MSE error of 260. Even the cubic spline, which is the worst performing of these nonlinear models, has almost halved the test MSE.
We surprised that the polynomial regression is the best model, since it is a cubic model and the cubic spline also uses a cubic model, in this case with one knot. We posit that the cause of this difference is that the knot adds undesired variance.
# General additive model
## All variables
```{r}
library(gam)
gam.all = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(age,5)+s(lbph,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)
plot(gam.all)
gam.all.mse = mean((Cscore-predict(gam.all, prostate))[-sample]^2)
gam.all.mse
summary(gam.all)
gam.all.mse
```
If we look at the summary of the full GAM model, we can see that the predictor variables are all significant or almost significant, excepting svi, because it is a categorical variable. This was to be expected as the model we use is quite flexible.
We also looked at the test MSE, which is 570 for this model. This is much lower than the forward selected model from our previous work, which had a test MSE of 803. Thus it is a better model.
We can also see that the model is not as smooth near the edges for some of the predictors. This is because of the sparseness of the data at these extreme values.
The model is thus an improvement on our previous work.
## Backward selection
```{r}
getAIC <- function(data, nbloops){
namelist = c()
for (i in 1:nbloops){
name = paste(data,i,sep ="")
namelist = c(namelist,name)
}
aiclist = lapply(namelist, function(x)get(x))
aiclist
}
gam.5 = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)$aic
gam.4.1 = gam(Cscore~+s(lweight,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)$aic
gam.4.2 = gam(Cscore~s(lcavol,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)$aic
gam.4.3 = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(lpsa,5)+svi, data = train)$aic
gam.4.4 = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(lcp,5)+svi, data = train)$aic
gam.4.5 = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(lcp,5)+s(lpsa,5), data = train)$aic
aic_4 = getAIC("gam.4.",5)
gam.4 = gam.4.2
gam.3.1 = gam(Cscore~s(lcp,5)+s(lpsa,5)+svi, data = train)$aic
gam.3.2 = gam(Cscore~s(lcavol,5)+s(lpsa,5)+svi, data = train)$aic
gam.3.3 = gam(Cscore~s(lcavol,5)+s(lcp,5)+svi, data = train)$aic
gam.3.4 = gam(Cscore~s(lcavol,5)+s(lcp,5)+s(lpsa,5), data = train)$aic
aic_3 = getAIC("gam.3.",4)
gam.3 = gam.3.1
gam.2.1 = gam(Cscore~s(lpsa,5)+svi, data = train)$aic
gam.2.2 = gam(Cscore~s(lcp,5)+svi, data = train)$aic
gam.2.3 = gam(Cscore~s(lcp,5)+s(lpsa,5), data = train)$aic
aic_2 = getAIC("gam.2.", 3)
gam.2 = gam.2.3
gam.1.1 = gam(Cscore~s(lpsa,5), data = train)$aic
gam.1.2 = gam(Cscore~s(lcp,5), data = train)$aic
aic_1 = getAIC("gam.1.",2)
aic_1
gam.1 = gam.1.1
```
To pick the best GAM using backward selection, we first manually perform backwards selection on the models. The best model with a specified number of variables is selected by using the AIC. A starting model includes every continuous variable with the degrees of freedom set to five. We use AIC instead of R² value since this value is readily available in the summary and we are not confident R² is a good measure for evaluating a GAM.
```{r}
gam5 = gam(Cscore~s(lcavol,5)+s(lweight,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)
gam4 = gam(Cscore~s(lcavol,5)+s(lcp,5)+s(lpsa,5)+svi, data = train)
gam3 = gam(Cscore~s(lcp,5)+s(lpsa,5)+svi, data = train)
gam2 = gam(Cscore~s(lcp,5)+s(lpsa,5), data = train)
gam1 = gam(Cscore~s(lpsa,5), data = train)
gam5.mse = mean((Cscore-predict(gam5, prostate))[-sample]^2)
gam4.mse = mean((Cscore-predict(gam4, prostate))[-sample]^2)
gam3.mse = mean((Cscore-predict(gam3, prostate))[-sample]^2)
gam2.mse = mean((Cscore-predict(gam2, prostate))[-sample]^2)
gam1.mse = mean((Cscore-predict(gam1, prostate))[-sample]^2)
testerrs = c(gam5.mse,gam4.mse,gam3.mse,gam2.mse,gam1.mse)
testerrs
```
To select the best model out of all the models with different numbers of variables, we use cross-validation. The model with two variables has the lowest test MSE, so we used this model for the rest of our analysis, to vary the degrees of freedom of the predictor variables.
```{r}
MSE = 5000
for (i in 1:15){
for (j in 1:15){
model = gam(Cscore~s(lcp,i)+s(lpsa,j), data = train)
mse = mean((Cscore-predict(model, prostate))[-sample]^2)
if (mse < MSE){
i.select = i
j.select = j
MSE = mse
}
}
}
backwards_selected = gam(Cscore~s(lcp,2)+s(lpsa,3), data = train)
summary(backwards_selected)
```
We select the model with the lowest test MSE as the best model. The degrees of freedom of both predictor variables is varied between one and fifteen. As we can see, the best model has two degrees of freedom for lcp, and three degrees of freedom for lpsa.
We have also troubleshooted a backward selection algorithm provided as part of the gam package, in the form of the step.Gam function. The code is available in stepGamfixed.R and is not reproduced here for brevity. The result differs on whether we attempt backward selection with the full data or with a halved training set.
The full dataset takes 19 steps to converge on the model:
$$
Cscore = lcavol + age + s(lcp, 4) + s(lpsa, 5)
$$
The training set takes seven steps to converge on the model:
$$
Cscore = lcavol + s(lweight, 5) + s(age, 5) + s(lbph, 2) + s(lcp, 5) + s(lpsa, 5)
$$
This has a test MSE of 583. The differences from our implementation can probably be explained by the extra sophistication of the step.Gam implementation, but also demonstrate once again that stepwise variable selection can lead to very varying results and should normally be done extremely carefully, if at all.
## Comparison to linear models
To compare our final model with the linear models, we also used the test MSE.
The linear models all have a test MSE between 650 and 900, the test mean squared error of our manually backward selected GAM model is the following:
```{r}
backwards_selected.mse = mean((Cscore-predict(backwards_selected, prostate))[-sample]^2)
backwards_selected.mse
```
It is safe to conclude that this nonlinear model is a better model than all of the linear models, since it reduces the test errors by at least $63.1\%$ compared on the best performing linear model - lasso.