-
Notifications
You must be signed in to change notification settings - Fork 10
/
lec15.Rmd
436 lines (328 loc) · 20.5 KB
/
lec15.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
---
title: "GLMs, GAMs, and CARTs"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
```{r load, echo=FALSE, cache=FALSE}
load(".Rdata")
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction #
There are a number of extensions or alternative approaches for building regression-like models for circumstances when the (fairly restrictive) set of assumptions that underlie the use of ordinary least squares for fitting the regression model may be violated (e.g. binary response variables, nonlinearizable relationships, etc.) These alternative approaches are not simply workarounds, but are useful in their own right for exploring or describing relationships between a response variable and several predictors.
# Generalized Linear Models (GLMs) #
Generalized linear models (GLMSs) can be considered as THE general case of the "General Linear Model" that underlies analysis of variance and regression (and note the subtle distinction between "Generalized" and "General"). GLMs (pronounced "glims") relax many of the assumptions that underlie OLS (Ordinary Least Squares) regression, such as the assumption that the residuals are "i.i.d. normal" (independent, indentically (normally) distributed random (i.e. not autocorrelated) variables). Datasets in which the response variable is binary (0 or 1), a proportion or fraction, or a count usually automatically violate that assumption.
(Note that the "General Linear Model" (which includes ordinary least-squares (OLS) regression and analysis of variance) was developed first, and should be considered as a special case of the more recently developed "Generalized Linear Model (GLM).)
GLMs consist of a linear predictor (that looks like an ordinary regression equation), that together with a "link function", describes the distribution of the response variable (where that distribution may be one of several from the family of exponential distributions). When the distribution of the dependent variable is gaussian (i.e. normal), and the link function is just the "identity" function, the model is equivalent to the standard OLS regression model.
GLMs thus have two elements that must be chosen, based on the data at hand: 1) a *family* attribute that describes the distribution of the errors (residuals) which includes, but is not limited to the normal distribution assumed in OLS regression, and a 2) a *link function* that describes the relationship of the response to the predictor variables. In fitting GLMs, a quantity called the *deviance* is minimized, and is similar to the sum-of-squares of deviations that is minimized in OLS regression.
This example uses data from Chuck Gottfried's master's thesis (Gottfried, C.E., 1992. Residential wood heating and urban air quality : evaluation of a voluntary wood-heating curtailment program. MA Thesis, University of Oregon, Department of Geography, 83 p.) [[burn_csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/burn_csv).
The objective this study was to understand the controls of woodstove use in Lane County (i.e. the proprotion of wood-stove chimneys that were "hot" (`hotp`) during a particular survey). The potential predictor variables included various meteorological variables, like the dew-point temperature (`dewpt`), where values that are low relative to air temperature indicate generally cool, damp conditions, as well as the binary variable that indicated whether a wood-stove-use curtailment advisory was issued (to reduce air pollution during cool, stable, foggy conditions).
```{r glm02, eval=TRUE, warning=FALSE, message=FALSE}
# Gottfried MS Thesis data (UO Dept. Geography)
attach(burn)
names(burn)
```
As always, the first thing to do is to examine the data:
```{r glm03}
# examine variables
pairs(cbind(hotp,tmin,tmax,degday,trange,precip,dewpt,wind,api,adv))
```
The matrix scatter plots suggest that minimum temperature `tmin` and dewpoint temperature `dewpt` are correlated with the proportion of wood stoves in use `hotp`, e.g.:
```{r glm04}
plot(hotp ~ dewpt)
```
## OLS regression ##
Here's a standard OLS (linear) regression model with `hotp` as the response and `dewpt` as the predictor:
```{r glm05}
# linear model
burn_lm1 <- lm(hotp ~ dewpt)
burn_lm1
summary(burn_lm1)
# diagnostic plots
opar <- par(mfrow = c(2,2))
plot(burn_lm1, which=c(1,2,4))
hist(burn_lm1$residuals, breaks=20)
par <- opar
```
Note that the histogram (lower-right plot) is nothing like one that resembles a normal distribution--it looks like there are actually two distinct modes.
The regression line superficially looks fine (left below) but if plotted over a larger range (right) shows that unreasonable proporations (greater than 1.0, and less than 0.0) could result.
```{r glm06, fig.width=13}
# plot regression line
opar <- par(mfrow = c(1,2))
plot(hotp ~ dewpt, ylim=c(0,1))
abline(burn_lm1)
# plot regression line again -- larger range
plot(hotp ~ dewpt, ylim=c(0,1), xlim=c(-40,40))
abline(burn_lm1)
par <- opar
```
## GLM regression ##
A genearlized linear model can be fit using the `glm()` function. Because the response here is a proportion (based on a number of individual binary (burn/noburn) observations, the appropriate family is the "binomial" with a "logit" link function, and each observation weighted by the total number of chimneys that were measured.
```{r glm07}
# generalized linear model
burn_glm1 <- glm(hotp ~ dewpt, binomial(link = "logit"), weights=total)
burn_glm1
summary(burn_glm1)
# diagnostic plots
opar <- par(mfrow = c(2,2))
plot(burn_glm1, which=c(1,2,4))
hist(burn_glm1$residuals, breaks=20)
par <- opar
```
Note that the regression diagnostic plots are much better "behaved" than previously. Get predicted values for a range of `dewpt` values for plotting:
```{r glm08}
# evaluate predicted values
new_dewpt <- seq(-40,40,.2)
newdata <- data.frame(dewpt=new_dewpt)
hotp_pred1 <- predict(burn_glm1,newdata,type="resp")
```
Here's the regression line generated by the `glm()` function--note that it is curvilinear, and approaches the possible limits of a proportion (0 and 1) asymptotically.
```{r glm09, fig.width=13}
# plot regression line
opar <- par(mfrow = c(1,2))
plot(hotp ~ dewpt, ylim=c(0,1))
lines(hotp_pred1 ~ new_dewpt, col="blue")
# plot regression line again
plot(hotp ~ dewpt, ylim=c(0,1), xlim=c(-40,40))
lines(hotp_pred1 ~ new_dewpt, col="blue")
```
## A second GLM model ##
Inspection of the matrix scatter plot also suggested that the issuance of a "no-use" advisory (`adv` = 0 if no advisory issued, `adv` = 1 if issued) is also correlated with the proportion of stoves in use (in a perverse way, more stoves were in use when a no-burn advisory was issued than when not, which might be expected because the meteorological conditions that lead to the issuance of an advisory are just those when a nice fire would be, nice. Look at the correlations with an interaction (`dewpt*adv`) between dewpoint and the issuance of an advisory. (Note that in the R formula `dewpt*adv` means generate the set of predictors `dewpt`, `adv`, and `dewpt*adv`, and *not* simply include the product between `dewpt` and `adv`).
```{r glm10}
# generalized linear model -- two predictors
burn_glm2 <- glm(hotp ~ dewpt*adv, binomial(link = "logit"), weights=total)
burn_glm2
summary(burn_glm2)
```
The two models can be compared by doing an "analysis of deviance" in a parallel fashion to an analysis of variance. If large, the `Deviance` value signals that the second model is better than the first, as it is here.
```{r glm11}
# compare models
anova(burn_glm1,burn_glm2,test="Chi")
```
The two models can also be compared using their AIC (Akaike Information Criterion) values, a statistic that trades off the goodness of fit of a model against the number of parameters that have to be estimated. In general, we favor a model with a lower AIC. The first model has an AIC value of `r burn_glm1$aic`, while the second has a lower value of `r burn_glm2$aic`, and so is preferred on this basis as well.
Finally, plot some of regression lines
```{r glm12, fig.width=13}
# evaluate predicted values
new_dewpt <- seq(-40,40,.2)
new_adv0 <- rep(0,length(new_dewpt))
newdata_adv0 <- data.frame(dewpt=new_dewpt, adv=new_adv0)
hotp_pred2_adv0 <- predict(burn_glm2, newdata_adv0, type="resp")
new_adv1 <- rep(1,length(new_dewpt))
newdata_adv1 <- data.frame(dewpt=new_dewpt, adv=new_adv1)
hotp_pred2_adv1 <- predict(burn_glm2, newdata_adv1, type="resp")
opar <- par(mfrow = c(1,2))
# plot regression lines
plot(hotp ~ dewpt, ylim=c(0,1))
lines(hotp_pred2_adv0 ~ new_dewpt, col="green")
lines(hotp_pred2_adv1 ~ new_dewpt, col="magenta")
legend("topright", legend = c("no advisory", "advisory"), lty = c(1, 1), lwd = 2,
cex = 1, col = c("green", "magenta"))
# plot regression lines again
plot(hotp ~ dewpt, ylim=c(0,1), xlim=c(-40,40))
lines(hotp_pred2_adv0 ~ new_dewpt, col="green")
lines(hotp_pred2_adv1 ~ new_dewpt, col="magenta")
legend("topright", legend = c("no advisory", "advisory"), lty = c(1, 1), lwd = 2,
cex = 1, col = c("green", "magenta"))
```
The interpretation is straightforward: as the dew point temperature falls, the liklihood of using a woodstove increases, and it increase faster if a no-burn advisory was issued.
Here is everything on a single scatter diagram:
```{r glm13}
# plot all
plot(hotp ~ dewpt, ylim=c(0,1), xlim=c(-40,40))
abline(burn_lm1)
lines(hotp_pred1 ~ new_dewpt, col="blue")
lines(hotp_pred2_adv0 ~ new_dewpt, col="green")
lines(hotp_pred2_adv1 ~ new_dewpt, col="magenta")
legend("topright", legend = c("OLS", "GLM", "GLM no advisory", "GLM advisory"), lty = c(1,1,1,1), lwd = 2,
cex = 1, col = c("black", "blue", "green", "magenta"))
detach(burn)
```
[[Back to top]](lec15.html)
# Another GLM example #
This second example of GLMs using a data set and code from Crawley, M.J. (2013) *The R Book*, Wiley. [[island_csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/island_csv)
The data describe the controls of the indcidence (presence or absence) of a particular bird species on a set of islands, and such controls as the area of the island, its isolation, presence of predators, etc. Such *binary response* data sets occur frequently in practice, and OLS regression is not appropriate for a number of reasons.
First, look at the data.
```{r glm202}
# examine variables
attach(island)
pairs(cbind(incidence,area,isolation,quality,enemy,competitors))
```
Compare two models for `incidence`, one a simple model with `area` and `isolation` as predictors, and the other, a model that includes those predictors with interaction.
## Simple model, no interaction ##
```{r glm203}
# simple model
island_glm1<-glm(incidence~area+isolation, binomial)
summary(island_glm1)
```
The model can be visualized by generating a grid of values of the predictor variables, and "pluggin in" those values to the fitted model using the `predict()` function, and displaying the results as a perspective plot.
```{r}
new_area <- seq(from=0, to=10, len=40); new_isolation <- seq(from=-2, to=12, len=40)
new_x <- expand.grid(area=new_area, isolation=new_isolation)
island_glm1_sfc <- matrix(predict(island_glm1, new_x, type="response"),40,40)
persp(new_area, new_isolation, island_glm1_sfc, theta=-150, phi=20, d=1.5,
col="gray", ticktype="detailed", zlim=c(0,1), xlab="Area",
ylab="Isolation", zlab = "Species Incidence")
```
## A second model, with interaction ##
Here's a second model, that allows for interaction between `area` and `isolation`:
```{r}
# interaction
island_glm2<-glm(incidence~area*isolation, binomial)
summary(island_glm2)
island_glm2_sfc <- matrix(predict(island_glm2, new_x, type="response"),40,40)
persp(new_area, new_isolation, island_glm2_sfc, theta=-150, phi=20, d=1.5,
col="gray", ticktype="detailed", zlim=c(0,1), xlab="Area",
ylab="Isolation", zlab = "Species Incidence")
```
```{r}
# compare models
anova(island_glm1,island_glm2,test="Chi")
```
The comparison here suggests that the simpler model is better. This could also be expected because in the simpler model, the *z*-statistic test of the significance of the regression coefficients shows that the coefficients of `area` and `isolation` are both significant, whereas in the second model, none of the coefficeints are significant.
Another way to look at the relationships is via two bivariate models:
```{r glm204, fig.width=13}
island_glma<-glm(incidence ~ area, binomial)
summary(island_glma)
island_glmi<-glm(incidence ~ isolation, binomial)
summary(island_glmi)
```
Take a look at two scatter plots with fitted curves using `area` and `isolation` as individual predictors.
```{r glm205}
par(mfrow=c(1,2))
xv<-seq(0,9,0.01)
yv<-predict(island_glma,list(area=xv),type="response")
plot(area,incidence)
lines(xv,yv, lwd=2)
xv2<-seq(0,10,0.1)
yv2<-predict(island_glmi,list(isolation=xv2),type="response")
plot(isolation,incidence)
lines(xv2,yv2, lwd=2)
par(mfrow=c(1,1))
```
[[Back to top]](lec15.html)
# Generalized Additive Models (GAMs) #
Generalized additive models implement an overall strategy for relaxing the assumption that the relationships between the response variable and predictor variables are linear, by allowing the forms of the relationship to be determined by the data, in a manner analogous to using loess to describe the relationship.
Generalized additive models (GAMs) in some ways can be considered to be the general case of regression analysis, with GLMs being a special case that allows for different kinds of responses (e.g. binary, counts, proportions, as well as "continuous" interval- or ratio-scale data), and OLS regression being a very special case where the residuals are i.i.d. normal, and the relationships between responses and predictors are linear. GAMs are quite flexible, making no assumptions of the forms of the relationships between the response and predictors.
This simple example of GAMs using a data set and code from This second example of GLMs using a data set and code from Crawley, M.J. (2013) *The R Book*, Wiley: [[ozone_csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/ozone_csv) The `ozone` data set is one of the classical data sets that has been used in illustrating analyses is S and R.
Look at the data:
```{r gam02}
attach(ozone_data)
pairs(ozone_data, panel=function(x,y) {points(x,y); lines(lowess(x,y), lwd=2, col="red")})
```
## A simple example ##
Fit a GAM using the `gam` package, with `ozone` levels as the response and solar radiation (`rad`), temperature (`temp`) and wind speed (`wind`) as predictors.
```{r gam03}
# load Hastie & Tibshirani-style GAM package
library(gam)
ozone_gam1 <- gam(ozone ~ s(rad)+s(temp)+s(wind))
summary(ozone_gam1)
```
Note that in GAMs there is no "regression equation" that is produced. Instead, the sense of the relationships between the response and individual predictors is illustrated using "partial residual plots" for individual predictors that show the relationship between that predictor and the differences (residuals) of a prediction of the response using all other predictors.
```{r gam04}
opar <- par(mfrow=c(2,2))
plot(ozone_gam1, resid=T, pch=16)
par <- opar
```
```{r gam05}
# check for interactions
wt<-wind*temp
ozone_gam2 <- gam(ozone~s(temp)+s(wind)+s(rad)+s(wt))
summary(ozone_gam2)
opar <- par(mfrow=c(2,2))
plot(ozone_gam2, resid=T, pch=16)
par <- opar
```
```{r gam06}
# compare models
anova(ozone_gam1,ozone_gam2,test="F")
```
The F statistic is not very large (which is indicated by its *p*-value), but the partial residual plots show evidence of a strong effect of including the `wind x temp` interaction, and that term is significant in the regression equation, so it might be scientifically reasonable to include it.
[[Back to top]](lec15.html)
# Classification and regression trees (CARTs) #
CARTs implement a much different approach for describing the relationship between a continuous or categorical response variable and several predictor variables that makes few assumptions about the nature of the relationship. Instead of an equation (or an R object that is "equation-like" in the sense of turning some plugged-in values of predictor variable values into estimates of the response using the R `predict()` function), a tree-structure reminiscent of a "binomial key" is constructed, and is used to arrive at a prediction of the response variable at the tip of a branch. When applied to categorical variables, a "classification tree" results, and when applied to continuous variables, a "regression tree" is produced.
This example comes from Crawley, M.J. (2013) *The R Book*, Wiley.
## A regression-tree example ##
The easiest way to visualize what a regrssion tree looks like and does is to simply produce one:
```{r cart02}
library(tree)
ozone_tree1 <-tree(ozone~.,data=ozone_data)
plot(ozone_tree1,type="u")
text(ozone_tree1)
```
The `tree()` function produces what amounts to an (upside down) decision tree, where (starting at the top) each level and branch the values of a specific variable are examined, and a decsion is made as to which branch to take based on whether the value of that variable exceeds some threshold, or does not. The values at the tip of each branch show the values of the response variable that prevail under the conditions indicated by the individual branches that were followed.
A second example from Crawley (2013) [[Pollute.csv]]((https://pjbartlein.github.io/GeogDataAnalysis/data/csv/Pollute.csv))
Look at the data:
```{r cart03}
attach(Pollute)
pairs(Pollute, panel=function(x,y) {points(x,y); lines(lowess(x,y))})
```
Regression tree for `SO2Conc`
```{r cart04}
SO2Conc_tree1 <- tree(SO2Conc ~ . , Pollute)
plot(SO2Conc_tree1)
text(SO2Conc_tree1)
```
Printing the object created by `tree()` provides a text description of the tree and branches:
```{r cart05}
print(SO2Conc_tree1)
```
The variables and their threshold values are determined by "partitioning" the observations that remain at each step into two groups that maximize the between group variance. This can be illustrated as follows for the first split, where a threhold value of 748 for `Industry` maximizes the distinctiveness of the two groups.
```{r cart06}
xcut <- 748
plot(SO2Conc ~ Industry)
lines(c(xcut,xcut),c(min(SO2Conc),max(SO2Conc)), col="blue", lwd=2, lty="dashed")
m1 <- mean(SO2Conc[Industry < xcut])
m2 <- mean(SO2Conc[Industry >= xcut])
lines(c(0,xcut),c(m1,m1), col="blue", lwd=2)
lines(c(xcut,max(Industry)),c(m2,m2), col="blue", lwd=2)
```
## A "Maptree" example
Denis White (OSU/EPA in Corvallis) developed a package that allows the results of a recursive partitioning and regression tree ("RPART", same thing as CART...) to be mapped. Here the data consist of bird species distributions and environmental variables for Oregon, and the tree-like partioning derives clusters of data points (hexagonal grid boxes) with similar attributes, and `maptree` maps them.
```{r maptr02, warning=FALSE, message=FALSE}
library(maptree)
data(oregon.env.vars, oregon.grid, oregon.border)
names(oregon.env.vars)
attach(oregon.env.vars)
```
```{r maptr3}
# regression tree of Oregon environmental variables
model_tree1 <- "bird.spp ~ jan.temp+jul.temp+rng.temp+ann.ppt+min.elev+rng.elev+max.slope"
bird_tree1 <- rpart(model_tree1)
plot(bird_tree1)
text(bird_tree1)
```
"Pruning" (as on a real tree) removes some long, dangly branches...
```{r maptr4}
# prune the tree
group <- group.tree(clip.rpart(rpart(model_tree1), best=7))
plot(clip.rpart(rpart(model_tree1), best=7))
text(clip.rpart(rpart(model_tree1), best=7))
```
The following code plots the clusters...
```{r maptr5}
# plot the regression tree
names(group) <- row.names(oregon.env.vars)
map.groups(oregon.grid, group=group)
map.key (0.05, 0.6, labels=as.character(seq(6)),pch=19, head="node")
lines(oregon.border)
```
... and redraws the tree to serve as a legend for the map.
```{r maptr6}
draw.tree (clip.rpart(rpart(model_tree1), best=7), nodeinfo=TRUE,
units="species", cases="cells", cex=.7, pch=16, digits=0)
```
[[Back to top]](lec15.html)
# Readings #
GLMs, GAMs and CARTs are well covered in Chapters 13-18 and 23 in Crawley, M.J. (2013) *The R Book*, Wiley. To get to the book, visit [http://library.uoregon.edu](http://library.uoregon.edu), login, and search for the 2013 edition of the book.
Here's a direct link, once you're logged on: [http://onlinelibrary.wiley.com/book/10.1002/9781118448908](http://onlinelibrary.wiley.com/book/10.1002/9781118448908)
Kuhnert & Venebles (An Introduction...): p. 141-168, 169-178, 259-300; Maindonald (Using R...): ch. 9