-
Notifications
You must be signed in to change notification settings - Fork 10
/
lec13.Rmd
853 lines (646 loc) · 38.9 KB
/
lec13.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
---
title: "More regression analysis"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
```{r load, echo=FALSE, cache=FALSE}
load(".Rdata")
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction #
Multiple regression is (conceptually) a simple extension of bivariate regression, in which the influence of more than one predictor variable on the response can be estimated. For the case with two predictor variables, the analysis can be thought of as involving the fitting of a plane (as opposed to a line in the bivariate regression case), and the equations for the OLS estimates of the regression equations are only a little more complicated algebraically. For three or more predictors, the algebra is also quite simple, but requires the use of matrix algebra.
A couple of illustrations jointly describe the idea of fitting a plane:
- [fitting a plane using two predictor variables](https://pjbartlein.github.io/GeogDataAnalysis/images/nwk7_1.gif)
- [one data point and its deviation from the regression plane or response surface](https://pjbartlein.github.io/GeogDataAnalysis/images/mreg.gif)
# Fitting a multiple regression equation #
The mathematics behind multiple regression analysis is more complicated than that for bivariate regression, but can be elegantly presented using matrix algebra
- [matrix algebra](https://pjbartlein.github.io/GeogDataAnalysis/topics/matrix.pdf)
- [regression analysis in matrix algebra terms](https://pjbartlein.github.io/GeogDataAnalysis/topics/matreg.pdf)
The following example provide a short illustration of the use of matrix algebra to obtain the regression coefficients.
The example data set for illustrating the use of regression diagnostics [[regrex3.csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/regrex3.csv) is used here, in particular, the multiple regression using `x1` and `x2` as predictors for the response variable `y5`
First, take a look at the different variables in the example data set.
```{r matreg}
# regrex3
attach(regrex3)
summary(regrex3)
head(cbind(y5,x1,x2))
```
Create an *n* row by 1 column matrix (i.e. a column vector) called **y**:
```{r create y}
# create the column vector y
n <- length(y5)
y <- matrix(y5, nrow=n, ncol=1)
dim(y)
head(y)
```
Create an *n* row by *p*+1 matrix, **X**, with 1's in the first column, and `x1` and `x2` in the second and third columns:
```{r create x}
# create the predictor-variable matrix
X <- matrix(cbind(rep(1,n),x1,x2), nrow=n, ncol=3)
dim(X)
head(X)
```
Now use matrix algebra to calculate **b**, the *p*+1 row by 1 column matrix (e.g. a column vector) of regression coefficients, *b*<sub>0</sub>, *b*<sub>1</sub> and *b*<sub>2</sub>: **b** = (**X'X**)<sup>-1</sup>**X'y**.
```{r b}
# calculate the regression coefficients
b <- solve(t(X) %*% X) %*% (t(X) %*% y)
print(b)
dim(b)
```
The matrix functions and operators used in the above expression include `t()`, which transposes a matrix, `%*%`, which is the matrix multiplication operator, and `solve()`, which inverts a matrix.
Compare these values with those obtained using the `lm()` function:
```{r lm1}
# linear model with lm()
lm1 <- lm(y5 ~ x1+x2, data=regrex3)
lm1
```
Now calculate the fitted values (*y-hats*), i.e. \(\mathbf{\widehat{y}}\) = **Xb**:
```{r matrix fitted values}
# matrix fitted values
yhat <- X %*% b
```
and compare these with those obtained using the `lm()` function
```{r compare fitted}
head(cbind(yhat,lm1$fitted.values))
```
In addition to being able to efficiently represent the derivation of terms and thier properties in regression analysis in general, matrix algebra also provides a an efficient way of doing the actual calculations.
[[Back to top]](lec13.html)
# Regression Assumptions #
The basic regression model (as well as more complicated ones) have certain underlying assumptions, violations of which have an impact on the optimality of the fitted model, i.e., the extent to which the model and its parameters represent the best model that can be fit, that is, the one that performs the best in the tasks of representing the relationship between the response variable and the predictor variable(s), or predicting future values of the response variable given new values of the predictor variables.
The main assumptions that underlie regression analysis:
- the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, with a mean of 0 and a standard deviation of *s*,
- the *X*'s (predictor or independent variables) are known without error,
- the *X*'s are not correlated.
- the correct model has been specified (i.e. the right predictors have been included in the model.
If the assumptions are not violated, then the Gauss-Markov theorem indicates that the usual OLS estimates are optimal in the sense of being unbiased and having minimum variance. If one or more of the assumptions are violated, then estimated regression coefficients may be biased (i.e. they may be systematically in error), and not minimum variance (i.e. there may be more uncertainty in the coefficients than is apparent from the results).
## Consequences of assumption violations ##
If the assumptions are violated, then there may be two consequences--the estimated coefficients may be biased (i.e. systematically wrong), and they may longer have minimum variance (i.e. their uncertainty increases).
- the notion of variability of the regression coefficients
- illustrations using repeated simulations of data sets with built-in assumption violations [[examples]](https://pjbartlein.github.io/GeogDataAnalysis/images/violate1.gif), [[solutions]](https://pjbartlein.github.io/GeogDataAnalysis/images/violate2.gif)
[[Back to top]](lec13.html)
# Testing for assumption violations with diagnostic analyses #
There are several ways to test whether the assumptions that underlie regression analysis have been violated. As might be expected, these include analytical methods, in which the values of particular test statistics are computed and compared with appropriate reference distributions, and graphical methods, which are often easier to implement (and just as informative).
A good way to understand the way in which the various statistics and diagnostic plots allow one to examine the reqression equation, its goodness-of-fit, and a to assess the possibility of assumption violations is to design-in various assumption violations and issues, and compare the results to a regression analysis without issues.
An example data set for illustrating the use of regression diagnostics [[regrex3.csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/regrex3.csv)
First, take a look at the different variables in the example data set.
```{r rdiag01, fig.width=7, fig.height=7}
# regrex3
summary(regrex3)
# matrix plot of the data
plot(regrex3[,2:8])
```
There are two potential predictor variables `x1` and `x2`, and five potential responses, each with particular issues or assumption violations. The idea here is to examine the summary output and diagnostic plots for each regression, and compare them to a case (the first regression) where there are no major issues.
```{r rdiag02}
# set up for plotting
oldpar <- par(mfrow = c(2, 2))
```
## Example 1 -- no major issues ##
Here's an inital regression, that shows no major issues:
```{r rdiag03}
# first regression
regr_y1x1 <- lm(y1 ~ x1, data=regrex3)
summary(regr_y1x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y1x1, which=c(1,2,4,5), main="First Regression")
```
## Example 2 -- non-significant predictor ##
One of the most common issues in regression equations is the inclusion of a predictor that is not significant, in the sense that the `t-test` of the null hypothesis that the slope parameter for that predictor is zero is not significant (i.e. there is no support for rejecting that hypthesis). Here's a example, where `x2` is not significant:
```{r rdiag04}
# add a second predictor
regr_y1x1x2 <- lm(y1 ~ x1 + x2, data=regrex3)
summary(regr_y1x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y1x1x2, which=c(1,2,4,5), main="Two Predictors")
```
## Example 3 -- outliers present ##
Outliers frequently arise, from mechanical (coding or recording), sampling (i.e. selection of the cases or locations), or from the simple inclusion of data points that are "outside" of the domain of interest. The third example shows how the outliers can be identified in the diagnostic plots. Cases 79 and 49 standout.
```{r diag05}
# outliers
regr_y2x1 <- lm(y2 ~ x1, data=regrex3)
summary(regr_y2x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y2x1, which=c(1,2,4,5), main="Outliers")
```
## Example 4 -- outliers and an additional predictor ##
Sometimes the existence of outliers can be traced to the omission of a potentially important predictor. In this case, the inclusion of `x2` does not eliminate the outliers, and as before, the slope coefficient of `x2` is not significant.
```{r diag06}
# no help for outliers
regr_y2x1x2 <- lm(y2 ~ x1 + x2, data=regrex3)
summary(regr_y2x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y2x1x2, which=c(1,2,4,5), main="Still Outliers")
```
## Example 5 -- heteroscedasticity (non-normality) of the residual ##
Heteroscedasticity, or inhomogeneity of variance can arise several ways: it can be "inherited" from the response variable in cases where that variable is not completely "explained" by the predictor variable(s), and it can also signal issues with the structure of the model. In practice, heterscedastic residuals increase the variability of the regression parameter estimates, which means that we can be less confident that the "true" values of the coeffiecients have been estimated. Heteroscedasticity is signaled by the diagnostic plots, in particular "megaphone" patterns in the residual scatterplot, and "doglegs" in the normal probability plots.
```{r rdiag07}
# heteroscedasticity
regr_y3x1 <- lm(y3 ~ x1, data=regrex3)
summary(regr_y3x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y3x1, which=c(1,2,4,5), main="Heteroscedasticity")
```
## Example 6 -- nonlinear relationships ##
Standard regrssion analyses attempt to fit *linear* relationships among variables, and often that's not a reasonable assumption to make. This example illustrates how nonlinear relationships are exposed by the residual diagnostic plots (if they haven't already been recognized), in particular by well-organized patterns in the residual scatterplot and leverage plot.
```{r rdiag08}
# nonlinear relationship
regr_y4x1 <- lm(y4 ~ x1, data=regrex3)
summary(regr_y4x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y4x1, which=c(1,2,4,5), main="Nonlinear Relationship")
```
## Example 7 -- transformation ##
The usual first-order strategy for dealing with heteroscedasticity and non-linearity is to transform one or another of the variables (or both) in a bivariate regression. Here, the values of `y4` are transformed buy taking their (base 10) logarithms. Because the logs of 0 and negative numbers are undefined, the values of `y4` are adjusted to make them nonzero and positive. However, there are still major problems signalled by the regression diagnostic plots.
```{r rdiag09}
# alternative nonlinear log(y4) ~ x1
regr_logy4x1 <- lm(log10(y4-min(y4) + 0.001) ~ x1, data=regrex3)
summary(regr_logy4x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_logy4x1, which=c(1,2,4,5), main="Transformation of y")
```
## Example 8 -- fitting a nonlinear (quadratic) function ##
An alternative approach to "linearizing" a relationship is to fit a quadratic polynomial (i.e containing the terms *x* and *x<sup>2</sup>*) to the data. The `I()` function indicates that the term `x1^2` should be treated as is, and is not part of the "formula" in the `lm()` function.
```{r rdiag10}
# alternative nonlinear (quadratic) y4 ~ x1, x1^2
regr_y4x1x1 <- lm(y4 ~ x1 +I(x1^2), data=regrex3)
summary(regr_y4x1x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y4x1x1, which=c(1,2,4,5), main="Quadratic Polynomial")
```
This equation fits much better than the previous one, and show no major problems
## Example 9 -- model inadequacy due to a missing predictor ##
Here's an example regression, where the data were generated by an equation that included `x2`. Note that there are slight megaphone and dog-leg patterns in the residual plots.
```{r rdiag11}
# model inadequacy, missing predictor
regr_y5x1 <- lm(y5 ~ x1, data=regrex3)
summary(regr_y5x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y5x1, which=c(1,2,4,5), main="Missing Predictor")
```
The typical way of examining the possibility that the inclusion of another predictor may be warrented, is to examine correlations between the residual, and other potential predictors. In this case it looks like the residuals are correlated with `x2`.
```{r rdiag12}
# check for correlation between residuals and other predictors
par(oldpar)
plot(regr_y5x1$residual ~ regrex3$x2)
```
When `x2` is included in the model, the various diagnostic checks show no major issues, except possibly a some extra influence by points 100, 24 and 4.
```{r rdiag13}
# last model, y5 ~ x1 and x2
regr_y5x1x2 <- lm(y5 ~ x1+x2, data=regrex3)
summary(regr_y5x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y5x1x2, which=c(1,2,4,5), main="Two Predictors")
par(oldpar)
```
[[Back to top]](lec13.html)
# An example of iterative diagnostic analysis #
*Multiple regression* is the extension of bivariate regression to the case where there are multiple predictor variables. (The case where there are multiple predictors and multiple responses is generally referred to as *multivariate regression*.) This example illustrates the development of a "transfer function" a not-ideally named approach for reconstructing past climates from fossil pollen data. The idea is to build a regression equation that illustrates the relationship between a particular climate variable as the response to a number of pollen variables as predictors using modern pollen and climate data, and then to apply this relationship "downcore" to fossil-pollen data (plugging in pollen, and getting out climate). The data set has a few issues that often arise in practice, including the fact that the pollen variables (pollen types) are themselves correlated (violating the OLS assumption of no collinearity), and the climate variables are highly spatially autocorrelated, which could lead to autocorrelation in the residuals (violating another OLS assumption, homogeneity of variance of the residual). The relationships between the climate and pollen variables are often curvilinear (violating yet another assumption, correct specification of the model).
These issues will be examined using the standard diagnostic plots, as well as the Moran statistic for assessing spatial autocorrelation, and the curvilinearity/collinearity issue will be handled by transformation and "best subsets" predictor selection.
Here are the data [[midwtf2]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/midwtf2.csv), and the components of a shape file for mapping the data and results [[.dbf]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.dbf) [[.shx]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.shx) [[.shp]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.shp)
[[.prj]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.prj)
Load some libraries.
```{r libraries, echo=FALSE, include=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
library(spdep)
library(leaps)
library(ggplot2)
```
Load some packages, and attach the data frame:
```{r mr01, eval=FALSE}
# libraries
library(sf)
library(ggplot2)
library(RColorBrewer)
library(spdep)
library(leaps)
library(ggplot2)
```
The pollen variables with numbers in their names (e.g. `Querc.25`) are transformation of the original pollen variables (`Quercus`) constructed using "power function" transformations (i.e. `Querc.25 <- Quercus^0.25`) to linearize the relationships between climate and pollen (see the scatter diagrams below). The transformation parameters were chosen by inspecting individual scatter diagrams between a climate variable and pollen variable.
## Inspection of the data ##
*Always* look at the data before running any kind of analysis. Begin by mapping the data. Get the point locations of each modern pollen (and climate) sample, and plot the distance-weighted neighbors (for calculating the Moran statistic) using a distance threshold of 100km
Read a shapefile of outlines for the region
```{r}
# read a state outline shape file
shapefile <- "/Users/bartlein/Documents/geog495/data/shp/midwotl.shp"
midwotl_sf <- st_read(shapefile)
midwotl_sf
plot(midwotl_sf) # plot the outline
plot(st_geometry(midwotl_sf), axes = TRUE)
```
Read the data, if it's not already in the workspace:
```{r, echo=TRUE, eval=FALSE}
# read the data file (if not already in the workspace)
csv_path <- "/Users/bartlein/Documents/geog495/data/csv/midwtf2"
midwtf2 <- read.csv(csv_path)
```
```{r mr03}
attach(midwtf2)
names(midwtf2)
```
Plot the network of sites, and also get distance-neighbor matrix for calculating spatial autocorrelation.
```{r}
# get point locations of modern data
midwtf2_loc=cbind(longitud, latitude)
```
Get the distance neighbors (i.e. the contiguity matrix **W**)
```{r dist}
# distance neighbors
d <- 100 # points are neighbors if locations are within d km of one another
midwtf2_neighbors_dist <- dnearneigh(midwtf2_loc, 0, d, longlat=TRUE)
plot(st_geometry(midwotl_sf), axes=TRUE)
points(midwtf2_loc)
plot(midwtf2_neighbors_dist, midwtf2_loc, add=TRUE, col="magenta")
```
Next, examine the data set. First plot latitude, longitude, and the climate variables relative to one another (to reveal the geographical gradients in the cliamte data).
```{r mr04}
# examine the data set
# plot lat, lon and climate variables
plot(midwtf2[,2:6])
```
Now plot July temperature (the response variable here) relative to a few (untransformed) pollen types, and note the curvilinear relationships.
```{r mr05}
# plot July temperature and selected pollen types
plot(midwtf2[,c(4,7,10,13,21)])
```
Next, plot July temperature vs. the transformed pollen types, and note how the transformations linearize the relationships.
```{r mr06}
# plot July temperature and transformed pollen
plot(midwtf2[,c(4,22,25,28,36)])
```
## Spatial autocorrelation of July temperature values ##
The assuption that the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, is often violated in geographical or Earth-system science data sets owing to spatial or temporal autocorrelation in the dependent or response variable. If the regression model cannot explain *all* of the variation in the response model, then the autocorrelation of the response will be inhereted by the residuals or prediction errors. This inherentance will be evidenced by spatial or temporal patterns in residuals, which could be though of inforation that should be incorporated in the "systematic" component of the model.
In this example, spatial autocorrelation of the residuals will the issue. The *Moran statistic* is a measure of spatial autocorrelation, and in the regressions below, will be used to test for spatial pattern in the residuals as the model is gradually refined.
- [Moran statistic details](https://pjbartlein.github.io/GeogDataAnalysis/topics/moran.pdf)
To illustrate the Moran statistic, and gain an impression of the level of spatial autocorrelation in the response (as opposed to the residuals) calculate the Moran statistic for July temperature, the response. For comparison with the residuals (which should be centered on zero), we'll convert the July temperature values to "centered" values, by subtracting the mean. We should see the amplitude of the values diminish as the "fit" gets better and better. This is consistent with the idea of the residuals having no "pattern" or systematic variation.
```{r mr07}
# ggplot2 map of temperature
plotvar <- tmeanjul
plotvar <- plotvar-mean(plotvar)
plottitle <- "Modern Climate (mean subtracted)"
leglable <- "tmeanjul (C)"
ggplot() +
geom_sf(data = midwotl_sf, fill="grey70", color="black") +
geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
limits = c(-5, 5), breaks = seq(-5, 5, by = 1), guide = 'colorbar', name = leglable) +
labs(x = "Longitude", y = "Latitude", title = plottitle)
```
The map shows that July temperatures are quite spatially autocorrelated, which the Moran statistic confirms.
```{r mr09}
# spatial autocorrelation of dependent variable (tmeanjul)
moran.test(tmeanjul, zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```
In some ways, the objective the regression analysis can be posed in terms of "taking the autocorrelation out" of July temperature.
## A "naive" model ##
The simplist model (in terms of not worrying about assumption violations, etc.) is one in which all of the potential predictor variables were included, in their "raw" or untransfomed form
```{r mr10}
# naive model
tmeanjul_lm0 <- lm(tmeanjul ~ Picea + Abies + Juniper
+ Pinus + Betula + Fraxinus + Quercus + Ulmus
+ Acer + Juglans + Carya + Tsuga + Fagus
+ Alnus + Herbsum)
summary(tmeanjul_lm0)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm0, which=c(1,2,4,5))
par(oldpar)
```
Map the residuals from the naive model
```{r mr11}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm0)
plotvar <- plotvar-mean(plotvar)
plottitle <- "Naive model residuals (C)"
leglable <- "residuals tmeanjul_lm0"
ggplot() +
geom_sf(data = midwotl_sf, fill="grey70", color="black") +
geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
labs(x = "Longitude", y = "Latitude", title = plottitle)
```
... and calculate the Moran statistic for the naive model residuals:
```{r mr12}
# Moran test
moran.test(residuals(tmeanjul_lm0), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```
There are a few troublesome issues with the naive model. The obvious one is the similarity of the standard errors of the regression coefficients, which is one of the byproducts of collinearity. Here, the pollen data are almost exactly collinear, because the pollen percentages must some to 100. (There are a few minor types omitted, so they don't do that exactly.)
The extent of collinearity can be confirmed by calculating something called the "condition number" of the matrix **X<sup>T</sup>X** (where **X** is the matrix of predictor variables), which is the ratio of the largest to smallest eigenvector of **X<sup>T</sup>X**. This number should be small, say less than 50 or so, and it isn't.
```{r mr13}
kappa(tmeanjul_lm0)
```
## Second model, with transformed predictors ##
There are two approaches for dealing with collinearity here: 1) omitting some variables, and 2) repeating the regression with the transformed values of the pollen variables (which we are inclined to do anyway):
```{r mr14}
# a second model, transformed predictors
tmeanjul_lm1 <- lm(tmeanjul ~ Picea.50 + Abies.100 + Junip.100
+ Pinus.100 + Betula.50 + Frax.15 + Querc.25 + Ulmus.25
+ Acer.100 + Juglans.25 + Carya.25 + Tsuga.100 + Fagus.100
+ Alnus.100 + Herbs.25)
summary(tmeanjul_lm1)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm1, which=c(1,2,4,5))
par(oldpar)
```
Map the residuals.
```{r mr15}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm1)
plotvar <- plotvar-mean(plotvar)
plottitle <- "Second model residuals (C)"
leglable <- "residuals tmeanjul_lm1"
ggplot() +
geom_sf(data = midwotl_sf, fill="grey70", color="black") +
geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
labs(x = "Longitude", y = "Latitude", title = plottitle)
```
Get the Moran statistic for the residuals.
```{r mr16}
# Moran test
moran.test(residuals(tmeanjul_lm1), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```
Note that this model is better behaved in the sense of not showing strange values for the standard errors of the regresssion coefficients, and the spatial autocorrelation, while still barely significant, is lower too. Many of the *t*-tests of the regression coefficients are not significant (meaning that the null hypothesis that the regression coefficient value is 0 can not be rejected). A model with a subset of (significant) predictors can be selected using "all/best possible subsets" regression.
Note that the condition number for this model is still relatively high, because, even though the predictor variables have been transformed, there is still a lot of correlation among them.
```{r mr17}
kappa(tmeanjul_lm1)
```
## All possible subsets regression ##
The `regsubsets()` function in the `leaps` package uses a clever algorithm from the field of numerical analysis to examine a large (possibly all) set regressions with subsets of the predictors.
```{r mr18}
# another model, all possible subsets regression
tmeanjul_lm2 <- regsubsets(tmeanjul ~ Picea.50 + Abies.100 + Junip.100
+ Pinus.100 + Betula.50 + Frax.15 + Querc.25 + Ulmus.25
+ Acer.100 + Juglans.25 + Carya.25 + Tsuga.100 + Fagus.100
+ Alnus.100 + Herbs.25, data=midwtf2)
summary(tmeanjul_lm2)
```
To get the diagnostic plots and statistics for one of the "best" models, that regression is simly rerun using the `lm()` function.
```{r mr19}
# rerun one of the best-subsets regressions
tmeanjul_lm3 <- lm(tmeanjul ~ Picea.50 + Abies.100 + Betula.50 + Fraxinus
+ Querc.25 + Fagus.100 + Herbs.25)
summary(tmeanjul_lm3)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm1, which=c(1,2,4,5))
par(oldpar)
```
```{r mr20}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm3)
plotvar <- plotvar-mean(plotvar)
plottitle <- "All-possible subsets model residuals (C)"
leglable <- "residuals tmeanjul_lm3"
ggplot() +
geom_sf(data = midwotl_sf, fill="grey70", color="black") +
geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
labs(x = "Longitude", y = "Latitude", title = plottitle)
```
```{r mr21}
# Moran test
moran.test(residuals(tmeanjul_lm3), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```
The condition number is now acceptably low:
```{r mr22}
kappa(tmeanjul_lm3)
```
The progress of the analysis can be gauged by comparing the individual Moran statistics, which should get smaller as the model gets more refined. The Moran statistic for the data is 12.08, and this could be thought of as a "totally naive" model, where the response variable is expressed as a function of just its mean value. The naive model with all untransformed predictors does achieve a lower Moran statistic (2.197) indicating that some of the pattern in the response has indeed been accounted for by the predictors, but there are other issues with this model. The second regression including transformed predictors has a Moran statistic of 2.480, but this model too has some issues, in particular, it is "overfit". The final model, the result of the best-possible-subsets regression has the lowest Moran statistics too, 1.834.
```{r mr23}
# compare Moran tests
# dependent variable (tmeanjul)
moran.test(tmeanjul, zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm0 (naive model) no transformation, all predictors included
moran.test(residuals(tmeanjul_lm0), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm1 transformed predictors, all predictors included
moran.test(residuals(tmeanjul_lm1), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm3 transformed predictors, best subset model
moran.test(residuals(tmeanjul_lm3), zero.policy=T,
nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```
[[Back to top]](lec13.html)
# Another example -- dummy-variable regression #
Sometimes a data set exists in which fitting different relationships for subsets of the observations are in order. This situation is exemplified by the Scandinavian EU preference data. The technique knows as *dummy-variable* regression allows the simulataneous calculation of separate regression lines for subsets of the data, or in the example here, for the individual countries. The regression lines can vary in either their slopes or intercepts, or both.
Here are the shapefile components, which include the data: [[.dbf]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.dbf) [[.shx]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.shx) [[.shp]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.shp)
```{r dum01, echo=FALSE, include=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
```
Load some packages, and attach the data frame:
```{r dum02, eval=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
```
## Look at the data ##
First, get a reference map of the commune/county/district names. Read a shapefile of Scandanavian communes.
```{r, echo=TRUE, eval=FALSE}
# read a Scandinamvian province/county shape file
shapefile <- "/Users/bartlein/Documents/geog495/data/shp/scand_prov.shp"
scand_prov_sf <- st_read(shapefile)
scand_prov_sf
plot(st_geometry(scand_prov_sf), axes = TRUE)
```
Get the centroids of the individual districts for locating labels later:
```{r dum03}
# get coordinates of district centroids
scand_prov_pts <- (st_centroid(scand_prov_sf))
```
Ignore the warning, which alerts us to the fact that latitude and longitude are not isotropic, and so we really should be calculating centroids only with projected data.
Plot the District (commune) names:
```{r plot names}
# ggplot2 map of District names
ggplot() +
geom_sf(data = scand_prov_sf, fill=NA, color="gray") +
geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=as.character(District)), size = 2) +
coord_sf(xlim=c(2.5,32.5)) +
labs(x="Longitude", y="Latitude") +
theme_bw()
```
Next, extact the variable values from the shape file, and plot the yes votes. (The data are the same as those in `scanvote_csv`, but are extracted from the shapefile here to make sure that the individual values are in the same order as the polygons in the shapefile.)
```{r dum04}
# get variable values from .dbf attributes
Y <- scand_prov_sf$Yes
X <- scand_prov_sf$Pop
Country <- scand_prov_sf$Country
```
Plot the "Yes" vote:
```{r}
# plot Yes votes -- setup
pal <- brewer.pal(8, "PuOr")
ggplot() +
geom_sf(data = scand_prov_sf, aes(fill = Yes)) +
geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=as.character(Yes)), size = 2) +
scale_fill_gradientn(colors = pal) +
labs(title = "Yes Vote (%)", x="Longitude", y="Latitude")
```
Here's a standard scatter plot of the data, with the points labeled by Country.
```{r dum05}
# scatter plot with Country labels
plot(Y ~ log10(X), type="n")
text(log10(X),Y, labels=Country, cex=0.8)
```
## First regression, all data ##
The regression model for all data, considered without respect to individual country is as follows:
```{r dum06}
# linear model, Yes ~ log10(Pop)
vote_lm1 <- lm(Y ~ log10(X))
summary(vote_lm1)
oldpar <- par(mfrow = c(2, 2))
plot(vote_lm1, which=c(1,2,4,5))
par(oldpar)
```
The initial regression has several issues: There is a distinct arch in the residual scatter plot, the Normal QQ plot indicates there are several outliers, and the Cook's distance and leverage plots indicate the presence of several overly influential observations. Further examination of the results is warrented.
```{r dum07}
# examine the regression equation
plot(Y ~ log10(X), type="n")
text(log10(X),Y, labels=Country)
abline(vote_lm1)
segments(log10(X), fitted(vote_lm1), log10(X), Y)
```
Examination of this plot suggests that positive deviations are often contributed by communes in Finland, and negative deviations by communes in Norway. This suggests that Norwegian communes are less likely to have voted Yes, and Finnish communes more likely to have voted yes than Scandinavian communes as a group. This idea is reinforced by looking at the prediction and confidence intervals for the regression.
```{r dum08}
# confidence intervals
pred_data <- data.frame(X=seq(1,1151,by=50))
pred_int <- predict(vote_lm1, int="p", newdata=pred_data)
conf_int <- predict(vote_lm1, int="c", newdata=pred_data)
plot(Y ~ log10(X), ylim=range(Y, pred_int, na.rm=T), type="n")
text(log10(X),Y, labels=Country, cex=0.7)
pred_X <- log10(pred_data$X)
matlines(pred_X, pred_int, lty=c(1,2,2), col="black")
matlines(pred_X, conf_int, lty=c(1,2,2), col="red")
```
Mapping the residuals suggests the same thing--there is an obvious spatial pattern in the sign of the residuals (so obvious, that it's not really necessary to calculate a Moran statistic).
```{r dum09}
# map the residuals
pal <- brewer.pal(8, "PuOr")
plotlab <- as.character(round(vote_lm1$residuals, 1))
plottitle <- "Residuals from lm1"
ggplot() +
geom_sf(data = scand_prov_sf, aes(fill = vote_lm1$residuals)) +
geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
labs(title = plottitle, x="Longitude", y="Latitude")
```
Boxplots of the response variable (`Y`) and the residuals by country also suggest the same thing.
```{r dum10}
opar <- par(mfrow=c(1,4))
# country-effect
boxplot(Y ~ Country, ylab="Yes")
# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
par <- opar
```
## Dummy variable regression, with Country as a predictor ##
A dummy-variable regression can be run by including `Country` as a predictor. The formula `Y ~ log10(X)+Country` specifies a regression in which separate intercept values are calculated for each country. In essence, two new variables are generated, each binary (0 or 1), one for Sweden and the other for Norway. The intercepts that are estimated work as follows: The "plain" intercept applies to the country left out (Finland), while the intercept for Sweden is that value plus the coefficient estimated for Sweden, and likewise for Norway.
```{r dum11}
# Scandinavian EU Preference Vote -- dummy variable regression
# model with a factor as predictor
vote_lm2 <- lm(Y ~ log10(X)+Country)
summary(vote_lm2)
```
The different regression lines can be displayed as follows:
```{r dum12}
# display the fitted lines
plot(Y ~ log10(X))
legend("bottomright", legend=c("N","F","S"), lty=c(1,1,1), lwd=2, cex=1, col=c("red","blue","purple"))
lines(log10(X)[Country=="N"],fitted(vote_lm2)[Country=="N"], lwd=2, col="red")
lines(log10(X)[Country=="F"],fitted(vote_lm2)[Country=="F"], lwd=2, col="blue")
lines(log10(X)[Country=="S"],fitted(vote_lm2)[Country=="S"], lwd=2, col="purple")
```
Notice that Finland has the highest intercept (34.434), while that for Sweden is lower (34.434 - 6.679), and that for Norway is lower still (34.434 - 8.609).
Examine the model
```{r dum13}
# examine the model
oldpar <- par(mfrow = c(2, 2))
plot(vote_lm2, which=c(1,2,4,5))
par(oldpar)
```
The obvious arch in the residual scatter plot is much less obvious, as is the departure from normality of the residuals. There are still a few overly influential points, but their influence is reduced as well.
Map the residuals:
```{r dum14}
# map residuals
pal <- brewer.pal(8, "PuOr")
plotlab <- as.character(round(vote_lm2$residuals, 1))
plottitle <- "Residuals from lm2"
ggplot() +
geom_sf(data = scand_prov_sf, aes(fill = vote_lm2$residuals)) +
geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
labs(title = plottitle, x="Longitude", y="Latitude")
```
The obvious patterns in the residuals from the first regression are also much reduced.
```{r dum19}
opar <- par(mfrow=c(1,4))
# country-effect
boxplot(Y ~ Country, ylab="Yes")
# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm2) ~ Country, ylim=c(-15,15))
par <- opar
```
It also the case that the obvious differences in the location (i.e. the medians), and scale (the interquartile range) of the residuals (as illustrated by the boxplots) in the first regression model is reduced, bringing the residuals more in line with the assumption that they are independent and identically distributed.
## Dummy-variable regression with both slope and intercept varying by Country ##
A second dummy variable regression, where both the intercept and slope vary can be specified using the model formula `Y ~ log10(X)*Country'.
```{r dum15}
vote_lm3 <- lm(Y ~ log10(X)*Country)
summary(vote_lm3)
```
None of the *t*-tests of the significance of the dummy-variable terms (intercepts and slopes) are significant, and the adjusted *R<sup>2</sup>* value is lower, and so this model is likely "overfitted". Nevertheless, it's useful to look at how both the intercepts and slopes can be made to vary amoung countries.
```{r dum16}
# display the fitted lines
plot(Y ~ log10(X))
legend("bottomright", legend=c("N","F","S"), lty=c(1,1,1), lwd=2, cex=1, col=c("red","blue","purple"))
lines(log10(X)[Country=="N"],fitted(vote_lm3)[Country=="N"], lwd=2, col="red")
lines(log10(X)[Country=="F"],fitted(vote_lm3)[Country=="F"], lwd=2, col="blue")
lines(log10(X)[Country=="S"],fitted(vote_lm3)[Country=="S"], lwd=2, col="purple")
```
Map the residuals from the third model.
```{r dum17}
# map residuals
pal <- brewer.pal(8, "PuOr")
plotlab <- as.character(round(vote_lm3$residuals, 1))
plottitle <- "Residuals from lm3"
ggplot() +
geom_sf(data = scand_prov_sf, aes(fill = vote_lm3$residuals)) +
geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
labs(title = plottitle, x="Longitude", y="Latitude")
```
Look at the country effect on the residuals from the third model.
```{r dum18}
# country-effect
opar <- par(mfrow=c(1,4))
boxplot(Y ~ Country, ylab="Yes")
# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm2) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm3) ~ Country, ylim=c(-15,15))
par <- opar
```
There is little difference between the third model (in which the slope and the intercept vary amoung countries) and the second, and so the "principle of parsimony" would indicate that the second model should be preferred.
[[Back to top]](lec13.html)
# Readings #
- Kuhnert & Venebles (*An Introduction...*): p. 109-120
- Maindonald (*Using R...*): ch. 5