-
Notifications
You must be signed in to change notification settings - Fork 19
/
regression2.Rmd
881 lines (681 loc) · 45.7 KB
/
regression2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
---
title: "Regression 2 -- Multiple regression"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r regression2-1, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2024, but may undergo further edits.</span></p>
# Introduction #
Multiple regression is (conceptually) a simple extension of bivariate regression, in which the influence of more than one predictor variable on the response can be estimated. For the case with two predictor variables, the analysis can be thought of as involving the fitting of a plane (as opposed to a line in the bivariate regression case), and the equations for the OLS estimates of the regression equations are only a little more complicated algebraically. For three or more predictors, the algebra is also quite simple, but requires the use of matrix algebra.
A couple of illustrations jointly describe the idea of fitting a plane:
- [fitting a plane using two predictor variables](https://pjbartlein.github.io/REarthSysSci/images/nwk7_1.gif)
- [one data point and its deviation from the regression plane or response surface](https://pjbartlein.github.io/REarthSysSci/images/mreg.gif)
# Fitting a multiple regression equation #
The mathematics behind multiple regression analysis is more complicated than that for bivariate regression, but can be elegantly presented using matrix algebra
- [matrix algebra](https://pjbartlein.github.io/REarthSysSci/topics/matrix.pdf)
- [regression analysis in matrix algebra terms](https://pjbartlein.github.io/REarthSysSci/topics/matreg.html)
The following example provide a short illustration of the use of matrix algebra to obtain the regression coefficients.
The example data set for illustrating the use of regression diagnostics (`/Users/bartlein/projects/RESS/data/csv_files/`) is used here, in particular, the multiple regression using `x1` and `x2` as predictors for the response variable `y5`
Read the data:
```{r regression2-2 }
# read regrex3.csv
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "regrex3.csv"
csv_file <- paste(csv_path, csv_name, sep="")
regrex3 <- read.csv(csv_file)
```
First, take a look at the different variables in the example data set.
```{r regression2-3 }
# regrex3
attach(regrex3)
summary(regrex3)
```
```{r}
head(cbind(y5,x1,x2))
```
Create an *n* row by 1 column matrix (i.e. a column vector) called **y**:
```{r regression2-4 }
# create the column vector y
n <- length(y5)
y <- matrix(y5, nrow=n, ncol=1)
dim(y)
head(y)
```
Create an *n* row by *p*+1 matrix, **X**, with 1's in the first column, and `x1` and `x2` in the second and third columns:
```{r regression2-5 }
# create the predictor-variable matrix
X <- matrix(cbind(rep(1,n),x1,x2), nrow=n, ncol=3)
dim(X)
head(X)
```
Now use matrix algebra to calculate **b**, the *p*+1 row by 1 column matrix (e.g. a column vector) of regression coefficients, *b*<sub>0</sub>, *b*<sub>1</sub> and *b*<sub>2</sub>: **b** = (**X'X**)<sup>-1</sup>**X'y**.
```{r regression2-6 }
# calculate the regression coefficients
b <- solve(t(X) %*% X) %*% (t(X) %*% y)
print(b)
dim(b)
```
The matrix functions and operators used in the above expression include `t()`, which transposes a matrix, `%*%`, which is the matrix multiplication operator, and `solve()`, which inverts a matrix.
Compare these values with those obtained using the `lm()` function:
```{r regression2-7 }
# linear model with lm()
lm1 <- lm(y5 ~ x1+x2, data=regrex3)
lm1
```
Now calculate the fitted values (*y-hats*), i.e. \(\mathbf{\widehat{y}}\) = **Xb**:
```{r regression2-8 }
# matrix fitted values
yhat <- X %*% b
```
and compare these with those obtained using the `lm()` function
```{r regression2-9 }
head(cbind(yhat,lm1$fitted.values))
```
In addition to being able to efficiently represent the derivation of terms and their properties in regression analysis in general, matrix algebra also provides a an efficient way of doing the actual calculations.
[[Back to top]](regression2.html)
# Regression Assumptions #
The basic regression model (as well as more complicated ones) have certain underlying assumptions, violations of which have an impact on the optimality of the fitted model, i.e., the extent to which the model and its parameters represent the best model that can be fit, that is, the one that performs the best in the tasks of representing the relationship between the response variable and the predictor variable(s), or predicting future values of the response variable given new values of the predictor variables.
The main assumptions that underlie regression analysis:
- the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, with a mean of 0 and a standard deviation of *s*,
- the *X*'s (predictor or independent variables) are known without error,
- the *X*'s are not correlated.
- the correct model has been specified (i.e. the right predictors have been included in the model.
If the assumptions are not violated, then the Gauss-Markov theorem indicates that the usual OLS estimates are optimal in the sense of being unbiased and having minimum variance. If one or more of the assumptions are violated, then estimated regression coefficients may be biased (i.e. they may be systematically in error), and not minimum variance (i.e. there may be more uncertainty in the coefficients than is apparent from the results).
## Consequences of assumption violations ##
If the assumptions are violated, then there may be two consequences--the estimated coefficients may be biased (i.e. systematically wrong), and they may longer have minimum variance (i.e. their uncertainty increases).
- the notion of variability of the regression coefficients
- illustrations using repeated simulations of data sets with built-in assumption violations [[examples]](https://pjbartlein.github.io/REarthSysSci/images/violate1.gif), [[solutions]](https://pjbartlein.github.io/REarthSysSci/images/violate2.gif)
[[Back to top]](regression2.html)
# Example -- NOAA Global Land + Ocean temperatures
It's plain to see that Earth's average temperature is increasing (e.g. [[NOAA 2023 Global Climate Report]](https://www.ncei.noaa.gov/access/monitoring/monthly-report/global/202313). (See the second figure.) One issue that arises is the estimation of the rate of temperature increase, and whether the rate is also changing over time. One factor that complicates that assessment is ENSO (El Niño Southern Oscillation), the variations in temperature of the tropical Pacific Ocean [[https://www.climate.gov/enso]](https://www.climate.gov/enso), between warm phases (El Niño) and cool phases (La Niña). In addition to the influence of ENSO on atmospheric circulation across the Pacific, owing to the large surface area of the Pacific Ocean, ENSO may influence global average temperatures, locally (in time) increasing them during warm phases (El Niño) and locally decreasing them during cool phases (La Niña) ([[NOAA 2023 Global Climate Report]](https://www.ncei.noaa.gov/access/monitoring/monthly-report/global/202313). See the third figure.)
Because both things are going at at once (the trend toward increasing temperature overall, and the local ENSO-related increases and decreases), it's difficult to quantitatively assess or separate the trend and local variability from the inspection of figures alone. That's where quantitative estimation using regression comes in. The information for colorizing the 1950-present monthly temperature plot comes from the [[Oceanic Niño Index]](https://www.climate.gov/news-features/understanding-climate/climate-variability-oceanic-nino-index), a composite of several tropical Pacific sea-surface temperature (SST) indices, which has been detrended. The data are from [[ONI data]](https://www.cpc.ncep.noaa.gov/data/indices/oni.ascii.txt), see here for the data in tabular form [[ONI data table]](https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php).
In addition to the ONI data, we'll also use the [[NOAA Global Monthly Land + Ocean time series]](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/ytd/12/1850-2023) (back to 1950). Because the trend in global temperatures is not mechanistically controlled by time, but is controlled by CO~2~, we'll also use the monthly Scripps/NOAA CO~2~ trends data [[https://gml.noaa.gov/ccgg/trends/]](https://gml.noaa.gov/ccgg/trends/) (March 1958 onward). The data can be found here: [[NOAA CO~2~ Trends]](https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.txt). The input data set (`NOAA_Globe_T_ENSO_CO2_1950-2023.csv`) contains the year, month, decimal year, temperatures of the land, ocean, and land plus ocean, the ONI monthly total value, the anomaly and code, and the monthly mean CO~2~, the deseasonalized CO~2~ value, and their difference (the "local" CO~2~ anomaly.)
## Read the NOAA data
Load libraries
```{r regression2-10}
library(ggplot2)
library(forecast)
```
Read the data, modifying paths as necessary:
```{r regression2-11}
# read NOAA monthly temperature, ONI (ENSO), and CO2
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "NOAA_Globe_T_ENSO_CO2_1950-2023.csv"
csv_file <- paste(csv_path, csv_name, sep="")
NOAA <- read.csv(csv_file)
str(NOAA)
summary(NOAA)
```
Note the missing CO~2~ data prior to 1958.
Decode the ONI_code variable, a character on input, in to a factor and order the factors from "La Niña" to "El Niño" (with "Neutral" in the middle.
```{r regression2-12}
# recode ONI_code as a factor
NOAA$ONI_code <- factor(NOAA$ONI_code, levels = c("La Niña", "Neutral", "El Niño"))
```
Get a simple plot:
```{r regression2-13}
# plots
plot(T_LandOcean ~ YrMn, type = "l", col = "gray70", lwd = 1.5, data = NOAA)
points(T_LandOcean ~ YrMn, pch = 16, cex =0.8, data = NOAA)
```
Here's a `{gglot2}` version of the basic plot:
```{r regression2-14}
# ggplot2 version
ggplot(data = NOAA, aes(x=YrMn, y=T_LandOcean)) +
geom_line(color = "gray70") +
geom_point(size = 1) +
scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
theme_bw()
```
Even at this point, there's some visual evidence that the relationship is curvilinear, with the slope of the trend increasing over time. We'll postpone dealing with that for a little while.
For later use, create another data frame, removing the observations with missing CO~2~ data.
```{r regression2-15}
# remove records with missing CO2
NOAA2 <- NOAA[!is.na(NOAA[, 10]), ]
summary(NOAA2)
str(NOAA2)
head(NOAA2)
```
[[Back to top]](regression2.html)
## Simple linear regression (lm_01)
First, do a simple linear regression with `YrMn` as the predictor and land plus ocean temperatures `T_LandOcean` as the response.
```{r regression2-16}
# simple linear (OLS) regression line
lm_01 <- lm(T_LandOcean ~ YrMn, data = NOAA)
summary(lm_01)
AIC(lm_01)
```
The fit of the "curve" (a straight line in this instance) is pretty good: the *R^2^* value is 0.8135, which is almost identical to the adjusted *R^2^*, which also incorporates the automatic increase in *R^2^* that occurs as more predictors are added. The Akaike Information Criterion (AIC) (-803.5893 here( is a similar statistic that trades off the goodness of fit again the number of parameters that have to be estimated. In practice we would choose the model with the lowest AIC value among choices. The *t*- and *F*-statistics are also significant. These are tests of the null hypothesis that the regression coefficients are 0.0, and that there is a significant relationship between the response and predictor(s).
Here's a look at the data, fitted values (the line) and the residuals, whose sum-of-squares were minimized:
```{r regression2-17}
# examine the fitted model -- residuals
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
segments(NOAA$YrMn, fitted(lm_01), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
```
The fitted model can be further examined by looking at the prediction intervals, which gauge the uncertainty in the predictions from the model, and the confidence intervals, which gauge the uncertainty in the regression line.
```{r regression2-18}
# examine the fitted model -- prediction and confidence limits
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
pred_data <- data.frame(YrMn=NOAA$YrMn)
pred_int <- predict(lm_01, int="p", newdata=pred_data)
conf_int <- predict(lm_01, int="c", newdata=pred_data)
matlines(pred_data$YrMn, pred_int, lty=c(1,2,2), col="black")
matlines(pred_data$YrMn, conf_int, lty=c(1,2,2), col="orange")
```
Close inspection of both intervals shows that the slightly flare out farther away from the centroid of the data. The fitted line always goes through the centroid of the data (the mean of the response and the mean of the predictor), which acts more-or-less as a "pivot" and so it makes sense that the uncertainties are greater the farther away from that centroid one goes.
The assumptions can be checked using a few diagnostic plots:
```{r regression2-19}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_01, which=c(1,2,4))
acf(lm_01$residuals)
par(oldpar)
```
The residual plot (residuals plotted as a function of the fitted values) has a distinct inverted arch, suggesting the there is some non-linearity in the relationship between the response and the predictor (i.e. a non-linear trend), The `qqnorm()` plot suggests that the residuals are essentially normally distributed (except that a few large residuals are tending to make the upper-tail of the distribution a little "fat"). Cook's distance measures the influence of individual observations on the regression. It's not unusual in time-series data for the first and last observations to show greater leverage on the regression coefficients. The residual autocorrelation function, does, however, signal that there is considerable autocorrelation in the residuals. This could arise from incompletely accounting for the trend in the data, or from short-term persistence in the response. This will be examined further below.
### Eye-balling the influence of ENSO on the regression model
The potential influence of ENSO on the trend in the data can be illustrated by simply labeling the points by the ONI_code values, red for El Niño, blue for La Niña, and gray for Neutral conditions:
```{r regression2-20}
# examine the regression equation again, by ENSO state
pal <- c("blue", "gray", "red")
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
segments(NOAA$YrMn, fitted(lm_01), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.8, col = pal[ONI_code])
```
It's easy to see that many positive residuals are red, and that many negative residuals are blue, suggesting that the regression line under predicts during El Niño conditions and over predicts during La Niña conditions.
Here's a `{ggplot2}` package version of the labeled plot:
```{r regression2-21}
# points colored by ENSO state
ggplot() +
geom_line(data = NOAA, aes(x = YrMn, y = T_LandOcean)) +
geom_abline(intercept = lm_01$coefficients[1], slope = lm_01$coefficients[2], color = "black") +
geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean, color = ONI_code), size = 1.5) +
geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean), color = "black", shape = 1, size = 1.5) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
guides(color = guide_legend(override.aes = list(size = 3))) +
theme_bw()
```
Here's a plot of the residuals, on the same scales as the data plot:
```{r regression2-22}
# residuals
ggplot() +
geom_line(data = NOAA, aes(x = YrMn, y = residuals(lm_01))) +
geom_point(data = NOAA, aes(x = YrMn, y = residuals(lm_01), color = ONI_code), size = 1.5) +
geom_point(data = NOAA, aes(x = YrMn, y = residuals(lm_01)), color = "black", shape = 1, size = 1.5) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25)
, minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean (Residuals)") +
guides(color = guide_legend(override.aes = list(size = 3))) +
theme_bw()
```
The plot of the residuals from the first model versus time again hints and some non-linearity in the trend and the sorting of the residual values by ENSO state is quite evident. To look at this further, we can use a grouped box plot:
```{r regression2-23}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```
From the box plots, it's possible to infer that the median of the residuals during El Niño conditions is about 0.1 ^o^C higher than during La Niña" conditions, while the median during Neutral conditions is very close to 0.0.
The assumptions that underlie ordinary least-squares regression essentially state that there should be no discernible pattern in the residuals, apart from being normally distributed random numbers. When that assumption is satisfied, then the Gauss-Markov theorem assures us that there is no better way to estimate the regression coefficients. Here, however, there are clear patterns in the residuals: they have a slight curvilinear trend, are autocorrelated, but mainly show considerable bias related to ENSO state.
[[Back to top]](regression2.html)
## Dummy-variable regression (lm_02)
One approach for easily learning about the consequences of a "group membership" variable like ENSO state as represented by the `ONI_code` is the unfortunately named "dummy variable regression." The name arises from the generation of various "dummy" (or indicator) variables, that can be employed to estimate separate intercept and slope values for the different groups. For example, a set of variables for examining the variations in intercepts across groups would take on the value 0.0, except when the particular ENSO state occurs, when it would be set equal to 1.0, and likewise, to examine variations in slope the dummy variables would set equal to 0.0, except when the particular ENSO state occurs, when it would be set equal to the predictor variable, in this case `YrMn`. Although the dummy variables could be generated for all three ENSO states (in this example), only two of each kind would be included in what's now a multiple regression model because including all three would lead to linearities among predictors: the sum of the intercept-estimating dummy variables would equal 1.0 for every observation, while the sum of the slope-estimating variables would equal `YrMn`.
In the current example, we'll include the dummy variables for Neutral and El Niño conditions (omitting those for La Niña conditions). To interpret the results, the intercept and slope for La Niña conditions would simply be the overall intercept and slope of the fitted model. The intercept for El Niño conditions would be the sum of the overall intercept and the coefficient of the El Niño-condition (0 or 1) intercept dummy variable, while the slope for El Niño conditions would be the sum of the overall slope and the coefficient of the El Niño condition (0 or `YrMn`) slope dummy variable. Dummy-variable regression is much more elegant than its name implies.
It turns out that the "equation" language in R takes care of the generation of the dummy variables (and leaving one out). (But there will be an example of explicit generation of dummy variables below.) Here's the code to look at only the difference in intercept among ENSO-state groups.
```{r regression2-24}
# dummy-variable regression
lm_02 <- lm(T_LandOcean ~ YrMn + ONI_code, data = NOAA)
summary(lm_02)
```
All of the coefficients in the model are significant as before, and the *R^2^* value has gone up. One way to gauge the improvement by the second model over the first is to compare the AIC values:
```{r regression2-25}
# print AIC values
print(c(AIC(lm_01), AIC(lm_02)))
```
The AIC value for the second model is indeed lower than that for the first model. Another comparison can be made by doing a simple analysis of variance:
```{r regression2-26}
# compare model by ANOVA
anova(lm_01, lm_02)
```
The *F*-statistic is significant (i.e. it has a low *p*-value), and so overall, the second, dummy-variable model (lm_02) appears to be (statistically) superior to the simple linear regression model (lm_01).
Here's what the fitted model looks like:
```{r regression2-27}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))
lines(fitted(lm_02)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_02)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_02)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
```
Here is comparison of the grouped box plots for the two models:
```{r regression2-28}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_02) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```
The bias in the residuals (positive for El Niño conditions, negative for La Niña conditions) has evidently removed.
[[Back to top]](regression2.html)
## Dummy-variable regression, both intercept and slope varying (lm_03)
Here is the code for fitting a model where both intercept and slope vary. (Note the subtle difference, an `*` instead of a `+` in the model specification.)
```{r regression2-29}
# dummy-variable regression, slope and intercept varying
lm_03 <- lm(T_LandOcean ~ YrMn * ONI_code, data = NOAA)
summary(lm_03)
```
Compare the models:
```{r regression2-30}
# print AIC values
print(c(AIC(lm_01), AIC(lm_02), AIC(lm_03)))
```
```{r regression2-31}
# compare via ANOVA
anova(lm_02, lm_03)
```
The AIC value for this third (intercept + slope varying model) is a little lower than for the second, but the analysis of variance suggest there's not much difference in the explained variance. (Small *F*-statistic, large "p*-value.)
The fitted lines show little variation:
```{r regression2-32}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))
lines(fitted(lm_03)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_03)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_03)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
```
The slope for the Neutral conditions is a little lower than for the La Niña and El Niño, which doesn't have an immediately apparent physical explanation.
Here is the same plot, with the regression lines for `lm_02` added as dashed lines:
```{r regression2-33}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))
lines(fitted(lm_03)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_03)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_03)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
lines(fitted(lm_02)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=1, lty = 2, col="blue")
lines(fitted(lm_02)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=1, lty = 2,col="black")
lines(fitted(lm_02)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=1, lty = 2,col="red")
```
With negligible improvement in the goodness of fit of `lm_03`, and no practical differences in the curves, there seems to be little reason to declare the model where both intercept and slope vary among ENSO states as any better than the one where just the intercept varies. In fact the "principle of parsimony" in statistics would strongly argue for `lm_02` as the better model.
[[Back to top]](regression2.html)
# A model with CO~2~ as the predictor variable
Although it makes sense when trying to assess the overall trend in global temperatures to use time (`YrMn`) as a predictor, we know that in fact, it's the level of CO~2~ and other greenhouse gasses (GHGs) that are responsible for the temperature increase. The NOAA data set from Mauna Loa contains both the mean monthly values of CO~2~ (plotted in red below) and a "deseasonalized" value (that takes out the prominent annual cycle in the CO~2~ data related to the "respiration" of the Northern Hemisphere biosphere, plotted in black.) There is some spatial variability inn monthly CO~2~ values, with the Northern and Southern Hemispheres being "antiphased", and so for looking at global temperatures, it makes the most sense to use the deseasonalized values as a potential predictor.
```{r regression2-34}
plot(CO2_mean ~ YrMn, data = NOAA2, pch = 16, col = "red", cex = 0.3, ylab = expression("Mauna Loa CO"[2]))
points(CO2_deseas ~ YrMn, data = NOAA2, pch = 16, col = "black", cex = 0.2)
```
## Simple linear regression, CO~2~ as a predictor (lm_05)
For comparison, using now the `NOAA2` data frame, in which the earlier months when CO~2~ data were not yet available, the simple linear model with `YrMn` as a predictor is:
```{r regression2-35}
# simple linear (OLS) regression line, only with NOAA2
lm_04 <- lm(T_LandOcean ~ YrMn, data = NOAA2)
summary(lm_04)
AIC(lm_04)
```
Now here is the model with CO~2~ as the predictor:
```{r regression2-36}
# simple linear (OLS) regression line, CO2 as predictor
lm_05 <- lm(T_LandOcean ~ CO2_deseas, data = NOAA2)
summary(lm_05)
AIC(lm_05)
```
Here are the residual diagnostic plots:
```{r regression2-37}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_05, which=c(1,2,4))
acf(residuals(lm_05))
par(oldpar)
```
In general, except for autocorrelation of the residuals, the residuals are a bit better behaved relative to the first model `lm_01`. The *R^2^*-squared values are higher, and the AIC more negative for the model with CO~2~ as a predictor, suggesting that overall it's a better predictor of the trend in monthly global-averaged temperatures than simply time. (Not surprising)
```{r regression2-38}
print(c(AIC(lm_04), AIC(lm_05)))
```
Here's a plot:
```{r regression2-39}
# points colored by ENSO state
ggplot() +
geom_abline(intercept = lm_05$coeff[1], slope = lm_05$coeff[2], color = "black") +
geom_point(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean), color = "black", shape = 1, size = 2) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(315, 430, by = 15)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = expression(CO[2]~" "(ppm)), y = "Land + Ocean") +
guides(color = guide_legend(override.aes = list(size = 3))) +
theme_bw()
```
[[Back to top]](regression2.html)
## Dummy-variable regression, intercepts-only, CO~2~ as a predictor (lm_06)
Next, we'll fit a dummy-variable regression, with CO~2~ as a predictor, allowing the intercepts to vary among ENSO states:
```{r regression2-40}
# dummy-variable regression
lm_06 <- lm(T_LandOcean ~ CO2_deseas + ONI_code, data = NOAA2)
summary(lm_06)
print(c(AIC(lm_05), AIC(lm_06)))
anova(lm_05, lm_06)
```
As before, with `YrMn` as the predictor, the dummy-variable regression performs better. Here's a plot"
```{r regression2-41}
# display the fitted lines
plot(T_LandOcean ~ CO2_deseas, data = NOAA2, type="n")
points(T_LandOcean ~ CO2_deseas, data = NOAA2, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))
lines(fitted(lm_06)[NOAA2$ONI_code == "La Niña"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_06)[NOAA2$ONI_code == "Neutral"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_06)[NOAA2$ONI_code == "El Niño"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "El Niño"], lwd=2, col="red")
```
Here's the `{ggplot2}` version:
```{r regression2-42}
# intercepts and slopes lm_06
ln_intercept <- lm_06$coeff[1]
ln_slope <- lm_06$coeff[2]
n_intercept <- lm_06$coeff[1] + lm_06$coeff[3]
n_slope <- lm_06$coeff[2]
en_intercept <- lm_06$coeff[1] + lm_06$coeff[4]
en_slope <- lm_06$coeff[2]
# ggplot2 plot with regression line
ggplot(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean)) +
geom_abline(intercept = ln_intercept, slope = ln_slope, color = "blue") +
geom_abline(intercept = n_intercept, slope = n_slope, color = "gray40") +
geom_abline(intercept = en_intercept, slope = en_slope, color = "red") +
geom_point(data = NOAA2, aes(x=CO2_deseas, y=T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA2, aes(x=CO2_deseas, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(315, 430, by = 15)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = expression(CO[2]~" "(ppm)), y = "Land + Ocean") +
theme_bw()
```
The CO~2~ have an evident non-linear trend, while the relationship with temperature seems quite linear. Recalling that there is a slight tendency in the relationship of temperature with time to be slightly curvilinear, this suggests global temperatures indeed have a curvilinear as opposed to linear trend. An interesting addition to a scatter plot of temperature vs. time, is the fitted values of the regression with CO~2~ as the predictor:
```{r regression2-43}
# scatter plot with CO2-fitted values ===============================
ln_fit <- ln_intercept + ln_slope*NOAA2$CO2_deseas
n_fit <- n_intercept + n_slope*NOAA2$CO2_deseas
en_fit <- en_intercept + n_slope*NOAA2$CO2_deseas
# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) +
geom_point(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", shape = 4, size = 0.5) +
geom_point(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", shape = 4, size = 0.5) +
geom_point(aes(x = NOAA2$YrMn, y = en_fit), color = "red", shape = 4, size = 0.5) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
theme_bw()
```
Above, the fitted or predicted values from `lm_06` are plotted as tiny "x"'s, which together give the impression of a curvilinear relationship.
[[Back to top]](regression2.html)
# Curvilinear trends
## A simple polynomial trend (lm_07)
A curvilinear, second-order polynomial regression model (i.e. *Y~t~* = *b~0~* + *b~1~X~t~* + *b~2~X~t~^2^* + e~t~) can be fit as follows. Note the use of the `poly()` function to generate the predictor variables.
```{r regression2-44}
# second-order polynomial
lm_07 <- lm(T_LandOcean ~ poly(YrMn, 2, raw = TRUE), data = NOAA)
summary(lm_07)
AIC(lm_07)
```
```{r regression2-45}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_07, which=c(1,2,4))
acf(lm_07$residuals)
par(oldpar)
```
Note that the AIC value has decreased substantially relative to `lm_04` the linear model.
Here is the plot of the polynomial fit:
```{r regression2-46}
# examine the regression equation again, by ENSO state
pal <- c("blue", "gray", "red")
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
segments(NOAA$YrMn, fitted(lm_07), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
lines(NOAA$YrMn, fitted(lm_07))
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.8, col = pal[ONI_code])
```
The polynomial line seems to better discriminate between the various ENSO states that the linear regression line did. Here's a `{ggplot2}` with a linear and loess smooth curve added:
```{r regression2-47}
# ggplot2 plot with regression line
ggplot(data = NOAA, aes(x = YrMn, y = T_LandOcean)) +
geom_line(color = "gray70") +
geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean), color = "black", shape = 1, size = 2) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
geom_smooth(method = "lm", se = TRUE, level = 0.95) +
geom_smooth(method = "loess", se = TRUE, level = 0.95, col = "purple") +
geom_line(aes(x = YrMn, y = fitted(lm_07)), color = "black", size = 1) +
scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
theme_bw()
```
And the residual grouped box plot shows that, relative to the original model `lm_01` the variability of the residuals is smaller, but they are still biased by ENSO state:
```{r regression2-48}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_07) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```
## A polynomial dummy-variable regression (lm_08)
The obvious solution to deal with the ENSO-state bias in the residuals is to again implement a dummy-variable regression. Here's the code and results:
```{r regression2-49}
# simple linear (OLS) regression line
lm_08 <- lm(T_LandOcean ~ poly(YrMn, 2, raw = TRUE) + ONI_code, data = NOAA)
summary(lm_08)
AIC(lm_08)
```
Here's the `{ggplot2}` plot:
```{r regression2-50}
# get the fitted values
ln_fit <- lm_08$coeff[1] + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
n_fit <- (lm_08$coeff[1] + lm_08$coeff[4]) + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
en_fit <- (lm_08$coeff[1] + lm_08$coeff[5]) + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
```
```{r regression2-51}
# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) +
geom_line(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", linewidth = 1.0) +
geom_line(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", linewidth = 1.0) +
geom_line(aes(x = NOAA2$YrMn, y = en_fit), color = "red", linewidth = 1.0) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
theme_bw()
```
And the residual diagnostic plots:
```{r regression2-52}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_08, which=c(1,2,4))
acf(lm_08$residuals)
par(oldpar)
```
Except for the autocorrelation in the residuals, these are the "best-looking" plots we've seen. The AIC values confirm that `lm_08` fits the data better than the earlier models:
```{r regression2-53}
print(c(AIC(lm_01), AIC(lm_03), AIC(lm_08)))
```
And this is also confirmed by the ANOVA results:
```{r regression2-54}
# anova
anova(lm_07, lm_08)
anova(lm_01, lm_08)
```
```{r regression2-55}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_02) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_08) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```
As was the case for the linear model, the dummy-variable polynomial model eliminated the ENSO-state related residual bias.
[[Back to top]](regression2.html)
# A regression model with autocorrelated residuals
There is one remaining issue: the residual are autocorrelated. This is not really a surprise, either looking at the data or considering what ENSO SST variability is all about. However, the autocorrelation does violate the OLS assumptions. The autocorrelation can be explicitly accounted for (and hopefully removed) by simultaneously fitting a "Box-Jenkins" type of time series model to the data while fitting the polynomial dummy-variable regression. This can be accomplished using the 'arima()' function from the base `{stats}` package. The process requires a little more set-up, but is not difficult.
Box-Jenkins, or "ARIMA" (autoregressive integrated moving-average) models are a topic in themselves, but they basically related the current value of a time series to its previous values, and to the current and previous values of "innovations" or "exogenous shocks". They thus can represent both "memory" in a system as well as current perturbations to its state. In the case of ENSO, this sounds like the memory incorporated by the simple thermal inertia of sea water displayed by the slow changes in SST anomalies, as will as the faster current and previous interactions with the atmosphere.
## Set up
To use the `arima()` function to fit a model that includes autoregressive and moving-average terms as well as the polynomial dummy variables, we have to make the dummy variables manually:
```{r regression2-56}
# make the dummy variables
dum_ln <- rep(0, length(NOAA$YrMn))
dum_n <- rep(0, length(NOAA$YrMn))
dum_en <- rep(0, length(NOAA$YrMn))
dum_ln[NOAA$ONI_code == "La Niña"] <- 1
dum_n[NOAA$ONI_code == "Neutral"] <- 1
dum_en[NOAA$ONI_code == "El Niño"] <- 1
# head(cbind(dum_ln, dum_n, dum_en), 30)
```
Note that we made dummy variables for all three ENSO states, but only two will be included in the model to avoid collinearity. Next, make a data frame of the the regression predictors, `xreg`:
```{r regression2-57}
# make regression data frame
xreg <- data.frame(NOAA$YrMn, NOAA$YrMn^2, dum_n, dum_en) # note: leave dum_ln out
names(xreg) <- c("YrMn", "YrMn^2", "Neutral", "El Niño")
head(xreg)
```
## Time-series regression model
Next, the time-series regression. There is a regular set of procedures for iteratively determining the form of the ARIMA model (model identification), fitting the model (estimation), and diagnostic checking that are beyond this example. Here, it looks like an ARMA(2,2) (two autoregressive terms, two moving average terms) will be apt. The model is a little complicated, but the equation that is being fit to the data (by minimizing the sum-of-squares of the residuals) is *Y~t~ = φ~1~Y~t-1~ + φ~2~Y~t-2~ + θ~1~a~t-1~ + θ~2~a~t-2~ + b~0~ + b~1~X~t~ + b~2~X~t~^2^ + b~3~D~(1)t~ + b~4~D~(2)t~ + a~t~*, where *Y~t~* is the current temperature value, the *X~t~*'s are the predictors, including `YrMn` and the dummy variables, *a~t~* is the residual or prediction errors, and *φ~p~*, *θ~q~*, and *b~r~*, are autoregressive (AR), moving average (MA) and regression coefficients.
```{r regression2-58}
# time series regression, AR(2) and MA(2)
tsreg_01 <- arima(NOAA$T_LandOcean, order = c(2, 0, 2), xreg = xreg, include.mean = TRUE)
tsreg_01
```
Check the residuals using a function from the `{forecast}` package:
```{r regression2-59}
checkresiduals(tsreg_01, test = "LB", lag = 12)
```
The Ljung-Box test examines the null hypothesis that the autocorrelation of the residual is 0.0. Here the large value of `Q*` and the large *p*-value suggest that this hypothesis cannot be rejected.
Get the fitted values:
```{r regression2-60}
# list the coefficients of the model
tsreg_01$coef
# the intercept is the 5-th term in the model
c1 <- 5
```
```{r regression2-61}
# fitted values
ln_fit <- tsreg_01$coef[c1] + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
n_fit <- (tsreg_01$coef[c1] + tsreg_01$coef[c1 + 3]) + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
en_fit <- (tsreg_01$coef[c1] + tsreg_01$coef[c1 + 4]) + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
head(cbind(ln_fit, n_fit, en_fit))
```
The usual plot:
```{r regression2-62}
# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) +
geom_line(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", linewidth = 1.0) +
geom_line(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", linewidth = 1.0) +
geom_line(aes(x = NOAA2$YrMn, y = en_fit), color = "red", linewidth = 1.0) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) +
geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
scale_color_manual(values = c("blue", "gray", "red"),
limits = c("La Niña", "Neutral", "El Niño")) +
scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25)
, minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""),
x = "Year", y = "Land + Ocean") +
theme_bw()
```
The residual diagnostic plots (note there is not time-series equivalent of Cook's distance)
```{r regression2-63}
# residual diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(fitted(tsreg_01), tsreg_01$residuals)
qqnorm(tsreg_01$residuals)
qqline(tsreg_01$residuals)
acf(tsreg_01$residuals)
par(oldpar)
```
These look good. The AIC values suggest that the model fits better still than the previous one, so it was worthwhile to include the ARMA terms;
```{r regression2-64}
print(c(AIC(lm_01), AIC(lm_08), AIC(tsreg_01)))
```
Here are the residual grouped box plots:
```{r regression2-65}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_08) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(tsreg_01) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```
The spread of the residuals from the time-series regression model is clearly smaller than the others, which can be seen by simply overplotting the residuals from `lm_08` with those from `tsreg_01`.
```{r regression2-66}
plot(NOAA$YrMn, residuals(lm_08), type = "l", col = "magenta")
lines(NOAA$YrMn, tsreg_01$residuals, type = "l", col = "green")
```
[[Back to top]](regression2.html)
## Slope of the polynomial trend
The original focus of this example was to estimate the trend in monthly global average temperatures, and the influence of the ENSO state on rising temperatures. However, the fact that the trend is curvilinear complicates this, because the slopes vary over time. The slope of a polynomial is just the first derivative. The increasing slope of the fitted curves (which are the same, only the intercept varies) can be gotten as follows:
```{r regression2-67}
# slope is first derivative of the polynomial,
# If f(x) = b0 + b1 * x + b2 * x^2, then f'(x) = 2 * b2 * x + b1
c1 <- 5
slope <- 2.0 * tsreg_01$coef[c1 + 2] * NOAA$YrMn + tsreg_01$coef[c1 + 1]
plot(NOAA$YrMn, slope, type = "o", cex = 0.5, col = "red", xlab = "Year", ylab = "slope (degC/Year)")
```
As can be seen, the slope of the polynomial increases over time. Here's another perspective:
```{r regression2-68}
slope_1960 <- 2.0 * tsreg_01$coef[c1 + 2] * 1960 + tsreg_01$coef[c1 + 1]
slope_2020 <- 2.0 * tsreg_01$coef[c1 + 2] * 2020 + tsreg_01$coef[c1 + 1]
slope_ratio <- (slope_2020 / slope_1960)
slopes <- data.frame(slope_1960, slope_2020, slope_ratio)
colnames(slopes) <- c("slope 1960", "slope_2020", "ratio")
rownames(slopes) <- NULL
slopes
```
The rate of global warming has increased about five times between 1960 and 2020!
[[Back to top]](regression2.html)