regression2.Rmd

---
title: "Regression 2 -- Multiple regression"
output: 
  html_document:
    fig_caption: no
    number_sections: yes
    toc: yes
    toc_float: false
    collapsed: no
---

```{r regression2-1, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```

<p><span style="color: #00cc00;">NOTE:  This page has been revised for Winter 2024, but may undergo further edits.</span></p>

# Introduction #

Multiple regression is (conceptually) a simple extension of bivariate regression, in which the influence of more than one predictor variable on the response can be estimated.   For the case with two predictor variables, the analysis can be thought of as involving the fitting of a plane (as opposed to a line in the bivariate regression case), and the equations for the OLS estimates of the regression equations are only a little more complicated algebraically.  For three or more predictors, the algebra is also quite simple, but requires the use of matrix algebra.

A couple of illustrations jointly describe the idea of fitting a plane:  

- [fitting a plane using two predictor variables](https://pjbartlein.github.io/REarthSysSci/images/nwk7_1.gif)
- [one data point and its deviation from the regression plane or response surface](https://pjbartlein.github.io/REarthSysSci/images/mreg.gif)

# Fitting a multiple regression equation #

The mathematics behind multiple regression analysis is more complicated than that for bivariate regression, but can be elegantly presented using matrix algebra  

- [matrix algebra](https://pjbartlein.github.io/REarthSysSci/topics/matrix.pdf)
- [regression analysis in matrix algebra terms](https://pjbartlein.github.io/REarthSysSci/topics/matreg.html)

The following example provide a short illustration of the use of matrix algebra to obtain the regression coefficients.  
The  example data set for illustrating the use of regression diagnostics (`/Users/bartlein/projects/RESS/data/csv_files/`) is used here, in particular, the multiple regression using `x1` and `x2` as predictors for the response variable `y5`

Read the data:

```{r regression2-2 }
# read regrex3.csv
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "regrex3.csv"
csv_file <- paste(csv_path, csv_name, sep="")
regrex3 <- read.csv(csv_file) 
```

First, take a look at the different variables in the example data set.

```{r regression2-3 }
# regrex3
attach(regrex3)
summary(regrex3)
```

```{r}
head(cbind(y5,x1,x2))
```

Create an *n* row by 1 column matrix (i.e. a column vector) called **y**:

```{r regression2-4 }
# create the column vector y
n <- length(y5)
y <- matrix(y5, nrow=n, ncol=1)
dim(y)
head(y)
```

Create an *n* row by *p*+1 matrix, **X**, with 1's in the first column, and `x1` and `x2` in the second and third columns:

```{r regression2-5 }
# create the predictor-variable matrix
X <- matrix(cbind(rep(1,n),x1,x2), nrow=n, ncol=3)
dim(X)
head(X)
```

Now use matrix algebra to calculate **b**, the *p*+1 row by 1 column matrix (e.g. a column vector) of regression coefficients, *b*<sub>0</sub>, *b*<sub>1</sub> and *b*<sub>2</sub>:  **b** = (**X'X**)<sup>-1</sup>**X'y**.

```{r regression2-6 }
# calculate the regression coefficients
b <- solve(t(X) %*% X) %*% (t(X) %*% y)
print(b)
dim(b)
```

The matrix functions and operators used in the above expression include `t()`, which transposes a matrix, `%*%`, which is the matrix multiplication operator, and `solve()`, which inverts a matrix.

Compare these values with those obtained using the `lm()` function:

```{r regression2-7 }
# linear model with lm()
lm1 <- lm(y5 ~ x1+x2, data=regrex3)
lm1
```

Now calculate the fitted values (*y-hats*), i.e. \(\mathbf{\widehat{y}}\) = **Xb**:

```{r regression2-8 }
# matrix fitted values
yhat <- X %*% b
```

and compare these with those obtained using the `lm()` function

```{r regression2-9 }
head(cbind(yhat,lm1$fitted.values))
```

In addition to being able to efficiently represent the derivation of terms and their properties in regression analysis in general, matrix algebra also provides a an efficient way of doing the actual calculations.

[[Back to top]](regression2.html)

# Regression Assumptions #

The basic regression model (as well as more complicated ones) have certain underlying assumptions, violations of which have an impact on the optimality of the fitted model, i.e., the extent to which the model and its parameters represent the best model that can be fit, that is, the one that performs the best in the tasks of representing the relationship between the response variable and the predictor variable(s), or predicting future values of the response variable given new values of the predictor variables.  

The main assumptions that underlie regression analysis:

- the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, with a mean of 0 and a standard deviation of *s*,
- the *X*'s (predictor or independent variables) are known without error,
- the *X*'s are not correlated.
- the correct model has been specified (i.e. the right predictors have been included in the model.

If the assumptions are not violated, then the Gauss-Markov theorem indicates that the usual OLS estimates are optimal in the sense of being unbiased and having minimum variance. If one or more of the assumptions are violated, then estimated regression coefficients may be biased (i.e. they may be systematically in error), and not minimum variance (i.e. there may be more uncertainty in the coefficients than is apparent from the results).

## Consequences of assumption violations ##

If the assumptions are violated, then there may be two consequences--the estimated coefficients may be biased (i.e. systematically wrong), and they may longer have minimum variance (i.e. their uncertainty increases).

- the notion of variability of the regression coefficients
- illustrations using repeated simulations of data sets with built-in assumption violations [[examples]](https://pjbartlein.github.io/REarthSysSci/images/violate1.gif), [[solutions]](https://pjbartlein.github.io/REarthSysSci/images/violate2.gif)

[[Back to top]](regression2.html)

# Example -- NOAA Global Land + Ocean temperatures

It's plain to see that Earth's average temperature is increasing (e.g. [[NOAA 2023 Global Climate Report]](https://www.ncei.noaa.gov/access/monitoring/monthly-report/global/202313). (See the second figure.) One  issue that arises is the estimation of the rate of temperature increase, and whether the rate is also changing over time. One factor that complicates that assessment is ENSO (El Niño Southern Oscillation), the variations in temperature of the tropical Pacific Ocean [[https://www.climate.gov/enso]](https://www.climate.gov/enso), between warm phases (El Niño) and cool phases (La Niña). In addition to the influence of ENSO on atmospheric circulation across the Pacific, owing to the large surface area of the Pacific Ocean, ENSO may influence global average temperatures, locally (in time) increasing them during warm phases (El Niño) and locally decreasing them during cool phases (La Niña) ([[NOAA 2023 Global Climate Report]](https://www.ncei.noaa.gov/access/monitoring/monthly-report/global/202313). See the third figure.)

Because both things are going at at once (the trend toward increasing temperature overall, and the local ENSO-related increases and decreases), it's difficult to quantitatively assess or separate the trend and local variability from the inspection of figures alone. That's where quantitative estimation using regression comes in. The information for colorizing the 1950-present monthly temperature plot comes from the [[Oceanic Niño Index]](https://www.climate.gov/news-features/understanding-climate/climate-variability-oceanic-nino-index), a composite of several tropical Pacific sea-surface temperature (SST) indices, which has been detrended. The data are from [[ONI data]](https://www.cpc.ncep.noaa.gov/data/indices/oni.ascii.txt), see here for the data in tabular form [[ONI data table]](https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php).

In addition to the ONI data, we'll also use the [[NOAA Global Monthly Land + Ocean time series]](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/ytd/12/1850-2023) (back to 1950). Because the trend in global temperatures is not mechanistically controlled by time, but is controlled by CO~2~, we'll also use the monthly Scripps/NOAA CO~2~ trends data [[https://gml.noaa.gov/ccgg/trends/]](https://gml.noaa.gov/ccgg/trends/) (March 1958 onward). The data can be found here: [[NOAA CO~2~ Trends]](https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.txt). The input data set (`NOAA_Globe_T_ENSO_CO2_1950-2023.csv`) contains the year, month, decimal year, temperatures of the land, ocean, and land plus ocean, the ONI monthly total value, the anomaly and code, and the monthly mean CO~2~, the deseasonalized CO~2~ value, and their difference (the "local" CO~2~ anomaly.)

## Read the NOAA data 

Load libraries

```{r regression2-10}
library(ggplot2)
library(forecast)
```

Read the data, modifying paths as necessary:

```{r regression2-11}
# read NOAA monthly temperature, ONI (ENSO), and CO2
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "NOAA_Globe_T_ENSO_CO2_1950-2023.csv"
csv_file <- paste(csv_path, csv_name, sep="")
NOAA <- read.csv(csv_file) 
str(NOAA)
summary(NOAA)
```
Note the missing CO~2~ data prior to 1958.

Decode the ONI_code variable, a character on input, in to a factor and order the factors from "La Niña" to "El Niño" (with "Neutral" in the middle.

```{r regression2-12}
# recode ONI_code as a factor
NOAA$ONI_code <- factor(NOAA$ONI_code, levels = c("La Niña", "Neutral", "El Niño"))
```

Get a simple plot:

```{r regression2-13}
# plots
plot(T_LandOcean ~ YrMn, type = "l", col = "gray70", lwd = 1.5, data = NOAA)
points(T_LandOcean ~ YrMn, pch = 16, cex =0.8, data = NOAA)
```

Here's a `{gglot2}` version of the basic plot:

```{r regression2-14}
# ggplot2 version
ggplot(data = NOAA, aes(x=YrMn, y=T_LandOcean)) + 
  geom_line(color = "gray70") +
  geom_point(size = 1) + 
  scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  theme_bw()
```

Even at this point, there's some visual evidence that the relationship is curvilinear, with the slope of the trend increasing over time. We'll postpone dealing with that for a little while.

For later use, create another data frame, removing the observations with missing CO~2~ data.

```{r regression2-15}

# remove records with missing CO2
NOAA2 <- NOAA[!is.na(NOAA[, 10]), ]
summary(NOAA2)
str(NOAA2)
head(NOAA2)
```

[[Back to top]](regression2.html)

## Simple linear regression (lm_01)

First, do a simple linear regression with `YrMn` as the predictor and land plus ocean temperatures `T_LandOcean` as the response.

```{r regression2-16}
# simple linear (OLS) regression line
lm_01 <- lm(T_LandOcean ~ YrMn, data = NOAA)
summary(lm_01)
AIC(lm_01)
```
The fit of the "curve" (a straight line in this instance) is pretty good: the *R^2^* value is 0.8135, which is almost identical to the adjusted *R^2^*, which also incorporates the automatic increase in *R^2^* that occurs as more predictors are added. The Akaike Information Criterion (AIC) (-803.5893 here( is a similar statistic that trades off the goodness of fit again the number of parameters that have to be estimated. In practice we would choose the model with the lowest AIC value among choices. The *t*- and *F*-statistics are also significant. These are tests of the null hypothesis that the regression coefficients are 0.0, and that there is a significant relationship between the response and predictor(s). 

Here's a look at the data, fitted values (the line) and the residuals, whose sum-of-squares were minimized:

```{r regression2-17}
# examine the fitted model -- residuals
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
segments(NOAA$YrMn, fitted(lm_01), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
```

The fitted model can be further examined by looking at the prediction intervals, which gauge the uncertainty in the predictions from the model, and the confidence intervals, which gauge the uncertainty in the regression line. 

```{r regression2-18}
# examine the fitted model -- prediction and confidence limits
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
pred_data <- data.frame(YrMn=NOAA$YrMn)
pred_int <- predict(lm_01, int="p", newdata=pred_data)
conf_int <- predict(lm_01, int="c", newdata=pred_data)
matlines(pred_data$YrMn, pred_int, lty=c(1,2,2), col="black")
matlines(pred_data$YrMn, conf_int, lty=c(1,2,2), col="orange")
```

Close inspection of both intervals shows that the slightly flare out farther away from the centroid of the data. The fitted line always goes through the centroid of the data (the mean of the response and the mean of the predictor), which acts more-or-less as a "pivot" and so it makes sense that the uncertainties are greater the farther away from that centroid one goes.

The assumptions can be checked using a few diagnostic plots:

```{r regression2-19}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_01, which=c(1,2,4))
acf(lm_01$residuals)
par(oldpar)
```

The residual plot (residuals plotted as a function of the fitted values) has a distinct inverted arch, suggesting the there is some non-linearity in the relationship between the response and the predictor (i.e. a non-linear trend), The `qqnorm()` plot suggests that the residuals are essentially normally distributed (except that a few large residuals are tending to make the upper-tail of the distribution a little "fat"). Cook's distance measures the influence of individual observations on the regression. It's not unusual in time-series data for the first and last observations to show greater leverage on the regression coefficients. The residual autocorrelation function, does, however, signal that there is considerable autocorrelation in the residuals. This could arise from incompletely accounting for the trend in the data, or from short-term persistence in the response. This will be examined further below.

### Eye-balling the influence of ENSO on the regression model

The potential influence of ENSO on the trend in the data can be illustrated by simply labeling the points by the ONI_code values, red for El Niño, blue for La Niña, and gray for Neutral conditions: 

```{r regression2-20}
# examine the regression equation again, by ENSO state
pal <- c("blue", "gray", "red")
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
abline(lm_01)
segments(NOAA$YrMn, fitted(lm_01), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.8, col = pal[ONI_code])
```

It's easy to see that many positive residuals are red, and that many negative residuals are blue, suggesting that the regression line under predicts during El Niño conditions and over predicts during La Niña conditions. 

Here's a `{ggplot2}` package version of the labeled plot:

```{r regression2-21}
# points colored by ENSO state
ggplot() + 
  geom_line(data = NOAA, aes(x = YrMn, y = T_LandOcean)) +
  geom_abline(intercept = lm_01$coefficients[1], slope = lm_01$coefficients[2], color = "black") +
  geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean, color = ONI_code), size = 1.5) + 
  geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean), color = "black", shape = 1, size = 1.5) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) +
  scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25), 
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  theme_bw()
```

Here's a plot of the residuals, on the same scales as the data plot:

```{r regression2-22}
# residuals
ggplot() + 
  geom_line(data = NOAA, aes(x = YrMn, y = residuals(lm_01))) +
  geom_point(data = NOAA, aes(x = YrMn, y = residuals(lm_01), color = ONI_code), size = 1.5) + 
  geom_point(data = NOAA, aes(x = YrMn, y = residuals(lm_01)), color = "black", shape = 1, size = 1.5) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) +
  scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25)
                     , minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean (Residuals)") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  theme_bw()
```

The plot of the residuals from the first model versus time again hints and some non-linearity in the trend and the sorting of the residual values by ENSO state is quite evident. To look at this further, we can use a grouped box plot:

```{r regression2-23}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```

From the box plots, it's possible to infer that the median of the residuals during El Niño conditions is about 0.1 ^o^C higher than during La Niña" conditions, while the median during Neutral conditions is very close to 0.0.

The assumptions that underlie ordinary least-squares regression essentially state that there should be no discernible pattern in the residuals, apart from being normally distributed random numbers. When that assumption is satisfied, then the Gauss-Markov theorem assures us that there is no better way to estimate the regression coefficients. Here, however, there are clear patterns in the residuals: they have a slight curvilinear trend, are autocorrelated, but mainly show considerable bias related to ENSO state.

[[Back to top]](regression2.html)

## Dummy-variable regression (lm_02)

One approach for easily learning about the consequences of a "group membership" variable like ENSO state as represented by the `ONI_code` is the unfortunately named "dummy variable regression." The name arises from the generation of various "dummy" (or indicator) variables, that can be employed to estimate separate intercept and slope values for the different groups. For example, a set of variables for examining the variations in intercepts across groups would take on the value 0.0, except when the particular ENSO state occurs, when it would be set equal to 1.0, and likewise, to examine variations in slope the dummy variables would set equal to 0.0, except when the particular ENSO state occurs, when it would be set equal to the predictor variable, in this case `YrMn`. Although the dummy variables could be generated for all three ENSO states (in this example), only two of each kind would be included in what's now a multiple regression model because including all three would lead to linearities among predictors: the sum of the intercept-estimating dummy variables would equal 1.0 for every observation, while the sum of the slope-estimating variables would equal `YrMn`.

In the current example, we'll include the dummy variables for Neutral and El Niño conditions (omitting those for La Niña conditions). To interpret the results, the intercept and slope for La Niña conditions would simply be the overall intercept and slope of the fitted model. The intercept for El Niño conditions would be the sum of the overall intercept and the coefficient of the El Niño-condition (0 or 1) intercept dummy variable, while the slope for El Niño conditions would be the sum of the overall slope and the coefficient of the El Niño condition (0 or `YrMn`) slope dummy variable. Dummy-variable regression is much more elegant than its name implies.

It turns out that the "equation" language in R takes care of the generation of the dummy variables (and leaving one out). (But there will be an example of explicit generation of dummy variables below.) Here's the code to look at only the difference in intercept among ENSO-state groups.

```{r regression2-24}
# dummy-variable regression
lm_02 <- lm(T_LandOcean ~ YrMn + ONI_code, data = NOAA)
summary(lm_02)
```
All of the coefficients in the model are significant as before, and the *R^2^* value has gone up. One way to gauge the improvement by the second model over the first is to compare the AIC values:

```{r regression2-25}
# print AIC values
print(c(AIC(lm_01), AIC(lm_02)))
```

The AIC value for the second model is indeed lower than that for the first model. Another comparison can be made by doing a simple analysis of variance:

```{r regression2-26}
# compare model by ANOVA
anova(lm_01, lm_02)
```

The *F*-statistic is significant (i.e. it has a low *p*-value), and so overall, the second, dummy-variable model (lm_02) appears to be (statistically) superior to the simple linear regression model (lm_01). 

Here's what the fitted model looks like:

```{r regression2-27}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))

lines(fitted(lm_02)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_02)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_02)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
```

Here is comparison of the grouped box plots for the two models:

```{r regression2-28}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_02) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```

The bias in the residuals (positive for El Niño conditions, negative for La Niña conditions) has evidently removed.

[[Back to top]](regression2.html)

## Dummy-variable regression, both intercept and slope varying (lm_03)

Here is the code for fitting a model where both intercept and slope vary. (Note the subtle difference, an `*` instead of a `+` in the model specification.)

```{r regression2-29}
# dummy-variable regression, slope and intercept varying
lm_03 <- lm(T_LandOcean ~ YrMn * ONI_code, data = NOAA)
summary(lm_03)
```
Compare the models:

```{r regression2-30}
# print AIC values
print(c(AIC(lm_01), AIC(lm_02), AIC(lm_03)))
```

```{r regression2-31}
# compare via ANOVA
anova(lm_02, lm_03)
```

The AIC value for this third (intercept + slope varying model) is a little lower than for the second, but the analysis of variance suggest there's not much difference in the explained variance. (Small *F*-statistic, large "p*-value.)

The fitted lines show little variation:

```{r regression2-32}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))

lines(fitted(lm_03)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_03)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_03)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
```

The slope for the Neutral conditions is a little lower than for the La Niña and El Niño, which doesn't have an immediately apparent physical explanation.

Here is the same plot, with the regression lines for `lm_02` added as dashed lines:

```{r regression2-33}
# display the fitted lines
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))

lines(fitted(lm_03)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_03)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_03)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=2, col="red")
lines(fitted(lm_02)[NOAA$ONI_code == "La Niña"] ~ NOAA$YrMn[NOAA$ONI_code == "La Niña"], lwd=1, lty = 2, col="blue")
lines(fitted(lm_02)[NOAA$ONI_code == "Neutral"] ~ NOAA$YrMn[NOAA$ONI_code == "Neutral"], lwd=1, lty = 2,col="black")
lines(fitted(lm_02)[NOAA$ONI_code == "El Niño"] ~ NOAA$YrMn[NOAA$ONI_code == "El Niño"], lwd=1, lty = 2,col="red")
```

With negligible  improvement in the goodness of fit of `lm_03`, and no practical differences in the curves, there seems to be little reason to declare the model where both intercept and slope vary among ENSO states as any better than the one where just the intercept varies. In fact the "principle of parsimony" in statistics would strongly argue for `lm_02` as the better model.

[[Back to top]](regression2.html)

# A model with CO~2~ as the predictor variable 

Although it makes sense when trying to assess the overall trend in global temperatures to use time (`YrMn`) as a predictor, we know that in fact, it's the level of CO~2~ and other greenhouse gasses (GHGs) that are responsible for the temperature increase. The NOAA data set from Mauna Loa contains both the mean monthly values of CO~2~ (plotted in red below) and a "deseasonalized" value (that takes out the prominent annual cycle in the CO~2~ data related to the "respiration" of the Northern Hemisphere biosphere, plotted in black.) There is some spatial variability inn monthly CO~2~ values, with the Northern and Southern Hemispheres being "antiphased", and so for looking at global temperatures, it makes the most sense to use the deseasonalized values as a potential predictor.


```{r regression2-34}
plot(CO2_mean ~ YrMn, data = NOAA2, pch = 16, col = "red", cex = 0.3, ylab = expression("Mauna Loa CO"[2]))
points(CO2_deseas ~ YrMn, data = NOAA2, pch = 16, col = "black", cex = 0.2)
```

## Simple linear regression, CO~2~ as a predictor (lm_05)

For comparison, using now the `NOAA2` data frame, in which the earlier months when CO~2~ data were not yet available, the simple linear model with `YrMn` as a predictor is:

```{r regression2-35}
# simple linear (OLS) regression line, only with NOAA2
lm_04 <- lm(T_LandOcean ~ YrMn, data = NOAA2)
summary(lm_04)
AIC(lm_04)
```
Now here is the model with CO~2~ as the predictor:

```{r regression2-36}
# simple linear (OLS) regression line, CO2 as predictor
lm_05 <- lm(T_LandOcean ~ CO2_deseas, data = NOAA2)
summary(lm_05)
AIC(lm_05)
```
Here are the residual diagnostic plots:

```{r regression2-37}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_05, which=c(1,2,4))
acf(residuals(lm_05))
par(oldpar)
```

In general, except for autocorrelation of the residuals, the residuals are a bit better behaved relative to the first model `lm_01`. The *R^2^*-squared values are higher, and the AIC more negative for the model with CO~2~ as a predictor, suggesting that overall it's a better predictor of the trend in monthly global-averaged temperatures than simply time. (Not surprising)

```{r regression2-38}
print(c(AIC(lm_04), AIC(lm_05)))
```

Here's a plot:

```{r regression2-39}
# points colored by ENSO state
ggplot() + 
  geom_abline(intercept = lm_05$coeff[1], slope = lm_05$coeff[2], color = "black") +
  geom_point(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean), color = "black", shape = 1, size = 2) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) +
  scale_x_continuous(breaks = seq(315, 430, by = 15)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25),
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = expression(CO[2]~" "(ppm)), y = "Land + Ocean") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  theme_bw()
```

[[Back to top]](regression2.html)

## Dummy-variable regression, intercepts-only, CO~2~ as a predictor (lm_06)

Next, we'll fit a dummy-variable regression, with CO~2~ as a predictor, allowing the intercepts to vary among ENSO states:

```{r regression2-40}
# dummy-variable regression
lm_06 <- lm(T_LandOcean ~ CO2_deseas + ONI_code, data = NOAA2)
summary(lm_06)
print(c(AIC(lm_05), AIC(lm_06)))
anova(lm_05, lm_06)
```

As before, with `YrMn` as the predictor, the dummy-variable regression performs better. Here's a plot"

```{r regression2-41}
# display the fitted lines
plot(T_LandOcean ~ CO2_deseas, data = NOAA2, type="n")
points(T_LandOcean ~ CO2_deseas, data = NOAA2, pch = 16, cex = 0.5)
legend("bottomright", legend=c("La Niña", "Neutral", "El Niño"), lty=c(1,1,1), lwd=3, cex=1, col=c("blue","gray","red"))

lines(fitted(lm_06)[NOAA2$ONI_code == "La Niña"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "La Niña"], lwd=2, col="blue")
lines(fitted(lm_06)[NOAA2$ONI_code == "Neutral"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "Neutral"], lwd=2, col="gray")
lines(fitted(lm_06)[NOAA2$ONI_code == "El Niño"] ~ NOAA2$CO2_deseas[NOAA2$ONI_code == "El Niño"], lwd=2, col="red")

```

Here's the `{ggplot2}` version:

```{r regression2-42}
# intercepts and slopes lm_06
ln_intercept <- lm_06$coeff[1]
ln_slope <- lm_06$coeff[2]
n_intercept <- lm_06$coeff[1] + lm_06$coeff[3]
n_slope <- lm_06$coeff[2] 
en_intercept <- lm_06$coeff[1] + lm_06$coeff[4]
en_slope <- lm_06$coeff[2] 

# ggplot2 plot with regression line
ggplot(data = NOAA2, aes(x = CO2_deseas, y = T_LandOcean)) + 
  geom_abline(intercept = ln_intercept, slope = ln_slope, color = "blue") +
  geom_abline(intercept = n_intercept, slope = n_slope, color = "gray40") +
  geom_abline(intercept = en_intercept, slope = en_slope, color = "red") +
  geom_point(data = NOAA2, aes(x=CO2_deseas, y=T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA2, aes(x=CO2_deseas, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) + 
  scale_x_continuous(breaks = seq(315, 430, by = 15)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25), 
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = expression(CO[2]~" "(ppm)), y = "Land + Ocean") + 
  theme_bw()
```

The CO~2~ have an evident non-linear trend, while the relationship with temperature seems quite linear. Recalling that there is a slight tendency in the relationship of temperature with time to be slightly curvilinear, this suggests global temperatures indeed have a curvilinear as opposed to linear trend. An interesting addition to a scatter plot of temperature vs. time, is the fitted values of the regression with CO~2~ as the predictor:

```{r regression2-43}
# scatter plot with CO2-fitted values ===============================
ln_fit <- ln_intercept + ln_slope*NOAA2$CO2_deseas
n_fit <- n_intercept + n_slope*NOAA2$CO2_deseas
en_fit <- en_intercept + n_slope*NOAA2$CO2_deseas

# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) + 
  geom_point(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", shape = 4, size = 0.5) +
  geom_point(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", shape = 4, size = 0.5) +
  geom_point(aes(x = NOAA2$YrMn, y = en_fit), color = "red", shape = 4, size = 0.5) +
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) + 
  scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25), 
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  theme_bw()
```

Above, the fitted or predicted values from `lm_06` are plotted as tiny "x"'s, which together give the impression of a curvilinear relationship.

[[Back to top]](regression2.html)

# Curvilinear trends

## A simple polynomial trend (lm_07)

A curvilinear, second-order polynomial regression model (i.e. *Y~t~* = *b~0~* + *b~1~X~t~* + *b~2~X~t~^2^* + e~t~) can be fit as follows. Note the use of the `poly()` function to generate the predictor variables.

```{r regression2-44}
# second-order polynomial
lm_07 <- lm(T_LandOcean ~ poly(YrMn, 2, raw = TRUE), data = NOAA)
summary(lm_07)
AIC(lm_07)
```
```{r regression2-45}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_07, which=c(1,2,4))
acf(lm_07$residuals)
par(oldpar)
```

Note that the AIC value has decreased substantially relative to `lm_04` the linear model.

Here is the plot of the polynomial fit:

```{r regression2-46}
# examine the regression equation again, by ENSO state
pal <- c("blue", "gray", "red")
plot(T_LandOcean ~ YrMn, data = NOAA, type="n")
segments(NOAA$YrMn, fitted(lm_07), NOAA$YrMn, NOAA$T_LandOcean, col = "gray")
lines(NOAA$YrMn, fitted(lm_07))
points(T_LandOcean ~ YrMn, data = NOAA, pch = 16, cex = 0.8, col = pal[ONI_code])
```

The polynomial line seems to better discriminate between the various ENSO states that the linear regression line did. Here's a `{ggplot2}` with a linear and loess smooth curve added:

```{r regression2-47}
# ggplot2 plot with regression line
ggplot(data = NOAA, aes(x = YrMn, y = T_LandOcean)) + 
  geom_line(color = "gray70") +
  geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA, aes(x = YrMn, y = T_LandOcean), color = "black", shape = 1, size = 2) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) +
  geom_smooth(method = "lm", se = TRUE, level = 0.95) +
  geom_smooth(method = "loess", se = TRUE, level = 0.95, col = "purple") +
  geom_line(aes(x = YrMn, y = fitted(lm_07)), color = "black", size = 1) +
  scale_x_continuous(breaks = seq(1950, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25), 
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  theme_bw() 
```
And the residual grouped box plot shows that, relative to the original model `lm_01` the variability of the residuals is smaller, but they are still biased by ENSO state:

```{r regression2-48}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_07) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```

## A polynomial dummy-variable regression (lm_08)

The obvious solution to deal with the ENSO-state bias in the residuals is to again implement a dummy-variable regression. Here's the code and results:

```{r regression2-49}
# simple linear (OLS) regression line
lm_08 <- lm(T_LandOcean ~ poly(YrMn, 2, raw = TRUE) + ONI_code, data = NOAA)
summary(lm_08)
AIC(lm_08)
```
Here's the `{ggplot2}` plot:

```{r regression2-50}
# get the fitted values
ln_fit <- lm_08$coeff[1] + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
n_fit <- (lm_08$coeff[1] + lm_08$coeff[4]) + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
en_fit <- (lm_08$coeff[1] + lm_08$coeff[5]) + lm_08$coeff[2] * NOAA2$YrMn + lm_08$coeff[3] * NOAA2$YrMn^2
```

```{r regression2-51}
# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) + 
  geom_line(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", linewidth = 1.0) +
  geom_line(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", linewidth = 1.0) +
  geom_line(aes(x = NOAA2$YrMn, y = en_fit), color = "red", linewidth = 1.0) +
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) + 
  scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25), 
                     minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  theme_bw()
```

And the residual diagnostic plots:

```{r regression2-52}
# regression diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(lm_08, which=c(1,2,4))
acf(lm_08$residuals)
par(oldpar)
```

Except for the autocorrelation in the residuals, these are the "best-looking" plots we've seen. The AIC values confirm that `lm_08` fits the data better than the earlier models:

```{r regression2-53}
print(c(AIC(lm_01), AIC(lm_03), AIC(lm_08)))
```

And this is also confirmed by the ANOVA results:
```{r regression2-54}
# anova
anova(lm_07, lm_08)
anova(lm_01, lm_08)
```

```{r regression2-55}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_02) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_08) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```

As was the case for the linear model, the dummy-variable polynomial model eliminated the ENSO-state related residual bias.

[[Back to top]](regression2.html)

# A regression model with autocorrelated residuals

There is one remaining issue: the residual are autocorrelated. This is not really a surprise, either looking at the data or considering what ENSO SST variability is all about. However, the autocorrelation does violate the OLS assumptions. The autocorrelation can be explicitly accounted for (and hopefully removed) by simultaneously fitting a "Box-Jenkins" type of time series model to the data while fitting the polynomial dummy-variable regression. This can be accomplished using the 'arima()' function from the base `{stats}` package. The process requires a little more set-up, but is not difficult.

Box-Jenkins, or "ARIMA" (autoregressive integrated moving-average) models are a topic in themselves, but they basically related the current value of a time series to its previous values, and to the current and previous values of "innovations" or "exogenous shocks". They thus can represent both "memory" in a system as well as current perturbations to its state. In the case of ENSO, this sounds like the memory incorporated by the simple thermal inertia of sea water displayed by the slow changes in SST anomalies, as will as the faster current and previous interactions with the atmosphere.

## Set up

To use the `arima()` function to fit a model that includes autoregressive and moving-average terms as well as the polynomial dummy variables, we have to make the dummy variables manually:

```{r regression2-56}
# make the dummy variables
dum_ln <- rep(0, length(NOAA$YrMn))
dum_n <- rep(0, length(NOAA$YrMn))
dum_en <- rep(0, length(NOAA$YrMn))
dum_ln[NOAA$ONI_code == "La Niña"] <- 1
dum_n[NOAA$ONI_code == "Neutral"] <- 1
dum_en[NOAA$ONI_code == "El Niño"] <- 1
# head(cbind(dum_ln, dum_n, dum_en), 30)
```
Note that we made dummy variables for all three ENSO states, but only two will be included in the model to avoid collinearity. Next, make a data frame of the the regression predictors, `xreg`:

```{r regression2-57}
# make regression data frame
xreg <- data.frame(NOAA$YrMn, NOAA$YrMn^2, dum_n, dum_en) # note: leave dum_ln out
names(xreg) <- c("YrMn", "YrMn^2", "Neutral", "El Niño")
head(xreg)
```

## Time-series regression model

Next, the time-series regression. There is a regular set of procedures for iteratively determining the form of the ARIMA model (model identification), fitting the model (estimation), and diagnostic checking that are beyond this example. Here, it looks like an ARMA(2,2) (two autoregressive terms, two moving average terms) will be apt. The model is a little complicated, but the equation that is being fit to the data (by minimizing the sum-of-squares of the residuals) is *Y~t~ = &phi;~1~Y~t-1~ + &phi;~2~Y~t-2~ + &theta;~1~a~t-1~ + &theta;~2~a~t-2~ + b~0~ + b~1~X~t~ + b~2~X~t~^2^ + b~3~D~(1)t~ + b~4~D~(2)t~ + a~t~*, where *Y~t~* is the current temperature value, the *X~t~*'s are the predictors, including `YrMn` and the dummy variables, *a~t~* is the residual or prediction errors, and *&phi;~p~*, *&theta;~q~*, and *b~r~*, are autoregressive (AR), moving average (MA) and regression coefficients.

```{r regression2-58}
# time series regression, AR(2) and MA(2)
tsreg_01 <- arima(NOAA$T_LandOcean, order = c(2, 0, 2), xreg = xreg, include.mean = TRUE)
tsreg_01
```

Check the residuals using a function from the `{forecast}` package:

```{r regression2-59}
checkresiduals(tsreg_01, test = "LB", lag = 12)
```

The Ljung-Box test examines the null hypothesis that the autocorrelation of the residual is 0.0. Here the large value of `Q*` and the large *p*-value suggest that this hypothesis cannot be rejected.

Get the fitted values:

```{r regression2-60}
# list the coefficients of the model
tsreg_01$coef
# the intercept is the 5-th term in the model
c1 <- 5
```
```{r regression2-61}
# fitted values
ln_fit <- tsreg_01$coef[c1] + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
n_fit <- (tsreg_01$coef[c1] + tsreg_01$coef[c1 + 3]) + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
en_fit <- (tsreg_01$coef[c1] + tsreg_01$coef[c1 + 4]) + tsreg_01$coef[c1 + 1] * NOAA2$YrMn + tsreg_01$coef[c1 + 2] * NOAA2$YrMn^2
head(cbind(ln_fit, n_fit, en_fit))
```

The usual plot:

```{r regression2-62}
# ggplot2 plot with fitted values line
ggplot(data = NOAA2, aes(x = YrMn, y = T_LandOcean)) + 
  geom_line(aes(x = NOAA2$YrMn, y = ln_fit), color = "blue", linewidth = 1.0) +
  geom_line(aes(x = NOAA2$YrMn, y = n_fit), color = "gray", linewidth = 1.0) +
  geom_line(aes(x = NOAA2$YrMn, y = en_fit), color = "red", linewidth = 1.0) +
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean, color = ONI_code), size = 2) + 
  geom_point(data = NOAA2, aes(x=YrMn, y=T_LandOcean), color = "black",shape = 1, size = 2 ) +
  scale_color_manual(values = c("blue", "gray", "red"),
                     limits = c("La Niña", "Neutral", "El Niño")) + 
  scale_x_continuous(breaks = seq(1960, 2025, by = 10)) +
  scale_y_continuous(limits = c(-0.75, 1.25), breaks = seq(-1.75, 1.25, by = 0.25)
                     , minor_breaks = seq(-0.75, 1.25, by = 0.05)) +
  labs(title = paste("NOAA Global Average Temperature Anomalies (","°","C)", sep = ""), 
       x = "Year", y = "Land + Ocean") + 
  theme_bw()
```

The residual diagnostic plots (note there is not time-series equivalent of Cook's distance)

```{r regression2-63}
# residual diagnostics
oldpar <- par(mfrow = c(2, 2))
plot(fitted(tsreg_01), tsreg_01$residuals)
qqnorm(tsreg_01$residuals)
qqline(tsreg_01$residuals)
acf(tsreg_01$residuals)
par(oldpar)
```
These look good. The AIC values suggest that the model fits better still than the previous one, so it was worthwhile to include the ARMA terms;

```{r regression2-64}
print(c(AIC(lm_01), AIC(lm_08), AIC(tsreg_01)))
```

Here are the residual grouped box plots:

```{r regression2-65}
# residual grouped box plot
opar <- par(mfrow=c(1,3))
boxplot(residuals(lm_01) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(lm_08) ~ NOAA$ONI_code, ylim=c(-1,1))
boxplot(residuals(tsreg_01) ~ NOAA$ONI_code, ylim=c(-1,1))
par(opar)
```

The spread of the residuals from the time-series regression model is clearly smaller than the others, which can be seen by simply overplotting the residuals from `lm_08` with those from `tsreg_01`.

```{r regression2-66}
plot(NOAA$YrMn, residuals(lm_08), type = "l", col = "magenta")
lines(NOAA$YrMn, tsreg_01$residuals, type = "l", col = "green")
```

[[Back to top]](regression2.html)

## Slope of the polynomial trend

The original focus of this example was to estimate the trend in monthly global average temperatures, and the influence of the ENSO state on rising temperatures. However, the fact that the trend is curvilinear complicates this, because the slopes vary over time. The slope of a polynomial is just the first derivative. The increasing slope of the fitted curves (which are the same, only the intercept varies) can be gotten as follows:

```{r regression2-67}
# slope is first derivative of the polynomial,
# If f(x) = b0 + b1 * x + b2 * x^2, then f'(x) = 2 * b2 * x + b1
c1 <- 5
slope <- 2.0 * tsreg_01$coef[c1 + 2] * NOAA$YrMn + tsreg_01$coef[c1 + 1]
plot(NOAA$YrMn, slope, type = "o", cex = 0.5, col = "red", xlab = "Year", ylab = "slope (degC/Year)")
```

As can be seen, the slope of the polynomial increases over time. Here's another perspective:

```{r regression2-68}
slope_1960 <- 2.0 * tsreg_01$coef[c1 + 2] * 1960 + tsreg_01$coef[c1 + 1]
slope_2020 <- 2.0 * tsreg_01$coef[c1 + 2] * 2020 + tsreg_01$coef[c1 + 1]
slope_ratio <- (slope_2020 / slope_1960)

slopes <- data.frame(slope_1960, slope_2020, slope_ratio)
colnames(slopes) <- c("slope 1960", "slope_2020", "ratio")
rownames(slopes) <- NULL
slopes
```

The rate of global warming has increased about five times between 1960 and 2020!

[[Back to top]](regression2.html)