lec13.Rmd

---
title: "More regression analysis"
output: 
  html_document:
    fig_caption: no
    number_sections: yes
    toc: yes
    toc_float: false
    collapsed: no
---

```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
```{r load, echo=FALSE, cache=FALSE}
load(".Rdata")
```
<p><span style="color: #00cc00;">NOTE:  This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction #

Multiple regression is (conceptually) a simple extension of bivariate regression, in which the influence of more than one predictor variable on the response can be estimated.   For the case with two predictor variables, the analysis can be thought of as involving the fitting of a plane (as opposed to a line in the bivariate regression case), and the equations for the OLS estimates of the regression equations are only a little more complicated algebraically.  For three or more predictors, the algebra is also quite simple, but requires the use of matrix algebra.

A couple of illustrations jointly describe the idea of fitting a plane:  

- [fitting a plane using two predictor variables](https://pjbartlein.github.io/GeogDataAnalysis/images/nwk7_1.gif)
- [one data point and its deviation from the regression plane or response surface](https://pjbartlein.github.io/GeogDataAnalysis/images/mreg.gif)

# Fitting a multiple regression equation #

The mathematics behind multiple regression analysis is more complicated than that for bivariate regression, but can be elegantly presented using matrix algebra  

- [matrix algebra](https://pjbartlein.github.io/GeogDataAnalysis/topics/matrix.pdf)
- [regression analysis in matrix algebra terms](https://pjbartlein.github.io/GeogDataAnalysis/topics/matreg.pdf)

The following example provide a short illustration of the use of matrix algebra to obtain the regression coefficients.  
The  example data set for illustrating the use of regression diagnostics [[regrex3.csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/regrex3.csv) is used here, in particular, the multiple regression using `x1` and `x2` as predictors for the response variable `y5`

First, take a look at the different variables in the example data set.

```{r matreg}
# regrex3
attach(regrex3)
summary(regrex3)
head(cbind(y5,x1,x2))
```
Create an *n* row by 1 column matrix (i.e. a column vector) called **y**:

```{r create y}
# create the column vector y
n <- length(y5)
y <- matrix(y5, nrow=n, ncol=1)
dim(y)
head(y)
```

Create an *n* row by *p*+1 matrix, **X**, with 1's in the first column, and `x1` and `x2` in the second and third columns:

```{r create x}
# create the predictor-variable matrix
X <- matrix(cbind(rep(1,n),x1,x2), nrow=n, ncol=3)
dim(X)
head(X)
```

Now use matrix algebra to calculate **b**, the *p*+1 row by 1 column matrix (e.g. a column vector) of regression coefficients, *b*<sub>0</sub>, *b*<sub>1</sub> and *b*<sub>2</sub>:  **b** = (**X'X**)<sup>-1</sup>**X'y**.

```{r b}
# calculate the regression coefficients
b <- solve(t(X) %*% X) %*% (t(X) %*% y)
print(b)
dim(b)
```

The matrix functions and operators used in the above expression include `t()`, which transposes a matrix, `%*%`, which is the matrix multiplication operator, and `solve()`, which inverts a matrix.

Compare these values with those obtained using the `lm()` function:

```{r lm1}
# linear model with lm()
lm1 <- lm(y5 ~ x1+x2, data=regrex3)
lm1
```

Now calculate the fitted values (*y-hats*), i.e. \(\mathbf{\widehat{y}}\) = **Xb**:

```{r matrix fitted values}
# matrix fitted values
yhat <- X %*% b
```

and compare these with those obtained using the `lm()` function

```{r compare fitted}
head(cbind(yhat,lm1$fitted.values))
```

In addition to being able to efficiently represent the derivation of terms and thier properties in regression analysis in general, matrix algebra also provides a an efficient way of doing the actual calculations.

[[Back to top]](lec13.html)

# Regression Assumptions #

The basic regression model (as well as more complicated ones) have certain underlying assumptions, violations of which have an impact on the optimality of the fitted model, i.e., the extent to which the model and its parameters represent the best model that can be fit, that is, the one that performs the best in the tasks of representing the relationship between the response variable and the predictor variable(s), or predicting future values of the response variable given new values of the predictor variables.  

The main assumptions that underlie regression analysis:

- the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, with a mean of 0 and a standard deviation of *s*,
- the *X*'s (predictor or independent variables) are known without error,
- the *X*'s are not correlated.
- the correct model has been specified (i.e. the right predictors have been included in the model.

If the assumptions are not violated, then the Gauss-Markov theorem indicates that the usual OLS estimates are optimal in the sense of being unbiased and having minimum variance. If one or more of the assumptions are violated, then estimated regression coefficients may be biased (i.e. they may be systematically in error), and not minimum variance (i.e. there may be more uncertainty in the coefficients than is apparent from the results).

## Consequences of assumption violations ##

If the assumptions are violated, then there may be two consequences--the estimated coefficients may be biased (i.e. systematically wrong), and they may longer have minimum variance (i.e. their uncertainty increases).

- the notion of variability of the regression coefficients
- illustrations using repeated simulations of data sets with built-in assumption violations [[examples]](https://pjbartlein.github.io/GeogDataAnalysis/images/violate1.gif), [[solutions]](https://pjbartlein.github.io/GeogDataAnalysis/images/violate2.gif)

[[Back to top]](lec13.html)

# Testing for assumption violations with diagnostic analyses #

There are several ways to test whether the assumptions that underlie regression analysis have been violated.  As might be expected, these include analytical methods, in which the values of particular test statistics are computed and compared with appropriate reference distributions, and graphical methods, which are often easier to implement (and just as informative).

A good way to understand the way in which the various statistics and diagnostic plots allow one to examine the reqression equation, its goodness-of-fit, and a to assess the possibility of assumption violations is to design-in various assumption violations and issues, and compare the results to a regression analysis without issues.

An example data set for illustrating the use of regression diagnostics [[regrex3.csv]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/regrex3.csv)
First, take a look at the different variables in the example data set.

```{r rdiag01, fig.width=7, fig.height=7}
# regrex3
summary(regrex3)
# matrix plot of the data
plot(regrex3[,2:8])
```

There are two potential predictor variables `x1` and `x2`, and five potential responses, each with particular issues or assumption violations.  The idea here is to examine the summary output and diagnostic plots for each regression, and compare them to a case (the first regression) where there are no major issues.

```{r rdiag02}
# set up for plotting
oldpar <- par(mfrow = c(2, 2))
```

## Example 1 -- no major issues ##

Here's an inital regression, that shows no major issues:

```{r rdiag03}
# first regression
regr_y1x1 <- lm(y1 ~ x1, data=regrex3)
summary(regr_y1x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y1x1, which=c(1,2,4,5), main="First Regression")
```

## Example 2 --  non-significant predictor ##

One of the most common issues in regression equations is the inclusion of a predictor that is not significant, in the sense that the `t-test` of the null hypothesis that the slope parameter for that predictor is zero is not significant (i.e. there is no support for rejecting that hypthesis).  Here's a example, where `x2` is not significant:

```{r rdiag04}
# add a second predictor
regr_y1x1x2 <- lm(y1 ~ x1 + x2, data=regrex3)
summary(regr_y1x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y1x1x2, which=c(1,2,4,5), main="Two Predictors")
```

## Example 3 -- outliers present ##

Outliers frequently arise, from mechanical (coding or recording), sampling (i.e. selection of the cases or locations), or from the simple inclusion of data points that are "outside" of the domain of interest.  The third example shows how the outliers can be identified in the diagnostic plots.  Cases 79 and 49 standout.

```{r diag05}
# outliers
regr_y2x1 <- lm(y2 ~ x1, data=regrex3)
summary(regr_y2x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y2x1, which=c(1,2,4,5), main="Outliers")
```

## Example 4 -- outliers and an additional predictor ##

Sometimes the existence of outliers can be traced to the omission of a potentially important predictor.  In this case, the inclusion of `x2` does not eliminate the outliers, and as before, the slope coefficient of `x2` is not significant.

```{r diag06}
# no help for outliers
regr_y2x1x2 <- lm(y2 ~ x1 + x2, data=regrex3)
summary(regr_y2x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y2x1x2, which=c(1,2,4,5), main="Still Outliers")
```

## Example 5 -- heteroscedasticity (non-normality) of the residual ##

Heteroscedasticity, or inhomogeneity of variance can arise several ways:  it can be "inherited" from the response variable in cases where that variable is not completely "explained" by the predictor variable(s), and it can also signal issues with the structure of the model.  In practice, heterscedastic residuals increase the variability of the regression parameter estimates, which means that we can be less confident that the "true" values of the coeffiecients have been estimated.  Heteroscedasticity is signaled by the diagnostic plots, in particular "megaphone" patterns in the residual scatterplot, and "doglegs" in the normal probability plots.

```{r rdiag07}
# heteroscedasticity
regr_y3x1 <- lm(y3 ~ x1, data=regrex3)
summary(regr_y3x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y3x1, which=c(1,2,4,5), main="Heteroscedasticity")
```

## Example 6 -- nonlinear relationships ##

Standard regrssion analyses attempt to fit *linear* relationships among variables, and often that's not a reasonable assumption to make.  This example illustrates how nonlinear relationships are exposed by the residual diagnostic plots (if they haven't already been recognized), in particular by well-organized patterns in the residual scatterplot and leverage plot.  

```{r rdiag08}
# nonlinear relationship
regr_y4x1 <- lm(y4 ~ x1, data=regrex3)
summary(regr_y4x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y4x1, which=c(1,2,4,5), main="Nonlinear Relationship")
```

## Example 7 -- transformation ##

The usual first-order strategy for dealing with heteroscedasticity and non-linearity is to transform one or another of the variables (or both) in a bivariate regression.  Here, the values of `y4` are transformed buy taking their (base 10) logarithms.  Because the logs of 0 and negative numbers are undefined, the values of `y4` are adjusted to make them nonzero and positive.  However, there are still major problems signalled by the regression diagnostic plots.

```{r rdiag09}
# alternative nonlinear log(y4) ~ x1
regr_logy4x1 <- lm(log10(y4-min(y4) + 0.001) ~ x1, data=regrex3)
summary(regr_logy4x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_logy4x1, which=c(1,2,4,5), main="Transformation of y")
```

## Example 8 -- fitting a nonlinear (quadratic) function ##

An alternative approach to "linearizing" a relationship is to fit a quadratic polynomial (i.e containing the terms *x* and *x<sup>2</sup>*) to the data.  The `I()` function indicates that the term `x1^2` should be treated as is, and is not part of the "formula" in the `lm()` function.

```{r rdiag10}
# alternative nonlinear (quadratic) y4 ~ x1, x1^2
regr_y4x1x1 <- lm(y4 ~ x1 +I(x1^2), data=regrex3)
summary(regr_y4x1x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y4x1x1, which=c(1,2,4,5), main="Quadratic Polynomial")
```

This equation fits much better than the previous one, and show no major problems

## Example 9 -- model inadequacy due to a missing predictor ##

Here's an example regression, where the data were generated by an equation that included `x2`.  Note that there are slight megaphone and dog-leg patterns in the residual plots.

```{r rdiag11}
# model inadequacy, missing predictor
regr_y5x1 <- lm(y5 ~ x1, data=regrex3)
summary(regr_y5x1)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y5x1, which=c(1,2,4,5), main="Missing Predictor")
```

The typical way of examining the possibility that the inclusion of another predictor may be warrented, is to examine correlations between the residual, and other potential predictors.  In this case it looks like the residuals are correlated with `x2`.

```{r rdiag12}
# check for correlation between residuals and other predictors
par(oldpar)
plot(regr_y5x1$residual ~ regrex3$x2)
```

When `x2` is included in the model, the various diagnostic checks show no major issues, except possibly a some extra influence by points 100, 24 and 4.

```{r rdiag13}
# last model, y5 ~ x1 and x2
regr_y5x1x2 <- lm(y5 ~ x1+x2, data=regrex3)
summary(regr_y5x1x2)
oldpar <- par(mfrow = c(2, 2))
plot(regr_y5x1x2, which=c(1,2,4,5), main="Two Predictors")

par(oldpar)
```
[[Back to top]](lec13.html)

# An example of iterative diagnostic analysis #

*Multiple regression* is the extension of bivariate regression to the case where there are multiple predictor variables.  (The case where there are multiple predictors and multiple responses is generally referred to as *multivariate regression*.)  This example illustrates the development of a "transfer function" a not-ideally named approach for reconstructing past climates from fossil pollen data.  The idea is to build a regression equation that illustrates the relationship between a particular climate variable as the response to a number of pollen variables as predictors using modern pollen and climate data, and then to apply this relationship "downcore" to fossil-pollen data (plugging in pollen, and getting out climate).  The data set has a few issues that often arise in practice, including the fact that the pollen variables (pollen types) are themselves correlated (violating the OLS assumption of no collinearity), and the climate variables are highly spatially autocorrelated, which could lead to autocorrelation in the residuals (violating another OLS assumption, homogeneity of variance of the residual).  The relationships between the climate and pollen variables are often curvilinear (violating yet another assumption, correct specification of the model).

These issues will be examined using the standard diagnostic plots, as well as the Moran statistic for assessing spatial autocorrelation, and the curvilinearity/collinearity issue will be handled by transformation and "best subsets" predictor selection.  

Here are the data [[midwtf2]](https://pjbartlein.github.io/GeogDataAnalysis/data/csv/midwtf2.csv), and the components of a shape file for mapping the data and results [[.dbf]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.dbf)  [[.shx]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.shx)  [[.shp]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.shp) 
[[.prj]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/midwotl.prj) 

Load some libraries.

```{r libraries, echo=FALSE, include=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
library(spdep)
library(leaps)
library(ggplot2)
```

Load some packages, and attach the data frame:

```{r mr01, eval=FALSE}
# libraries
library(sf)
library(ggplot2)
library(RColorBrewer)
library(spdep)
library(leaps)
library(ggplot2)
```

The pollen variables with numbers in their names (e.g. `Querc.25`) are transformation of the original pollen variables (`Quercus`) constructed using "power function" transformations (i.e. `Querc.25 <- Quercus^0.25`) to linearize the relationships between climate and pollen (see the scatter diagrams below).  The transformation parameters were chosen by inspecting individual scatter diagrams between a climate variable and pollen variable.

## Inspection of the data ##

*Always* look at the data before running any kind of analysis. Begin by mapping the data.  Get the point locations of each modern pollen (and climate) sample, and plot the distance-weighted neighbors (for calculating the Moran statistic) using a distance threshold of 100km

Read a shapefile of outlines for the region

```{r}
# read a state outline shape file
shapefile <- "/Users/bartlein/Documents/geog495/data/shp/midwotl.shp"
midwotl_sf <- st_read(shapefile)
midwotl_sf
plot(midwotl_sf) # plot the outline
plot(st_geometry(midwotl_sf), axes = TRUE)
```

Read the data, if it's not already in the workspace:

```{r, echo=TRUE, eval=FALSE}
# read the data file (if not already in the workspace)
csv_path <- "/Users/bartlein/Documents/geog495/data/csv/midwtf2"
midwtf2 <- read.csv(csv_path)
```

```{r mr03}
attach(midwtf2)
names(midwtf2)
```

Plot the network of sites, and also get distance-neighbor matrix for calculating spatial autocorrelation.

```{r}
# get point locations of modern data
midwtf2_loc=cbind(longitud, latitude)
```

Get the distance neighbors (i.e. the contiguity matrix **W**)

```{r dist}
# distance neighbors
d <- 100 # points are neighbors if locations are within d km of one another
midwtf2_neighbors_dist <- dnearneigh(midwtf2_loc, 0, d, longlat=TRUE) 
plot(st_geometry(midwotl_sf), axes=TRUE)
points(midwtf2_loc)
plot(midwtf2_neighbors_dist, midwtf2_loc, add=TRUE, col="magenta")
```

Next, examine the data set.  First plot latitude, longitude, and the climate variables relative to one another (to reveal the geographical gradients in the cliamte data).

```{r mr04}
# examine the data set
# plot lat, lon and climate variables
plot(midwtf2[,2:6])
```

Now plot July temperature (the response variable here) relative to a few (untransformed) pollen types, and note the curvilinear relationships.

```{r mr05}
# plot July temperature and selected pollen types
plot(midwtf2[,c(4,7,10,13,21)])
```

Next, plot July temperature vs. the transformed pollen types, and note how the transformations linearize the relationships.

```{r mr06}
# plot July temperature and transformed pollen 
plot(midwtf2[,c(4,22,25,28,36)])
```

## Spatial autocorrelation of July temperature values ##

The assuption that the prediction errors or residuals are assumed to be independent, identically normally distributed random variables, is often violated in geographical or Earth-system science data sets owing to spatial or temporal autocorrelation in the dependent or response variable.  If the regression model cannot explain *all* of the variation in the response model, then the autocorrelation of the response will be inhereted by the residuals or prediction errors.  This inherentance will be evidenced by spatial or temporal patterns in residuals, which could be though of inforation that should be incorporated in the "systematic" component of the model.

In this example, spatial autocorrelation of the residuals will the issue.  The *Moran statistic* is a measure of spatial autocorrelation, and in the regressions below, will be used to test for spatial pattern in the residuals as the model is gradually refined.

- [Moran statistic details](https://pjbartlein.github.io/GeogDataAnalysis/topics/moran.pdf)

To illustrate the Moran statistic, and gain an impression of the level of spatial autocorrelation in the response (as opposed to the residuals) calculate the Moran statistic for July temperature, the response.  For comparison with the residuals (which should be centered on zero), we'll convert the July temperature values to "centered" values, by subtracting the mean.  We should see the amplitude of the values diminish as the "fit" gets better and better.  This is consistent with the idea of the residuals having no "pattern" or systematic variation.

```{r mr07}
# ggplot2 map of temperature
plotvar <- tmeanjul
plotvar <- plotvar-mean(plotvar)
plottitle <- "Modern Climate (mean subtracted)"
leglable <- "tmeanjul (C)"
ggplot() +
  geom_sf(data = midwotl_sf, fill="grey70", color="black") +
  geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
  scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
                       limits = c(-5, 5), breaks = seq(-5, 5, by = 1), guide = 'colorbar', name = leglable) +
  labs(x = "Longitude", y = "Latitude", title = plottitle) 
```

The map shows that July temperatures are quite spatially autocorrelated, which the Moran statistic confirms.

```{r mr09}
# spatial autocorrelation of dependent variable (tmeanjul)
moran.test(tmeanjul, zero.policy=T, 
     nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```

In some ways, the objective the regression analysis can be posed in terms of "taking the autocorrelation out" of July temperature.

## A "naive" model ##

The simplist model (in terms of not worrying about assumption violations, etc.) is one in which all of the potential predictor variables were included, in their "raw" or untransfomed form

```{r mr10}
# naive model
tmeanjul_lm0 <- lm(tmeanjul ~ Picea + Abies + Juniper
   + Pinus + Betula + Fraxinus + Quercus + Ulmus
   + Acer + Juglans + Carya + Tsuga + Fagus
   + Alnus + Herbsum)

summary(tmeanjul_lm0)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm0, which=c(1,2,4,5))
par(oldpar)
```

Map the residuals from the naive model

```{r mr11}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm0)
plotvar <- plotvar-mean(plotvar)
plottitle <- "Naive model residuals (C)"
leglable <- "residuals tmeanjul_lm0"
ggplot() +
  geom_sf(data = midwotl_sf, fill="grey70", color="black") +
  geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
  scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
                       limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
  labs(x = "Longitude", y = "Latitude", title = plottitle) 
```


... and calculate the Moran statistic for the naive model residuals:

```{r mr12}
# Moran test
moran.test(residuals(tmeanjul_lm0), zero.policy=T, 
     nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```

There are a few troublesome issues with the naive model.  The obvious one is the similarity of the standard errors of the regression coefficients, which is one of the byproducts of collinearity.  Here, the pollen data are almost exactly collinear, because the pollen percentages must some to 100.  (There are a few minor types omitted, so they don't do that exactly.)

The extent of collinearity can be confirmed by calculating something called the "condition number" of the matrix **X<sup>T</sup>X** (where **X** is the matrix of predictor variables), which is the ratio of the largest to smallest eigenvector of **X<sup>T</sup>X**.  This number should be small, say less than 50 or so, and it isn't.

```{r mr13}
kappa(tmeanjul_lm0)
```

## Second model, with transformed predictors ##

There are two approaches for dealing with collinearity here:  1) omitting some variables, and 2) repeating the regression with the transformed values of the pollen variables (which we are inclined to do anyway):

```{r mr14}
# a second model, transformed predictors
tmeanjul_lm1 <- lm(tmeanjul ~ Picea.50 + Abies.100 + Junip.100
    + Pinus.100 + Betula.50 + Frax.15 + Querc.25 + Ulmus.25
    + Acer.100 + Juglans.25 + Carya.25 + Tsuga.100 + Fagus.100
    + Alnus.100 + Herbs.25)

summary(tmeanjul_lm1)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm1, which=c(1,2,4,5))
par(oldpar)
```

Map the residuals.

```{r mr15}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm1)
plotvar <- plotvar-mean(plotvar)
plottitle <- "Second model residuals (C)"
leglable <- "residuals tmeanjul_lm1"
ggplot() +
  geom_sf(data = midwotl_sf, fill="grey70", color="black") +
  geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
  scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
                       limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
  labs(x = "Longitude", y = "Latitude", title = plottitle) 
```

Get the Moran statistic for the residuals.
```{r mr16}
# Moran test
moran.test(residuals(tmeanjul_lm1), zero.policy=T, 
     nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```

Note that this model is better behaved in the sense of not showing strange values for the standard errors of the regresssion coefficients, and the spatial autocorrelation, while still barely significant, is lower too.  Many of the *t*-tests of the regression coefficients are not significant (meaning that the null hypothesis that the regression coefficient value is 0 can not be rejected).  A model with a subset of (significant) predictors can be selected using "all/best possible subsets" regression.

Note that the condition number for this model is still relatively high, because, even though the predictor variables have been transformed, there is still a lot of correlation among them.

```{r mr17}
kappa(tmeanjul_lm1)
```

## All possible subsets regression ##

The `regsubsets()` function in the `leaps` package uses a clever algorithm from the field of numerical analysis to examine a large (possibly all) set regressions with subsets of the predictors.

```{r mr18}
# another model, all possible subsets regression
tmeanjul_lm2 <- regsubsets(tmeanjul ~ Picea.50 + Abies.100 + Junip.100
    + Pinus.100 + Betula.50 + Frax.15 + Querc.25 + Ulmus.25
    + Acer.100 + Juglans.25 + Carya.25 + Tsuga.100 + Fagus.100
    + Alnus.100 + Herbs.25, data=midwtf2)
summary(tmeanjul_lm2)
```

To get the diagnostic plots and statistics for one of the "best" models, that regression is simly rerun using the `lm()` function.

```{r mr19}
# rerun one of the best-subsets regressions
tmeanjul_lm3 <- lm(tmeanjul ~ Picea.50 + Abies.100 + Betula.50 + Fraxinus
     + Querc.25 + Fagus.100 + Herbs.25)

summary(tmeanjul_lm3)
oldpar <- par(mfrow = c(2, 2))
plot(tmeanjul_lm1, which=c(1,2,4,5))
par(oldpar)
```

```{r mr20}
# ggplot2 map of residuals
plotvar <- residuals(tmeanjul_lm3)
plotvar <- plotvar-mean(plotvar)
plottitle <- "All-possible subsets model residuals (C)"
leglable <- "residuals tmeanjul_lm3"
ggplot() +
  geom_sf(data = midwotl_sf, fill="grey70", color="black") +
  geom_point(data = midwtf2, aes(x = longitud, y = latitude, color = plotvar), size = 3) +
  scale_fill_gradient2(low = "blue", mid= "white", high = "red", aesthetics = 'color',
                       limits = c(-3, 3), breaks = seq(-3, 3, by = 1), guide = 'colorbar', name = leglable) +
  labs(x = "Longitude", y = "Latitude", title = plottitle) 
```

```{r mr21}
# Moran test
moran.test(residuals(tmeanjul_lm3), zero.policy=T,
    nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```

The condition number is now acceptably low:

```{r mr22}
kappa(tmeanjul_lm3)
```

The progress of the analysis can be gauged by comparing the individual Moran statistics, which should get smaller as the model gets more refined.  The Moran statistic for the data is 12.08, and this could be thought of as a "totally naive" model, where the response variable is expressed as a function of just its mean value.  The naive model with all untransformed predictors does achieve a lower Moran statistic (2.197) indicating that some of the pattern in the response has indeed been accounted for by the predictors, but there are other issues with this model.  The second regression including transformed predictors has a Moran statistic of 2.480, but this model too has some issues, in particular, it is "overfit".  The final model, the result of the best-possible-subsets regression has the lowest Moran statistics too, 1.834.

```{r mr23}
# compare Moran tests
# dependent variable (tmeanjul)
moran.test(tmeanjul, zero.policy=T, 
     nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm0 (naive model) no transformation, all predictors included
moran.test(residuals(tmeanjul_lm0), zero.policy=T,
    nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm1 transformed predictors, all predictors included
moran.test(residuals(tmeanjul_lm1), zero.policy=T,
    nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
# lm3 transformed predictors, best subset model
moran.test(residuals(tmeanjul_lm3), zero.policy=T,
    nb2listw(midwtf2_neighbors_dist, zero.policy=T, style="W"))
```

[[Back to top]](lec13.html)

# Another example -- dummy-variable regression  #

Sometimes a data set exists in which fitting different relationships for subsets of the observations are in order.  This situation is exemplified by the Scandinavian EU preference data.  The technique knows as *dummy-variable* regression allows the simulataneous calculation of separate regression lines for subsets of the data, or in the example here, for the individual countries.  The regression lines can vary in either their slopes or intercepts, or both.

Here are the shapefile components, which include the data:  [[.dbf]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.dbf) [[.shx]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.shx) [[.shp]](https://pjbartlein.github.io/GeogDataAnalysis/data/shp/scand.shp)

```{r dum01, echo=FALSE, include=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
```
Load some packages, and attach the data frame:

```{r dum02, eval=FALSE}
library(sf)
library(ggplot2)
library(RColorBrewer)
```

## Look at the data ##

First, get a reference map of the commune/county/district names.  Read a shapefile of Scandanavian communes.

```{r, echo=TRUE, eval=FALSE}
# read a Scandinamvian province/county shape file
shapefile <- "/Users/bartlein/Documents/geog495/data/shp/scand_prov.shp"
scand_prov_sf <- st_read(shapefile)
scand_prov_sf
plot(st_geometry(scand_prov_sf), axes = TRUE)
```

Get the centroids of the individual districts for locating labels later:

```{r dum03}
# get coordinates of district centroids
scand_prov_pts <- (st_centroid(scand_prov_sf))
```

Ignore the warning, which alerts us to the fact that latitude and longitude are not isotropic, and so we really should be calculating centroids only with projected data.

Plot the District (commune) names:

```{r plot names}
# ggplot2 map of District names
ggplot() +
  geom_sf(data = scand_prov_sf, fill=NA, color="gray") +
  geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=as.character(District)), size = 2) +
  coord_sf(xlim=c(2.5,32.5)) +
  labs(x="Longitude", y="Latitude") +
  theme_bw()
```

Next, extact the variable values from the shape file, and plot the yes votes.  (The data are the same as those in `scanvote_csv`, but are extracted from the shapefile here to make sure that the individual values are in the same order as the polygons in the shapefile.)

```{r dum04}
# get variable values from .dbf attributes
Y <- scand_prov_sf$Yes
X <- scand_prov_sf$Pop
Country <- scand_prov_sf$Country
```

Plot the "Yes" vote:

```{r}
# plot Yes votes -- setup
pal <- brewer.pal(8, "PuOr") 
ggplot()  +
  geom_sf(data = scand_prov_sf, aes(fill = Yes)) +
  geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=as.character(Yes)), size = 2) +
  scale_fill_gradientn(colors = pal) +
  labs(title = "Yes Vote (%)", x="Longitude", y="Latitude")
```

Here's a standard scatter plot of the data, with the points labeled by Country.

```{r dum05}
# scatter plot with Country labels
plot(Y ~ log10(X), type="n")
text(log10(X),Y, labels=Country, cex=0.8)
```

## First regression, all data ##

The regression model for all data, considered without respect to individual country is as follows:

```{r dum06}
# linear model, Yes ~ log10(Pop)
vote_lm1 <- lm(Y ~ log10(X))
summary(vote_lm1)
oldpar <- par(mfrow = c(2, 2))
plot(vote_lm1, which=c(1,2,4,5))
par(oldpar)
```

The initial regression has several issues:  There is a distinct arch in the residual scatter plot, the Normal QQ plot indicates there are several outliers, and the Cook's distance and leverage plots indicate the presence of several overly influential observations.  Further examination of the results is warrented.

```{r dum07}
# examine the regression equation
plot(Y ~ log10(X), type="n")
text(log10(X),Y, labels=Country)
abline(vote_lm1)
segments(log10(X), fitted(vote_lm1), log10(X), Y)
```

Examination of this plot suggests that positive deviations are often contributed by communes in Finland, and  negative deviations by communes in Norway.  This suggests that Norwegian communes are less likely to have voted Yes, and Finnish communes more likely to have voted yes than Scandinavian communes as a group.  This idea is reinforced by looking at the prediction and confidence intervals for the regression.

```{r dum08}
# confidence intervals
pred_data <- data.frame(X=seq(1,1151,by=50))
pred_int <- predict(vote_lm1, int="p", newdata=pred_data)
conf_int <- predict(vote_lm1, int="c", newdata=pred_data)

plot(Y ~ log10(X), ylim=range(Y, pred_int, na.rm=T), type="n")
text(log10(X),Y, labels=Country, cex=0.7)
pred_X <- log10(pred_data$X)
matlines(pred_X, pred_int, lty=c(1,2,2), col="black")
matlines(pred_X, conf_int, lty=c(1,2,2), col="red")
```

Mapping the residuals suggests the same thing--there is an obvious spatial pattern in the sign of the residuals (so obvious, that it's not really necessary to calculate a Moran statistic).

```{r dum09}
# map the residuals
pal <- brewer.pal(8, "PuOr") 
plotlab <- as.character(round(vote_lm1$residuals, 1))
plottitle <- "Residuals from lm1"
ggplot()  +
  geom_sf(data = scand_prov_sf, aes(fill = vote_lm1$residuals)) +
  geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
  scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
  labs(title = plottitle, x="Longitude", y="Latitude") 
```

Boxplots of the response variable (`Y`) and the residuals by country also suggest the same thing.

```{r dum10}
opar <- par(mfrow=c(1,4))
# country-effect
boxplot(Y ~ Country, ylab="Yes")

# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
par <- opar
```

## Dummy variable regression, with Country as a predictor ##

A dummy-variable regression can be run by including `Country` as a predictor.  The formula `Y ~ log10(X)+Country` specifies a regression in which separate intercept values are calculated for each country.  In essence, two new variables are generated, each binary (0 or 1), one for Sweden and the other for Norway.  The intercepts that are estimated work as follows:  The "plain" intercept applies to the country left out (Finland), while the intercept for Sweden is that value plus the coefficient estimated for Sweden, and likewise for Norway.


```{r dum11}
# Scandinavian EU Preference Vote -- dummy variable regression
# model with a factor as predictor
vote_lm2 <- lm(Y ~ log10(X)+Country)
summary(vote_lm2)
```

The different regression lines can be displayed as follows:

```{r dum12}
# display the fitted lines
plot(Y ~ log10(X))
legend("bottomright", legend=c("N","F","S"), lty=c(1,1,1), lwd=2, cex=1, col=c("red","blue","purple"))

lines(log10(X)[Country=="N"],fitted(vote_lm2)[Country=="N"], lwd=2, col="red")
lines(log10(X)[Country=="F"],fitted(vote_lm2)[Country=="F"], lwd=2, col="blue")
lines(log10(X)[Country=="S"],fitted(vote_lm2)[Country=="S"], lwd=2, col="purple")
```

Notice that Finland has the highest intercept (34.434), while that for Sweden is lower (34.434 - 6.679), and that for Norway is lower still (34.434 - 8.609).

Examine the model

```{r dum13}
# examine the model
oldpar <- par(mfrow = c(2, 2))
plot(vote_lm2, which=c(1,2,4,5))
par(oldpar)
```

The obvious arch in the residual scatter plot is much less obvious, as is the departure from normality of the residuals.  There are still a few overly influential points, but their influence is reduced as well.  

Map the residuals:

```{r dum14}
# map residuals
pal <- brewer.pal(8, "PuOr") 
plotlab <- as.character(round(vote_lm2$residuals, 1))
plottitle <- "Residuals from lm2"
ggplot()  +
  geom_sf(data = scand_prov_sf, aes(fill = vote_lm2$residuals)) +
  geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
  scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
  labs(title = plottitle, x="Longitude", y="Latitude") 
```

The obvious patterns in the residuals from the first regression are also much reduced.

```{r dum19}
opar <- par(mfrow=c(1,4))
# country-effect
boxplot(Y ~ Country, ylab="Yes")

# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm2) ~ Country, ylim=c(-15,15))
par <- opar
```

It also the case that the obvious differences in the location (i.e. the medians), and scale (the interquartile range) of the residuals (as illustrated by the boxplots) in the first regression model is reduced, bringing the residuals more in line with the assumption that they are independent and identically distributed.

## Dummy-variable regression with both slope and intercept varying by Country ##

A second dummy variable regression, where both the intercept and slope vary can be specified using the model formula `Y ~ log10(X)*Country'.

```{r dum15}
vote_lm3 <- lm(Y ~ log10(X)*Country)
summary(vote_lm3)
```

None of the *t*-tests of the significance of the dummy-variable terms (intercepts and slopes) are significant, and the adjusted *R<sup>2</sup>* value is lower, and so this model is likely "overfitted".  Nevertheless, it's useful to look at how both the intercepts and slopes can be made to vary amoung countries.

```{r dum16}
# display the fitted lines
plot(Y ~ log10(X))
legend("bottomright", legend=c("N","F","S"), lty=c(1,1,1), lwd=2, cex=1, col=c("red","blue","purple"))

lines(log10(X)[Country=="N"],fitted(vote_lm3)[Country=="N"], lwd=2, col="red")
lines(log10(X)[Country=="F"],fitted(vote_lm3)[Country=="F"], lwd=2, col="blue")
lines(log10(X)[Country=="S"],fitted(vote_lm3)[Country=="S"], lwd=2, col="purple")
```
Map the residuals from the third model.

```{r dum17}
# map residuals
pal <- brewer.pal(8, "PuOr") 
plotlab <- as.character(round(vote_lm3$residuals, 1))
plottitle <- "Residuals from lm3"
ggplot()  +
  geom_sf(data = scand_prov_sf, aes(fill = vote_lm3$residuals)) +
  geom_sf_text(data = scand_prov_pts, aes(geometry=geometry, label=plotlab), size = 2) +
  scale_fill_gradientn(colors = pal, limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +
  labs(title = plottitle, x="Longitude", y="Latitude") 
```

Look at the country effect on the residuals from the third model.

```{r dum18}
# country-effect
opar <- par(mfrow=c(1,4))
boxplot(Y ~ Country, ylab="Yes")

# residual grouped boxplot
boxplot(residuals(vote_lm1) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm2) ~ Country, ylim=c(-15,15))
boxplot(residuals(vote_lm3) ~ Country, ylim=c(-15,15))
par <- opar
```

There is little difference between the third model (in which the slope and the intercept vary amoung countries) and the second, and so the "principle of parsimony" would indicate that the second model should be preferred.

[[Back to top]](lec13.html)

# Readings #

- Kuhnert & Venebles (*An Introduction...*):  p. 109-120  
- Maindonald (*Using R...*):  ch. 5