-
Notifications
You must be signed in to change notification settings - Fork 10
/
lec17.Rmd
211 lines (154 loc) · 10.3 KB
/
lec17.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: "MANOVA and discriminant analysis"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
```{r load, echo=FALSE, cache=FALSE}
load(".Rdata")
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction #
There are two related multivariate analysis methods, *MANOVA* and *discriminant analysis* that could be thought of as answering the questions, "Are these groups of observations different, and if so, how?" MANOVA is an extension of ANOVA, while one approach to discriminant analysis is somewhat analogous to principal components analysis in that new variables are created that have certain desirable properties.
# MANOVA #
Multivariate analysis of variance, or MANOVA, like univariate analysis of variance is aimed at testing the null hypothesis that the means of groups of observations are identical. Rejection of this hypothesis is generally accompanied by the scientific conclusion that the groups of observations are indeed different, or were generated by some different process, or come from different underlying populations.
Illustration of the conceptual model:
- The univariate analysis of variance situation [[aov.gif]](https://pjbartlein.github.io/GeogDataAnalysis/images/aov.gif)
- The multivariate analysis of variance situation [[manova.gif]](https://pjbartlein.github.io/GeogDataAnalysis/images/manova1.gif)
- [Details of MANOVA](https://pjbartlein.github.io/GeogDataAnalysis/topics/manova.html)
## ANOVA on individual variables ##
As an example of ANOVA, we can use the Summit Cr. data, testing the hypothesis that the mean values of some of the hydraulic geometry variables differ from reach to reach (or from HU to HU). First, we examine some univariate analyses of variance with a single grouping variable (Reach), which implements "one-way" analysis of variance.
```{r man01, echo=FALSE, include=FALSE}
load(".Rdata")
```
Start by loading the `lattice` package and attaching the `sumcr` data frame.
```{r man02}
# ANOVA and MANOVA
library(lattice)
attach(sumcr)
```
Then, analysis of variances of individual variables can be conducted by setting a temporary variable (and its name) to one of the variables in the data frame:
```{r man03}
# univariate analysis of variance
# set variable to analyze
varname <- "WidthWS"
# plot distributions by groups
histogram(~ eval(get(varname)) | Reach, nint=20, aspect=1/3, xlab=varname)
# test for homogeneity of group variances
tapply(get(varname), Reach, var)
bartlett.test(get(varname) ~ Reach)
# analysis of variance
sumcr_aov1 <- aov(get(varname) ~ Reach)
summary(sumcr_aov1)
```
In the above example (for `WidthWS`), first the variable was plotted by groups, then Bartlett's homogeneity-of-variance test was run, and finally the one-way analysis of variance. In this example, Bartlett's *K*-statistic was not so large (with a small *p*-value) to reject a null hypothesis of equal variances, while in contrast, the *F*-statistic was large enough to reject the null hypothesis of equality of group means.
Univariate analyses of variance for individual variables can be gotten simply by replacing the name assigned to `varname` and rerunning the block of code.
The main issue that arises in conducting some number of individual ANOVAs is the possibility that some analyses will appear significant, simply by chance.
## MANOVA ##
MANOVA aims to answer the question of whether the groups differ, when all variables are considered jointly. The `summary.aov()` function displays the individual ANOVAs, but the examination of these is not a replacement for examining the individual ANOVAs in detail as above.
```{r man04}
# MANOVA
Y <- cbind(DepthWS, WidthWS, WidthBF, HUAreaWS, HUAreaBF, wsgrad)
sumcr_mva1 <- manova(Y ~ Reach)
summary.aov(sumcr_mva1)
summary(sumcr_mva1, test="Wilks")
```
Note that `manova()` shows the individual ANOVAs, along with the Wilk's Λ and associated *F* statistic at the bottom of the output.
Here, the *F*-statistic is large enough (with a correspondingly small *p*-value) to reject the null hypothesis of equality of group means.
# Discriminant Analysis #
Discriminant analysis has several interrelated objectives that include:
- identification of how one or more groups of observations differ as described by one or more, usually several, interrelated variables
- identification of which variables best distinguish among the different groups
- assignment or classification of new observations to one or another of the groups based on the values of the variables
In short, the analysis can be thought of as answering the related questions: Are the groups of observations different, and if so, *how* are they different?
Illustration of the conceptual model
- The discriminant analysis situation: [[discrim1.gif]](https://pjbartlein.github.io/GeogDataAnalysis/images/discrim1.gif)
Details and examples
- [Details of discriminant analysis](https://pjbartlein.github.io/GeogDataAnalysis/topics/da.html)
As an example of discriminant analysis, following up on the MANOVA of the Summit Cr. data, we can investigate how the reaches differ from one another, or in other words, we can identify the variables that best illustrate the difference among the reaches.
## First example -- two variables and two groups (grazed (Reaches A and C), and ungrazed (Reach B)) ##
Here, the three reaches are recoded into two, grazed (`G`, reaches `A` and `C`) and ungrazed (`U`, reach `B`)
```{r da02, message=FALSE}
# discriminant analysis example, Summit Cr. data, two variables, two groups
library(MASS)
library(lattice) # for plots
attach(sumcr)
```
Recode the `Reach` variable (which has the values `A`, `B` and `C`) first to integer values (1 to 3) using the `as.integer()` function, then convert the 3's to 1's, and then create a factor (using the `as.factor()` function) with the values "G" and "U" (grazed and ungrazed, assigned using the `levels()` function).
The labelled scatter plot displays a distinct separation in a space defined by `WidthWS` and `DepthWS` of the grazed `G` and ungrazed `U` reaches (with ungrazed reaches tending to be deep and narrow, and grazed reaches wide and shallow).
```{r da03, fig.asp=1}
# recode Reach into two levels, Grazed and Ungrazed
Grazed <- as.integer(Reach)
Grazed[Grazed==3] <- 1
Grazed <- as.factor(Grazed)
levels(Grazed) <- c("G","U")
plot(WidthWS, DepthWS, type="n")
text(WidthWS, DepthWS, label=as.character(Grazed))
```
Now compute the discriminant function (using the `lda()` function from the `MASS` package). Discriminant "loadings" (correlations between the new discriminant functions and the original variables) are found simply with the `cor()` function, and the discriminant function scores for each observation are plotted using the `lattice()` function.
```{r da04}
# discriminant analysis, coefficients and scores
Grazed_lda1 <- lda(Grazed ~ WidthWS + DepthWS, method="moment")
Grazed_lda1
plot(Grazed_lda1)
Grazed_dscore <- predict(Grazed_lda1, dimen=1)$x
cor(cbind(WidthWS, DepthWS, Grazed_dscore))
histogram(~ Grazed_dscore[,1] | Grazed, nint=20, aspect=1/3)
```
The scatter plot, as well as the loadings show that `WidthWS` is the more important variable for "discriminating" the difference between the grazed and ungrazed observations.
The following code plots the discriminant function on the scatter plot.
```{r da5, fig.asp=1}
# plot discriminant function and classification line
# to make plots easier to interpret, rescale DepthWS by 10x
plot(WidthWS, DepthWS*10, type="n", asp=1)
text(WidthWS, DepthWS*10, label=as.character(Grazed), cex=.5)
chw <- par()$cxy[1]
text(WidthWS+chw, DepthWS*10, label=as.character(round(Grazed_dscore,2)),
cex=.6)
# DF1
slope_DF1 <- Grazed_lda1$scaling[2]/Grazed_lda1$scaling[1]
abline(7, slope_DF1, lwd=2)
# classification threshold line
slope_CL1 <- -1/(Grazed_lda1$scaling[2]/Grazed_lda1$scaling[1])
int_CL1 <- ((Grazed_lda1$means[1,]%*%Grazed_lda1$scaling +
Grazed_lda1$means[2,]%*%Grazed_lda1$scaling)/2)*
(Grazed_lda1$scaling[1]/
Grazed_lda1$scaling[2]/Grazed_lda1$scaling[2])
abline(int_CL1, slope_CL1, col="purple", lwd=2)
```
The black line is the actual first discriminant function, along which the groups are maximally separated. The purple line is orthogonal to the discriminant function. Points whose `WidthWS` and `DepthWS` values plot above the line are likely to be ungrazed, while those below are more likely to be Grazed_
## A second example ##
In this second example, all three reaches are considered, along with multiple variables. Two discriminant functions are generated.
```{r da6, fig.asp=1}
# second example
# linear discriminant analysis
Reach_lda1 <- lda(Reach ~ DepthWS + WidthWS + WidthBF + HUAreaWS + HUAreaBF
+ wsgrad)
Reach_lda1
plot(Reach_lda1, asp=1)
Reach_dscore <- predict(Reach_lda1, dimen=2)$x
histogram(~Reach_dscore[,1] | Reach, nint=20, aspect=1/3)
histogram(~Reach_dscore[,2] | Reach, nint=20, aspect=1/3)
cor(sumcr[6:11],Reach_dscore)
```
The `Coefficients of linear discriminants` provide the equation for the discriminant functions, while the correlations aid in the interpretation of functions (e.g. which variables they're correlated with).
The `mosicplot()` function compares the true group membership, with that predicted by the discriminant functions. The large boxes along the diagonal of the mosaicplot show the the discriminant functions provide useful information for distinguishing among the groups.
```{r da7}
Reach_pred <- predict(Reach_lda1, dimen=2)$class
class_table <- table(Reach, Reach_pred)
mosaicplot(class_table, color=T)
```
# Readings #
- Ch. 14 in Zuur et al. (2007) *Analysing Ecological Data*, Springer. (Search the UO Library catalog for the Bivand et al. book (as in lecture 7), then search for "Analysing Ecological Data" in the *Springer Link* search field.)
- Chapter 25, Multivariate Statistics, in Crawley, M.J. (2013) *The R Book*, Wiley. To get to the book, visit [http://library.uoregon.edu](http://library.uoregon.edu), login, and search for the 2013 edition of the book. Here's a direct link, once you're logged on: [http://onlinelibrary.wiley.com/book/10.1002/9781118448908](http://onlinelibrary.wiley.com/book/10.1002/9781118448908)
- Maindonald (*Using R...*), Ch. 6