-
Notifications
You must be signed in to change notification settings - Fork 10
/
lec09.Rmd
333 lines (250 loc) · 13.9 KB
/
lec09.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: "Reference distributions"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
```{r load, echo=FALSE, cache=FALSE}
load(".Rdata")
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction -- Significance evaluation #
Descriptive statistics (like the mean or standard deviation) are useful for characterizing a set of observations, but the significance of a particular value of a statistic (i.e. is it large or small? commonplace or unusual?) will in general not be immediately apparent, and must be judged using some external (to the data set) standard.
The significance of a particular value of a statistic (e.g. the 30-yr mean annual temperature of a place) can be evaluated by determining how rare or how commonplace that particular value is. In ordinary practice, those values of a statistic with absolute values that are equaled or exceeded in no more than 5% of the realizations of that statistic are regarded as significant.
The significance of a particular value of a statistic can be judged by comparing that value to the distribution of "all" values of the statistic. That distribution is known as the reference distribution, and in practice, can be either an *empirical distribution* (as represented by a histogram or cumulative frequency curve for an actual data set), or a *theoretical distribution* (as represented by an equation or function).
# Empirical reference distributions #
Empirical reference distributions are contructed using observed (i.e. real) data sets, which if sufficently large, allow statements to be made about the chances of observing particular values of a variable--possibly new obervations of the variable, or of observing combinations of values of a variable, such as the mean for some subset of observations.
If a relatively large data set exists, then the distribution of a particular statistic (e.g. a 30-yr mean) can be created by repeatedly
- drawing a sample from the data set,
- calculating the value of the statistic for the sample,
- saving the statistic, and then
and, after a sufficient number of samples have been taken (the more the better),
- looking at the distribution of the sample statistics (via the histogram or other plot, or by direct inspection).
## Use of an empirical reference distribution ##
The examples here use the 1189-year-long time series of summer temperature (TSum) for the Sierra Nevada, as reconstructed from tree-ring data (Graumlich, 1993). [`sierra.csv`] This data set is sufficiently long enough to illustrate the idea of having a large amount of empirical information to base judgements about the *significance* (or the relative unusualness) of individual values or statistics.
### How unusual is a particular observation of `TSum`? ###
To answer this question, the appropriate reference distribution is the one formed by all of the individual observations of `TSum`. In other words, the entire data set should be used to answer questions about the significance (or relative size) of an individual observation, simply by comparing that observation to all other observations. This comparison is facilitated by first sorting the observations, and then creating a new variable that indicates the relative position within the sorted observations of each observation.
```{r refDist01}
# Sierra Nevada tree-ring summer temperature reconstructions
attach(sierra)
# sort and rank TSum values
sort_Tsum <- sort(TSum) # arranges values from low to high
rank_Tsum <- rank(TSum) # replaces values with rank number (1 to n)
# inspect the sorted and ranked values
# close window to exit
sierra_sort <- data.frame(Year, TSum, rank(TSum), sort_Tsum, rank(sort_Tsum),
Year[rank(TSum)])
head(sierra_sort)
tail(sierra_sort)
```
The table can be inspected by opening it in a data viewer:
```{r view sierra_sort, echo=TRUE, eval=FALSE}
View(sierra_sort)
```
From this table, for example, the value 13.79 C can be seen to have occurred three times, in 1101, 1479, and 1892, and is a relatively low value of temperature, in contrast to the highest temperature value, 17.48, which occurred in 1194.
Similar information can be gained from the histogram.
```{r refDist02}
# nice histogram
cutpts <- seq(12.0, 18.0, by=0.1)
hist(TSum, breaks=cutpts, probability=T, xlim=c(13,18))
lines(density(TSum))
rug(TSum)
```
### What value of `TSum` is exceeded 10 percent of the time? ###
Other information can be gained by examining the empirical cumulative density function, which is a plot of the sorted values of `TSum` and the probability of equaling or exceeding that value in the data set
```{r refDist03}
# empirical cumulative density function
ecdf_Tsum <- rank(sort_Tsum)/length(TSum)
plot(sort_Tsum, ecdf_Tsum, type="l")
sierra_ecdf <- data.frame(Year, TSum, rank(TSum), sort_Tsum, rank(sort_Tsum),
Year[rank(TSum)], ecdf_Tsum)
head(sierra_ecdf); tail(sierra_ecdf)
```
Again, the specific answer can be obtained by inspection of the data sheet, the histogram, or the Line plot of ProbTSum vs SortTSum.
```{r view sierra_ecdf, , echo=TRUE, eval=FALSE}
View(sierra_ecdf)
```
Alternatively, the values corresponding to specific quantiles can be gotten by the following command:
```{r refDist04}
quantile(TSum, probs = c(0.0,.05,.10,.25,.50,.75,.90,.95, 1.0))
```
### How unusual is the mean of the last 30 years of the record? ###
The following command gets the mean of the last 30 yrs of data:
```{r refDist05}
mean(TSum[(length(TSum)-29):(length(TSum))])
```
To answer the above question, the appropriate reference distribution is no longer that formed from the whole data set, but instead is the distribution that would be formed by a set of mean values of 30-year long samples from the data. This set can be obtained by repeated sampling of the `TSum` data using the following:
```{r refDist06}
# repeated sampling and calculation of means
nsamp <- 200 # number of samples
Means30yr <- matrix(1:nsamp) # matrix to hold means
for (i in 1:nsamp) {
samp <- sample(TSum, 30, replace=T)
Means30yr[i] <- mean(samp)
}
```
The script causes causes `nsamp` samples of 30 observations each to be drawn at random from the column of data containing `TSum`. For each sample, the mean is calculated, and stored in the variable `Means30yr`. The distribution of this variable (`Means30yr`) rather than that of the original data is the appropriate one for answering the above question.
The two empirical reference distributions can be compared by looking at their histograms:
```{r refDist07}
# histograms of data and of 30-Means
oldpar <- par(mfrow=c(2,1))
cutpts <- seq(12.0, 18.0, by=0.1)
hist(TSum, breaks=cutpts, probability=T, xlim=c(12,20))
lines(density(TSum))
hist(Means30yr, breaks=cutpts, probability=T, xlim=c(12,20))
lines(density(Means30yr))
par(oldpar)
```
The above procedure draws 30 years at random from TSum. It might be more appropriate to draw 30-consecutive-year samples from TSum. This can be done as follows:
```{r refDist08}
# repeated sampling and calculation of means
# consecutive samples
nsamp <- 200 # number of samples
Means30yrb <- matrix(1:nsamp) # matrix to hold means
for (i in 1:nsamp) {
start <- runif(1, min=1, max=length(Year)-30)
start <- round(start)
sampmean <- 0.0
for (j in 1:30) {
sampmean <- sampmean + TSum[start+j-1]
}
Means30yrb[i] <- sampmean/30.0
}
```
And here are the histograms:
```{r refDist09}
# histograms of data and of 30-Means from consecutive samples
oldpar <- par(mfrow=c(2,1))
cutpts <- seq(12.0, 18.0, by=0.1)
hist(TSum, breaks=cutpts, probability=T, xlim=c(12,20))
lines(density(TSum))
hist(Means30yrb, breaks=cutpts, probability=T, xlim=c(12,20))
lines(density(Means30yrb))
```
Set the plotting parameters back to the originals
```{r}
par(oldpar)
```
Note that this approach for using an empirical reference distribution is a general one--the significance of other statistics (e.g. the variance) can be evaluated the same way.
[[Back to top]](lec09.html)
# A theoretical reference distribution -- the normal (or standard normal) distribution #
Much of the time, a data set large enough to make probabilistic-type statements about the values of a statistic does not exist, or the statistic may be difficult to observe. In such cases, theoretical reference distributions allow the significance of particular values of statistics to be evaluated. One distribution that is applicable to many cases in practice is the normal distribution, and the Central Limit Theorem explains the source of that general applicability.
- [the normal distribution](https://pjbartlein.github.io/GeogDataAnalysis/topics/normaldist.pdf)
## Density functions for the normal distribution ##
The standard normal distribution is widely used in inferential statistics, owing to its:
- applicability (many processes are intrinsically normally distributed, and the central limit theorem assures us that sample means, in specific, and many processes that involve summation or integration, in general, are normally distributed)
- simplicity (it has a relatively simple shape, owing to the particular choice of values for its two parameters, mu and sigma)
The standard normal distribution is described by an equation and is represented by its:
Probability density function (pdf) (R/S-Plus term is Density),
```{r normal01}
# probability density function
x <- seq(-3, +3, by=.1)
pdf_norm<- dnorm(x, mean=0, sd=1)
plot(x, pdf_norm, type="l")
```
Cumulative density function (cdf) (R/S-Plus term is Probability), and
```{r normal02}
# cumulative density function
x <- seq(-3, +3, by=.1)
cdf_norm<-pnorm(x, mean=0, sd=1)
plot(x, cdf_norm, type="l")
```
Inverse cumulative density function (invcdf) (R/S-Plus term is Quantile).
```{r normal03}
# inverse cumulative density function
pr <- seq(0, 1, by=.01)
invcdf_norm<- qnorm(pr, mean=0, sd=1)
plot(pr, invcdf_norm, type="l")
```
Random numbers drawn from the normal distribution can be obtained as follows
```{r normal04}
# normally distributed random numbers
z <- rnorm(n=1000, mean=0, sd=1)
hist(z, nclass=40)
plot(z)
```
## The Central Limit Theorem ##
The Central Limit Theorem provides the explanation for the wide applicability of the normal distribution in significance testing, because it guarantees us that (providing certain assumptions are met) that statistics (like the mean) obtained by integrating (as in summing up for computing the mean) will turn out to be normally distributed, no matter what the distribution of the underlying data is.
The Central Limit Theorem is a mathematical demonstration that statistics (like the mean) that are based on summations or integrations of a set of observations are normally distributed, no matter what the distribution is of the data from which samples are being drawn. The theorem therefore assures us that the normal distribution is the appropriate reference distribution for a sample mean to be compared with, in order to judge the relative size of that sample mean_
### Empirical demonstration of the Central Limit Theorem ###
Generate 1000 random numbers from four different distributions, the standard normal, the normal, the log-normal and uniform distributions. Histograms of the random numbers will make it clear what the shape of the distributions are like
```{r central01}
n <- 1000
stdNormal <- rnorm(n, mean=0, sd=1)
Normal <- rnorm(n, mean=4.0, sd=2.0)
logNormal <- rlnorm(n, meanlog=1.0, sdlog=0.5)
Uniform <- runif(n, min=0.0, max=1.0)
```
```{r central02}
oldpar <- par(mfrow=c(2,2))
hist(stdNormal)
hist(Normal)
hist(logNormal)
hist(Uniform)
par(oldpar)
```
The following script generates nsamp sample means from one of these distributions
```{r central03}
# take repeated samples from each distribution and calculate and save means
# repeated sampling and calculation of means
nsamp <- 200 # number of samples
mean_stdNormal <- matrix(1:nsamp) # matrix to hold means
mean_normal <- matrix(1:nsamp) # matrix to hold means
mean_logNormal <- matrix(1:nsamp) # matrix to hold means
mean_Uniform <- matrix(1:nsamp) # matrix to hold means
for (i in 1:nsamp) {
samp <- sample(stdNormal, 30, replace=T)
mean_stdNormal[i] <- mean(samp)
samp <- sample(Normal, 30, replace=T)
mean_normal[i] <- mean(samp)
samp <- sample(logNormal, 30, replace=T)
mean_logNormal[i] <- mean(samp)
samp <- sample(Uniform, 30, replace=T)
mean_Uniform[i] <- mean(samp)
}
```
Histograms of the original data, as well as of the sample means can be obtained with the following script.
```{r central04}
# histograms of data and of sample means
oldpar <- par(mfrow=c(2,1))
# standard Normal
xmax <- max(stdNormal)
xmin <- min(stdNormal)
hist(stdNormal, nclass=40, probability=T, xlim=c(xmin,xmax))
hist(mean_stdNormal, nclass=40, probability=T, xlim=c(xmin,xmax))
# Normal
xmax <- max(Normal)
xmin <- min(Normal)
hist(Normal, nclass=40, probability=T, xlim=c(xmin,xmax))
hist(mean_normal, nclass=40, probability=T, xlim=c(xmin,xmax))
# log Normal
xmax <- max(logNormal)
xmin <- min(logNormal)
hist(logNormal, nclass=40, probability=T, xlim=c(xmin,xmax))
hist(mean_logNormal, nclass=40, probability=T, xlim=c(xmin,xmax))
# Uniform
xmax <- max(Uniform)
xmin <- min(Uniform)
hist(Uniform, nclass=40, probability=T, xlim=c(xmin,xmax))
hist(mean_Uniform, nclass=40, probability=T, xlim=c(xmin,xmax))
par(oldpar)
```
[[Back to top]](lec09.html)
## Other theoretical reference distributions ##
Although the normal distribution is applicable in a number of situations (owing to the implications of the Central Limit Theorem), other situations arise, in which other referernce distributions may apply.
- [details of other reference distributions](https://pjbartlein.github.io/GeogDataAnalysis/topics/theoretical.pdf)
[[Back to top]](lec09.html)
# Readings #
- Owen (The R Guide): Ch. 6
- Irizarry (*Intro to Data Science*): Ch. 12 - 15
[[Back to top]](lec09.html)