-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathWorked_questions_gaps.Rmd
596 lines (446 loc) · 17 KB
/
Worked_questions_gaps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
---
title: "STAT-462 Worked questions"
author: "hlg5155"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 3
toc_float:
collapsed: no
number_sections: yes
df_print: paged
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
```
All the questions and plots here are on the Palmer Penguins dataset, to
carry on the penguin health theme of the course. You don't need to do
this, but so you can see where each command comes from, I have also
added the package e.g. `ggstatsplott::gghistostats` is the same as
`library(ggstatsplot)` followed by `gghistostats()`
# Load Data and Packages
```{r, message=FALSE, warning=FALSE}
# Important core packages
library(tidyverse)
library(knitr)
library(rmarkdown)
library(skimr)
library(Stat2Data)
library(readxl)
library(ggpubr)
# Regression packages
library(olsrr)
library(Stat2Data)
library(nortest)
library(lmtest)
library(IMTest)
library(MASS)
library(visreg)
library(ggstatsplot)
# Plotting packages
library(lattice)
library(RColorBrewer)
library(viridis)
library(corrplot)
library(plotly)
library(car)
library(GGally)
# Datasets
library(ISLR) # contains a credit dataset
library(yarrr) # contains a toy dataset about pirates
library(palmerpenguins) # contains the penguin dataset
```
# Load data
I use the data command because the data already exists inside Note,
because the data is already "pre-loaded into R", I don't need to give it
a variable name. In fact this command loads TWO datasets: penguins and
penguins_raw. If in doubt see the help file e.g. ?penguins.
```{r}
#?penguins
data("penguins")
```
```{r}
skimr::skim(penguins)
```
Check column names
```{r}
names(penguins)
```
# Descriptive questions
## What is the unit of analysis
The "thing" the study is about. I used the help file, `?penguins` and
`?penguinsraw` to show that the unit of analysis is a single penguin.
There are 151 penguins in the sample. n=151 (there is one missing).
## What is a reasonable population?
This is the boundaries that the study can be extrapoled to.
Again, using the helpfile, `?penguins`: The study collected data on
nesting observations, penguin size data, and isotope measurements from
blood samples for foraging adult Adélie, Chinstrap, and Gentoo near
Palmer Station, Antarctica.
I would say a reasonable population would be either adult Antarctic
penguins of those breeds (maybe splitting into a smaller geographic area
after consulting with subject experts). If I split the sample into
individual species (I do this later on), then I expect I might be able
to use these results for all populations of that penguin breed, again
after consulting with experts.
## What variables exist in the dataset?
Here is our dataset:
```{r}
#?penguins
data("penguins")
skimr::skim(penguins)
```
or alternatively
```{r}
glimpse(penguins)
```
Our data is comprised of a sample of 344 penguins, with variables:
- `species`: *factor, nominal categorical* Penguin species (Adélie,
Chinstrap and Gentoo)
- `island`: *factor, nominal categorical* Island in Palmer
Archipelago, Antarctica (Biscoe, Dream or Torgersen) that the
penguin was observed on
- `bill_length_mm` : *numeric* Bill (beak) length (millimeters)
- `bill_length_mm` : *numeric* Bill depth (millimeters)
- `flipper_length_mm` : *integer, numeric* Flipper length
(millimeters)
- `body_mass_g` : *integer, numeric* Body mass (grams)
- `sex` *factor, nominal categorical* Penguin sex (female, male)
- `year` *integer, numeric* Study year (2007, 2008, or 2009)
Most of our data stored as an integer is not "true" integer data e.g.
it's *physically possible* to have a body mass of 45.295g. So I will
convert my integer columns to numeric
```{r}
# take all the integers in penguins, convert to float numeric and overwrite.
penguins <- dplyr::mutate_if( penguins,
is.integer, as.numeric)
# and check - yep, dbl means "double float"
glimpse(penguins)
```
## Describe a single variable (including W-S Normality)
**Describe the body mass of Palmer Penguins**
```{r}
ggstatsplot::gghistostats(data=penguins,
x=body_mass_g,
results.subtitle=FALSE,
normal.curve = TRUE,
xlab = "Penguin mass (g)",
title = "Palmer penguin body mass")
```
This histogram shows the distribution of penguin weights in grams for
the *sample* of 344 Adélie, Chinstrap and Gentoo penguins in the Palmer
penguin dataset.
The distribution looks approximately univariate with a mean penguin mass
of 4201.75g. There does appear to be some left skew:
```{r, message=FALSE, warning=FALSE}
ggpubr::ggqqplot(penguins$body_mass_g,col="blue")
```
```{r}
WS.test <- shapiro.test(penguins$body_mass_g)
WS.test
```
- H~0~ : The sample comes from a population which has a normal
distribution. Wilks Shapiro test statistic would be close to 1:
$W = 1$
- H~1~ : The sample is unlikely to come from a population which has a
normal distribution. Wilks Shapiro test statistic would be LESS than
1: $W < 1$
The results above suggest that the test-statistic, W is
`r WS.test$statistic` (relatively normal), but the likelyhood of seeing
this result with a large sample of 344 penguins is exceptionally small
(p = `WS.test$p.value`). It is important to note that the sample is
large, so this test is very sensitive.
Finally, I expect that that the data is not fully independent, which
would affect the normality of test results. The dataset is comprised of
three individual species. Breaking these apart:
```{r}
ggbetweenstats(data = penguins,
x = species,
y = body_mass_g ,
plot.type = "box",xlab = "Penguin mass (g)",
centrality.point.args=list(color = "darkblue")
,pairwise.display = "significant",centrality.plotting=FALSE,
results.subtitle = FALSE)
```
The data is clearly not independent. So from now onwards I will simply
select a single species.
```{r}
adelie <- dplyr::filter(penguins, species=="Adelie")
skim(adelie)
```
- **New unit of analysis:** an individual Adelie penguin -**New
reasonable population:** All Adélie penguins - although I would want
to talk to a biologist before extrapolating results beyond either
the specific penguin colony or Antarctia.
```{r}
ggstatsplot::gghistostats(data=adelie,
x=body_mass_g,
results.subtitle=FALSE,
normal.curve = TRUE,
xlab = "mass (g)",
title = "Palmer penguin body mass: Species- Adelie")
```
```{r, message=FALSE, warning=FALSE}
ggpubr::ggqqplot(adelie$body_mass_g,col="blue")
```
```{r}
WS.test <- shapiro.test(adelie$body_mass_g)
WS.test
```
- H~0~ : The sample comes from a population which has a normal
distribution. Wilks Shapiro test statistic would be close to 1:
$W = 1$
- H~1~ : The sample is unlikely to come from a population which has a
normal distribution. Wilks Shapiro test statistic would be LESS than
1: $W < 1$
The results above suggest that the sampled test-statistic, W is
`r WS.test$statistic` close to Normal. The p-value is still relatively
low at p = `WS.test$p.value`, but good enough to proceed.
## Describe correlations
**Describe the correlation between Adelie penguin's mass vs other
variables**
```{r, message=FALSE, warning=FALSE}
GGally::ggpairs( adelie[ , sapply(adelie,is.numeric)])
```
For our sample, there appears to be moderate positive correlation
between adelie penguin mass and both bill depth and flipper length.
There is very little correlation with bill length.
# Simple Linear Models
## Set up a linear model
```{r}
knitr::include_graphics("bird.png")
```
**Create a linear model between Adelie Penguin mass/weight and Flipper
Length**
- Unit of analysis: An individual Adelie penguin
- Response (Y) : Penguin mass (g)
- Predictor (X1) : Flipper length (mm)
```{r}
Model1 <- lm(body_mass_g ~ flipper_length_mm, data=adelie)
```
```{r}
# INSTALL GGSIDE IN THE CONSOLE
ggstatsplot::ggscatterstats(adelie,
x=flipper_length_mm,
y=body_mass_g,
results.subtitle=FALSE,
xlab="Flipper length (mm)",
ylab="body mass (g)")
```
The relationship can be expressed as:
```{r,asis=TRUE}
equatiomatic::extract_eq(Model1,use_coefs = TRUE)
```
```{r}
Model1.Summary <- olsrr::ols_regress(Model1)
Model1.Summary
```
## Describe the intercept
If our sample regression is:
$$
\operatorname{\widehat{y}} = b_{0} + b_{1}x
$$
Then the intercept is the MODELLED AVERAGE value of y when x is 0.
In our case, the intercept is the .\
Does it make sense?\
.
## Describe the slope
If our sample regression is:
$$
\operatorname{\widehat{y}} = b_{0} + b_{1}x
$$
Then the gradient/slope is the change in the MODELLED AVERAGE value of y
with a unit increase in x.
The slope suggests that as the flipper length increases by 1mm,
## R^2^ - how much variability is explained.
R^2^ is the percentage of the variability in our response that can be
"explained" by our predictor. For example 96% of the variability in
building height can be "explained" by how many floors/stories is has.
R^2^ is also known as the coefficient of determination.
```{r}
Model1.Summary
```
See here for more:
<https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/assessing-fit-least-squares-regression/a/r-squared-intuition>
We can see from this model summary that. % of the variability in our
sampled Penguin mass can be explained by flipper length.
The correlation is
# Hypothesis test on slope
## T-test on slope, one sided & $\beta_{1}$ is a number.
**A famous penguin researcher thinks that the slope between mass and
flipper length is *at least* 40 mm/g. Given your sample, do you believe
this is true, with a critical significance of 5%**
We can no longer use the R summary.
- H_0\_ : $\beta_{1} = 40$
If our person's statement really was true, then at a MINIMUM, it should
be 40 mm/g. So it would be pretty typical to get a higher slope from our
sample. What would be unusual would be to get a lower slope.
- H_1\_ : $\beta_{1} < 40$ It would be unusual to see a slope as *low*
as our sample if H0 really was true.
The reason that H0 is an equals is that H0 is always the model we set up
to test our scenario.
OK, so here is the point where I find drawing the diagram helps. So if
the scientist was correct, then higher slopes are to be expected. We
only care if it is low.
```{r,echo=FALSE}
x=seq(40-(4*5.076),40+(4*5.076),length=2000)
plot(x,dnorm(x,mean = 40,sd=5.076),type="l",lwd=2,col="blue",
xlab="H0, Beta1=40",ylab="",axes=FALSE,mgp=c(1,1,0))
box()
axis(2)
axis(1,labels=FALSE)
abline(v=32.832,col="red")
```
$$
t = \frac{b_1 − 40}{se(b_1)}=\frac{32.832-40}{5.076}=-1.412136
$$
So in this case, we use lower.tail=TRUE
```{r}
pt(-1.412136,df=149,lower.tail = TRUE)
```
The p-value is - as this is above , we do/donot have enough evidence to
reject H0. E.g.
```{r}
x=seq(-4,4,length=2000)
plot(x,dt(x,149),type="l",lwd=2,col="blue",
xlab="t-statistic",ylab="")
polygon(c(lower,x,upper),c(0,dt(x,149),0),col="gray")
lower <- qt(0.05,149)
x=seq(lower,4,length=2000)
polygon(c(lower,x,4),c(0,dt(x,149),0),col="white")
text(0,0.05,"95%");text(-3,0.05,"5%");
abline(v=-1.412136,col="red",lwd=2)
text(-2.3,0.3,"<- 7.973%",col="red");text(-0.7,0.3,"92% ->",col="red")
```
To recap:
**A famous penguin researcher thinks that the slope between mass and
flipper length is *at least* 40 mm/g. Given your sample, do you believe
this is true, with a critical significance of 5%**
- H_0\_ : $\beta_{1} = 40$
- H_1\_ : $\beta_{1} < 40$ It would be unusual to see a slope as *low*
as our sample if H0 really was true.
We found a 7% chance of seeing a slope lower or equal to our sample. As
this is above our 5% critical significance, we do not have enough
evidence to reject H0. Therefore our sample is likely consistent with
the scientist's statement.
## F-test on slope, Is $\beta_{1} = 0$?
If we simply want to see if our model is explaining any variation in y
(e.g. is our predictor explaining the variability in our response), we
can use an F test.
**Is there a relationship between body mass and study year, at a 1%
critical significance?**
```{r}
anova(Model2)
```
- H_0\_ : $\beta_{1} = 0$
- H_1\_ : $\beta_{1} != 0$
The test statistic is the F statistic.
As always, the P-value is obtained by answering the question: "What is
the probability that we'd get an F\* statistic as large as we did, if
the null hypothesis is true?"
The P-value is determined by comparing F\* to an F distribution with 1
numerator degree of freedom and n-2 denominator degrees of freedom.
Here we can see that the p-value is 0.72, which is well above 0.01. So
we choose H0. We can confirm that there is no evidence to suggest a
relationship between body mass and study year.
## Is the F test one tailed or two tailed?
This is a common confusion! See
<http://daniellakens.blogspot.com/2016/04/one-sided-f-tests-and-halving-p-values.html>
and <https://psycnet.apa.org/record/1981-00280-001>: *"I pointed out to
my student that although t distributions and F distributions each have
probability values associated with both tails of their respective
curves, the curves bear an exponential relationship, not a linear one. I
reminded them that for d.f. = l/n, F = P. Thus, while values of t can be
both positive and negative, values of F are limited to positive numbers.
They saw the point immediately, and told me that in the case of a t
curve, the areas contained in, for example, the extreme 5% of each of
the two tails, positive plus negative, would be combined and found in
one tail of the F curve, the tail of the curve for the values of F
greater than one. I agreed and then asked them what the probability
associated with an F of that size would be. It was clear to them that
the area beneath the F curve which contained the 5% end of the positive
tail of the corresponding t curve and the 5% end of the negative tail of
the t curve, was 10%. Now they understood why they should have compared
the F values of his anlyses with significance levels corresponding to
10%, not 5%-the Ftest is a one-tailed two-tailed test"*
# Hypothesis test on the intercept
# Confidence intervals
These ALL have the formula, e.g. we are looking at the error bar on the
TRUE population parameter given our imperfect sample.
$$
Sample.Estimate ± (T.multiplier_{(df,1-\alpha/2)} × Standard.Error.On.Estimate)
$$
## 95% Confidence interval on the slope.
$$
Sample.Estimate ± (T.multiplier_{(df,1-\alpha/2)} × Standard.Error.On.Estimate)
$$ So
$$
b_1 ± T.multiplier_{(df,1-\alpha/2)} × se(b_1)
$$
Here's how to get the slope and intercept
```{r}
Model1.Summary <- ols_regress(Model1)
intercept <- Model1.Summary$betas[1]
se.intercept <- Model1.Summary$std_errors[1]
slope <- Model1.Summary$betas[2]
se.slope <- Model1.Summary$std_errors[2]
```
So the 95% interval on the slope for model 1 is between 22.8 mm/g and
42.86 mm/g
```{r}
lower <- slope - qt(0.975,df=149)*se.slope
upper <- slope + qt(0.975,df=149)*se.slope
lower
upper
```
## 99% Confidence interval on the intercept
$$
Sample.Estimate ± (T.multiplier_{(df,1-\alpha/2)} × Standard.Error.On.Estimate)
$$ So
$$
b_0 ± T.multiplier_{(df,1-\alpha/2)} × se(b_0)
$$
```{r}
Model1.Summary <- ols_regress(Model1)
intercept <- Model1.Summary$betas[1]
se.intercept <- Model1.Summary$std_errors[1]
```
So the 99% error bar /confidence interval on the intercept for model 1
is between -5053g and -18.5g.
```{r}
lower <- intercept - qt(0.995,df=149)*se.intercept
upper <- intercept + qt(0.995,df=149)*se.intercept
lower
upper
```
## 90% Confidence interval on the model when flipperlength=170mm
So we need to know the range of values of the best fit line when
flipperlength=170mm. We use the predict function
```{r}
test_location <- data.frame(flipper_length_mm=170)
predict(Model1,test_location,interval ="confidence",levels=0.9)
```
The 99% error bar on the AVERAGE mass of penguins with flipper length =
170mm is between 2834g and 3526g.
# Prediction intervals
## 90% PREDICTION interval on the model when flipperlength=170mm
If the confidence interval is "where is the true regression line when
x=170", the prediction interval is, where is the *cloud of points* for
x=170mm.
AKA if I had a new penguin with flipper length of 170mm, what is our
best guess of the range of weights it might have
```{r}
new_bird <- data.frame(flipper_length_mm=170)
predict(Model1,new_bird,interval ="prediction",levels=0.9)
```
If our model is true, a new bird is likely to weigh between 2215g and
3876g.
## PREDICTION INTERVALS ARE BIGGER THAN CONFIDENCE INTERVALS
If you are confused, see the textbook or the many resources online..
here are just two..
<https://www.indeed.com/career-advice/career-development/prediction-interval-vs-confidence-interval>
<https://www.statology.org/confidence-interval-vs-prediction-interval/>