-
Notifications
You must be signed in to change notification settings - Fork 3
/
4_distributions_solutions.Rmd
269 lines (168 loc) · 9.52 KB
/
4_distributions_solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
--
title: "Introduction to R for Biostatistics 4"
output: html_document
---
```{r global_options, include=FALSE}
library(plyr)
library(dplyr)
library(ggplot2)
library(Rmisc)
set.seed(1) # this makes everything reproducible
```
#In-class worksheet 4
**29 Feb to 4 Mar, 2016**
In this worksheet you will learn to generate data from different distributions and how to graphically assess whether a dataset is normally distrbuted. Please answer the questions in the document.
## Generating data from distributions
### Instructions
For many applications, it is important to be able to generate random numbers from specified distributions. R contains functions for handling many distributions, also of course for the normal distribution. Indeed, there is a whole family of functions for generating data from a normal distribution (*rnorm*), to obtain the normal density function (*dnorm*) and the cumulative density function (*pnorm*). Let's start with generating some data:
```{r}
m = 5 # mean
s = 2 # sd
x <- rnorm(50, mean = m, sd = s) # generate 50 samples form a normal distribution with mean 5 and sd 2
data = data.frame(x)
ggplot(data, aes(x=x)) +
stat_bin(aes(y=..count../sum(..count..))) + # histogram of the samples with 20 bins ?Q: why using stat_bin instead of geom_hist as in the last lesson, maybe explain the link between the two/differences/advantages
geom_vline(aes(xintercept=mean(x, na.rm=T)), # compute mean of the samples and add to plot as vertical line
color="green", linetype="dashed", size=.75) +
geom_vline(aes(xintercept=m), # add true mean to plot as vertical line
color="red", linetype="dashed", size=.75) +
stat_function(fun = dnorm, colour = "red",
arg = list(mean = m, sd=s)) + # add the gaussian density with the true parameters
xlab("x") + ylab("Fraction") +
theme_grey(base_size=14) +
xlim(m-3*s, m+3*s)
```
Is the sample mean the same as the true mean?
> Provide your answer here.
Run the cell multiple times. Does the histogram always look the same and lead to the same mean?
> Provide your answer here.
Play with the parameters of the distribution to observe what changes (mind the scale of the x-axis!).
> Provide your answer here.
Let's look at the relationship between the sample size and the variability of the mean in a bit more detail. We will now generate 1000 samples of the same mean and sd for each of a set of sample sizes. For each of these, we will not look at the distribution of the data points, but at the distribution of the *means* of each sample.
```{r}
m = 5 # mean
s = 2 # sd
n = c(10, 20, 50, 200, 500) # we will compute the mean from samples of these sizes
rep = 1000 # for each sample size, we will do that 1000 times
# this loop generates the data - try to understand it
x = numeric(length(n)*rep)
nsamp = numeric(length(n)*rep)
for(i in 1:length(n)){
for(j in 1:rep){
x[(i-1)*rep+j] = mean(rnorm(n[i],mean=m,sd=s))
nsamp[(i-1)*rep+j] = n[i]
}
}
data = data.frame(x,nsamp)
# we create faceted histograms of the data
ggplot(data, aes(x=x)) +
geom_histogram(aes(y=..density..),bins=50,alpha=.5, position="stack") +
facet_grid(nsamp ~ . ) +
xlab("Sample mean") + ylab("Density") +
theme_grey(base_size=14)
```
Are there any samples whos mean is larger than six for samples of size 10? What about samples of size 500?
> Provide your answer here.
Describe how the with of the distribution of means changes with the sample size.
> Provide your answer here.
The width of this distribution is called the standard error of the mean. This quantity is the standard deviation of the distribution of means and is defined as $SEM=\frac{\sigma}{\sqrt{N}}$. We can check graphically that this formula is correct:
```{r}
# compute summary statistics about the distribution of means
data_sum <- summarySE(data = data, measurevar = "x", groupvars = c("nsamp"))
semfun <- function(n,s) {s/sqrt(n)} # this is the SEM formula
ggplot(data_sum, aes(x=nsamp, y=sd)) + # the sd of the distribution of means is the SEM
geom_point() +
stat_function(fun=semfun, args = list(s=2) ) + # here we evaluate the SEM formula
xlab("Number of samples") + ylab("SEM") +
theme_grey(base_size=14)
```
This looks reassuring. The points are all on the line. What does this mean?
> Provide your answer here.
We are now able to generate samples from a Normal distribution. In addition, we are interested in the question, what fraction of samples lies below a certain value. We can easily obtain this number from a *cumulative histogram*:
```{r}
m = 5 # mean
s = 2 # sd
x <- rnorm(50, mean = m, sd = s) # generate 50 samples form a normal distribution with mean 5 and sd 2
data = data.frame(x)
ggplot(data,aes(x=x)) +
geom_histogram(aes(y=cumsum(..count..)), bins = 20) +
ylab("Cumulative count")
```
In the sample, roughly how many samples are smaller than 2.5?
> Provide your answer here.
How can you use this plot to graphically estimate the mean? Add the required lines to the plot (use the *+geom_hline* command).
> Provide your answer here.
Such a function is also defined for the Normal distribution, not only for samples from it. There it is called the *cumulative density function* or *distribution function*. It is very useful, because it let's you assess the percentiles of certain values quite easily. This is how it looks (for comparison, the density is shown with dashes):
```{r}
m = 5 # mean
s = 2 # sd
x <- -5:-1:15
data = data.frame(x)
ggplot(data,aes(x=x)) +
stat_function(fun = pnorm, colour = "red",
arg = list(mean = m, sd=s)) + # add the gaussian density with the true parameters
stat_function(fun = dnorm, colour = "red",
arg = list(mean = m, sd=s), linetype="dashed") + # add the gaussian density with the true parameters
ylab("Cumulative density")
```
If you are given a cumulative density function of a normal distribution, you can find out the mean and the standard deviation of a normally dsitributed dataset. How?
> Provide your answer here.
For many applications, it is important to diagnose whether a given dataset is from a Normal distribution or not. One graphical tool to do this are *qq-plots*. *qq* stand for quantile-quantile plots. What the plot does is plot the quantile of the actual data values against the corresponding quantiles of a standard Normal distribution.
```{r}
set.seed(99)
m = 5 # mean
s = 2 # sd
x <- rnorm(50, mean = m, sd = s) # generate 50 samples form a normal distribution with mean 5 and sd 2
data = data.frame(x)
ggplot(data,aes(sample=x)) +
stat_qq(color="firebrick2", alpha=1) +
geom_abline(intercept = mean(x), slope = sd(x))
```
If the sample follows roughly a Normal distribution, the points lie approximately on a straight line. The slope of that line is the standard deviation, the y-intercept the mean. Except for the most extreme point, the sample we plotted follows a straight line pretty well.
How is the plot constructed? Let's say we have a sample of N points.
1. Sort the points in the sample. The position in the sorted sample is called rank. We call it k.
2. For the k-th value, compute $(k-.5)/N$ and obtain the value where it should lie given a normal distribution. You can think of this as starting at $(k-.5)/N$ on the y-axis and going to the right until you hit the cumulative distribution function of a standard normal distribution. Then you go straight down until you hit the x-axis. The direct way to compute this number of is the inverse distribution function or quantile function, called *qnorm*. Find out how it works using ?.
3. Make a scatter plot of the sorted points and the theoretical expected values, assuming a standard normal. If the data comes from a normal, you expect the points to lie roughly on a straight line. If not, they can form all kinds of shapes.
Now implement this construction "by hand" and compare to the plot above.
```{r}
set.seed(99)
m = 5 # mean
s = 2 # sd
x <- rnorm(50, mean = m, sd = s) # generate 50 samples form a normal distribution with mean 5 and sd 2
# ASSIGNMENT
xs = sort(x)
N = length(x)
p = (1:N - .5)/(N)
q = qnorm(p)
data = data.frame(xs,q)
ggplot(data,aes(x=q,y=xs)) +
geom_point()
```
Now, finally let's use the diagnose if some samples are normally distributed or not. Of course, this is not a rigorous quantitative procedure, but in many cases it is quite helpful. Read the data from *qqplot.csv*. There are four data sets combined in one data frame, only one of which is normally distributed. Visualize the data using histograms and use QQ-plots to find out which is likely normally distributed.
```{r}
# set.seed(552)
# idx <- factor(sample(1:4))
# generating data
# m = 5 # mean
# s = 2 # sd
#
# x <- rnorm(100, mean = m, sd = s) # generate 50 samples form a normal distribution with mean 5 and sd 2
# data <- data.frame(x=x,set=idx[1])
#
# x <- rlnorm(100,mean=1,sdlog=1)
# data <- rbind(data,data.frame(x=x,set=idx[2]))
#
# x <- rgamma(100,shape=.5, scale=1)
# data <- rbind(data,data.frame(x=x,set=idx[3]))
#
# x <- rt(100,df=2)
# data <- rbind(data,data.frame(x=x,set=idx[4]))
data = read.csv("D:/lab/teaching/statistics2016/R_for_biostatistics/qqplot.csv")
ggplot(data, aes(sample=x)) +
stat_qq() +
facet_grid(. ~ set)
```
Which dataset is likely normally distributed?
> Provide your answer here.
What can you say about the shape of the other distributions by looking at the qq-plots?
> Provide your answer here.