-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path04-Probability.Rmd
517 lines (391 loc) · 18.4 KB
/
04-Probability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
# Probability
\def\Prob{\operatorname{Pr}}
\def\E{\operatorname{E}}
## Some terminology
Probability is about quantifying the relative chances of various possible
outcomes of a random process.
As a very simple example (used to illustrate the terminology below), considering
rolling a single 6-sided die.
**sample space:** The set of all possible outcomes of a random process.
[{1, 2, 3, 4, 5, 6}]
**event:** a set of outcomes (subset of sample space) [E = {2, 4, 6} is the event that we obtain an even number]
**probability:** a number between 0 and 1 assigned to an event
(really a function that assigns numbers to each event). We write this
P(E). [P(E) = 1/2 where E = {1, 2, 3}]
**random variable:** a random process that produces a number. [So rolling a die can
be considered a random variable.]
**probability distribution:** a description of all possible outcomes and their
probabilities.
For rolling a die we might do this with a table like this:
1 | 2 | 3 | 4 | 5 | 6
---:|:---:|:---:|:---:|:---:|:---
1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6
**support (of a random variable):** the set of possible values of
a random variable. This is very similar to the sample space.
**probability mass function (pmf):** a function (often denoted with
$p$ or $f$) that takes possible values of a
discrete random variable as input and returns the probability of that outcome.
* If $S$ is the support of the random variable, then
$$
\sum_{x \in S} p(x) = 1
$$
and any function with this property is a pmf.
* Probabilities of events are
obtained by adding the probabilities of all outcomes in the event:
$$
\Prob(E) = \sum_{x \in E} p(x)
$$
* pmfs can be represented in a table (like the one above) or graphically with a
probability histogram or lollipop plot like the ones below. [These are not
for the 6-sided die, as we can tell because the probabilities are not
the same for each input; the die rolling example would make very boring plots.]
```{r ch04-binom-01, echo = FALSE}
gf_dist("binom", size = 6, prob = 0.25, kind = "histogram",
binwidth = 1, color = "black",
title = "A histogram for a discrete random variable")
gf_dist("binom", size = 6, prob = 0.25,
title = "A lollipop plot for a discrete random variable")
```
* Histograms are generally presented on the **density scale** so the total
area of the histogram is 1. (In this example, the bin widths are 1, so this
is the same as being on the probability scale.)
**probability density function (pdf):** a function (often denoted with $p$ or $f$)
that takes the possible
values of continuous random variable as input and returns the probability
*density*.
* If $S$ is the support of the random variable, then [^04-1]
$$
\int_{x \in S} f(x) \; dx = 1
$$
and any function with this property is a pmf.
[^04-1]: Kruschke likes to write his integrals in a different order: $\int dx \; f(x)$
instead of $\int f(x) \; dx$. Either order means the same thing.
* Probabilities are obtained by integrating
(visualized by the area under the density curve):
$$
\Prob(a \le X \le b) = \int_a^b f(x) \; dx
$$
```{r ch04-beta-01, echo = FALSE}
xpbeta(c(0.20, 0.40), shape1 = 3, shape2 = 6, return = "plot") %>%
gf_labs(title = "Probability = area under a density curve")
```
**kernel function:**
If $\int_{x \in S} f(x) \; dx = k$ for some real number $k$, then
$f$ is a kernel function. We can obtain the pdf from the kernel by dividing
by $k$.
**cumulative distribution function (cdf):** a function (often denoted with a
capital $F$) that takes a possible value
of a random variable as input and returns the probability of obtaining a
value less than or equal to the input:
$$
F_X(x) = \Prob(X \le x)
$$
cdfs can be defined for both discrete and continuous random variables.
**a family of distributions:** is a collection of distributions which
share common features but are distinguished by different parameter values.
For example, we could have the family of distributions of fair dice
random variables. The parameter would tell us how many sides the die has.
Statisticians call this family the **discrete uniform distributions** because
all the probabilities are equal (1/6 for 6-sided die, 1/10 for a $D_10$, etc.).
We will get to know several important families of distributions, among them
the **binomial**, **beta**, **normal**, and **t** families will be
especially useful. You may already be familiar with some or all of these.
We will also use distributions that have no name and are only described by
a pmf or pdf, or perhaps only by a large number of random samples from which
we attempt to estimate the pmf or pdf.
## Distributions in R
pmfs, pdfs, and cdfs are available in R for many important families
of distributions. You just need to know a few things:
* each family has a standard abbreviation in R
* pmf and pdf functions begin with the letter `d` followed by the family abbreviation
* cdf functions begin with the letter `p` followed by the family abbreviation
* the inverse of the cdf function is called a quantile function,
it starts with the letter `q`
* functions beginning with `r` can generate random samples from a distribution
* help for any of these functions will tell you what R calls the parameters
of the family.
* `gf_dist()` can be used to make various plots of distributions.
### Example: Normal distributions
As an example, let's look the family of normal distributions. If you type
`dnorm(` and then hit TAB or if you type `args(dnorm)` you can see the arguments
for this function.
```{r ch04-dnorm-args}
args(dnorm)
```
From this we see that the parameters are called `mean` and `sd` and have
default value of 0 and 1. These values will be used if we don't specify
something else.
As with many of the pmf and pdf functions,
there is also an option to get back the log of the
pmf or pdf by setting `log = TRUE`. This turns out to be computationally much
more efficient in many contexts, as we will see.
Let's begin with some pictures of a normal distribution with
mean 10 and standard deviation 1:
```{r ch04-norm-01}
gf_dist("norm", mean = 10, sd = 2, title = "pdf for Norm(10, 2)")
gf_dist("norm", mean = 10, sd = 2, kind = "cdf", title = "cdf for Norm(10, 2)")
```
Now some exercises. Assume $X \sim {\sf Norm}(10, 2)$.
1. What is $\Prob(X \le 5)$?
We can see by inspection that it is less that 0.5. `pnorm()` will give us
the value we are after; `xpnorm()` will provide more verbose output and a plot
as well.
```{r ch04-pnorm-01}
pnorm(5, mean = 10, sd = 2)
xpnorm(5, mean = 10, sd = 2)
```
2. What is $\Prob(5 \le X \le 10)$?
```{r ch04-pnorm-02}
pnorm(10, mean = 10, sd = 2) - pnorm(5, mean = 10, sd = 2)
```
3. How tall is the density function at it's peak?
Normal distributions are symmetric about their means, so we need the
value of the pdf at 10.
```{r ch04-dnorm-01}
dnorm(10, mean = 10, sd = 2)
```
4. What is the mean of a Norm(10, 2) distribution?
Ignoring for the moment that we know the answer is 10, we can compute it.
Notice the use of `dnorm()` in the computation.
```{r ch04-integrate}
integrate(function(x) x * dnorm(x, mean = 10, sd = 2), -Inf, Inf)
```
5. What is the variance of a Norm(10, 2) distribution?
Again, we know the answer is the square of the standard deviation, so 4.
But let's get R to compute it in a way that would work for other distributions
as well.
```{r ch04-integrate-02}
integrate(function(x) (x - 10)^2 * dnorm(x, mean = 10, sd = 2), -Inf, Inf)
```
6. Simulate a data set with 50 values drawn from a ${\sf Norm}(10, 2)$
distribution and make a histogram of the results and overlay the
normal pdf for comparison.
```{r ch04-rnorm}
x <- rnorm(50, mean = 10, sd = 2)
# be sure to use a density histogram so it is on the same scale as the pdf!
gf_dhistogram(~ x, bins = 10) %>%
gf_dist("norm", mean = 10, sd = 2, color = "red")
```
### Simulating running proportions
```{r ch04-simulation}
library(ggformula)
library(dplyr)
theme_set(theme_bw())
Flips <-
tibble(
n = 1:500,
flip = rbinom(500, 1, 0.5),
running_count = cumsum(flip),
running_prop = running_count / n
)
gf_line(
running_prop ~ n, data = Flips,
color = "skyblue",
ylim = c(0, 1.0),
xlab = "Flip Number", ylab = "Proportion Heads",
main = "Running Proportion of Heads") %>%
gf_hline(yintercept = 0.5, linetype = "dotted")
```
\newpage
## Joint, marginal, and conditional distributions
Sometimes (most of the time, actually) we are interested joint distributions.
A joint distribution is the distribution of multiple random variables
that result from the same random process. For example, we might roll a pair
of dice and obtain two numbers (one for each die). Or we might collect
a random sample of people and record the height for each of them. Or we might
randomly select one person, but record multiple facts (height and weight, for example). All of these situations are covered by joint distributions.[^04-2]
[^04-2]: Kruschke calls these 2-way distributions, but there can be more than variables
involved.
### Example: Hair and eye color
Kruschke illustrates joint distributions with an example of hair and eye color
recorded for a number of people. [^04-3] That table below has the proportions
for each hair/eye color combination.
For example,
[^04-3]: The datasets package has a version of this data with a third
variable: `sex`. (It is as a 3d table rather than as a data frame). According
to the help for this data set, these data come from
"a survey of students at the University of Delaware reported by Snee (1974).
The split by Sex was added by Friendly (1992a) for didactic purposes."
It isn't exactly clear what population this might represent
```{r ch04-HairEyeColor, include = FALSE}
# NB: there is a different HairEyeColor data set in datasets!
CalvinBayes::HairEyeColor %>%
group_by(Hair) %>%
summarise(
Blue = sum(Count[Eye == "Blue"]),
Green = sum(Count[Eye == "Green"]),
Hazel = sum(Count[Eye == "Hazel"]),
Brown = sum(Count[Eye == "Brown"]))
df_stats(Count ~ Eye + Hair, data = CalvinBayes::HairEyeColor, total = sum) %>%
mutate(prop = round(total/sum(total), 3))
```
<!-- Hair/Eyes | Blue | Green | Hazel | Brown -->
<!-- ---------:|:-----:|:-----:|:-----:|:------: -->
<!-- Black | 20 | 5 | 15 | 68 -->
<!-- Blond | 94 | 16 | 10 | 7 -->
<!-- Brown | 84 | 29 | 54 | 119 -->
<!-- Red | 17 | 14 | 14 | 26 -->
Hair/Eyes | Blue | Green | Hazel | Brown
---------:|:-----:|:-----:|:-----:|:------:
Black | 0.034 | 0.115 | 0.008 | 0.025
Blond | 0.159 | 0.012 | 0.027 | 0.017
Brown | 0.142 | 0.201 | 0.049 | 0.091
Red | 0.029 | 0.044 | 0.024 | 0.024
Each value in the table indicates the proportion
of people that have a particular hair color *and*
a particular eye color. So the upper left
cell says that 3.4% of people
have black hair and blue eyes
(in this particular
sample -- the proportions will vary a lot depending
on the population of interest).
We will denote this as
$$
\Prob(\mathrm{Hair} = \mathrm{black}, \mathrm{Eyes} = \mathrm{blue}) = 0.034 \;.
$$
or more succinctly as
$$
p(\mathrm{black}, \mathrm{blue}) = 0.034 \;.
$$
This type of probability is called a **joint probability**
because it tells about the probability of **both** things
happening.
Use the table above to do the following.
1. What is $p(\mathrm{brown}, \mathrm{green})$ and what
does that number mean?
2. Add the proportion across each row and down each column.
(Record them to the right and along the bottom of the table.)
For example, in the first row we get
$$
0.034 + 0.115 + 0.008 + 0.025 =
`r 0.034 + 0.115 + 0.008 + 0.025` \;.
$$
a. Explain why
$p(\mathrm{black}) = `r 0.034 + 0.115 + 0.008 + 0.025`$
is good notation for this number.
2. Up to round-off error, the total of all the
proportions should be 1. Check that this is true.
3. What proportion of people with black hair have
blue eyes?
This is called a conditional probability. We denote it
as
$\Prob(\mathrm{Eyes} = \mathrm{blue} \mid \mathrm{Hair} = \mathrm{black})$.
or
$p(\mathrm{blue} \mid \mathrm{black})$.
4. Compute some other conditional probabilities.
a. $p(\mathrm{black} \mid \mathrm{blue})$.
a. $p(\mathrm{blue} \mid \mathrm{blond})$.
a. $p(\mathrm{blond} \mid \mathrm{blue})$.
a. $p(\mathrm{brown} \mid \mathrm{hazel})$.
a. $p(\mathrm{hazel} \mid \mathrm{brown})$.
5. There are 32 such conditional probabilities that we can
compute from this table. Which is largest? Which is smallest?
6. Write a general formula for computing the
conditional probability
$p(c \mid r)$ from the $p(r,c)$ values.
($r$ and $c$ are to remind you of rows and columns.)
7. Write a general formula for computing the
conditional probability
$p(r \mid c)$ from the $p(r,c)$ values.
If we have continuous random variables, we can do a similar
thing. Instead of working with probability, we will
work with a pdf. Instead of sums, we will have integrals.
8. Write a general formula for computing each of the following
if $p(x,y)$ is a continuous joint pdf.
a. $p_X(x) = p(x) =$
b. $p_Y(y) = p(y) =$
c. $p_{Y\mid X}(y\mid x) = p(y \mid x) =$
d. $p_{X\mid Y}(y\mid x) = (x \mid y) =$
9. We can expression both versions of conditional probability
using a word equation. Fill in the missing numerator and
denominator
$$
\mathrm{conditional} = \frac{\phantom{joint}}{\phantom{marginal}}
$$
### Independence
If $p(x \mid y) = p(x)$ (conditional = marginal)
for all combinations of $x$ and $y$, we say that
$X$ and $Y$ are **independent**.
10. Use the definitions above to express independence another way.
11. Are hair and eye color independent in our example?
12. True or False. If we randomly select a card from
a standard deck (52 cards, 13 denominations, 4 suits),
are suit and denomination independent?
13. Create a table for two independent random variables
$X$ and $Y$, each of which takes on only 3 possible values.
14. Now create a table for a different pair $X$ and $Y$
that are not independent but have the same marginal probabilities
as in the previous exercise.
## Exercises {#ch04-exercises}
<!-- similar to Kruschke2-4.4 -->
1. Suppose a random variable has the pdf $p(x) = 6x (1-x)$ on the interval
$[0,1]$. (That means it is 0 outside of that interval.)
a. Use `function()` to create a function in R that is equivalent to `p(x)`.
b. Use `gf_function()` to plot the function on the interval $[0, 1]$.
c. Integrate by hand to show that the total area under the pdf is 1
(as it should be for any pdf).
c. Now have R compute that same integral (using `integrate()`).
d. What is the largest value of $p(x)$? At what value of $x$ does it occur?
Is it a problem that this value is larger than 1?
Hint: differentiation might be useful.
2. Recall that
$\E(X) = \int x f(x) \;dx$ for a continuous random variable with pdf $f$ and
$\E(X) = \sum x f(x) \;dx$ for a discrete random variable with pmf $f$. (The
integral or sum is over the support of the random variable.) Compute
the expected value for the following random variables.
a. $A$ is discrete with pmf $f(x) = x/10$ for $x \in \{1, 2, 3, 4\}$.
b. $B$ is continuous with kernel $f(x) = x^2(1-x)$ on $[0, 1]$.
Hint: first figure out what the pdf is.
3. Compute the variance and standard deviation
of each of the distributions in the previous problem.
<!-- similar to Kruschke2-4.5 -->
4. In Bayesian inference, we will often need to come up with a distribution
that matches certain features that correspond to our knowledge or intuition about
a situation. Find a normal distribution with a mean of 10 such that half of the
distribution is within 3 of 10 (ie, between 7 and 13).
Hint: use `qnorm()` to determine how many standard deviations are between 10 and 7.
<!-- Kruschke2-4.6 -->
5. School children were surveyed regarding their favorite foods. Of the total
sample, 20% were 1st graders, 20% were 6th graders, and 60% were 11th graders.
For each grade, the following table shows the proportion of respondents that
chose each of three foods as their favorite.
a. From that information, construct a table of joint probabilities of grade and favorite food.
a. Are grade and favorite food independent? Explain how you ascertained the answer.
grade | Ice cream | Fruit | French fries
:----:|:----------:|:-----------:|:------------:
1st | 0.3 | 0.6 | 0.1
6th | 0.6 | 0.3 | 0.1
11th | 0.3 | 0.1 | 0.6
<!--
Hint: You are given p(grade) and p(food|grade).
You need to determine p(grade,food). -->
6. Three cards are placed in a hat. One is black on both sides, one is white on both sides, and the third is white on one side and black on the other. One card is selected at random from the three cards in the hat and placed on the table.
The top of the card is black.
a. What is the probability that the bottom is also black?
b. What notation should we use for this probability?
7. The three cards from the previous problem are returned to the hat.
Once again a card is pulled and placed on the table, and once again the top
is black. This time a second card is drawn and placed on the table. The
top side of the second card is white.
a. What is the probability that the bottom side of the first card is black?
b. What is the probability that the bottom side of the second card is black?
c. What is the probability that the bottom side of both cards is black?
8. **Pandas.**
Suppose there are two species of panda, A and B. Without a special blood
test, it is not possible to tell them apart. But it is known that half of
pandas are of each species and that 10% of births from species A are twins and
20% of births from species B are twins.
a. If a female panda has twins, what is the probability that she is
from species A?
b. If the same panda later has another set of twins, what is the probability
that she is from species A?
c. A different panda has twins and a year later gives birth to a single panda.
What is the probability that this panda is from species A?
9. **More Pandas.**
You get more interested in pandas and learn that at your favorite zoo,
70% of pandas are species A and 30% are species B. You learn that one of the
pandas has twins.
a. What is the probability that the panda is species A?
b. The same panda has a single panda the next year. Now what is the probability
that the species is A?
## Footnotes