-
Notifications
You must be signed in to change notification settings - Fork 1
/
002-the-coalescent-and-demographic-scenarios.Rmd
323 lines (245 loc) · 12.8 KB
/
002-the-coalescent-and-demographic-scenarios.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
title: "The Coalescent with Population Structure and Demography, etc."
output:
html_notebook:
toc: true
toc_float: true
author: "Eric C. Anderson"
bibliography: references.bib
---
## Using `ms` via GSImulator
We are using `ms` this way because it is fairly easy to distribute the binaries
in an R package.
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(GSImulator)
```
`ms` is a coalescent simulator written in C by Dick Hudson at U Chicago. It has
been around for a long time, but a paper describing it came out long after it had
been in routine use [@hudsonGeneratingSamplesWright2002]
It can be downloaded from [here](http://home.uchicago.edu/rhudson1/source/mksamples.html).
However, I have
wrapped up compiled versions in an R package, so you needn't worry about that.
It is quite fast for simulating small segments of
DNA without much recombination. It is worth noting, however, that
a more recent program that implements the whole process more efficiently
is available and much better for whole-genome-scale simulation [@kelleherEfficientCoalescentSimulation2016].
### Running `ms` within R with the system2 command
`ms` is a program that runs in the shell. Here we use R's `system2()` command to run it
from within R.
If we invoke `ms` with no arguments it gives us a message about what kinds
of arguments one can use:
```{r}
# ms_binary() gives the path to ms on your system
system2(command = ms_binary(), args = "")
```
That shows a lot of fun things to do. Let's start by just simulating coalescent
trees and looking at them. In order to do that, you will need to install another
couple packages if you don't already have them. See the instructions at: [https://eriqande.github.io/coalescent-hands-on/000-prep-for-coalescent.nb.html](https://eriqande.github.io/coalescent-hands-on/000-prep-for-coalescent.nb.html).
Load the other packages we will use:
```{r, message=FALSE, warning=FALSE}
library(ape)
library(ggtree)
```
### A basic vanilla tree
Here we simulate a single tree with 10 tips and tell `ms` to just print the tree out
in Newick format:
```{r}
system2(command = ms_binary(), args = "10 1 -T")
```
The resulting coalescent tree is written out above in "Newick" format, which is
named after a restaurant ("Newick's") in New Hampshire where a group of phylogeneticist's
sketched it out on a napkin.
It shows the branch tip numbers and the patterns in which
branches in a tree join one another. It also shows the branch
lengths in units of time.
In trees made by `ms`, the branch lengths are coalescence times are
in units of $4N$ generations. So an `ms` time of 0.5 is the expected
coalescence time of a single pair of lineages.
Cool. Now, we can redirect that sort of output into a file, read it in and plot
a tree.
```{r}
system2(command = ms_binary(), args = "10 1 -T", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
ggtree(ctree, layout = "rectangular") +
geom_tiplab(size = 3) + coord_flip() + scale_x_reverse()
```
Note that from here on out, we are going to flip our trees sideways, since it is a little
easier to plot them that way with ggtree. Like this:
```{r}
p <- ggtree(ctree, layout = "rectangular") +
geom_tiplab(size = 3) +
theme_tree2()
revts(p)
```
We also put a "timeline" on that one so that the units of time can be visualized.
In this case, time going backwards is depicted as negative.
### Multiple trees
One characteristic of the coalescent process is its variability. Here
we simulate 12 different trees (each with 10 tips) and plot
them to appreciate how variable
trees can be, even if simulated from the same neutral process.
```{r}
system2(command = ms_binary(), args = "10 12 -T", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:12
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
### Population structure
The standard coalescent can easily be updated to include population structure. Basically,
at the time of sampling (i.e., "the present"), you specify that your samples are from $K$
different populations. This is done with the `-I` flag. For example, `-I 2 20 30` means that
there are 2 subpopulations with
the first 20 (haploid) samples from the first subpopulation, and the next 30 from the second
subpopulation. Note that the nsam parameter (number of samples) must be 50 = 20 + 30 in that case.
After the `-I 2 20 30` you then give a number which is the _population scaled migration rate_, $4Nm$,
where $m$ in the per-generation probability that a parent in the population is an
immigrant from a different population. (You can actually specify a whole migration matrix if you
want, but we won't worry about that now.) You can specify an arbitrary number of populations,
but we will only focus on two of them in this notebook.
What does the coalescent with migration look like? When everyone gets down to this
point, ask Eric, and he will show you with a small example.
Think about this: what does $4Nm$ mean in terms of how many migrational events you would
expect to see in the history of a pair of gene copies? And why? (Once again, think of
each branch as a bead-string of generations...)
Let's do two populations, each of 10 chromosomes ($2N = 10$), and let the migration
rate be quite low. Like $4Nm = 0.01$, to see what we get. Let's do 6 reps:
```{r}
system2(command = ms_binary(), args = "20 6 -T -I 2 10 10 0.01", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
OK, at such low migration rates, we see that all 10 gene copies in each population have,
almost without exception, coalesced with one another before coalescing with any genes from
the other population.
That is not the always the case when migration rate gets higher, for example $4Nm = 0.2$.
```{r}
system2(command = ms_binary(), args = "20 6 -T -I 2 10 10 0.2", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
By taking the limit as the mutation rate goes to zero, Monty Slatkin, showed
that the parameter $F_\mathrm{ST}$ can be defined in terms of the expected coalescent
times between pairs of genes drawn within and between different populations.
$$
F_\mathrm{ST} = \frac{\bar{t} - \bar{t}_0}{\bar{t}}
$$
where $\bar{t}_0$ is the expected coalescence time between two genes drawn at random
from within subpopulations and $\bar{t}$ is the expected coalescence time for two gene
copies drawn at random from amongst all those in either subpopulation [@slatkinInbreedingCoefficientsCoalescence2007].
### Population growth
As the manual for `ms` notes, "The switch `-G` [used like `ms nsam nreps -t 4 -G a`] is used to specify that the population has been growing (or shrinking) exponentially. That is, the population size at time $t$ in the past is given by: $N(t) = N_0 \exp^{-at}$, where $t$ is the time before the present, measured in units of $4N_0$ generations." Note that $N_0$ is the size of the population at the time of sampling. So, if $a>0$ it means the population was smaller further back in time.
Let's make some plots to grok out what this means. First, let's pretend that we are thinking about a
population that, at the time of sampling is of size $N_0 = 10,000$. And so, if we were interested in
what its population size was, say 34,000 generations ago, then that would mean that time, measured in units
of $4N_0$ generations would be:
$$
t = \frac{34,000}{4N_0} = \frac{34,000}{40,0000} \approx 0.85
$$
So, let's make a picture of what the population size looks like at different times going back in the past.
You can play with the values of `N_0` and `a` below to see what you get
```{r}
# define parameter values
N_0 <- 10000
a = 5
# make a tibble of different values at different times
ptib <- tibble(
generations = seq(0, 5 * N_0, by = N_0 / 100)
) %>%
mutate(t = generations / (4 * N_0),
Nt = N_0 * exp(-t * a)
)
# make a plot
ggplot(ptib, aes(x = generations, y = Nt)) +
geom_line() +
scale_x_continuous("Time in generations", sec.axis = sec_axis(~ . / (4 * N_0), name = "Time in 4N_0 generations" ))
```
Population growth can stretch trees out. If a population is growing exponentially
then the terminal branches tend to be longer. Let's experiment with that.
```{r}
system2(command = ms_binary(), args = "10 6 -T -G 5", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
Yep, longer terminal branches. Play around with different settings and check out the
results.
### Instantaneous population size changes
Past changes in the populations are controlled by values of `-eSomething` options. Where
the e stands for event. The simplest of these are instantaneous changes in population
size. The simplest is:
* `-en t i x` which means "Set subpop $i$ size to $x \times N_0$ at time $t$ and growth rate to zero.
So, for example, in our migration model above we could make the two populations very different sizes
and see what the results are. Let's make population 2 ten times bigger than population 1 from
the present all the way back into the past...
```{r}
system2(command = ms_binary(), args = "20 6 -T -I 2 10 10 0.01 -en 0 2 10", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
Aha! That is interesting. It is hard to tell, because the coalescence times are so short in population
1, but if you squint, you can convince yourself that, on average, the trees in population 2 reach
a MRCA in about 10 times the amount of time as the trees in population 1.
The results also show that there is a lot of variability in the coalescent process.
### Population Splitting/Merging
My goodness, there are a lot of options in `ms`. To see the full rundown on all the options,
you can read the full documentation: [msdoc.pdf](https://eriqande.github.io/coalescent-hands-on/msdoc.pdf). But for now, we will
just discuss one more, used to join or split two populations going back in time.
* `-ej t i j` which means move all lineages in subpopulation $i$ into subpopulation $j$ at time $t$
in units of $4N_0$. The `j` means "join". It corresponds to populations "joining" while going
backward in time. But, _forward in time_ this corresponds to population _splitting_.\
* `-es t i p` which means split subpopulation $i$ into two subpopulations $i$ and $npop + 1$ $t$
(in units of $4N_0$). Lineages currently in $i$ remain in $i$ with probability $p$ and
with probability $1-p$ they get placed into subpopulation $npop + 1$. The `s` means split.
It corresponds to populations "joining" while going forward in time.
**Discuss with your partner**: What are some conservation-genetic scenarios that
these last two demographic scenarios might correspond to?
Here is an example of a population join going back in time. We have 10 samples from
each population with no migration between them, and the populations join at time $2N_0$ in the
past (at which point most will have coalesced within subpopulations):
```{r}
system2(command = ms_binary(), args = "20 6 -T -I 2 10 10 0.00 -ej 0.5 1 2", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
## Your Mission/Homework!
All right, after playing around with all the above, look at the tree below and, with your
group:
1. think about and discuss what the demography might have looked like to produce the tree.
Think about how different population sizes must have been and at which time periods population
size changes might have happened.
2. Check your hypotheses by generating some trees with `ms`.
Here is a hint. To make the trees below, I didn't bother with exponential
growth, I just used (quite unrealistic) instantaneous population size changes.
```{r, echo=FALSE}
# Two subpopulations. The first one of size N0. We want to expect
# TMRCA for that one to be X = 4N0. For subpop 2 we want to expect only
# one coalescence by 5X. 10 lineages in each subpopulation, so n(n-1) is
# 90. That means subpop 2 must be 450 times bigger than subpop 1. At time
# 5X we join subpops 1 and 2, and set the size of the merged population to
# N0 so we have the same expected time to MRCA as we did with subpop 1 in the
# beginning.
system2(command = ms_binary(), args = "20 6 -T -I 2 10 10 0.00 -en 0 2 450 -ej 20 2 1 -en 20.001 1 1 ", stdout = "out.tree", stderr = FALSE)
ctree <- read.tree("out.tree")
names(ctree) <- 1:6
ggtree(ctree, layout = "rectangular") +
facet_wrap(~.id, ncol = 3) +
geom_tiplab(size = 2)
```
## References