forked from eriqande/coalescent-hands-on
-
Notifications
You must be signed in to change notification settings - Fork 0
/
004-one-dimensional-SFS.Rmd
173 lines (140 loc) · 5.54 KB
/
004-one-dimensional-SFS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
title: "One-dimensional Site Frequency Spectra"
output:
html_notebook:
toc: true
toc_float: true
author: "Eric C. Anderson"
bibliography: references.bib
---
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(GSImulator)
library(ape)
library(ggtree)
```
## Introduction
In class, we spoke about the fact that, under the neutral coalescent with
inifinite sites mutation, the expected number of mutations that were inherited
by $i$ haploid chromosomes in our sample is $\theta/i$, where $\theta = 4N\mu$.
Let's write that out using the notation that $S_n^{i}$ denotes the number
of mutations that were inherited by $i$ haploid chromosomes in the sample. We
have:
$$
E(S_n^{i}) = \frac{\theta}{i}~~,~~i = 1,\ldots, n-1
$$
So, what does that look like? Let's make a picture.
Imagine we have a sample of 25 diploids, so 50 haploid chromosomes
in our samples. The total number of SNPs we might find in that
sample depends on $\theta$, of course, but let's simple imagine that
we found 100K SNPs total. If they were generated from a neutral coalescent
we could compute the proportions they are expected to be in using the
equation above (basically we know that we expect to have 1/2 as many mutations found
in two copies as in one copy, and a third as many in 3 copies as in 1 copy, and
so forth). In R code we could say:
```{r}
NumSNPs <- 100000
NumSam <- 50 # number of haploid chromosomes
# here are the unnormalized fractions we expect to find
fracts <- 1 / 1:(NumSam - 1)
# normalize so they sum to one, the multiply by NumSNPs
Sin <- NumSNPs * fracts / sum(fracts)
# let's plot that:
tibble(
i = 1:(NumSam - 1),
E_of_S = Sin) %>%
ggplot(aes(x = i, y = E_of_S)) +
geom_col() +
theme_bw()
```
Question 1: with a sample of size 50 gene copies, what is the allele count for
a SNP with a minor allele frequency of 0.1?
Question 2: Look at the figure and roughly estimate how many SNPs are expected to
have a minor allele frequency less than 0.1.
Does the answer explain why your data set gets so much smaller when you filter
on minor allele frequency?
## Uses of the SFS
The 1d (and 2-d and higher) site frequency spectra are often used to make demographic
inference about the populations that produced them. Site frequency spectra are
quite simple to compute, and represent a huge reduction in data, which means
that they can be useful when dealing with huge genomic data sets.
We won't do any actual inference from site frequency spectra. The programs
$\partial a \partial i$ and _moments_ will do that. We will talk about that next
semester. But for now, I just want to give students the chance to explore how
the 1-d SFS changes under different demographic scenarios.
## Generating SFS from coalescent simulations
To compute average SFS from coalescent simulations, we can simulate thousands of
independent coalescent trees, each with a single mutation on them, and then
process those in R to compute the SFS. Here is how we would simulate 10K independent
SNPs with a sample of 25 diloids (50 haploids):
```{r}
system2(command = ms_binary(), args = "50 10000 -s 1", stdout = "out.snps", stderr = FALSE)
```
That creates a rather large file (1.3 Mb or so). We can, however, read that
into R and return a site frequency spectrum (which is really little more than
a table of counts) using this nifty function:
```{r}
#' parse ms output with one snp per tree simulated and return the SFS
#'
#' @param nsam The number of haploid chromosomes simulated
#' @param file The name of the file. By default = "out.snps"
ms2sfs <- function(nsam, file = "out.snps") {
x <- readr::read_lines(file)
mat <- matrix(x[x == "0" | x == "1"], nrow = nsam)
tibble(
S = as.numeric(colSums(mat == 1))
) %>%
count(S) %>%
left_join(tibble(i = 1:(nrow(mat) - 1)), ., by = c("i" = "S")) %>%
rename(S = n) %>%
mutate(S = ifelse(is.na(S), 0, S))
}
```
See how that works like this:
```{r}
S <- ms2sfs(50)
S
```
And now, it would be nice to be able to plot that SFS, along with
a picture of what the neutral, standard coalescnet would look like.
Here is a function for that:
```{r}
#' plot an SFS that comes out of ms2sfs()
#' @param S the SFS tibble that comes out of ms2sfs()
#' @param add_neural logical. If true, a blue dashed frame is printed
# in the location of the expected SFS under the standard coalescent
# with nothing fancy going on.
plot_sfs <- function(S, add_neutral = TRUE) {
g <- ggplot(S, aes(x = i, y = S)) +
geom_col(fill = "gray") +
theme_bw()
if(add_neutral == TRUE) {
NumSNPs <- sum(S$S)
NumSam <- max(S$i) + 1
fracts <- 1 / 1:(NumSam - 1)
Sin <- NumSNPs * fracts / sum(fracts)
expy <- tibble(
i = 1:(NumSam - 1),
S = Sin)
g <- g +
geom_col(data = expy, fill = NA, colour = "blue", linetype = "dashed", size = 0.5)
}
g
}
```
Now, we can have at it in three easy steps:
```{r}
system2(command = ms_binary(), args = "50 10000 -s 1", stdout = "out.snps", stderr = FALSE)
S <- ms2sfs(50)
plot_sfs(S)
```
And we see that our observed SFS looks a lot like that expected under
the neutral coalescent.
## Your Task
Now, think about how you might simulate different scenarios that would give you
an SFS pushed more to the left (more singletons and low-frequency variants---what
would the tree look like for that?), and also scenarios that would give
you a SFS pushed more to the right (more high-frequency variants). Then simulate
those scenarios and verify your intuition.
Turn in a narrative about what you did and why you thought it would change the SFS
in the way that it did, your R code, and your pictures.