004-one-dimensional-SFS.Rmd

---
title: "One-dimensional Site Frequency Spectra"
output: 
  html_notebook:
    toc: true
    toc_float: true
author: "Eric C. Anderson"
bibliography: references.bib
---


```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(GSImulator)
library(ape)
library(ggtree)
```


## Introduction

In class, we spoke about the fact that, under the neutral coalescent with 
inifinite sites mutation, the expected number of mutations that were inherited
by $i$ haploid chromosomes in our sample is $\theta/i$, where $\theta = 4N\mu$.

Let's write that out using the notation that $S_n^{i}$ denotes the number
of mutations that were inherited by $i$ haploid chromosomes in the sample.  We
have:
$$
E(S_n^{i}) = \frac{\theta}{i}~~,~~i = 1,\ldots, n-1
$$

So, what does that look like?  Let's make a picture.
Imagine we have a sample of 25 diploids, so 50 haploid chromosomes
in our samples.  The total number of SNPs we might find in that
sample depends on $\theta$, of course, but let's simple imagine that
we found 100K SNPs total.  If they were generated from a neutral coalescent
we could compute the proportions they are expected to be in using the
equation above (basically we know that we expect to have 1/2 as many mutations found
in two copies as in one copy, and a third as many in 3 copies as in 1 copy, and
so forth). In R code we could say:
```{r}
NumSNPs <- 100000
NumSam <- 50 # number of haploid chromosomes

# here are the unnormalized fractions we expect to find
fracts <-  1 / 1:(NumSam - 1)

# normalize so they sum to one, the multiply by NumSNPs
Sin <- NumSNPs * fracts / sum(fracts)

# let's plot that:
tibble(
  i = 1:(NumSam - 1),
  E_of_S = Sin) %>%
ggplot(aes(x = i, y = E_of_S)) +
  geom_col() +
  theme_bw()
```

Question 1: with a sample of size 50 gene copies, what is the allele count for
a SNP with a minor allele frequency of 0.1?

Question 2: Look at the figure and roughly estimate how many SNPs are expected to
have a minor allele frequency less than 0.1.  

Does the answer explain why your data set gets so much smaller when you filter
on minor allele frequency?


## Uses of the SFS

The 1d (and 2-d and higher) site frequency spectra are often used to make demographic
inference about the populations that produced them.  Site frequency spectra are
quite simple to compute, and represent a huge reduction in data, which means
that they can be useful when dealing with huge genomic data sets.

We won't do any actual inference from site frequency spectra. The programs
$\partial a \partial i$ and _moments_ will do that.  We will talk about that next
semester.  But for now, I just want to give students the chance to explore how 
the 1-d SFS changes under different demographic scenarios.

## Generating SFS from coalescent simulations

To compute average SFS from coalescent simulations, we can simulate thousands of
independent coalescent trees, each with a single mutation on them, and then
process those in R to compute the SFS.  Here is how we would simulate 10K independent
SNPs with a sample of 25 diloids (50 haploids):
```{r}
system2(command = ms_binary(), args = "50 10000 -s 1", stdout = "out.snps", stderr = FALSE)
```
That creates a rather large file (1.3 Mb or so).  We can, however, read that
into R and return a site frequency spectrum (which is really little more than
a table of counts) using this nifty function:
```{r}
#' parse ms output with one snp per tree simulated and return the SFS
#'
#' @param nsam The number of haploid chromosomes simulated
#' @param file The name of the file.  By default = "out.snps"
ms2sfs <- function(nsam, file = "out.snps") {
  x <- readr::read_lines(file)
  mat <- matrix(x[x == "0" | x == "1"], nrow = nsam)
  
  tibble(
    S = as.numeric(colSums(mat == 1))
  ) %>%
    count(S) %>%
    left_join(tibble(i = 1:(nrow(mat) - 1)), ., by = c("i" = "S")) %>%
    rename(S = n) %>%
    mutate(S = ifelse(is.na(S), 0, S))
}
```

See how that works like this:
```{r}
S <- ms2sfs(50)
S
```

And now, it would be nice to be able to plot that SFS, along with
a picture of what the neutral, standard coalescnet would look like.
Here is a function for that:
```{r}
#' plot an SFS that comes out of ms2sfs()
#' @param S the SFS tibble that comes out of ms2sfs()
#' @param add_neural logical.  If true, a blue dashed frame is printed
# in the location of the expected SFS under the standard coalescent
# with nothing fancy going on.
plot_sfs <- function(S, add_neutral = TRUE) {
  g <- ggplot(S, aes(x = i, y = S)) +
  geom_col(fill = "gray") +
  theme_bw()
  
  if(add_neutral == TRUE) {
    NumSNPs <- sum(S$S)
    NumSam <- max(S$i) + 1
    fracts <-  1 / 1:(NumSam - 1)
    Sin <- NumSNPs * fracts / sum(fracts)
    
    expy <- tibble(
      i = 1:(NumSam - 1),
      S = Sin)
    
    g <- g +
      geom_col(data = expy, fill = NA, colour = "blue", linetype = "dashed", size = 0.5)
  }
  g
}
```

Now, we can have at it in three easy steps:
```{r}
system2(command = ms_binary(), args = "50 10000 -s 1", stdout = "out.snps", stderr = FALSE)
S <- ms2sfs(50)
plot_sfs(S)

```

And we see that our observed SFS looks a lot like that expected under
the neutral coalescent.

## Your Task

Now, think about how you might simulate different scenarios that would give you
an SFS pushed more to the left (more singletons and low-frequency variants---what
would the tree look like for that?), and also scenarios that would give
you a SFS pushed more to the right (more high-frequency variants).  Then simulate
those scenarios and verify your intuition.

Turn in a narrative about what you did and why you thought it would change the SFS
in the way that it did, your R code, and your pictures.