app.Rmd

---
title: "Statistics for e-Science"
subtitle: "Week 4 exercises"
author: "Tycho Canter, Gowthami Rajukkannu and Antonio Ortega"
date: "December 19, 2016"
output: 
  ioslides_presentation: 
    logo: figures/ku_logo.png
runtime: shiny
---

```{r setup, include = FALSE, tidy = T}
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
```

```{r, echo = F, message = F}
library("ggfortify")
library("magrittr")
library("ggplot2")
library("ggvis")
library("dplyr")
library("shiny")
library("grid")
library("png")
text_size <- 18
theme_set(theme_bw(base_size = text_size))

# p <- ggplot(mtcars, aes(mpg, wt)) +
#   geom_point()
# p

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

```

## Exercise 4.4 - Full and nested models

```{r fig.width = 6, echo = FALSE}
img <- readPNG("figures/investigacion.png")
grid.raster(img)
```


## Is the nested model enough?

- Simulate data coming from similar, but *not identical* normal distributions

- Run tests to evalute the significance of the difference of means

- Conclude whether distributions can be approximated or not and tests yield

## 4.4.1 - Simulate data

Assume: 

$X \sim N(\mu = 2, \sigma = 4)$

$Y \sim N(\mu = 2.5, \sigma = 4)$


```{r, echo = T, eval = T}
x <- rnorm(20, 2, 4)
y <- rnorm(40, 2.5, 4)
xy <- c(x, y)
```
- x consists of a sample of X with size 20

- y consists of a sample of Y with size 40

- xy results from merging both samples into a single one


```{r}
x.mean <- mean(x)
y.mean <- mean(y)
```


## Theory and practice
```{r, message = F}
p <- ggdistribution(dnorm, x = seq(-10, 14, 0.1), mean = 2, sd = 4, fill = "blue", alpha = 0.2) +
  geom_vline(xintercept = 2, linetype = "longdash") +
  scale_x_continuous(breaks = seq(-10, 14, 2))

p <- ggdistribution(dnorm, x = seq(-10, 14, 0.1), p = p, mean = 2.5, sd = 4, fill = "red", alpha = 0.2) +
  geom_vline(xintercept = 2.5, linetype = "longdash") +
  ggtitle(label = "Theoretical distributions")

q <- autoplot(density(x), fill = "blue")
q <- autoplot(density(y), p = q, fill = "red") +
  ggtitle("Simulated data") +
  scale_x_continuous(breaks = seq(-10, 14, 2))

multiplot(p, q, cols = 2)
```

## Models

* **Full model: distinguish between X and Y**

x and y come from different distributions (different means)

Parameter space: $\Theta \rightarrow \{ \mu_X, \mu_Y, \sigma \}$


* **Nested model: no distinction between X and Y**

Assume x and y came from the same distribution

Parameter space: $\Theta_0 \rightarrow \{ \mu, \sigma \}$
$( \mu_X = \mu_Y )$


**The nested model is a simplification of the full model. Assumes we can consider both means similar for simplicity**

## 4.4.2 - Perform MLE and compute ML values

*MLE = Maximum Likelihood Estimation*

We want to minimise the -log(likelihood) value for each model using the simulated data

Minimise: $-\sum { \log { f_N\left( x_{ i } \right)  } }$

## 4.4.2 - Perform MLE and compute ML values 

Minimise (optimise), i.e get the parameters that minimise:

Full model                                                                                                                                      | Nested model 
-------------                                                                                                                                   | ------------- 
$- \left( \sum_{ i = 1 }^{ 20 } { \log { f_X\left( x_{ i } \right)  } } + \sum_{ i = 1}^{40} { \log { f_Y\left( y_{ i } \right)  } } \right)$  | $- \sum_{ i = 1 }^{ 60 } { \log { f_XY\left( xy_{ i } \right) } }$        

**Compute likelihood using optimized parameters**

1. Compute probability of each value according to model with ML parameters
2. Multiply all probabilities to obtain the likelihood of the parameters

## R Implementation

```{r, echo = T}
# nested
mllk <- function(params) {
    -(xy %>% dnorm(params[1], params[2], log=TRUE) %>% sum)
}

nested.model <- optim(c(1,1), mllk)
  
# full
mllk <- function(params) {
  -(x %>% dnorm(params[1], params[3], log=TRUE) %>% sum) - 
    (y %>% dnorm(params[2], params[3], log=TRUE) %>% sum)
}

full.model <- optim(c(1,1,1), mllk)
```

## 4.4.3 - Compute the LRT statistic and the p-value

Compute the quotient of likelihoods between the nested model and the full model

*This quotient will ALWAYS be* $\le 1$

```{r, echo = T}
q <- nested.model$value  - full.model$value
```

A statistic based on q is said to follow a $\chi^2$ distribution if the nested model captures the same info as the full one

$-2\log{\left( q \right)} \sim \chi^{2}_{n}$

```{r}
statistic <- 2 * q
```

## P-value (LRT)

$H_0$: The nested model is sufficient

* The p-value will give us the probability of observing a statistic more extreme than the observed

* If it is low, we reject the null hypothesis

```{r, echo = T}
lrt.pval <- 1 - pchisq(statistic, df = 1)
lrt.pval
```

## 4.4.4 P-value (T-test)

$H_0$: The nested model is sufficient

* We can also use the student's T test

* All distributions considered have the same variance (homoscedastic)

```{r, echo = T}
ttest.pval <- t.test(x, y,
                     var.equal = TRUE,
                     alternative = "two.sided")$p.value

ttest.pval
```

## Compare p-values

```{r}
ggplot(data = data.frame(test = as.factor(c("Likelihood ratio", "Student's T")), p_value = c(lrt.pval, ttest.pval))) +
  geom_bar(aes(x = test, y = p_value, fill = test), stat = "identity") +
  scale_y_continuous(limits = 0:1)
```

## Distribution of p-values

```{r}
N <- 100000
t.values <- numeric(N)
lrt.values <- numeric(N)

# Progress bar: uncomment while debugging  
#pb <- tcltk::tkProgressBar(title = "progress bar", min = 0,
#                    max = N, width = 300)

for(i in 1:N)
{
  x <- rnorm(20, 2, 4)
  y <- rnorm(40, 2.5, 4)
  x.mean <- mean(x)
  y.mean <- mean(y)
  xy <- c(x, y)
  
  # nested
  mllk <- function(params) {
      -(xy %>% dnorm(params[1], params[2], log=TRUE) %>% sum)
  }
  
  nested.model <- optim(c(1,1), mllk)
    
  # full
  mllk <- function(params) {
    -(x %>% dnorm(params[1], params[3], log=TRUE) %>% sum) - 
      (y %>% dnorm(params[2], params[3], log=TRUE) %>% sum)
  }

  full.model <- optim(c(1,1,1), mllk)
  
  q <- nested.model$value - full.model$value
  
  statistic <- 2 * q
  
  lrt.pval <- 1 - pchisq(statistic, df = 1)
  
  
  ttest.pval <- t.test(x, y,
                       var.equal = TRUE,
                       alternative = "two.sided")$p.value
  
  t.values[i] <- ttest.pval
  lrt.values[i] <- lrt.pval
  
  # setTkProgressBar(pb, i,
  #                  label = paste( round(i / N * 100, 0), "% done"))
}

pvalue.data <- data.frame(id = factor(rep(c("t.test", "lrt"), each = N)), value = c(t.values, lrt.values))
```

```{r, eval = T}
ggplot(pvalue.data, aes(x = value, fill = id, y = ..density..)) +
  geom_histogram(alpha = 0.5, position = "identity") +
  geom_density(alpha = 0.5) +
  facet_wrap( ~ id)

# ggplot(pvalue.data, aes(x = value, fill = id)) +
#   geom_histogram(alpha = 0.5, position = "identity") +
#   facet_wrap( ~ id)
```


## 4.5 - Simulations on Confidence Intervals

```{r fig.width = 6, echo = FALSE}
img <- readPNG("figures/stars.png")
grid.raster(img)
```

## Study the behaviour of Confidence Intervals according to:

* Sample size

* $\alpha$, the probability of leaving the actual mean out of the interval

## 4.5.1 - Set n = 20, μ = 2, σ = 3, N = 1000


```{r, echo = T}
n <- 20
mu <- 2
sigma <- 3
N <- 1000
```

## 4.5.2 - Simulate an observation x of length n following N (μ, σ) with mean μ and standard deviation σ.

```{r, echo = T}
x <- rnorm(n, mu, sigma)
```

## 4.5.3 - Compute the 95%-confidence interval for μ using data x.

Process each sample 

```{r, echo = T}
# Compute mean and sem of sample
average <- mean(x)
sem <- sd(x) / sqrt(n)
```

## 4.5.3 - Compute the 95%-confidence interval for μ using data x.

Build the confidence interval

```{r, echo = T}
p <- 0.95  # Fix p to 0.95
z <- qt(p + (1 - p) / 2, df = n - 1 )  # Extract percentil 0.975

inferior <- average - z * sem
superior <- average + z * sem
inferior
superior
```

## 4.5.4 - Check if the confidence interval contains μ and output a True or False.

```{r, echo = T}
result <- mu > inferior & mu < superior
```

## 4.5.5 - Repeat steps 2 - 4 for N times. How many repetitions give a True?

* Put all the code in 4.5.2 - 4.5.4 in a foor loop running N times

* Store mean and sem of each sample

   + *average* and *sem* become vectors storing each sample mean and sem
  
* *result* becomes a logical vector
   
   + Element i tells if the CI i contains $\mu$ or not

## Visualization [App](https://antortjim.shinyapps.io/shiny_4_5/) 

```{r, echo = F}
shinyAppFile(appFile = "ex4_5/app.R")
```

## 4.6 - Power of a statistical test

```{r fig.width = 6, echo = FALSE}
img <- readPNG("figures/StatPower.png")
grid.raster(img)
```

## Formulation of the problem

* A blood test measurement follows the $X$ distribution

* $X \sim N(\mu, \sigma)$

* Healthy population $\rightarrow X_1 \sim N(\mu = 0, \sigma = 1)$

* Sick population $\rightarrow X_2 \sim N(\mu = 2, \sigma = 1)$

## Formulation of the problem

```{r, include = F}
lower <- -5
upper <- +7
length.out <- 1001

healthy <- curve(dnorm(x, mean = 0, sd = 1), from = lower, to = upper, n = length.out)$y
sick <- curve(dnorm(x, mean = 2, sd = 1), from = lower, to = upper, n = length.out)$y
blood.data <- data.frame(id = factor(rep(c("healthy", "sick"), each = length.out)),
                         X = seq(lower, upper, length.out = length.out), 
                         p = c(healthy, sick))
```
```{r}
p <- ggplot(data = blood.data, aes(x = X, y = p, fill = id)) +
  geom_area(alpha = 0.5, position = 'identity') +
  geom_line(size = 1.5, aes(col = id)) + 
  guides(col = FALSE) +
  scale_fill_discrete(name = "Population",
                    breaks = c("healthy", "sick"),
                    labels = c("Healthy", "Sick")) +
  scale_y_continuous(name = "Density") +
  scale_x_continuous(limits = c(lower, upper))
  
p
```


## Formulation of the problem

* Statistical test to tell if person is sick

* The null hypothesis assumes that the measurement comes from $X_1$

$H_0: E(X) = 0$

* We reject the null hypothesis if statistic is greater than a threshold

$t\left(X\right) > \tau$

* The statistic used is X itself for simplicity (no need to process it)

## 4.6.1 - Calculating the probability of α and β

* Use $\tau = 2$

* Type I errors (α) are false positives. Very dangerous in science!

* Type II errors (β) are false negatives

* The power of the test is 1 - β

## 4.6.1 - Calculating the probability of α

```{r}
  tau <- 2
  ggplot(data = subset(blood.data, id == "healthy"), aes(x = X, y = p, fill = "red")) +
  geom_area(alpha = 0.5, position = 'identity') +
  geom_line(size = 1.5, col = "red") + 
  guides(col = FALSE, fill = FALSE) +
  scale_fill_discrete(name = "Population",
                    breaks = c("healthy", "sick"),
                    labels = c("Healthy", "Sick")) +
  scale_y_continuous(name = "Density") +
  scale_x_continuous(name = "X", limits = c(lower, upper), breaks = seq(lower, upper, 1))

```

## 4.6.1 - Calculating the probability of α

```{r}
  ggplot(data = filter(blood.data, id == "healthy" & X < tau), aes(x = X, y = p), aes(x = X, y = p)) +
  geom_area(alpha = 0.5, position = 'identity', fill = "red") +
  geom_line(size = 1.5, col = "red") + 
  scale_y_continuous(name = "Density") +
  scale_x_continuous(name = "X", limits = c(lower, upper), breaks = seq(lower, upper, 1)) +
  geom_area(data = filter(blood.data, id == "healthy" & X > tau), aes(x = X, y = p),
            fill = "grey9") +
  geom_line(data = filter(blood.data, id == "healthy" & X > tau),
            size = 1.5, col = "grey9") +
  scale_fill_discrete(name = "Population",
                    breaks = c("healthy", "sick"),
                    labels = c("Healthy", "Sick"))
```


## 4.6.1 - Calculating the probability of α

```{r, echo = T}
1 - pnorm(q = 2, mean = 0, sd = 1)
```

## 4.6.1 - Calculating the probability of β

```{r}
  ggplot(data = subset(blood.data, id == "sick"), aes(x = X, y = p)) +
  geom_area(alpha = 0.5, position = 'identity', fill = "blue") +
  geom_line(size = 1.5, col = "blue") + 
  guides(col = FALSE, fill = FALSE) +
  scale_fill_discrete(name = "Population",
                    breaks = c("healthy", "sick"),
                    labels = c("Healthy", "Sick")) +
  scale_y_continuous(name = "Density") +
  scale_x_continuous(name = "X", limits = c(lower, upper), breaks = seq(lower, upper, 1))

```

## 4.6.1 - Calculating the probability of β

```{r}
  ggplot(data = filter(blood.data, id == "sick" & X > tau), aes(x = X, y = p), aes(x = X, y = p)) +
  geom_area(alpha = 0.5, position = 'identity', fill = "blue") +
  geom_line(size = 1.5, col = "blue") + 
  scale_y_continuous(name = "Density") +
  scale_x_continuous(name = "X", limits = c(lower, upper), breaks = seq(lower, upper, 1)) +
  geom_area(data = filter(blood.data, id == "sick" & X < tau), aes(x = X, y = p),
            fill = "grey9") +
  geom_line(data = filter(blood.data, id == "sick" & X < tau),
            size = 1.5, col = "grey9") +
  scale_fill_discrete(name = "Population",
                    breaks = c("healthy", "sick"),
                    labels = c("Healthy", "Sick")) 
```


## 4.6.1 - Calculating the probability of β

```{r, echo = T}
1 - pnorm(2, mean = 2, sd = 1)
```

# Power = 0.5

* The power represents the complementary event of the Type II error, and thus it's probability is 1 - P(β)

* This is represented by the complementary area of the previous bell

## 4.6.2 - Probability of $\alpha$, $\beta$ and the power as a function of $\tau$

* We can repeat the same calculation for several values of $\tau$ using curve()

* Just select the y element in the output list

```{r}
tau <- seq(from = lower, to = upper, length.out = length.out)
```

```{r, echo = T, include = F}
alpha <- curve( {1 - pnorm(x, mean = 0, sd = 1)},
                from = lower, to = upper, n = length.out)$y

beta <- curve( {pnorm(x, mean = 2, sd = 1)},
               from = lower, to = upper, n = length.out,
               add = T)$y

power <- 1 - beta
```

```{r}
data <- data.frame(
  τ = tau,
  name = as.factor(rep(c("alpha", "beta", "power"), each = length.out)),
  value = c(alpha, beta, power)
)
```

## 4.6.2 - Visualization

```{r}
data$id <- 1:nrow(data)

base <- data %>%
  ggvis(~τ, ~value, stroke = ~name, key := ~id) %>%
  layer_points(fill = ~name)

base %>% add_tooltip(function(data){paste0(data$name, "<br />Value: ", round(data$value, digits = 2), "<br />Tau:", data$tau)}, "hover")
```


# Trade off $\alpha$ and $\beta$

* We can't decrase $\alpha$ without increasing $\alpha$

* The compromise will favour very low values of $\alpha$ because it's more dangerous

* The power obviously increases as $\beta$ decreases

## 4.7 - Can the exponential distribution provide enough detail to fit our data?

Compare the fit offered by the gamma and exponential distributions


```{r fig.width = 6, echo = FALSE}
img <- readPNG("figures/curves_2.png")
grid.raster(img)
```


## Download the data

```{r, echo = T}
neuron <- read.table("http://math.ku.dk/~tfb525/teaching/statbe/neuronspikes.txt", col.names = "time")
```

## Formulate expression to optimise

```{r, echo = T}
mllk.gamma <- function(params, data) {
  -(dgamma(data, params[1], params[2], log = T) %>% sum)
}

mllk.exponential <- function(rate, data) {
  -(dexp(data, rate, log = T) %>% sum)
}
```

## Optimise and compute -log(Likelihood)

```{r, echo = T, message = F, warning= F}
exponential.optim <- optimize(mllk.exponential, interval = c(-10, 10), data = neuron$time)
gamma.optim <- optim(c(1, 1), mllk.gamma, data = neuron$time)
```


## Likelihood ratio test: statistic and p-value 

```{r, echo = T, eval = T}
statistic <- 2 * (exponential.optim$objective - gamma.optim$value)
p.value <- 1 - pchisq(statistic, df = 1)
p.value
```

The obtained p-value is very low, thus we discard the null hypothesis

**The nested model is not sufficient**.

**The gamma distribution provides a better fit to our data**.

## Visualization [App](https://antortjim.shinyapps.io/ex4_7/) 

```{r, echo = F}
shinyAppFile(appFile = "ex4_7/app.R")
```


# Thank you!

* Hadley Wickham for ggvis, ggplot2, and dplyr

* Slides made with ioslides

* Apps made with Shiny, an R website framework

* Code running in R