stancon2019-intro.Rmd

---
title: "StanCon 2019"
author: "Cambridge, UK"
date: "August 20-23"
output:
  html_document:
    toc: true
    toc_depth: 2
---

## Setup

```{r setup, results="hide", message=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, 
  dev = "png",
  dpi = 150,
  fig.align = "center",
  comment = NA
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)

theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123) 

# load data
pest_data <- readRDS('data/pest_data.RDS')
standata_hier <- readRDS('data/standata_hier.RDS')
```

## The problem

### Background

Imagine that you are a statistician or data scientist working as an independent
contractor. One of your clients is a company that owns many residential buildings 
throughout New York City. The property manager explains that they are concerned about the number
of cockroach complaints that they receive from their buildings. Previously
the company has offered monthly visits from a pest inspector as a solution to
this problem. While this is the default solution of many property managers in
NYC, the tenants are rarely home when the inspector visits, and so the manager
reasons that this is a relatively expensive solution that is currently not very
effective.

One alternative to this problem is to deploy long term bait stations. In this
alternative, child and pet safe bait stations are installed throughout the
apartment building. Cockroaches obtain quick acting poison from these stations
and distribute it throughout the colony. The manufacturer of these bait stations
provides some indication of the space-to-bait efficacy, but the manager suspects
that this guidance was not calculated with NYC roaches in mind. NYC roaches, the
manager rationalizes, have more hustle than traditional roaches; and NYC
buildings are built differently than other common residential buildings in the
US. This is particularly important as the uni§t cost for each bait station per
year is quite high.

### The goal

The manager wishes to employ your services to help them to find the optimal
number of roach bait stations they should place in each of their buildings in
order to minimize the number of cockroach complaints while also keeping
expenditure on pest control affordable. 

A subset of the company's buildings have been randomly selected for an experiment: 

* At the beginning of each month, a pest inspector randomly places a number of
bait stations throughout the building, without knowledge of the current
cockroach levels in the building
* At the end of the month, the manager records
the total number of cockroach complaints in that building. 
* The manager would like to determine the optimal number of traps ($\textrm{traps}$) that
balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate
with the all-in cost of maintaining the traps ($\textrm{TC}$). 

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

$$
\arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})]
$$

The property manager would also, if possible, like to learn how these results 
generalize to buildings they haven't treated so they can understand the
potential costs of pest control at buildings they are acquiring as well as for
the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the
random variable contributing to this expectation is the number of complaints
given the number of traps. We will model the number of complaints as a function
of the number of traps.


## The data

The data provided to us is in a file called `pest_data.RDS`. Let's
load the data and see what the structure is:

```{r load-data}
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
```

We have access to the following fields: 

* `complaints`: Number of complaints per building per month
* `building_id`: The unique building identifier
* `traps`: The number of traps used per month per building
* `date`: The date at which the number of complaints are recorded
* `live_in_super`: An indicator for whether the building as a live-in super
* `age_of_building`: The age of the building
* `total_sq_foot`: The total square footage of the building
* `average_tenant_age`: The average age of the tenants per building
* `monthly_average_rent`: The average monthly rent per building
* `floors`: The number of floors per building

First, let's see how many buildings we have in the data:

```{r describe-data}
length(unique(pest_data$building_id))
```

And make some plots of the raw data: 

```{r data-plot-1}
ggplot(pest_data, aes(x = complaints)) + 
  geom_bar()
```

```{r data-plot-2}
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) + 
  geom_jitter()
```

```{r data-plot-ts, fig.width = 6, fig.height = 8}
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) + 
  geom_line(aes(linetype = "Number of complaints")) + 
  geom_point(color = "black") + 
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) + 
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) + 
  scale_x_date(name = "Month", date_labels = "%b") + 
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) + 
  scale_linetype_discrete(name = "") + 
  scale_color_discrete(name = "Live-in super")
```


The first question we'll look at is just whether the number of complaints per
building per month is associated with the number of bait stations per building
per month, ignoring the temporal and across-building variation (we'll get to
that later). This requires only two variables, $\textrm{complaints}$ and
$\textrm{traps}$. How should we model the number of complaints?


## Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The
number of complaints over a month should be either zero or an integer. The
property manager tells us that it is possible but unlikely that number of
complaints in a given month is zero. Occasionally there are a very large number
of complaints in a single month. A common way of modeling this sort of skewed,
single bounded count data is as a Poisson random variable. One concern about
modeling the outcome variable as Poisson is that the data may be
over-dispersed, but we'll start with the Poisson model and then check 
whether over-dispersion is a problem by comparing our model's predictions 
to the data.

### Model 

Given that we have chosen a Poisson regression, we define the likelihood to be
the Poisson probability mass function over the number bait stations placed in
the building, denoted below as `traps`. This model assumes that the mean and
variance of the outcome variable `complaints` (number of complaints) is the
same. We'll investigate whether this is a good assumption after we fit the 
model.

For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$


### Prior predictive checks

Before we fit the model to real data, we should check that our priors and model
can generate plausible simulated data. 

In R we can simulate from the prior predictive distribution using a 
function like this:

```{r simulate-poisson-data}
# using normal distributions for priors on alpha and beta 
simple_poisson_dpg <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
  N <- length(traps)
  alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
  beta <- rnorm(1, mean = beta_mean, sd = beta_sd);
  complaints <- rpois(N, lambda = exp(alpha + beta * traps))
  return(complaints)
}
```

Try with $N(0,10)$ priors on both parameters: 

```{r prior-pred-really-bad}
# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 10,
  beta_mean = 0,
  beta_sd = 10
)
```

Try with $N(0,1)$ priors on both parameters: 

```{r prior-pred-bad}
simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 1,
  beta_mean = 0,
  beta_sd = 1
)
```

Let's calculate some summary statistics of 1000 data sets generated according
to the $N(0,1)$ priors:

```{r prior-pred-check}
prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = 0,
    alpha_sd = 10,
    beta_mean = 0,
    beta_sd = 10
  )
}))

dim(prior_preds)
```

```{r prior-pred-mean}
summary(apply(prior_preds, 1, mean))
```
```{r prior-pred-min}
summary(apply(prior_preds, 1, min))
```
```{r prior-pred-max}
hist(apply(prior_preds, 1, max))
```


A more reasonable (though still quite flexible) prior:

```{r}
simple_poisson_dpg(
  traps = pest_data$traps,
  alpha_mean = log(7),
  alpha_sd = 0.5,
  beta_mean = -0.25,
  beta_sd = 0.5
)
```

Simulate 1000 times and calculate summary stats:

```{r prior-pred-better}
prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = log(7),
    alpha_sd = 0.5,
    beta_mean = -0.25,
    beta_sd = 0.5
  )
}))
```

```{r}
summary(apply(prior_preds, 1, mean))
```

```{r}
summary(apply(prior_preds, 1, min))
```

```{r}
summary(apply(prior_preds, 1, max))
```


### Writing and fitting our first Stan program

Let's encode the model in a Stan program. 

* Write `simple_poisson_regression.stan` together, put in `stan_programs` directory.

```{r compile-simple-poisson}
comp_simple <- stan_model('stan_programs/simple_poisson_regression.stan')
```

To fit the model to the data we'll first create a list to pass
to Stan using the variables in the `pest_data` data frame. The names of the 
list elements must match the names used in the `data` block of the Stan program.

```{r stan-data}
standata_simple <- list(
  N = nrow(pest_data), 
  complaints = pest_data$complaints,
  traps = pest_data$traps
)
str(standata_simple)
```

We have already compiled the model, so we can jump straight to sampling from it.

```{r fit_simple, cache=TRUE}
fit_simple <- sampling(comp_simple, data = standata_simple,
                       # these are the defaults but specifying them anyway
                       # so you can see how to use them: 
                       # posterior sample size = chains * (iter-warmup)
                       chains = 4, iter = 2000, warmup = 1000)
```

Print the summary of the intercept and slope parameters: 

```{r results_simple_P}
print(fit_simple, pars = c('alpha','beta'))
```

We can also plot the posterior distributions: 

```{r mcmc_hist}
# https://mc-stan.org/bayesplot/reference/MCMC-distributions
draws <- as.matrix(fit_simple, pars = c('alpha','beta'))
head(draws)
mcmc_hist(draws) # marginal posteriors of alpha and beta
mcmc_scatter(draws, alpha = 0.2, size = 1) # posterior of (alpha,beta)
```

And compare them to draws from the prior:


```{r compare-prior-posterior}
alpha_prior_post <- cbind(alpha_prior = rnorm(4000, log(7), 1), 
                          alpha_posterior = draws[, "alpha"])
mcmc_hist(alpha_prior_post, facet_args = list(nrow = 2), binwidth = 0.1) + 
  xlim(range(alpha_prior_post))


beta_prior_post <- cbind(beta_prior = rnorm(4000, -0.25, 0.5), 
                         beta_posterior = draws[, "beta"])
mcmc_hist(beta_prior_post, facet_args = list(nrow = 2), binwidth = 0.05) + 
  xlim(range(beta_prior_post))
```


From the posterior of `beta`, it appears the number of bait stations set in a
building is associated with the number of complaints about cockroaches that were
made in the following month. However, we still need to consider how well the
model fits.


### Posterior predictive checking

```{r y_rep_simple}
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various ways
# of extracting contents from stanfit objects
y_rep <- as.matrix(fit_simple, pars = "y_rep")
```

```{r marginal_PPC}
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])
```

In the plot above we have the kernel density estimate of the observed data ($y$,
thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the
posterior predictive distribution. Here the posterior predictive
simulations are not as dispersed as the real data and don't seem to capture the
rate of zeros well at all. This suggests the Poisson model may not be sufficiently
flexible for this data.

Let's explore this further by looking directly at the proportion of zeros in the
real data and predicted data.
```{r ppc_stat-prop_zero}
# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)

# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = .01
)
```
The plot above shows the observed proportion of zeros (thick vertical line) and
a histogram of the proportion of zeros in each of the simulated data sets. It is
clear that the model does not capture this feature of the data well at all.


The rootogram is another useful plot to compare the observed vs expected number
of complaints. This is a plot of the square root of the expected counts
(continuous line) vs the observed counts (blue histogram)

```{r rootogram}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)
```

If the model was fitting well these would be relatively similar, however in this
figure we can see the number of complaints is underestimated if there are few
complaints, over-estimated for medium numbers of complaints, and underestimated
if there are a large number of complaints.

The `ppc_bars()` function will make a bar plot of the observed values and
overlay prediction intervals but not on the square root scale (unlike the
rootogram):

```{r ppc_bars}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep)
```


We can also view how the predicted number of complaints varies with the number
of traps:

```{r intervals}
ppc_intervals(y = standata_simple$complaints, yrep = y_rep,
              # jitter number of traps since multiple observations at some values
              x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)) +  
  labs(x = "Number of traps", y = "Number of complaints")
```


## Expanding the model: multiple predictors

Currently, our model's mean parameter is a rate of complaints per 30 days, but
we're modeling a process that occurs over an area as well as over time. We have
the square footage of each building, so if we add that information into the
model, we can interpret our parameters as a rate of complaints per square foot
per 30 days.

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$

The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can 
put it in $\eta_{b,t}$:

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b
\end{align*}
$$


We will also include whether there is a live-in super or not as a predictor 
for the number of complaints, which gives us gives us: 

$$
\eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} +
\beta_{\rm super} \textrm{live_in_super}_{b,t} +
\textrm{log_sq_foot}_b
$$
 
Add these new variables to the data list for Stan.

```{r add-predictors-to-standata}
standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super
```


### Stan program for Poisson multiple regression

Now we need a new Stan program that uses multiple predictors.

* Write `multiple_poisson_regression.stan`

```{r compile-multi-poisson}
comp_multi <- stan_model('stan_programs/multiple_poisson_regression.stan')
```

### Fit the Poisson multiple regression

```{r fit_multi}
fit_multi <- sampling(comp_multi, data = standata_simple, 
                      refresh = 100) # turn off printed progress updates
print(fit_multi, pars = c("alpha", "beta", "beta_super"))
```


```{r ppc_dens_overlay-random-subset}
y_rep <- as.matrix(fit_multi, pars = "y_rep")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])

# in this case we have very high effective sample sizes, but if there is
# nontrivial autocorrelation then it's better to take a random sample of the draws, 
# for example:
ids <- sample(nrow(y_rep), size = 200)
ppc_dens_overlay(standata_simple$complaints, y_rep[ids, ])
```

This again looks like we haven't captured the smaller counts very well, nor
have we captured the larger counts.

We're still severely underestimating the proportion of zeros in the data:

```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = standata_simple$complaints, yrep = y_rep, 
         stat = "prop_zero", binwidth = 0.01)
```

Ideally this vertical line would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different
numbers of traps.

```{r}
ppc_intervals(
  y = standata_simple$complaints, 
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps", y = "Number of complaints")
```

We can see that we've increased the tails a bit more at the larger numbers of traps
but we still have some large observed numbers of complaints that the model
would consider extremely unlikely events. 
 

## Modeling count data with the Negative Binomial

When we considered modeling the data using a Poisson, we saw that the model
didn't appear to fit as well to the data as we would like. In particular the
model underpredicted low and high numbers of complaints, and overpredicted the
medium number of complaints. This is one indication of over-dispersion, where
the variance is larger than the mean. A Poisson model doesn't fit over-dispersed
count data very well because the same parameter $\lambda$, controls both the
expected counts and the variance of these counts. The natural alternative to
this is the negative binomial model:

$$
\begin{align*}
\text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b}
\end{align*}
$$

In Stan the negative binomial mass function we'll use is called 
$\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$ 
in Stan. Like the `poisson_log` function, this negative binomial mass function
that is parameterized in terms of its log-mean, $\eta$, but it also has a
precision $\phi$ such that

$$
\mathbb{E}[y] \, = \lambda = \exp(\eta)
$$

$$
\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.
$$ 

As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so 
the variance of the negative-binomial approaches $\lambda$, i.e., the
negative-binomial gets closer and closer to the Poisson.

### Stan program for negative-binomial regression

* Write `multiple_NB_regression.stan` together

```{r compile-multi-NB, cache=TRUE, results="hide", message=FALSE}
comp_multi_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
```


### Fit to data and check the fit

```{r fit_multi_NB}
fit_multi_NB <- sampling(comp_multi_NB, data = standata_simple, refresh = 250)

# to demonstrate extracting as a list instead of using as.matrix
samps_NB <- rstan::extract(fit_multi_NB) 
```

Let's look at our predictions vs. the data.

```{r ppc_dens_overlay_NB}
ppc_dens_overlay(y = standata_simple$complaints, yrep = samps_NB$y_rep[1:200, ])
```

It appears that our model now captures both the number of small counts better
as well as the tails. 

Let's check the proportion of zeros:

```{r prop_zero_NB}
ppc_stat(y = standata_simple$complaints, yrep = samps_NB$y_rep, 
         stat = "prop_zero", binwidth = 0.01)
```


Check predictions by number of traps: 

```{r}
ppc_intervals(
  y = standata_simple$complaints, 
  yrep = samps_NB$y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps (jittered)", y = "Number of complaints")
```

We haven't used the fact that the data are clustered by building yet. A posterior 
predictive check might elucidate whether it would be a good idea to add the building
information into the model.

```{r ppc-group-means, fig.width=7, fig.height=6}
ppc_stat_grouped(
  y = standata_simple$complaints, 
  yrep = samps_NB$y_rep, 
  group = pest_data$building_id, 
  stat = 'mean',
  binwidth = 0.2
)
```

We're getting plausible predictions for most building means but some are
estimated better than others and some have larger uncertainties than we might
expect. If we explicitly model the variation across buildings we may be able to
get much better estimates.

## Hierarchical modeling

### Modeling varying intercepts for each building

Let's add a hierarchical intercept parameter, $\alpha_b$ at the building level
to our model.

$$
\text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t}  = \exp{(\eta_{b,t})} \\
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\
\mu_b \sim \text{Normal}(\alpha, \sigma_{\mu})
$$

In our Stan model, $\mu_b$ is the $b$-th element of the vector
$\texttt{mu}$ which has one element per building.

One of our predictors varies only by building, so we can rewrite the above model
more efficiently like so:

$$
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\
\mu_b \sim \text{Normal}(\alpha +  \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu})
$$


We have more information at the building level as well, like the average age of
the residents, the average age of the buildings, and the average per-apartment
monthly rent so we can add that data into a matrix called
`building_data`, which will have one row per building and four columns:

* `live_in_super`
* `age_of_building`
* `average_tentant_age`
* `monthly_average_rent`

We'll write the Stan model like:

$$
\eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\
\mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu})
$$

### Prepare building data for hierarchical model

We'll need to do some more data prep before we can fit our models. Firstly to
use the building variable in Stan we will need to transform it from a factor
variable to an integer variable.

```{r prep-data}
N_months <- length(unique(pest_data$date))
N_buildings <- length(unique(pest_data$building_id))
# Add some IDs for building and month
pest_data <- pest_data %>%
  mutate(
    building_fac = factor(building_id, levels = unique(building_id)),
    building_idx = as.integer(building_fac),
    ids = rep(1:N_months, N_buildings),
    mo_idx = lubridate::month(date)
  )

# Center and rescale the building specific data
building_data <- pest_data %>%
    select(
      building_idx,
      live_in_super,
      age_of_building,
      average_tenant_age,
      monthly_average_rent
    ) %>%
    unique() %>%
    arrange(building_idx) %>%
    select(-building_idx) %>%
    scale(scale=FALSE) %>%
    as.data.frame() %>%
    mutate( # scale by constants to put variables on similar scales
      age_of_building = age_of_building / 10,
      average_tenant_age = average_tenant_age / 10,
      monthly_average_rent = monthly_average_rent / 1000
    ) %>%
    as.matrix()

# Make data list for Stan
standata_hier <-
  with(pest_data,
        list(complaints = complaints,
             N = length(complaints),
             traps = traps,
             log_sq_foot = log(pest_data$total_sq_foot/1e4),
             M = N_months,
             mo_idx = as.integer(as.factor(date)),
             J = N_buildings,
             K = ncol(building_data),
             building_idx = building_idx,
             building_data = building_data
             )
        )
```

### Compile and fit the hierarchical model

Let's compile the model.

```{r comp_hier_NB, results="hide", message=FALSE}
comp_hier_NB <- stan_model('stan_programs/hier_NB_regression.stan')
```

Fit the model to data.

```{r run-NB-hier}
fit_hier_NB <-
  sampling(
    comp_hier_NB,
    data = standata_hier,
    refresh = 1000, # only report progress every 1000 iterations
    # can run markov chains in parallel. you can check how many cores you 
    # have available using parallel::detectCores()
    cores = 4 
  )
```

### Diagnostics

We get a bunch of warnings from Stan about divergent transitions, which is
an indication that there may be regions of the posterior that have not been
explored by the Markov chains.

Divergences are discussed in more detail in the course slides as well as
the **bayesplot** [MCMC diagnostics vignette](http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html)
and [_A Conceptual Introduction to Hamiltonian Monte Carlo_](https://arxiv.org/abs/1701.02434).

In this example we will see that we have divergent transitions because we need to
reparameterize our model - i.e., we will retain the overall structure of the
model, but transform some of the parameters so that it is easier for Stan to
sample from the parameter space. Before we go through exactly how to do this
reparameterization, we will first go through what indicates that this is
something that reparameterization will resolve. We will go through:

1. Examining the fitted parameter values, including the effective sample size
2. Plots that reveal particular patterns in locations of the divergences.

Then we print the fits for the parameters that are of most interest.

```{r fit_hier_NB-low-ESS}
print(fit_hier_NB, pars = c('sigma_mu','beta','alpha','phi','mu'))
```
You can see that the effective sample sizes are quite low for many of the parameters
relative to the total number of samples. This alone isn't indicative of the need
to reparameterize, but it indicates that we should look further at the trace
plots and pairs plots. First let's look at the traceplots to see if the divergent
transitions form a pattern.

```{r traceplot-with-divs}
# change bayesplot color scheme (to make chains very different colors)
color_scheme_set("brewer-Spectral") 
mcmc_trace(
  # use as.array to keep the markov chains separate for trace plots
  as.array(fit_hier_NB,pars = 'sigma_mu'),
  np = nuts_params(fit_hier_NB),
  window = c(500,1000) # zoom into a window to see more clearly
)
color_scheme_set("blue")
```

Looks like the divergences (marked in red below the traceplot)
correspond to spots where sampler gets stuck.

```{r, scatter-with-divs}
# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
  as.array(fit_hier_NB),
  pars = c("mu[3]", "sigma_mu"),
  # look at sigma on log-scale (what Stan uses under the hood for positive parameters)
  transform = list("sigma_mu" = "log"), 
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB),
  np_style = scatter_style_np(div_size = 1.5) # change style of divergences points
)
scatter_with_divs + coord_equal()
```

What we have here is a cloud-like shape, with most of the divergences clustering
towards the bottom. We'll see a bit later that we actually want this to look more
like a funnel than a cloud, but the divergences are indicating that the sampler
can't explore the narrowing neck of the funnel.

One way to see why we should expect some version of a funnel is to look at some
simulations from the _prior_, which we can do without MCMC and thus with no risk
of sampling problems:

```{r}
N_sims <- 1000
sigma_mu_prior <- rep(NA, N_sims)
mu_prior <- rep(NA, N_sims)
for (j in 1:N_sims) {
  sigma_mu_prior[j] <- abs(rnorm(1, mean = 0, sd = 1))
  mu_prior[j] <- rnorm(1, mean = 0, sd = sigma_mu_prior[j])
}
prior_draws <- cbind("mu" = mu_prior, "log(sigma_mu)" = log(sigma_mu_prior))

mcmc_scatter(prior_draws)
```

Of course, if the data is at all informative we shouldn't expect the posterior
to look exactly like the prior. But unless the data is incredibly informative
about the parameters and the posterior concentrates away from the narrow neck of
the funnel, the sampler is going to have to confront the funnel geometry. 
(See the bayesplot 
[Visual MCMC Diagnostics](https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html)
vignette for more on this.)

Another way to look at the divergences is via a parallel coordinates plot.

```{r}
# https://mc-stan.org/bayesplot/reference/MCMC-parcoord.html#plot-descriptions
# this plot is a bit computationally intensive
parcoord_with_divs <- mcmc_parcoord(
  as.array(fit_hier_NB, pars = c("sigma_mu", "mu")),
  np = nuts_params(fit_hier_NB)
)
parcoord_with_divs
```

Again, we see evidence that our problems concentrate when $\texttt{sigma_mu}$ is small.

### Reparameterize and recheck diagnostics

Instead, we should use the non-centered parameterization for $\mu_b$. We
define a vector of auxiliary variables in the parameters block,
$\texttt{mu_raw}$ that is given a $\text{Normal}(0, 1)$ prior in the model
block. We then make $\texttt{mu}$ a transformed parameter:
We can reparameterize the random intercept $\mu_b$, which is distributed:

$$
\mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu})
$$

```
transformed parameters {
  vector[J] mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
```

This implies $\texttt{mu}$ is
$\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)$
distribution, but it decouples the dependence of the density of each element of
$\texttt{mu}$ from $\texttt{sigma_mu}$ ($\sigma_\mu$).
hier_NB_regression_ncp.stan uses the non-centered parameterization for
$\texttt{mu}$. We will examine the effective sample size of the fitted model to
see whether we've fixed the problem with our reparameterization.


```{r comp_hier_NB_ncp}
comp_hier_NB_ncp <- stan_model('stan_programs/hier_NB_regression_ncp.stan')
```

Fit the model to the data.

```{r fit_hier_NB_ncp}
fit_hier_NB_ncp <- sampling(comp_hier_NB_ncp, data = standata_hier, cores = 4, 
                            refresh = 1000)
```

You should have no divergences (or only a few) this time and much improved
effective sample sizes:

```{r fit_hier_NB_ncp-ESS}
print(fit_hier_NB_ncp, pars = c('sigma_mu','beta','alpha','phi','mu')) # optional 'probs' argument
```


Let's make the same scatter plot we made for the model with many divergences and 
compare: 

```{r compare-scatterplots}
scatter_no_divs <- mcmc_scatter(
  as.array(fit_hier_NB_ncp),
  pars = c("mu[3]", 'sigma_mu'),
  transform = list('sigma_mu' = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB_ncp)
)

bayesplot_grid(
  scatter_with_divs, scatter_no_divs,
  grid_args = list(ncol = 2), 
  ylim = c(-10, 1), 
  subtitles = c("Centered parameterization", 
                "Non-centered parameterization")
)
```

Comparing the plots, we see that the divergences were a sign that there was part
of the posterior that the chains were not exploring.

The marginal PPC plot, again.

```{r ppc-full-hier}
y_rep <- as.matrix(fit_hier_NB_ncp, pars = "y_rep")
ppc_dens_overlay(standata_hier$complaints, y_rep[1:200,])
```

This looks quite nice. 

If we've captured the building-level means well, then the
posterior distribution of means by building should match well with the observed
means of the quantity of building complaints by month.

```{r ppc-group_means-hier}
ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = 'mean',
  binwidth = 0.5
)
```

We weren't doing terribly with the building-specific means before, but now they
are all estimated a bit better by the model. The model is also able to do a
decent job estimating within-building variability and, for the most part, 
the probability of zero complaints:

```{r}
ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = 'sd',
  binwidth = 0.5
)
```

```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = prop_zero,
  binwidth = 0.02
)

# plot separately for each building
ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = prop_zero,
  binwidth = 0.05
)
```


Check quantiles of predictions: 

```{r}
q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = "q90",
  binwidth = 1
)
```

Predictions by number of traps:

```{r}
ppc_intervals_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  x = standata_hier$traps + rnorm(standata_hier$N, 0, 0.2), # jitter
  group = standata_hier$building_idx
  # difference between grouping by pest_data$building_id and standata_hier$building_idx
  # is that the latter is consecutive from 1 to N_buildings to make things easier 
  # to code in Stan
) +
  labs(
    x = "Number of traps", 
    y = "Predicted complaints", 
    title = "Traps vs predicted complaints by building"
  )
```