priors.Rmd

# Priors

> Yeah, well, you know, that’s just, like, your opinion, man. — The Dude (*The Big Lebowski*)

## Levels of Priors

The levels of 

1.  Flat prior
1.  Vague but proper prior, e.g. $\dnorm(. | 0, 1e6)$
1.  Weakly informative prior, but very weak $\dnorm(0, 10)$
1.  Generic weakly informative prior: $\dnorm(0, 1)$
1.  Specific informative prior

## Conjugate Priors

In a few cases, the posterior distribution,
$$
p(\theta | y) = \frac{p(y | \theta) p(\theta)}{\int p(y | \theta') p(\theta') d\theta'},
$$
has a [closed-form solution](https://en.wikipedia.org/wiki/Closed-form_expression) and can be calculated exactly.
In those cases, the posterior distribution is calculated exactly, and more costly numerical approximation methods do not need to be used.
Unfortunately, these cases are few.

Most of those cases involve **conjugate priors**.
In the case of a conjugate prior, the posterior distribution is in the same family as the prior distribution.

Here is a diagram of a few common conjugate priors.[^conjugate]

```{r conjugate-gamma, echo=FALSE}
DiagrammeR::grViz("diagrams/conjugate_gamma.gv")
DiagrammeR::grViz("diagrams/conjugate_beta.gv")
```

[^conjugate]: Based on John Cook's a [Diagram of Conjugate Prior distributions](https://www.johndcook.com/blog/conjugate_prior_diagram/).

The table in the Wikipedia page for [Conjugate priors](https://en.wikipedia.org/wiki/Conjugate_prior#cite_note-beta-interp-4) is as complete as any out there.
@Fink1997a for a compendium of references.
Also see [Distributions] for more information about probability distributions.

### Binomial-Beta

Binomial distribution: If $N \in \Nats$ (number of trials), $\theta \in (0, 1)$ (success probability in each trial),
then for $n \in \{0, \dots, N\}$,
$$
\dBinom(n | N, \theta) = \binom{N}{n} \theta^{n} (1 - \theta)^{N - n} .
$$

Beta distribution: If $\alpha \in \RealPos$ (shape) and $\beta \in \RealPos$ (shape), then for $\theta \in (0, 1)$,
$$
\dbeta(\theta | \alpha, \beta) = \frac{1}{\mathrm{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1},
$$
where $\mathrm{B}$ is the beta function,
$$
\mathrm{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} .
$$

Then,
$$
\begin{aligned}[t]
p(\theta | \alpha, \beta) &= \dbeta(\theta | \alpha, \beta) && \text{Beta prior} \\
p(y | \theta) &= \dBinom(y | n, \theta)  && \text{Binomial likelihood} \\
p(\theta | y, \alpha, \beta) &= \dbeta(\theta | \alpha + y, \beta + n - y) && \text{Beta posterior}
\end{aligned}
$$

### Categorical-Dirichlet

The Dirichlet distribution is a multivariate generalization of the Beta,
If $K \in \N$ and $\alpha \in (\R^{+})^{K}$, then for the $\theta \in K-\text{simplex}$ and $\theta_k > 0$ for all $k$,
$$
\ddirichlet(\theta | \alpha) = \frac{\Gamma(\sum_{k = 1}^K \alpha_k)}{\prod_{k = 1}^K \Gamma(\alpha_k)} \prod_{k = 1}^K \theta_{k}^{\alpha_k - 1}
$$

The multinomial distribution is a generalization of the binomial distribution with $K$ categories instead of 2.

If $K \in \n$, $N \in \N$, and $\theta \in K-\text{simplex}$, then for $y \in \N^{K}$ such that $\sum_{k = 1}^K y_k = N$,
$$
\dmultinom(y | \theta) = \binom{N}{y_1, \dots, y_K} \prod_{k = 1}^{K} \theta_k^{y_k},
$$
where the multinomial coefficient is defined as,
$$
\binom{N}{y_1, \dots, y_K} = \frac{N!}{\prod_{k = 1}^K y_k!}
$$

$$
\begin{aligned}[t]
p(\theta | \alpha) &= \ddirichlet(\theta | \alpha) && \text{Dirichlet prior} \\
p(y | \theta) &= \dmultinom(y | n, \theta)  && \text{Multinomial likelihood} \\
p(\theta | y, \alpha) &= \ddirichlet(\theta | \alpha + y) && \text{Dirichlet posterior}
\end{aligned}
$$

### Poisson-Gamma

Let $\lambda$ be the rate parameter of the Poisson distribution.

If $\lambda \in \R^+$ (rate parameter), then for $n \in \N$,
$$
\dpois(n|\lambda) = \frac{1}{n!} \lambda^n \exp(-\lambda)
$$
If $\alpha \in \R^{+}$ (shape parameter), $\beta \in \R^{+}$ (inverse scale parameter), then for $y \in \R^{+}$,
$$
\dgamma(y | \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} y^{\alpha - 1} \exp(- \beta y)
$$

Then,
$$
\begin{aligned}[t]
p(\lambda) &= \dgamma(\lambda | \alpha, \beta) \\
p(n | \lambda) &= \dpois(n | \lambda) \\
p(\lambda | n, \alpha, \beta) &= \dgamma(\lambda | \alpha + n, \beta + 1)
\end{aligned}
$$

### Normal with known variance

$$
\begin{aligned}[t]
p(\mu | \mu_0, \sigma_0) &= \dnorm(\mu | \mu, \sigma_0^2) && \text{Normal prior} \\
p(y | \mu) &= \dnorm(y | \mu, \sigma^2)  && \text{Normal likelihood} \\
p(\mu | y, \mu, \sigma^2, \mu_0, \sigma_0^2) &= \dbeta(\mu | \tilde{\mu}, \tilde{\sigma}^2) && \text{Normal posterior} \\
\tilde{\mu} &= \tilde{\sigma}^{2} \left(\frac{\mu_0}{\sigma_0^2} + \frac{y}{\sigma^2} \right) \\
\tilde{\sigma}^2 &= \left(\frac{1}{\sigma_0^2} +\frac{1}{\sigma^2}\right)^{-1} \\
\end{aligned}
$$

### Exponential Family

Likelihood functions in the [exponential family](https://en.wikipedia.org/wiki/Exponential_family) have conjugate priors, often also in the exponential family.[^expconj]

## Improper Priors

If prior distributions are given an [improper uniform prior](https://en.wikipedia.org/wiki/Prior_probability), $p(\theta) \propto 1$, then the posterior distribution is proportional to the likelihood,
$$
p(\theta | y) \propto p(y | \theta) p(\theta) \propto p(y | \theta)
$$

## Cromwell's Rule

The use of priors should placing a probability of 0 or 1 on events be avoided except where those events are excluded by logical impossibility.
If a prior places probabilities of 0 or 1 on an event, then no amount of data can update that prior.

The name, Cromwell's Rule, comes from a quote of Oliver Cromwell,

> I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

Lindley (1991) describes it as

> Leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.

If $p(\theta = x) = 0$, then for a value of $x$, then the posterior distribution is always zero.
$$
p(\theta = x | y) \propto p(y | \theta = x) p(\theta = x) = 0
$$

## Asymptotics

As the sample size increases, the Bayesian distribution converges to a normal distribution centered on the true value of the parameter.

Suppose data $y_1, \dots, y_n \sim$ are an iid sample from the distribution $f(y)$.
Suppose that the data are modeled with a parametric family $p(y | \theta)$ and a prior distribution $p(\theta)$.

If the data distribution is included in the parametric family, meaning that there exists a $\theta_0$ such that $p(y | \theta_0) = f(y)$, then the posterior distribution is *consistent* in that it converges to the true parameter value $\theta_0$ as $n \to \infty$.

Otherwise, the posterior convergences to the distribution $p(y | \theta)$ closes to the true distribution.

As $n \to infty$, the likelihood dominates the posterior distribution.

There are cases in which the normal approximation is incorrect.

1.  parameters are non-identified
1.  the number of parameters increases with sample size
1.  aliasing or non-identified parameters due to label switching
1.  unbounded likelihoods. This can happen if variance parameters go to zero.
1.  improper posterior distributions
1.  prior distributions that exclude the point of convergence. See Cromwell's Rule.
1.  convergence to the edge of the parameter space
1.  tails of the distribution can be inaccurate even if the normal approximation converges to the correct value; e.g. the normal approximation will still place a non-zero density on negative values of a non-negative parameter.

## Proper and Improper Priors

A prior distribution $p(\theta)$ is an improper when it is not a probability distribution, meaning
$$
\int p(\theta) \,d\theta = \infty .
$$
Perhaps the most common improper distribution is an unbounded uniform distribution,
$$
p(\theta) \propto 1
$$
for $-\infty < \theta < \infty$.

Improper priors can be used, because in some cases, the posterior distribution can still be proper even if the prior is not.

| Prior    | Posterior |
| -------- | --------- |
| Improper | ?         |
| Proper   | Proper    |

One common case of this is a linear regression model with improper priors.
$$
p(\beta, \sigma | y, X) =
\begin{cases}
y &\sim \dnorm(X \beta, \sigma^{2} I)
\end{cases}
$$
If the number of observations (rows of $X$) is less than the number of
independent columns in $X$ (variables plus the constant), then the MLE of
$\beta$ is undefined, and also the posterior distribution is improper.
But if alter that model to include a proper prior,
$$
p(\beta, \sigma | y, X) =
\begin{cases}
y &\sim \dnorm(X \beta, \sigma^{2} I) \\
\beta & \sim \dnorm(\mu_\beta, \Sigma_\beta) \\
\gamma &\sim p(\gamma)
\end{cases}
$$
then we can estimate $p(\beta, \sigma | y, X)$ even if the number of
observations is less than the number of variables. Consider the case if we
observe *no* data; then the posterior distribution is equal to the prior, and
since the prior is a proper distribution, so is the posterior.

However, because a proper posterior allows for estimating a posterior, does
imply that the posterior distribution is any "good". But that is the role of
the evaluation step in Bayesian analysis. In the cases where an improper prior
would lead to an improper posterior, the choice of the prior is important
because the prior will dominate the shape of the posterior distribution.

One way of thinking about many "identification" assumptions in MLE models is
that they can loosely be considered "priors". What is called the likelihood and
what is called the the prior is not well-defined, and often the choice of of
likelihood functions is both subjective and the most important part of the
analysis.

Regarding improper priors, also see the asymptotic results that the posterior
distribution increasingly depends on the likelihood as sample size increases.

Stan: If no prior distributions is specified for a parameter, it is given an
improper prior distribution on $(-\infty, +\infty)$ after transforming the
parameter to its constrained scale.

## Hyperpriors and Hyperparameters

A hyperparameter is a parameter in a prior.
A hyperprior is a term for a prior on

Consider the case of a binomial likelihood with a beta prior on the proportion parameter $\theta$.
The observed value of $n$ and the $N$
$$
\begin{aligned}[t]
p(n | \theta) &\sim \dBinom(N, \theta) && \text{likelihood for } n \\
p(\theta) &\sim \dbeta(a, b)  && \text{prior on } \theta \\
\end{aligned}
$$
This is a model of the posterior distribution of $\theta$ given the data,
where the data consists of the $n$ successes, the total number of trails $N$, *and* $a$ and $b$, the assumed shape parameters of the beta prior on $\theta$.
Thus our posterior distribution is,
$$
p(\theta | n, N, a, b) .
$$

However, suppose we decided that since we did not have a good reason to choose any particular values of $a$ and $b$ for the prior distribution, we would treat the shape parameters of the beta distribution as parameters and assign them their own prior distributions,
$$
\begin{aligned}[t]
p(n | \theta) &\sim \dBinom(N, \theta) && \text{likelihood for } n \\
p(\theta) &\sim \dbeta(\alpha, \beta)  && \text{prior on } \theta \\
p(\alpha) &\sim \dexp(a^*)  && \text{hyperprior} \\
p(\beta) &\sim \dexp(b^*)  && \text{hyperprior}
\end{aligned}
$$
Now the parameters of the model are $\theta$, $\alpha$, and $\beta$, and the data are $n$, $N$, $a^{*}$, and $b^{*}$.
Since one parameter in the model ($\theta$) is a function of other parameters, $\alpha$ and $\beta$, we call $\alpha$ and $\beta$ hyperparameters.

## References

-   [Stan Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) and the [rstanarm](https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html) vignette includes comprehensive advice for prior choice recommendations.
-   @Betancourt2017a provides numerical simulation of how the shapes of weakly informative priors affects inferences.
-   @Stan2016a for discussion of some types of priors in regression models
-   @ChungRabe-HeskethDorieEtAl2013a discuss scale priors in penalized MLE models
-   @GelmanJakulinPittauEtAl2008a discusses using Cauchy(0, 2.5) for prior distributions
-   @Gelman2006a provides a prior distribution on variance parameters in hierarchical models.
-   @PolsonScott2012a on using Half-Cauchy priors for scale parameters

[^expconj]: <https://en.wikipedia.org/wiki/Exponential_family#Bayesian_estimation:_conjugate_distributions>