appendix-misc.Rmd

# Miscellaneous

```{r include=FALSE, message=FALSE}
library("tidyverse")
library("stringr")
library("bayz")
```

## Parameters

| Category                | Description                                        |
| ----------------------- | -------------------------------------------------- |
| modeled data            | Data, assigned distribution                        |
| unmodeled data          | Data not given a distribution                      |
| modeled parameters      | Parameters with an informative prior distribution  |
| unmodeled parameters    | Parameters with non-informative prior distribution |
| derived quantities      | Variables defined deterministically                |

See @GelmanHill2007a [p. 366]

## Location-Scale Families

In a [location-scale family]( https://en.wikipedia.org/wiki/Location%E2%80%93scale_family) of distributions, if the random variable $X$ is distributed with mean 0 and standard deviation 1, then the random variable $Y$,
$$
Y = \mu + \sigma X ,
$$
has mean $\mu$ and standard deviation $\sigma$.

**Normal distribution:** Suppose $X \sim \dnorm(0, 1)$, then
$$
Y = \mu + \sigma X,
$$
is equivalent to $Y \sim \dnorm(\mu, \sigma)$ (normal with mean $\mu$ and standard deviation $\sigma$).

**Student-t distribution** (including Cauchy):
$$
\begin{aligned}[t]
X &\sim \dt{\nu}(0, 1) \\
Y &= \mu + \sigma X
\end{aligned}
$$
implies
$$
Y \sim \dt{\nu}(\mu, \sigma),
$$
i.e. $Y$ is distributed Student-$t$ with location $\mu$ and scale $\sigma$.

In Stan, it can be useful parameterize distributions in terms of a mean 0, scale 1 parameters, and separate parameters for the locations and scales. E.g. with normal distributions,

``` stan
parameters {
  real mu;
  real<lower = 0.0> sigma;
  vector[n] eps;
}
transformed parameters {
  vector[n] y;
  y = mu + sigma * eps;
}
model {
  eps ~ normal(0.0, 1.0);
}
```

## Scale Mixtures of Normal Distributions

Some commonly used distributions can be represented as scale mixtures of normal distributions.
For formal details of scale mixtures of normal distributions see @West1987a.
Distributions that are scale-mixtures of normal distributions can be written as,
$$
Y \sim \dnorm(\mu, \sigma_i^2) \\
\sigma_i \sim \pi(\sigma_i)
$$
As its name suggests, the individual variances (scales) themselves, have a distribution.

Some examples:

-   Student-t
-   Double Exponential
-   Horseshoe or Hierarchical Shrinkage (HS)
-   Horseshoe Plus or Hierarchical Shrinkage Plus (HS+)

Even when analytic forms of the distribution are available, representing them as scale mixtures of normal distributions may be convenient in modeling.
In particular, it may allow for drawing samples from the distribution easily.
And in HMC, it may induce a more tractable posterior density.

## Covariance-Correlation Matrix Decomposition

The suggested method for modeling covariance matrices in Stan is the separation strategy which decomposes a covariance matrix $\Sigma$ can be decomposed into a standard deviation vector $\sigma$,  and a correlation matrix $R$ [@BarnardMcCullochMeng2000a],
$$
\Sigma = \diag(\sigma) R \diag(\sigma) .
$$
This is useful for setting priors on covariance because separate priors can be set
for the scales of the variables via $\sigma$, and the correlation between them,
via $R$.

The [rstanarm](https://github.com/stan-dev/rstanarm/wiki/Prior-distributions) `decov` prior goes further and decomposes the covariance matrix into a correlation matrix, $\mat{R}$,
a diagonal variance matrix $\mat{\Omega}$ with trace $n \sigma^2$, a scalar global variance $\sigma^2$, and a simplex $\vec{\pi}$ (proportion of total variance for each variable):
$$
\begin{aligned}[t]
\mat{\Sigma} &= \mat{\Omega} \mat{R}  \\
\diag(\mat{\Omega}) &= n \vec{\pi} \sigma^2
\end{aligned}
$$
Separate and interpretable priors can be put on $\mat{R}$, $\vec{\pi}$, and $\sigma^2$.

The LKJ distribution is a distribution over correlation coefficients,
$$
R \sim \dlkjcorr(\eta) ,
$$
where
$$
\dlkjcorr(\Sigma | \eta) \propto \det(\Sigma)^{(\eta - 1)} .
$$

This distribution has the following properties:

-   $\eta = 1$: uniform correlations
-   $\eta \to \infty$: approaches the identity matrix
-   $0 < \eta < 1$: there is a trough at the identity matrix with higher probabilities placed on non-zero correlations.
-   For all positive $\eta$ ($\eta > 0$), $\E(R) = \mat{I}$.

```{r}
lkjcorr_df <- function(eta, n = 2) {
  out <- as.data.frame(rlkjcorr(n, eta))
  out$.row <- seq_len(nrow(out))
  out <- gather(out, .col, value, -.row)
  out$.col <- as.integer(str_replace(out$.col, "^V", ""))
  out$eta <- eta
  out  
}

lkjsims <- purrr::map_df(c(0.01, 0.1, 1, 2, 50, 1000), lkjcorr_df, n = 50)
```

This simulates a single matrix from the LKJ distribution with different values of $\eta$.
As $\eta \to \infty$, the off-diagonal correlations tend towards 0, and the correlation matrix to the identity matrix.
```{r}
ggplot(lkjsims, aes(x = .row, y = .col, fill = value)) +
  facet_wrap(~ eta, ncol = 2) +
  scale_fill_distiller(limits = c(-1, 1), type = "div", palette = "RdYlBu") +
  geom_raster() +
  theme_minimal() +
  theme(panel.grid = element_blank(), axis.text = element_blank()) +
  labs(x = "", y = "")
```

The density of the off-diagonal correlations.
```{r}
lkjsims %>%
  filter(.row < .col) %>%
  ggplot(aes(x = value, colour = factor(eta))) +
  geom_density()
```

For other discussions of the LKJ correlation distribution, see these:

-   <https://stats.stackexchange.com/questions/2746/how-to-efficiently-generate-random-positive-semidefinite-correlation-matrices/125017#125017>
-   <http://www.zinkov.com/posts/2015-06-09-where-priors-come-from/>
-   <http://www.psychstatistics.com/2014/12/27/d-lkj-priors/>

## QR Factorization

For a full-rank $N \times K$ matrix, the QR factorization is
$$
\mat{X} = \mat{Q} \mat{R}
$$
where $\mat{Q}$ is an orthonormal matrix such that $\mat{Q}\T \mat{Q}$ and
$\mat{R}$ is an upper triangular matrix.

Stan function
@Stan2016a suggest writing it is
$$
\begin{aligned}[t]
\mat{Q}^{*} = \mat{Q} \times \sqrt{N - 1} \\
\mat{R}^{*} = \frac{1}{\sqrt{N - 1}} \mat{R}
\end{aligned}
$$

This is used for solving linear model.

Suppose $\vec{\beta}$  is a $K \times 1$ vector, then
$$
\vec{eta} = \mat{x} \vec{\beta} = \mat{Q} \mat{R} \vec{\beta} = \mat{Q}^{*} \mat{R}^{*} \vec{\beta} .
$$
Suppose $\mat{theta} = \mat{R}^{*} \vec{\beta}$, then $\vec{eta} = \mat{Q}^{*} \mat{\theta}$ and $\vec{beta} = {\mat{R}^{*}}^{-1} \mat{\theta}$.

[rstanarm](https://cran.r-project.org/web/packages/rstanarm/vignettes/lm.html) provides a prior for a normal linear model which uses the QR decomposition to parameterize a prior in terms of $R^2$.

Stan functions:

-   `qr_Q(matrix A)`
-   `qr_R(matrix A)`

See @Stan2016a [Sec 8.2]

## Cholesky Decomposition

The [Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition) of a
positive definite matrix $A$ is,
$$
\mat{A} = \mat{L} \mat{L}\T ,
$$
where $\mat{L}$ is a lower-triangular matrix.

-   It is similar to a square root for a matrix.

-   It often more numerically stable or efficient to work with the Cholesky decomposition, than with
    a covariance matrix. When working with the covariance matrix, numerical precision can
    result in a non positive definite matrix. However, working with $\mat{L}$ will ensure
    that $\mat{A} = \mat{L} \mat{L}\T$ will be positive definite.

-   In Stan

    -   Types types  `cholesky_factor_cov`, and `cholesky_factor_corr` represent the Cholesky factor
        of covariance and correlation matrices, respectively.

    -   Cholesky decomposition function is `cholesky_decompose(matrix A)`

-   Multiple functions in Stan are parameterized with Cholesky decompositions instead of or in addition
    to covariance matrices. Use them if possible; they are more numerically stable.

    -   `lkj_corr_chol_lpdf`
    -   `multi_normal_cholesky_lpdf`

The Cholesky factor is used for sampling from a multivariate normal distribution using i.i.d. standard normal distributions.
Suppose $X_1, \dots, X_N$ are $N$ i.i.d. standard normal distributions, $\mat{\Omega}$ is an $N \times N$ lower-triangular matrix such that $\mat{\Omega} \mat{Omega}\T = \mat{\Sigma}$, and $\mu$ is an $N \times 1$ vector, then
$$
\vec{\mu} + \mat{\Omega} X \sim \dnorm(\vec{\mu}, \mat{\Sigma})
$$

See @Stan2016a [p. 40, 147, 241, 246]

## Scaled and Unscaled Variables

Though priors shouldn't depend on the data itself, many priors depend on the scales of the data.
The priors can be defined as a function of the scale of the data.
In linear regressions, using scaled variables or defining priors in terms of scaled parameters is common.

Suppose $\tilde{Y}$, $\tilde{X}$, $\tilde{\alpha}$, $\tilde{\beta}$, and $\epsilon$ are random variables, such that
$$
\tilde{Y} = \tilde{\alpha} + \tilde{\beta} \tilde{X} + \epsilon .
$$
These random variables have the following properties:
$$
\begin{aligned}
\tilde{Y} &= \frac{Y - \bar{Y}}{\sigma_Y}, & \E\[\tilde{Y}\] &= 0, & \sigma_Y^2 &= \V\[\tilde{Y}\] = 1 \\
\tilde{X} &= \frac{X - \bar{X}}{\sigma_X}, &  \E\[\tilde{X}\] &= 0, & \sigma_X^2 &= \V\[\tilde{X}\] = 1 , \\
&& \E\[\epsilon\] &= 0 & \sigma_{\tilde{\epsilon}}^2 &= \V\[\tilde{\epsilon}\]
\end{aligned}
$$
where
$$
\begin{aligned}[t]
\bar{X} &= \E\[X\] , & s_X^2 &= \V\[X\] , \\
\bar{Y} &= \E\[Y\] , & s_Y^2 &= \V\[Y\] .
\end{aligned}
$$

Then via some algebra,
$$
\begin{aligned}
Y &= \underbrace{\sigma_{Y} \tilde{\alpha} + \bar{Y} - \frac{\sigma_Y }{\sigma_X} \tilde{\beta} \bar{X}}_{\alpha} +
\underbrace{\frac{\sigma_Y}{\sigma_X} \tilde{\beta}}_{\beta} X + \underbrace{\sigma_Y \tilde{\epsilon}}_{\epsilon} \\
  &= \alpha + \beta X + \epsilon .
\end{aligned}
$$
The primary relationships of interest are those between $\alpha$ and $\tilde{\alpha}$, $\beta$ and $\tilde{\beta}$, and $\epsilon$ and $\tilde{\epsilon}$.
These can be used to convert between coefficients estimated with standardized data to the coefficients on the data scale, or to adjust scale-free weakly informative priors to the data scale.
$$
\begin{aligned}[t]
\tilde{\alpha} &= \sigma_Y^{-1}\left(\alpha - \bar{Y} + \beta \bar{X} \right) &
\alpha &= \sigma_Y \tilde{\alpha} + \bar{Y} - \frac{\sigma_Y}{\sigma_X} \tilde{\beta} \bar{X} \\
&= \sigma_Y^{-1}\left(\alpha - \bar{Y} + \frac{\sigma_Y}{\sigma_X} \tilde{\beta} \bar{X} \right) &
&= \sigma_Y \tilde{\alpha} + \bar{Y} - \beta \bar{X} ,  \\
\tilde{\beta} &= \frac{\sigma_X}{\sigma_Y} \beta  &
\beta &= \frac{\sigma_Y}{\sigma_X} \tilde{\beta} , \\
\tilde{\epsilon} &= \epsilon / \sigma_Y  &
\epsilon &= \sigma_Y \tilde{\epsilon} .
\end{aligned}
$$
This implies the following relationships between their means and variances,
$$
\begin{aligned}[t]
E(\alpha) &= \sigma_{Y} E(\tilde{\alpha}) + \bar{Y} - \frac{\sigma_Y}{\sigma_X} \tilde{\beta} \bar{X} & 
V(\alpha) &= \sigma^2_Y V(\tilde{\alpha}) \\
E(\tilde{\alpha}) &= \frac{E(\alpha) - \bar{Y} + \beta \bar{X} }{\sigma_Y} &
V(\tilde{\alpha}) &= \frac{V(\alpha)}{\sigma_{Y}^{2}}
\end{aligned}
$$
For example, a weakly informative prior on $\tilde{\alpha}$ implies a prior on $\alpha$,
$$
\tilde{\alpha} \sim N(0, 10^2) \Leftrightarrow \alpha \sim N \left( \frac{\beta \bar{X} - \bar{Y}}{\sigma_Y}, \sigma_Y^2 10^2 \right) .
$$

$$
\begin{aligned}[t]
E(\beta) &= \frac{\sigma_Y}{\sigma_X} E(\tilde{\beta})  , &
V(\beta) &= \frac{\sigma_Y^2}{\sigma_X^2} V(\tilde{\beta})  , \\
E(\tilde{\beta}) &= \frac{\sigma_X}{\sigma_Y} E(\beta)  , &
V(\tilde{\beta}) &= \frac{\sigma_X^2}{\sigma_Y^2} V(\beta) .
\end{aligned}
$$

For example, a weakly informative prior on $\tilde{\beta}$ implies the following prior on $\beta$,
$$
\tilde{\beta} \sim N(0, 2.5^2) \Leftrightarrow \beta \sim N\left(0, \frac{\sigma_Y^2}{\sigma_X^2} 2.5^2 \right) .
$$

$$
\begin{aligned}[t]
E(\epsilon) &= 0 , & V(\epsilon) &= \sigma_Y^2 V(\tilde{\epsilon}), \\
E(\tilde{\epsilon}) &= 0 , & V(\tilde{\epsilon}) &= \sigma_Y^{-2}  V(\epsilon) .
\end{aligned}
$$
For example, a weakly informative prior on the variance of $\tilde{\epsilon}}$ implies a weakly informative prior on the variance of $\epsilon$,
$$
\sigma_{\tilde{\epsilon}} \sim C^{+}\left(0, 5 \right) \Leftrightarrow 
\sigma_{\epsilon} \sim C^{+}\left(0, 5 \sigma_Y \right) .
$$

All of the above calculations are a little sloppy since I'm treating the other random variables as constants when deriving these (basically assuming $\alpha$ and $\beta$ are independent).