-
Notifications
You must be signed in to change notification settings - Fork 4
/
5-diagnostics.Rtex
271 lines (175 loc) · 10.5 KB
/
5-diagnostics.Rtex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
\frame{
\sffamily
\frametitle{Diagnostics for MCMC algorithms}
\begin{itemize}
\item Recall that, although Metropolis-Hastings algorithm work if we run them long enough, the fact that samples are dependent presents challenges:
\begin{itemize}
\item Convergence.
\item Mixing.
\end{itemize}
\item \alert{Mixing:} As with every Monte Carlo algorithm an important question is how many samples should be generated in order to have estimates of the parameters with a reasonable level of accuracy.
\begin{itemize}
\item However, because samples are (most usually positively) correlated, standard theory (standard central limit theorems, Chebyshev's inequality, etc.) does not apply here!
\item Mixing is related to how fast the chain explores the posterior distribution.
\end{itemize}
\end{itemize}
}
\frame{
\sffamily
\frametitle{Diagnostics for MCMC algorithms}
\begin{itemize}
\item \alert{Convergence:} An additional question when using MCMC algorithms that is not a concern in standard Monte Carlo is how many of the initial observations need to be discarded before we can perform any inference.
\begin{itemize}
\item Remember that, unless your initial state is sampled from the stationary distribution, the samples of your chain are only distributed as the stationary distribution when the number of iterations goes to infinity.
\item In practice, if your posterior is unimodal, you just want your initial state to be close to the high-probability areas.
\item When posterior is multimodal, convergence can be a huge issue.
\end{itemize}
\end{itemize}
}
\frame{
\sffamily
\frametitle{Diagnostics for MCMC algorithms}
\begin{itemize}
\item These two questions are interrelated, but are not identical. For example, you can accelerate convergence to the posterior by carefully selecting the initial state of your chain even if the chain mixes slowly. On the other hand, even if your initial state is really distributed from the stationary distribution (e.g., by using perfect sampling), you might need to run a very long chain to have an acceptable level of uncertainty in your estimates if the autocorrelation in the parameters of interest is very high.
\item Time series plots of the realized chains (the so-called \textit{trace} plots) provide visual intuition about these two concepts, and can be used as an informal check (which should be supplemented with some of the formal techniques discussed later).
\end{itemize}
}
\frame{
\sffamily
\frametitle{Diagnostics for MCMC algorithms}
\includegraphics[height=5.4cm,angle=0]{plots/mod_aucor_good_initial.pdf}
\includegraphics[height=5.4cm,angle=0]{plots/high_aucor_bad_initial.pdf}
}
\frame{
\sffamily
\frametitle{Checking for convergence}
\begin{itemize}
\item Generally speaking, determining \textit{a priori} how many iterations are needed for burn-in in not possible. Indeed, except for some special cases, general theory about the behavior of the algorithm (e.g., the value of the second largest eigenvalue of the kernel function) is not easy to derive.
\item For this reason, most practical algorithms to monitor convergence look at the time-series behavior of a small number of selected parameters (or functions of parameters).
\begin{itemize}
\item Selecting which parameters need to be monitored is tricky and model dependent. In high-dimensional models I usually monitor either the likelihood or the posterior, a small (random) subset of the main first-stage parameters, and a couple of important hyperparameters.
\item All criteria here are implemented in the \texttt{R} package \texttt{CODA}.
\end{itemize}
\item We typically cannot ensure convergence, just argue that there is no evidence of \text{lack of convergence}.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Checking for convergence (Multi-chain methods)}
\begin{itemize}
\item As the name suggests multi-chain methods use $M > 1$ realizations of the algorithm (each of length $B$, typically started from ``over dispersed'' values) and compare the output of the chains. Once the statistical properties of the realizations appear to be similar, the chains are deemed to have converged.
\item One example of this type of approach was introduced by \cite{GeRu92} who propose to monitor a statistic that compares the average within-chain variability against the between-chain variability:
$$
R = \left( \frac{B-1}{B} + \frac{M+1}{M} \frac{Z_B}{W_B} \right) \frac{\nu_B}{\nu_B - 2} ,
$$
where $Z_B$ is an estimate of between-chain variability, $W_B$ is an estimate of within-chain variability and $\nu_B$ is the approximate number of degrees of freedom.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Checking for convergence (Multi-chain methods)}
If $\xi = h(\theta)$ is the quantity of interest being monitored, define
\begin{align*}
\bar{\xi}_m &= \frac{1}{B} \sum_{b=1}^{B} \xi_m^{(b)} & \bar{\xi} &= \frac{1}{M} \sum_{m=1}^{M} \bar{\xi}_m & s_m^2 &= \frac{1}{B} \sum_{b=1}^{B} \left(\xi_m^{(b)} - \bar{\xi}\right)^2
\end{align*}
Then
\begin{align*}
Z_B &= \frac{1}{M} \sum_{m=1}^{M} \left(\bar{\xi}_m - \bar{\xi}\right)^2 & W_B &= \frac{1}{M} \sum_{m=1}^{M} s_m^2
\end{align*}
and
$$
\nu_B = \frac{2\left( \frac{B-1}{B}W_B + \frac{M+1}{M} Z_B \right)^2}{W_B}
$$
}
\frame{
\sffamily
\frametitle{Checking for convergence (Multi-chain methods)}
\begin{itemize}
\item We have $B Z_B/W_B \sim F_{M-1, 2W_B^2/\varpi_B}$ approximately, where
$$
\varpi_B = \frac{1}{M^2} \left[ \sum_{m=1}^{M} s_m^4 - \frac{1}{M} \left( \sum_{m=1}^{M} s_m^2 \right)^2 \right]
$$
\item Once the chains have converged we expect $R \to 1$, so the approximate distribution can be used to test convergence at different values of $B$.
\item Samples from different chains are then combined together for computing any expectation of interest.
\end{itemize}
}
\begin{frame}[fragile]
\sffamily
\frametitle{Checking for convergence (Multi-chain methods)}
\begin{center}
\hspace{-0.9cm}
\begin{tabular}{lr}
\begin{minipage}{5.5cm}
%% begin.rcode bliss-setup, include=FALSE
%% end.rcode
%% begin.rcode bliss-model, include=FALSE
%% end.rcode
%% begin.rcode bliss-mcmc-trialrun, include=FALSE
%% end.rcode
%% begin.rcode bliss-mcmc-twoStarts, include=FALSE
%% end.rcode
%% begin.rcode bliss-gelman-rubin, size='tiny'
%% end.rcode
\end{minipage}
%&
\begin{minipage}{5.5cm}
\includegraphics[width=5.4cm,angle=0]{plots/bliss-gelman-rubin.pdf}
\end{minipage}
\end{tabular}
\end{center}
\end{frame}
\frame{
\sffamily
\frametitle{Checking for convergence (Multi-chain methods)}
\begin{itemize}
\item I tend to prefer multi-chain methods. One advantage is that they can be naively parallelized.
\item However, finding over-dispersed initial points can be hard. For example, if the posterior is multimodal and you do not start the chain at least once on each mode you might never find out.
\item For slow-mixing chains, inferences based on a single long chain might be more accurate than the inferences based on many small chains put together. (However, this is an issue mostly if computational resources are limited.)
\end{itemize}
}
\frame{
\sffamily
\frametitle{Checking for convergence (Single-chain methods)}
\begin{itemize}
\item An alternative to multi-chain methods is to use a single long chain and see if the properties of the chain have ``stabilized''.
\item More specifically, we can divide the original chain of length $B$ into $M$ sections of approximately equal length and compare their spectral densities.
\item In particular \cite{Ge92} proposes to compare the first and second half of the chain (i.e., use $M=2$). The algorithm is implemented in the \texttt{R} function \texttt{geweke.diag} and \texttt{geweke.plot}.
\item Interestingly, the most recent edition of Gelman et al. Bayesian Data Analysis suggests to use the Gelman-Rubin approach with each chain split into two to assess both mixing and convergence.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Mixing}
\begin{itemize}
\item Again, it must be emphasized that mixing is a subtly different problem from convergence.
\item Note that a chain that mixes slowly, if started on a low-probability region of the space, might take a long time to converge. However, if we start it in a high probability region convergence is much less of an issue.
\item A simple way to decide how long the algorithm must be run \textit{after} the burn-in period is to evaluate the variance of the estimators.
\end{itemize}
}
\frame{
\sffamily
\frametitle{Mixing}
\begin{itemize}
\item Let $h$ be an integrable function. Recall that we aim at approximating $\E_{\bftheta \mid \bfy}\left\{ h(\bftheta) \right\} \approx \frac{1}{B} \sum_{b=1}^{B} h\left(\bftheta^{(b)}\right)$
\vspace{1mm}
\item If the samples $\bftheta^{(1)}, \ldots, \bftheta^{(B)}$ were independent, the variance of the estimator would be equal to $\frac{\Var \left\{ h(\bftheta) \right\} }{B}$. In practice a rough estimate of $\Var \left\{ h(\bftheta) \right\}$ can also be obtained by Monte Carlo.
\vspace{1mm}
\item From Markov chain theory, it is easy to show that for an MCMC that has converged the variance of the same estimator is $\tau_h \frac{\Var h(\bftheta)}{ B}$ where $\tau_h$ is the integrated autocorrelation: $\tau_h = 1 + 2 \sum_{t=1}^{\infty} \rho_t $, with $\rho_t$ the autocorrelation at lag $t$.
\item We call $\frac{B}{\tau_h}$ the equivalent sample size.
\end{itemize}
%so we call $\frac{n}{\tau_g}$ the effective sample size. Issues: highly dependent on $g$, it does not take into account the execution time of each step.
% For any $g$ it is possible to show that $\tau_g \ge \frac{1- \lambda_2}{1 + \lambda_2}$, where $\lambda_2 is the second largest eigenvalue
}
\frame{
\sffamily
\frametitle{Mixing}
\begin{itemize}
\item The equivalent sample size calculation is implemented in \texttt{CODA} through the function \texttt{effectiveSize}.
\item Note that the equivalent sample size will be different for different parameters. That means that a given number of iterations might be enough for one parameter, but not for another.
\item Some people like to ``thin" their chain (retain every X number of iterations and drop the rest) as a way to decrease $\tau_h$. However, this also reduces $B$!
\begin{itemize}
\item \citet{maceachern1994subsampling} proved the tradeoff is not favorable.
\item Unless storage (in memory or on disk) is an issue, thinning is not a good idea.
\end{itemize}
\end{itemize}
}