-
Notifications
You must be signed in to change notification settings - Fork 1
/
stancon2019-intro.Rmd
1042 lines (821 loc) · 32.8 KB
/
stancon2019-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "StanCon 2019"
author: "Cambridge, UK"
date: "August 20-23"
output:
html_document:
toc: true
toc_depth: 2
---
## Setup
```{r setup, results="hide", message=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
dev = "png",
dpi = 150,
fig.align = "center",
comment = NA
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default())
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
# load data
pest_data <- readRDS('data/pest_data.RDS')
standata_hier <- readRDS('data/standata_hier.RDS')
```
## The problem
### Background
Imagine that you are a statistician or data scientist working as an independent
contractor. One of your clients is a company that owns many residential buildings
throughout New York City. The property manager explains that they are concerned about the number
of cockroach complaints that they receive from their buildings. Previously
the company has offered monthly visits from a pest inspector as a solution to
this problem. While this is the default solution of many property managers in
NYC, the tenants are rarely home when the inspector visits, and so the manager
reasons that this is a relatively expensive solution that is currently not very
effective.
One alternative to this problem is to deploy long term bait stations. In this
alternative, child and pet safe bait stations are installed throughout the
apartment building. Cockroaches obtain quick acting poison from these stations
and distribute it throughout the colony. The manufacturer of these bait stations
provides some indication of the space-to-bait efficacy, but the manager suspects
that this guidance was not calculated with NYC roaches in mind. NYC roaches, the
manager rationalizes, have more hustle than traditional roaches; and NYC
buildings are built differently than other common residential buildings in the
US. This is particularly important as the uni§t cost for each bait station per
year is quite high.
### The goal
The manager wishes to employ your services to help them to find the optimal
number of roach bait stations they should place in each of their buildings in
order to minimize the number of cockroach complaints while also keeping
expenditure on pest control affordable.
A subset of the company's buildings have been randomly selected for an experiment:
* At the beginning of each month, a pest inspector randomly places a number of
bait stations throughout the building, without knowledge of the current
cockroach levels in the building
* At the end of the month, the manager records
the total number of cockroach complaints in that building.
* The manager would like to determine the optimal number of traps ($\textrm{traps}$) that
balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate
with the all-in cost of maintaining the traps ($\textrm{TC}$).
Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.
Formally, we are interested in finding
$$
\arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})]
$$
The property manager would also, if possible, like to learn how these results
generalize to buildings they haven't treated so they can understand the
potential costs of pest control at buildings they are acquiring as well as for
the rest of their building portfolio.
As the property manager has complete control over the number of traps set, the
random variable contributing to this expectation is the number of complaints
given the number of traps. We will model the number of complaints as a function
of the number of traps.
## The data
The data provided to us is in a file called `pest_data.RDS`. Let's
load the data and see what the structure is:
```{r load-data}
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
```
We have access to the following fields:
* `complaints`: Number of complaints per building per month
* `building_id`: The unique building identifier
* `traps`: The number of traps used per month per building
* `date`: The date at which the number of complaints are recorded
* `live_in_super`: An indicator for whether the building as a live-in super
* `age_of_building`: The age of the building
* `total_sq_foot`: The total square footage of the building
* `average_tenant_age`: The average age of the tenants per building
* `monthly_average_rent`: The average monthly rent per building
* `floors`: The number of floors per building
First, let's see how many buildings we have in the data:
```{r describe-data}
length(unique(pest_data$building_id))
```
And make some plots of the raw data:
```{r data-plot-1}
ggplot(pest_data, aes(x = complaints)) +
geom_bar()
```
```{r data-plot-2}
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) +
geom_jitter()
```
```{r data-plot-ts, fig.width = 6, fig.height = 8}
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) +
geom_line(aes(linetype = "Number of complaints")) +
geom_point(color = "black") +
geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) +
facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) +
scale_x_date(name = "Month", date_labels = "%b") +
scale_y_continuous(name = "", limits = range(pest_data$complaints)) +
scale_linetype_discrete(name = "") +
scale_color_discrete(name = "Live-in super")
```
The first question we'll look at is just whether the number of complaints per
building per month is associated with the number of bait stations per building
per month, ignoring the temporal and across-building variation (we'll get to
that later). This requires only two variables, $\textrm{complaints}$ and
$\textrm{traps}$. How should we model the number of complaints?
## Modeling count data : Poisson distribution
We already know some rudimentary information about what we should expect. The
number of complaints over a month should be either zero or an integer. The
property manager tells us that it is possible but unlikely that number of
complaints in a given month is zero. Occasionally there are a very large number
of complaints in a single month. A common way of modeling this sort of skewed,
single bounded count data is as a Poisson random variable. One concern about
modeling the outcome variable as Poisson is that the data may be
over-dispersed, but we'll start with the Poisson model and then check
whether over-dispersion is a problem by comparing our model's predictions
to the data.
### Model
Given that we have chosen a Poisson regression, we define the likelihood to be
the Poisson probability mass function over the number bait stations placed in
the building, denoted below as `traps`. This model assumes that the mean and
variance of the outcome variable `complaints` (number of complaints) is the
same. We'll investigate whether this is a good assumption after we fit the
model.
For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
### Prior predictive checks
Before we fit the model to real data, we should check that our priors and model
can generate plausible simulated data.
In R we can simulate from the prior predictive distribution using a
function like this:
```{r simulate-poisson-data}
# using normal distributions for priors on alpha and beta
simple_poisson_dpg <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
N <- length(traps)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta <- rnorm(1, mean = beta_mean, sd = beta_sd);
complaints <- rpois(N, lambda = exp(alpha + beta * traps))
return(complaints)
}
```
Try with $N(0,10)$ priors on both parameters:
```{r prior-pred-really-bad}
# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dpg(
pest_data$traps,
alpha_mean = 0,
alpha_sd = 10,
beta_mean = 0,
beta_sd = 10
)
```
Try with $N(0,1)$ priors on both parameters:
```{r prior-pred-bad}
simple_poisson_dpg(
pest_data$traps,
alpha_mean = 0,
alpha_sd = 1,
beta_mean = 0,
beta_sd = 1
)
```
Let's calculate some summary statistics of 1000 data sets generated according
to the $N(0,1)$ priors:
```{r prior-pred-check}
prior_preds <- t(replicate(1000, expr = {
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = 0,
alpha_sd = 10,
beta_mean = 0,
beta_sd = 10
)
}))
dim(prior_preds)
```
```{r prior-pred-mean}
summary(apply(prior_preds, 1, mean))
```
```{r prior-pred-min}
summary(apply(prior_preds, 1, min))
```
```{r prior-pred-max}
hist(apply(prior_preds, 1, max))
```
A more reasonable (though still quite flexible) prior:
```{r}
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = log(7),
alpha_sd = 0.5,
beta_mean = -0.25,
beta_sd = 0.5
)
```
Simulate 1000 times and calculate summary stats:
```{r prior-pred-better}
prior_preds <- t(replicate(1000, expr = {
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = log(7),
alpha_sd = 0.5,
beta_mean = -0.25,
beta_sd = 0.5
)
}))
```
```{r}
summary(apply(prior_preds, 1, mean))
```
```{r}
summary(apply(prior_preds, 1, min))
```
```{r}
summary(apply(prior_preds, 1, max))
```
### Writing and fitting our first Stan program
Let's encode the model in a Stan program.
* Write `simple_poisson_regression.stan` together, put in `stan_programs` directory.
```{r compile-simple-poisson}
comp_simple <- stan_model('stan_programs/simple_poisson_regression.stan')
```
To fit the model to the data we'll first create a list to pass
to Stan using the variables in the `pest_data` data frame. The names of the
list elements must match the names used in the `data` block of the Stan program.
```{r stan-data}
standata_simple <- list(
N = nrow(pest_data),
complaints = pest_data$complaints,
traps = pest_data$traps
)
str(standata_simple)
```
We have already compiled the model, so we can jump straight to sampling from it.
```{r fit_simple, cache=TRUE}
fit_simple <- sampling(comp_simple, data = standata_simple,
# these are the defaults but specifying them anyway
# so you can see how to use them:
# posterior sample size = chains * (iter-warmup)
chains = 4, iter = 2000, warmup = 1000)
```
Print the summary of the intercept and slope parameters:
```{r results_simple_P}
print(fit_simple, pars = c('alpha','beta'))
```
We can also plot the posterior distributions:
```{r mcmc_hist}
# https://mc-stan.org/bayesplot/reference/MCMC-distributions
draws <- as.matrix(fit_simple, pars = c('alpha','beta'))
head(draws)
mcmc_hist(draws) # marginal posteriors of alpha and beta
mcmc_scatter(draws, alpha = 0.2, size = 1) # posterior of (alpha,beta)
```
And compare them to draws from the prior:
```{r compare-prior-posterior}
alpha_prior_post <- cbind(alpha_prior = rnorm(4000, log(7), 1),
alpha_posterior = draws[, "alpha"])
mcmc_hist(alpha_prior_post, facet_args = list(nrow = 2), binwidth = 0.1) +
xlim(range(alpha_prior_post))
beta_prior_post <- cbind(beta_prior = rnorm(4000, -0.25, 0.5),
beta_posterior = draws[, "beta"])
mcmc_hist(beta_prior_post, facet_args = list(nrow = 2), binwidth = 0.05) +
xlim(range(beta_prior_post))
```
From the posterior of `beta`, it appears the number of bait stations set in a
building is associated with the number of complaints about cockroaches that were
made in the following month. However, we still need to consider how well the
model fits.
### Posterior predictive checking
```{r y_rep_simple}
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various ways
# of extracting contents from stanfit objects
y_rep <- as.matrix(fit_simple, pars = "y_rep")
```
```{r marginal_PPC}
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])
```
In the plot above we have the kernel density estimate of the observed data ($y$,
thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the
posterior predictive distribution. Here the posterior predictive
simulations are not as dispersed as the real data and don't seem to capture the
rate of zeros well at all. This suggests the Poisson model may not be sufficiently
flexible for this data.
Let's explore this further by looking directly at the proportion of zeros in the
real data and predicted data.
```{r ppc_stat-prop_zero}
# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)
# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
y = standata_simple$complaints,
yrep = y_rep,
stat = "prop_zero",
binwidth = .01
)
```
The plot above shows the observed proportion of zeros (thick vertical line) and
a histogram of the proportion of zeros in each of the simulated data sets. It is
clear that the model does not capture this feature of the data well at all.
The rootogram is another useful plot to compare the observed vs expected number
of complaints. This is a plot of the square root of the expected counts
(continuous line) vs the observed counts (blue histogram)
```{r rootogram}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)
```
If the model was fitting well these would be relatively similar, however in this
figure we can see the number of complaints is underestimated if there are few
complaints, over-estimated for medium numbers of complaints, and underestimated
if there are a large number of complaints.
The `ppc_bars()` function will make a bar plot of the observed values and
overlay prediction intervals but not on the square root scale (unlike the
rootogram):
```{r ppc_bars}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep)
```
We can also view how the predicted number of complaints varies with the number
of traps:
```{r intervals}
ppc_intervals(y = standata_simple$complaints, yrep = y_rep,
# jitter number of traps since multiple observations at some values
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)) +
labs(x = "Number of traps", y = "Number of complaints")
```
## Expanding the model: multiple predictors
Currently, our model's mean parameter is a rate of complaints per 30 days, but
we're modeling a process that occurs over an area as well as over time. We have
the square footage of each building, so if we add that information into the
model, we can interpret our parameters as a rate of complaints per square foot
per 30 days.
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can
put it in $\eta_{b,t}$:
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b
\end{align*}
$$
We will also include whether there is a live-in super or not as a predictor
for the number of complaints, which gives us gives us:
$$
\eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} +
\beta_{\rm super} \textrm{live_in_super}_{b,t} +
\textrm{log_sq_foot}_b
$$
Add these new variables to the data list for Stan.
```{r add-predictors-to-standata}
standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super
```
### Stan program for Poisson multiple regression
Now we need a new Stan program that uses multiple predictors.
* Write `multiple_poisson_regression.stan`
```{r compile-multi-poisson}
comp_multi <- stan_model('stan_programs/multiple_poisson_regression.stan')
```
### Fit the Poisson multiple regression
```{r fit_multi}
fit_multi <- sampling(comp_multi, data = standata_simple,
refresh = 100) # turn off printed progress updates
print(fit_multi, pars = c("alpha", "beta", "beta_super"))
```
```{r ppc_dens_overlay-random-subset}
y_rep <- as.matrix(fit_multi, pars = "y_rep")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])
# in this case we have very high effective sample sizes, but if there is
# nontrivial autocorrelation then it's better to take a random sample of the draws,
# for example:
ids <- sample(nrow(y_rep), size = 200)
ppc_dens_overlay(standata_simple$complaints, y_rep[ids, ])
```
This again looks like we haven't captured the smaller counts very well, nor
have we captured the larger counts.
We're still severely underestimating the proportion of zeros in the data:
```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = standata_simple$complaints, yrep = y_rep,
stat = "prop_zero", binwidth = 0.01)
```
Ideally this vertical line would fall somewhere within the histogram.
We can also plot uncertainty intervals for the predicted complaints for different
numbers of traps.
```{r}
ppc_intervals(
y = standata_simple$complaints,
yrep = y_rep,
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) +
labs(x = "Number of traps", y = "Number of complaints")
```
We can see that we've increased the tails a bit more at the larger numbers of traps
but we still have some large observed numbers of complaints that the model
would consider extremely unlikely events.
## Modeling count data with the Negative Binomial
When we considered modeling the data using a Poisson, we saw that the model
didn't appear to fit as well to the data as we would like. In particular the
model underpredicted low and high numbers of complaints, and overpredicted the
medium number of complaints. This is one indication of over-dispersion, where
the variance is larger than the mean. A Poisson model doesn't fit over-dispersed
count data very well because the same parameter $\lambda$, controls both the
expected counts and the variance of these counts. The natural alternative to
this is the negative binomial model:
$$
\begin{align*}
\text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b}
\end{align*}
$$
In Stan the negative binomial mass function we'll use is called
$\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$
in Stan. Like the `poisson_log` function, this negative binomial mass function
that is parameterized in terms of its log-mean, $\eta$, but it also has a
precision $\phi$ such that
$$
\mathbb{E}[y] \, = \lambda = \exp(\eta)
$$
$$
\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.
$$
As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so
the variance of the negative-binomial approaches $\lambda$, i.e., the
negative-binomial gets closer and closer to the Poisson.
### Stan program for negative-binomial regression
* Write `multiple_NB_regression.stan` together
```{r compile-multi-NB, cache=TRUE, results="hide", message=FALSE}
comp_multi_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
```
### Fit to data and check the fit
```{r fit_multi_NB}
fit_multi_NB <- sampling(comp_multi_NB, data = standata_simple, refresh = 250)
# to demonstrate extracting as a list instead of using as.matrix
samps_NB <- rstan::extract(fit_multi_NB)
```
Let's look at our predictions vs. the data.
```{r ppc_dens_overlay_NB}
ppc_dens_overlay(y = standata_simple$complaints, yrep = samps_NB$y_rep[1:200, ])
```
It appears that our model now captures both the number of small counts better
as well as the tails.
Let's check the proportion of zeros:
```{r prop_zero_NB}
ppc_stat(y = standata_simple$complaints, yrep = samps_NB$y_rep,
stat = "prop_zero", binwidth = 0.01)
```
Check predictions by number of traps:
```{r}
ppc_intervals(
y = standata_simple$complaints,
yrep = samps_NB$y_rep,
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) +
labs(x = "Number of traps (jittered)", y = "Number of complaints")
```
We haven't used the fact that the data are clustered by building yet. A posterior
predictive check might elucidate whether it would be a good idea to add the building
information into the model.
```{r ppc-group-means, fig.width=7, fig.height=6}
ppc_stat_grouped(
y = standata_simple$complaints,
yrep = samps_NB$y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.2
)
```
We're getting plausible predictions for most building means but some are
estimated better than others and some have larger uncertainties than we might
expect. If we explicitly model the variation across buildings we may be able to
get much better estimates.
## Hierarchical modeling
### Modeling varying intercepts for each building
Let's add a hierarchical intercept parameter, $\alpha_b$ at the building level
to our model.
$$
\text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} = \exp{(\eta_{b,t})} \\
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\
\mu_b \sim \text{Normal}(\alpha, \sigma_{\mu})
$$
In our Stan model, $\mu_b$ is the $b$-th element of the vector
$\texttt{mu}$ which has one element per building.
One of our predictors varies only by building, so we can rewrite the above model
more efficiently like so:
$$
\eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\
\mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu})
$$
We have more information at the building level as well, like the average age of
the residents, the average age of the buildings, and the average per-apartment
monthly rent so we can add that data into a matrix called
`building_data`, which will have one row per building and four columns:
* `live_in_super`
* `age_of_building`
* `average_tentant_age`
* `monthly_average_rent`
We'll write the Stan model like:
$$
\eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\
\mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu})
$$
### Prepare building data for hierarchical model
We'll need to do some more data prep before we can fit our models. Firstly to
use the building variable in Stan we will need to transform it from a factor
variable to an integer variable.
```{r prep-data}
N_months <- length(unique(pest_data$date))
N_buildings <- length(unique(pest_data$building_id))
# Add some IDs for building and month
pest_data <- pest_data %>%
mutate(
building_fac = factor(building_id, levels = unique(building_id)),
building_idx = as.integer(building_fac),
ids = rep(1:N_months, N_buildings),
mo_idx = lubridate::month(date)
)
# Center and rescale the building specific data
building_data <- pest_data %>%
select(
building_idx,
live_in_super,
age_of_building,
average_tenant_age,
monthly_average_rent
) %>%
unique() %>%
arrange(building_idx) %>%
select(-building_idx) %>%
scale(scale=FALSE) %>%
as.data.frame() %>%
mutate( # scale by constants to put variables on similar scales
age_of_building = age_of_building / 10,
average_tenant_age = average_tenant_age / 10,
monthly_average_rent = monthly_average_rent / 1000
) %>%
as.matrix()
# Make data list for Stan
standata_hier <-
with(pest_data,
list(complaints = complaints,
N = length(complaints),
traps = traps,
log_sq_foot = log(pest_data$total_sq_foot/1e4),
M = N_months,
mo_idx = as.integer(as.factor(date)),
J = N_buildings,
K = ncol(building_data),
building_idx = building_idx,
building_data = building_data
)
)
```
### Compile and fit the hierarchical model
Let's compile the model.
```{r comp_hier_NB, results="hide", message=FALSE}
comp_hier_NB <- stan_model('stan_programs/hier_NB_regression.stan')
```
Fit the model to data.
```{r run-NB-hier}
fit_hier_NB <-
sampling(
comp_hier_NB,
data = standata_hier,
refresh = 1000, # only report progress every 1000 iterations
# can run markov chains in parallel. you can check how many cores you
# have available using parallel::detectCores()
cores = 4
)
```
### Diagnostics
We get a bunch of warnings from Stan about divergent transitions, which is
an indication that there may be regions of the posterior that have not been
explored by the Markov chains.
Divergences are discussed in more detail in the course slides as well as
the **bayesplot** [MCMC diagnostics vignette](http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html)
and [_A Conceptual Introduction to Hamiltonian Monte Carlo_](https://arxiv.org/abs/1701.02434).
In this example we will see that we have divergent transitions because we need to
reparameterize our model - i.e., we will retain the overall structure of the
model, but transform some of the parameters so that it is easier for Stan to
sample from the parameter space. Before we go through exactly how to do this
reparameterization, we will first go through what indicates that this is
something that reparameterization will resolve. We will go through:
1. Examining the fitted parameter values, including the effective sample size
2. Plots that reveal particular patterns in locations of the divergences.
Then we print the fits for the parameters that are of most interest.
```{r fit_hier_NB-low-ESS}
print(fit_hier_NB, pars = c('sigma_mu','beta','alpha','phi','mu'))
```
You can see that the effective sample sizes are quite low for many of the parameters
relative to the total number of samples. This alone isn't indicative of the need
to reparameterize, but it indicates that we should look further at the trace
plots and pairs plots. First let's look at the traceplots to see if the divergent
transitions form a pattern.
```{r traceplot-with-divs}
# change bayesplot color scheme (to make chains very different colors)
color_scheme_set("brewer-Spectral")
mcmc_trace(
# use as.array to keep the markov chains separate for trace plots
as.array(fit_hier_NB,pars = 'sigma_mu'),
np = nuts_params(fit_hier_NB),
window = c(500,1000) # zoom into a window to see more clearly
)
color_scheme_set("blue")
```
Looks like the divergences (marked in red below the traceplot)
correspond to spots where sampler gets stuck.
```{r, scatter-with-divs}
# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
as.array(fit_hier_NB),
pars = c("mu[3]", "sigma_mu"),
# look at sigma on log-scale (what Stan uses under the hood for positive parameters)
transform = list("sigma_mu" = "log"),
size = 1,
alpha = 0.3,
np = nuts_params(fit_hier_NB),
np_style = scatter_style_np(div_size = 1.5) # change style of divergences points
)
scatter_with_divs + coord_equal()
```
What we have here is a cloud-like shape, with most of the divergences clustering
towards the bottom. We'll see a bit later that we actually want this to look more
like a funnel than a cloud, but the divergences are indicating that the sampler
can't explore the narrowing neck of the funnel.
One way to see why we should expect some version of a funnel is to look at some
simulations from the _prior_, which we can do without MCMC and thus with no risk
of sampling problems:
```{r}
N_sims <- 1000
sigma_mu_prior <- rep(NA, N_sims)
mu_prior <- rep(NA, N_sims)
for (j in 1:N_sims) {
sigma_mu_prior[j] <- abs(rnorm(1, mean = 0, sd = 1))
mu_prior[j] <- rnorm(1, mean = 0, sd = sigma_mu_prior[j])
}
prior_draws <- cbind("mu" = mu_prior, "log(sigma_mu)" = log(sigma_mu_prior))
mcmc_scatter(prior_draws)
```
Of course, if the data is at all informative we shouldn't expect the posterior
to look exactly like the prior. But unless the data is incredibly informative
about the parameters and the posterior concentrates away from the narrow neck of
the funnel, the sampler is going to have to confront the funnel geometry.
(See the bayesplot
[Visual MCMC Diagnostics](https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html)
vignette for more on this.)
Another way to look at the divergences is via a parallel coordinates plot.
```{r}
# https://mc-stan.org/bayesplot/reference/MCMC-parcoord.html#plot-descriptions
# this plot is a bit computationally intensive
parcoord_with_divs <- mcmc_parcoord(
as.array(fit_hier_NB, pars = c("sigma_mu", "mu")),
np = nuts_params(fit_hier_NB)
)
parcoord_with_divs
```
Again, we see evidence that our problems concentrate when $\texttt{sigma_mu}$ is small.
### Reparameterize and recheck diagnostics
Instead, we should use the non-centered parameterization for $\mu_b$. We
define a vector of auxiliary variables in the parameters block,
$\texttt{mu_raw}$ that is given a $\text{Normal}(0, 1)$ prior in the model
block. We then make $\texttt{mu}$ a transformed parameter:
We can reparameterize the random intercept $\mu_b$, which is distributed:
$$
\mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu})
$$
```
transformed parameters {
vector[J] mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
```
This implies $\texttt{mu}$ is
$\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)$
distribution, but it decouples the dependence of the density of each element of
$\texttt{mu}$ from $\texttt{sigma_mu}$ ($\sigma_\mu$).
hier_NB_regression_ncp.stan uses the non-centered parameterization for
$\texttt{mu}$. We will examine the effective sample size of the fitted model to
see whether we've fixed the problem with our reparameterization.
```{r comp_hier_NB_ncp}
comp_hier_NB_ncp <- stan_model('stan_programs/hier_NB_regression_ncp.stan')
```
Fit the model to the data.
```{r fit_hier_NB_ncp}
fit_hier_NB_ncp <- sampling(comp_hier_NB_ncp, data = standata_hier, cores = 4,
refresh = 1000)
```
You should have no divergences (or only a few) this time and much improved
effective sample sizes:
```{r fit_hier_NB_ncp-ESS}
print(fit_hier_NB_ncp, pars = c('sigma_mu','beta','alpha','phi','mu')) # optional 'probs' argument
```
Let's make the same scatter plot we made for the model with many divergences and
compare:
```{r compare-scatterplots}
scatter_no_divs <- mcmc_scatter(
as.array(fit_hier_NB_ncp),
pars = c("mu[3]", 'sigma_mu'),
transform = list('sigma_mu' = "log"),
size = 1,
alpha = 0.3,
np = nuts_params(fit_hier_NB_ncp)
)
bayesplot_grid(
scatter_with_divs, scatter_no_divs,
grid_args = list(ncol = 2),
ylim = c(-10, 1),
subtitles = c("Centered parameterization",
"Non-centered parameterization")
)
```
Comparing the plots, we see that the divergences were a sign that there was part
of the posterior that the chains were not exploring.
The marginal PPC plot, again.
```{r ppc-full-hier}
y_rep <- as.matrix(fit_hier_NB_ncp, pars = "y_rep")
ppc_dens_overlay(standata_hier$complaints, y_rep[1:200,])
```
This looks quite nice.
If we've captured the building-level means well, then the
posterior distribution of means by building should match well with the observed
means of the quantity of building complaints by month.
```{r ppc-group_means-hier}
ppc_stat_grouped(
y = standata_hier$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.5
)
```
We weren't doing terribly with the building-specific means before, but now they
are all estimated a bit better by the model. The model is also able to do a
decent job estimating within-building variability and, for the most part,
the probability of zero complaints:
```{r}
ppc_stat_grouped(
y = standata_hier$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'sd',
binwidth = 0.5
)
```
```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(
y = standata_hier$complaints,
yrep = y_rep,
stat = prop_zero,
binwidth = 0.02
)