add stan gp and particle filter

nicholasjclark · Jun 10, 2022 · 1d56920 · 1d56920
1 parent e5bc971
commit 1d56920
Show file tree

Hide file tree

Showing 87 changed files with 6,878 additions and 2,719 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: mvgam
 Title: Multivariate Bayesian Generalised Additive Models for discrete time series
-Version: 0.0.0.9000
+Version: 1.0.0
 Authors@R: 
     person(given = "Nicholas",
            family = "Clark",
@@ -9,8 +9,8 @@ Authors@R:
 Description: This package is for fitting Bayesian Generalised Additive Models to sets of discrete time series. 
              The primary purpose of the package is to build a version of dynamic factor models that incorporates 
              the flexibility of GAMs in the linear predictor and latent dynamic trend components that are useful 
-             for time series analysis and forecasting. Estimation is performed using Markov Chain Monte Carlo (MCMC) by
-             the Gibbs sampling software JAGS. The package also includes 
+             for time series analysis and forecasting. Estimation is performed using Markov Chain Monte Carlo (MCMC) with
+             either the Gibbs sampling software JAGS or with Hamiltonian Monte Carlo in the software Stan. The package also includes 
              utilities for online updating of  forecast distributions with a recursive particle filter that 
              uses sequential importance sampling to assimilate new observations as they become available.
 Maintainer: Nicholas J Clark <[email protected]>

diff --git a/NAMESPACE b/NAMESPACE
@@ -1,6 +1,5 @@
 # Generated by roxygen2: do not edit by hand
 
-S3method(dic,mvgam)
 S3method(plot,mvgam)
 S3method(ppc,mvgam)
 S3method(predict,mvgam)
@@ -15,11 +14,10 @@ export(add_stan_data)
 export(add_trend_lines)
 export(add_tweedie_lines)
 export(compare_mvgams)
-export(dic)
 export(eval_mvgam)
 export(get_monitor_pars)
 export(lv_correlations)
-export(mvjagam)
+export(mvgam)
 export(pfilter_mvgam_fc)
 export(pfilter_mvgam_init)
 export(pfilter_mvgam_online)

diff --git a/NEON_manuscript/Case studies/mvgam_case_study1.Rmd b/NEON_manuscript/Case studies/mvgam_case_study1.Rmd
@@ -93,9 +93,9 @@ head(fake_data$data_train)
 head(fake_data$data_test)
 ```
 
-As a first pass at modelling this series, we will fit a GAM that includes a seasonal smooth and a yearly smooth, as well as their tensor product interaction. As the seasonality is cyclic, we will use a cyclic cubic regression spline for `season`. The knots are set so that the boundaries of the cyclic smooth match up between December 31 and January 1. We will stick with the default thin plate regression spline for `year`. This is similar to what we might do when fitting a model in `mgcv` to try and forecast ahead, except here we also have an explicit model for the residual component. `mvgam` uses the `jagam` function from the `mgcv` package to generate a skeleton `JAGS` file and updates that file to incorporate any dynamic trend components (so far limited to no trend, random walks or AR models up to order 3). This is advantageous as any GAM that is allowed in `mgcv` can in principle be used in `mvgam` to fit dynamic linear models with smooth functions for nonlinear covariate effects. For multivariate series, `mvgam` also includes an option for dimension reduction by inducing trend correlations via a dynamic factor process. Here we use the Negative Binomial family and a burnin length of `10000` iterations (in practice we would probably run these models for a bit longer, but they tend to converge rather quickly for univariate series thanks to the useful starting values supplied by `jagam`). Note that feeding the `data_test` object does not mean that these values are used in training of the model. Rather, they are included as `NA` so that we can automatically create a forecast from the posterior predictions for these observations. This is useful for plotting forecasts without needing to run new data through the model's equations later on
+As a first pass at modelling this series, we will fit a GAM that includes a seasonal smooth and a yearly smooth, as well as their tensor product interaction. As the seasonality is cyclic, we will use a cyclic cubic regression spline for `season`. The knots are set so that the boundaries of the cyclic smooth match up between December 31 and January 1. We will stick with the default thin plate regression spline for `year`. This is similar to what we might do when fitting a model in `mgcv` to try and forecast ahead, except here we also have an explicit model for the residual component. `mvgam` uses the `jagam` function from the `mgcv` package to generate a skeleton `JAGS` file and updates that file to incorporate any dynamic trend components (so far limited to no trend, random walks or AR models up to order 3 for `JAGS` models). This is advantageous as any GAM that is allowed in `mgcv` can in principle be used in `mvgam` to fit dynamic linear models with smooth functions for nonlinear covariate effects. For multivariate series, `mvgam` also includes an option for dimension reduction by inducing trend correlations via a dynamic factor process. Here we use the Negative Binomial family and a burnin length of `10000` iterations (in practice we would probably run these models for a bit longer, but they tend to converge rather quickly for univariate series thanks to the useful starting values supplied by `jagam`). Note that feeding the `data_test` object does not mean that these values are used in training of the model. Rather, they are included as `NA` so that we can automatically create a forecast from the posterior predictions for these observations. This is useful for plotting forecasts without needing to run new data through the model's equations later on
 ```{r, message=FALSE, warning=FALSE}
-mod1 <- mvjagam(data_train = fake_data$data_train,
+mod1 <- mvgam(data_train = fake_data$data_train,
                data_test = fake_data$data_test,
                formula = y ~ s(season, bs = c('cc'), k = 12) +
                  s(year, k = 5) +
@@ -107,7 +107,7 @@ mod1 <- mvjagam(data_train = fake_data$data_train,
                chains = 4)
 ```
 
-We can view the `JAGS` model file that has been generated to see the additions that have been made to the base `gam` model. If the user selects `return_model_data = TRUE` when calling `mvjagam`, this file can be modified and the resulting `model_data` object can also be modified to fit more complex Bayesian models. Note that here the AR and phi terms have been set to zero as this model does not include a dynamic trend component
+We can view the `JAGS` model file that has been generated to see the additions that have been made to the base `gam` model. If the user selects `return_model_data = TRUE` when calling `mvgam`, this file can be modified and the resulting `model_data` object can also be modified to fit more complex Bayesian models. Note that here the AR and phi terms have been set to zero as this model does not include a dynamic trend component
 ```{r}
 mod1$model_file
 ```
@@ -201,7 +201,7 @@ ppc(mod1, series = 1, type = 'pit', data_test = fake_data$data_test)
 
 There are some problems with the way this model is generating future predictions. Perhaps a different smooth function for `year` can help? Here we fit our original model again but use a different smooth term for `year` to try and capture the long-term trend (using `B` splines with multiple penalties, following the excellent example by Gavin Simpson about [extrapolating with smooth terms](https://fromthebottomoftheheap.net/2020/06/03/extrapolating-with-gams/)). This is similar to what we might do when trying to forecast ahead from a more wiggly function, as `B` splines have useful properties by allowing the penalty to be extended into the range of values we wish to predict (in this case, the years in `data_test`). Note that we drop the `year*season` tensor product interaction to help emphasize the extrapolation behaviour of the `B` spline
 ```{r, message=FALSE, warning=FALSE}
-mod2 <- mvjagam(data_train = fake_data$data_train,
+mod2 <- mvgam(data_train = fake_data$data_train,
                data_test = fake_data$data_test,
                formula = y ~ s(season, bs = c('cc'), k = 12) +
                  s(year, bs = "bs", m = c(2, 1)),
@@ -259,7 +259,7 @@ ppc(mod2, series = 1, type = 'pit', data_test = fake_data$data_test)
 
 Can we improve the forecasts by removing our reliance on extrapolation? Now we will fit a model in which the GAM component of the linear predictor captures the repeated seasonality (again with a cyclic smooth) and a dynamic latent trend captures the temporal residual process using AR parameters (up to order 3). This model is a mildly modified version of the base `mgcv` model where the linear predictor is augmented with the latent trend component. Slightly longer burnin is used here due to the added complexity of the time series component, but the model still fits in ~ 30 seconds on most machines
 ```{r, message=FALSE, warning=FALSE}
-mod3 <- mvjagam(data_train = fake_data$data_train,
+mod3 <- mvgam(data_train = fake_data$data_train,
                 data_test = fake_data$data_test,
                 formula = y ~ s(season, bs = c('cc'), k = 12),
                 knots = list(season = c(0.5, 12.5)),
@@ -271,7 +271,7 @@ mod3 <- mvjagam(data_train = fake_data$data_train,
                chains = 4)
 ```
                
-In this case the fitted model is more different to the base `mgcv` model that was used in `jagam` to produce the skeleton `JAGS` file, so the summary of that base model is less accurate. But we can still check the model summary for the `mvjagam` mdoel to examine convergence for key parameters
+In this case the fitted model is more different to the base `mgcv` model that was used in `jagam` to produce the skeleton `JAGS` file, so the summary of that base model is less accurate. But we can still check the model summary for the `mvgam` mdoel to examine convergence for key parameters
 ```{r}
 summary(mod3)
 ```
@@ -316,7 +316,7 @@ Benchmarking against "null" models is a very important part of evaluating a prop
 ```{r, message=FALSE, warning=FALSE}
 fake_data$data_train$fake_cov <- rnorm(NROW(fake_data$data_train))
 fake_data$data_test$fake_cov <- rnorm(NROW(fake_data$data_test))
-mod4 <- mvjagam(data_train = fake_data$data_train,
+mod4 <- mvgam(data_train = fake_data$data_train,
                data_test = fake_data$data_test,
                formula = y ~ s(fake_cov, k = 3),
                family = 'poisson',
@@ -607,13 +607,13 @@ lynx_mgcv2_drps <- drps_mcmc_object(truth = lynx_test$population[1:10],
                    
 This forecast is highly overconfident, with very unrealistic uncertainty intervals due to the interpolation behaviours of the lag smooths. You can certainly keep trying different formulations (our experience is that the B spline variant above produces the best forecasts from any tested `mgcv` model, but we did not test an exhaustive set), but hopefully it is clear that forecasting using splines is tricky business and it is likely that each time you do it you'll end up honing in on different combinations of penalties, knot selections etc.... Now we will fit an `mvgam` model for comparison. This model fits a similar model to the `mgcv` model directly above but with a full time series model for the errors (in this case an `AR1` process), rather than smoothing splines that do not incorporate a concept of the future. We do not use a `year` term to reduce any possible extrapolation and because the latent dynamic component should capture this temporal variation.
 ```{r, message=FALSE, warning=FALSE}
-lynx_mvgam <- mvjagam(data_train = lynx_train,
+lynx_mvgam <- mvgam(data_train = lynx_train,
                data_test = lynx_test,
                formula = y ~ s(season, bs = 'cc', k = 19),
                knots = list(season = c(0.5, 19.5)),
                family = 'poisson',
                trend_model = 'AR1',
-               burnin = 5000,
+               burnin = 8000,
                chains = 4)
 ```
 

diff --git a/NEON_manuscript/Case studies/mvgam_case_study1.html b/NEON_manuscript/Case studies/mvgam_case_study1.html
diff --git a/NEON_manuscript/Case studies/mvgam_case_study2.Rmd b/NEON_manuscript/Case studies/mvgam_case_study2.Rmd
@@ -20,7 +20,7 @@ In this example we will examine multivariate forecasting models using `mvgam`, w
 ```{r}
 set.seed(500)
 library(mvgam)
-dat <- sim_mvgam(T = 80, n_series = 6, n_trends = 2,
+dat <- sim_mvgam(T = 80, n_series = 6, n_lv = 2,
                  family = 'poisson',
                  mu_obs = runif(8, 4, 6),
                  trend_rel = 0.6, train_prop = 0.8)
@@ -42,7 +42,7 @@ Clearly there are some correlations in the trends for these series. But how does
 
 A challenge with any factor model is the need to determine the number of factors `K`. Setting `K` too small prevents temporal dependencies from being adequately modelled, leading to poor convergence and difficulty estimating smooth parameters. By contrast, setting `K` too large leads to unnecessary computation. `mvgam` approaches this problem by formulating a prior distribution that enforces exponentially increasing penalties on the factor variances to allow any un-needed factors to evolve as flat lines. Now let's fit a well-specified model for our simulated series in which we estimate random intercepts, a shared seasonal cyclic smooth and `2` latent dynamic factors
 ```{r, message=FALSE, warning=FALSE}
-mod1 <- mvjagam(data_train = dat$data_train,
+mod1 <- mvgam(data_train = dat$data_train,
                 data_test = dat$data_test,
                 formula = y ~ s(series, bs = 're') +
                   s(season, bs = c('cc'), k = 12),
@@ -52,7 +52,7 @@ mod1 <- mvjagam(data_train = dat$data_train,
                 family = 'poisson',
                 trend_model = 'RW',
                 chains = 4,
-                burnin = 4000)
+                burnin = 1000)
 ```
 
 Look at a few plots. The estimated smooth function
@@ -72,7 +72,7 @@ plot_mvgam_factors(mod1)
 
 Now we fit the same model but assume that we no nothing about how many factors to use, so we specify the maximum allowed (the total number of series; `8`). Note that this model is computationally more expensive so it will take longer to fit
 ```{r, message=FALSE, warning=FALSE}
-mod2 <- mvjagam(data_train = dat$data_train,
+mod2 <- mvgam(data_train = dat$data_train,
                 data_test = dat$data_test,
                 formula = y ~ s(series, bs = 're') +
                   s(season, bs = c('cc'), k = 12),
@@ -82,7 +82,7 @@ mod2 <- mvjagam(data_train = dat$data_train,
                 family = 'poisson',
                 trend_model = 'RW',
                 chains = 4,
-                burnin = 4000)
+                burnin = 1000)
 ```
 
 The same plots as model 1 to see if this model has also fit the data well
@@ -138,7 +138,7 @@ plot(series, legend.loc = 'topleft')
 
 Now we will fit an `mvgam` model with shared seasonality and random intercepts per series. Our first attempt will ignore any temporal component in the residuals so that we can identidy which GAM predictor combination gives us the best fit, prior to investigating how to deal with any remaining autocorrelation. We assume a Negative Binomial distrubution with series-specific overdispersion parameters for the response. We use a complexity-penalising prior for the overdispersion, which allows the model to reduce toward a more simple Poisson observation process unless the data provide adequate information to support overdispersion. Also note that any smooths using the random effects basis (`s(series, bs = "re")` below) are automatically re-parameterised to use the [non-centred parameterisation that is necessary to help avoid common posterior degeneracies in hierarchical models](https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html). This parameterisation tends to work better for most ecological problems where the data for each group / context are not highly informative, but it is still probably worth investigating whether a centred or even a mix of centred / non-centred will give better computational performance. We suppress the global intercept as it is not needed and will lead to identifiability issues when estimating the series-specific random intercepts
 ```{r, message=FALSE, warning=FALSE}
-trends_mod1 <- mvjagam(data_train = trends_data$data_train,
+trends_mod1 <- mvgam(data_train = trends_data$data_train,
                       data_test = trends_data$data_test,
                       formula = y ~ s(series, bs = 're') + 
                         s(season, k = 12, m = 2, bs = 'cc') - 1,
@@ -151,7 +151,7 @@ trends_mod1 <- mvjagam(data_train = trends_data$data_train,
 
 Given that these series could potentially be following a hierarchical seasonality, we will also trial a slghtly more complex model with an extra smoothing term per series that allows its seasonal curve to deviate from the global seasonal smooth. Ignore the warning about repeated smooths, as this is not an issue for estimation.
 ```{r, message=FALSE, warning=FALSE}
-trends_mod2 <- mvjagam(data_train = trends_data$data_train,
+trends_mod2 <- mvgam(data_train = trends_data$data_train,
                       data_test = trends_data$data_test,
                       formula = y ~ s(season, k = 12, m = 2, bs = 'cc') +
                         s(season, series, k = 5, bs = 'fs', m = 1),
@@ -181,18 +181,9 @@ summary(trends_mod1$mgcv_model)
 summary(trends_mod2$mgcv_model)
 ```
 
-WE can also use `JAGS` routines for interrogating models. Which model minimises in-sample DIC?
-```{r}
-dic(trends_mod1)
-```
-
-```{r}
-dic(trends_mod2)
-```
-
 Model 2 seems to fit better so far, suggesting that hierarchical seasonality gives better performance for these series. But a problem with both of the above models is that their forecast uncertainties will not increase into the future, which is not how time series forecasts should behave. Here we fit Model 2 again but specifying a time series model for the latent trends. We assume the dynamic trends can be represented using latent factors that each follow an AR1 process, and we will rely on the exponential penalties to help regularise any un-needed factors by setting `n_lv = 4`
 ```{r, message=FALSE, warning=FALSE}
-trends_mod3 <- mvjagam(data_train = trends_data$data_train,
+trends_mod3 <- mvgam(data_train = trends_data$data_train,
                       data_test = trends_data$data_test,
                       formula = y ~ s(season, k = 12, m = 2, bs = 'cc') +
                         s(season, series, k = 5, bs = 'fs', m = 1),

diff --git a/NEON_manuscript/Case studies/mvgam_case_study2.html b/NEON_manuscript/Case studies/mvgam_case_study2.html