Skip to content

Commit

Permalink
Improve design vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesmbaazam committed May 15, 2024
1 parent a57a097 commit 1657537
Showing 1 changed file with 58 additions and 93 deletions.
151 changes: 58 additions & 93 deletions vignettes/design-principles.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,9 @@ knitr::opts_chunk$set(
)
```

This vignette outlines the design decisions that have been taken during the
development of the `{epichains}` R package, and provides some of the reasoning,
and possible pros and cons of each decision.
This vignette outlines the design decisions that have been taken during the development of the `{epichains}` R package, and provides some of the reasoning, and possible pros and cons of each decision.

The goal here is to make it easy to acquaint oneself with the code base
in the absence of the current maintainer. This will ease future contributions
and maintenance of the package.
The goal here is to make it easy to acquaint oneself with the code base in the absence of the current maintainer. This will ease future contributions and maintenance of the package.

## Scope

Expand All @@ -39,114 +35,83 @@ and maintenance of the package.
and `pois_length_ll()`.
- Numerical likelihood simulation using `offspring_ll()`.

Additionally, the package provides mixture probability distributions for
generating offspring distributions, for example, `rborel()`.
Additionally, the package provides mixture probability distributions for generating offspring distributions, for example, `rborel()`.

## Design decisions

### Simulation functions

Simulation of branching processes are achieved through `simulate_chains()` and
`simulate_summary()`. For details of the underlying methods, see the
[theory vignette](https://epiverse-trace.github.io/epichains/articles/theoretical_background.html). #nolint

The simulations are stochastic, meaning that one set of inputs can produce
varied results. The models here can also be use to explore scenario analyses
by varying the inputs. Often, in cases where there is need for more than one
run of the model and/or with more than one set of parameter values, the inputs
and outputs are stored in separate data structures. However, this approach can
be limiting when performing scenario analyses, as the user has to manually
manipulate the two objects with reshaping and joining operations. This has
the potential to lead to errors and loss of information. Hence,
`simulate_chains()` and `simulate_summary()` return the dedicated classes
`<epichains>` and `<epichains_summary>` objects respectively that store the
input parameters and the output of the simulation in a single object.

The `<epichains>` class extends the `<data.frame>`, using columns
to store information about the simulated transmission chains and the
parameter values as attributes. `<data.frame>` was chosen because its tabular
structure allows us to store information in rows and columns, and is a widely
used data structure in R. Similarly, the `<epichains_summary>` class is a
superclass (an extension) of R's `<numeric>` class and stores the parameter
values as attributes.

The `<epichains>` object contains information about the whole outbreak, but
key summaries are not easily deduced from a quick glance of the object. Hence,
the class has a dedicated `format()/print()` method to print the simulated
transmission chains in a manner similar to a `<dataframe>`, but accompanied
by extra summary information including the number of chains simulated, number
of generations reached, and the number of unique infectors created. These
summaries are useful for quickly assessing the output of the simulation.

Importantly, the `<epichains>` class has a `summary()` method that returns
an `<epichains_summary>` object. This is a design decision that was taken to
allow for easy coercion between an infection tree obtained from
`simulate_chains()` and summaries of the infection tree otherwise attainable
by a separate run of `simulate_summary()` with the same parameter values passed
to `simulate_chains()`.

Lastly, `<epichains>` objects have an `aggregate()` method that aggregates the
transmission tree into cases by "generation" or "time". This is syntactic sugar
for the `dplyr::group_by()` and `dplyr::summarise()` style of aggregation.
Simulation of branching processes are achieved through `simulate_chains()` and `simulate_chain_stats()`. For details of the underlying methods, see the [theory vignette](https://epiverse-trace.github.io/epichains/articles/theoretical_background.html). #nolint

The simulations are stochastic, meaning that one set of inputs can produce varied results. The models here can also be use to explore scenario analyses by varying the inputs. Often, in cases where there is need for more than one run of the model and/or with more than one set of parameter values, the inputs and outputs are stored in separate data structures. However, this approach can be limiting when performing scenario analyses, as the user has to manually manipulate the two objects with reshaping and joining operations. This has the potential to lead to errors and loss of information. Hence, `simulate_chains()` and `simulate_chain_stats()` return objects of the dedicated classes `<epichains>` and `<epichains_summary>` respectively that store the input parameters and the output of the simulation in a single object.

The `<epichains>` class extends the `<data.frame>`, using columns to store information about the simulated transmission chains and the parameter values as attributes. `<data.frame>` was chosen because its tabular structure allows us to store information in rows and columns, and is a widely used data structure in R. Similarly, the `<epichains_summary>` class is a superclass (an extension) of R's `<numeric>` class and stores the parameter values as attributes.

The `<epichains>` object contains information about the whole outbreak, but key summaries are not easily deduced from a quick glance of the object. Hence, the class has a dedicated `format()/print()` method to print the simulated transmission chains in a manner similar to a `<dataframe>`, but accompanied by extra summary information including the number of chains simulated, number of generations reached, and the number of infectors created. These summaries are useful for quickly assessing the output of the simulation.

Importantly, the `<epichains>` class has a `summary()` method that returns an `<epichains_summary>` object. This is a design decision that was taken to allow for easy coercion between an `<epichains>` object obtained from `simulate_chains()` and summaries of the `<epichains>` object otherwise attainable by a separate run of `simulate_chain_stats()` with the same parameter values passed to `simulate_chains()`.

Lastly, `<epichains>` objects have an `aggregate()` method that aggregates the simulated outbreak into cases by "generation" or "time". This is syntactic sugar for the `dplyr::group_by()` and `dplyr::summarise()` style of aggregation with the added benefit of not taking on `dplyr` as a dependency to achieve the goal.

In summary, an `<epichains>` object has the following structure:

* Columns:
* `infectee_id` (`<numeric>`)
* `infector_id` (`<numeric>`)
* `generation` (`<numeric>`), and optionally,
* `chain` (`<integer>`)
* `infector` (`<integer>`)
* `infectee` (`<integer>`)
* `generation` (`<integer>`), and optionally,
* `time` (`<numeric>`), if `generation_time` is specified
* `susc_pop` (`<numeric>`), if `pop` is finite

* Attributes (See definitions in [simulate_chains](https://epiverse-trace.github.io/epichains/reference/simulate_chains.html)): #nolint
* `index_cases`,
* `n_chains`,
* `statistic`,
* `stat_max`,
* `stat_threshold`,
* `offspring_dist`, and
* `track_pop` (if `pop` is finite).
* `track_pop` (if `pop` is finite, i.e., if specified).

### likelihood estimation

Likelihoods are estimated using the `likelihood()` function. The
function is designed to be flexible in two inputs:

* data can be supplied as a vector of chain summaries, a `<epichains>` object,
or a `<epichains_summary>` object, and
* the offspring distribution can be supplied as a function, allowing the
user to use a custom offspring distributions.

`likelihood()` uses either analytical or numerical methods to estimate the
likelihood of observing the input chain summaries. The analytical methods are
closed-form likelihoods that take the form `<offspring_dist>_<statistic>_ll`,
for example, `gborel_size_ll()` and `pois_length_ll()` and are shipped in this
package. If the user supplies an offspring distribution and a statistic for
which a solution exists, internally, it is used. If not, simulations are used
to estimate the likelihood. The numerical likelihood simulation is achieved
using `offspring_ll()`, which is a wrapper around `simulate_summary()`.

The output type of `likelihood()` depends on the combination of the
arguments `individual`, `obs_prob`, and `nsim_obs` as summarised in the
table below:

| `individual` | `obs_prob` | Output type | Output length| Element length |
|--------------|------------|-------------|-------|----------------|
| `FALSE` | 1 | `<numeric>` | 1 | NA |
| `FALSE` | `obs_prob` >= 0 and `obs_prob` <= 1 | `<numeric>`|
`nsim_obs` | NA |
| `TRUE` | 1 | `<list>` | 1 | input data |
| `TRUE` | `obs_prob` >= 0 and `obs_prob` <= 1 | `<list>` |
`nsim_obs` | input data |
Likelihoods are estimated using the `likelihood()` function. The function is designed to be flexible in two inputs:

* data can be supplied as a vector of chain summaries, a `<epichains>` object, or a `<epichains_summary>` object, and
* the offspring distribution can be supplied as a function, allowing the user to use a custom offspring distributions.

`likelihood()` uses either analytical or numerical methods to estimate the likelihood of observing the input chain summaries. The analytical methods are closed-form likelihoods that take the form `.<offspring_dist>_<statistic>_ll()`, for example, `.gborel_size_ll()` and `.pois_length_ll()` and are shipped in this package. If the user supplies an offspring distribution and a statistic for which a solution exists, internally, it is used. If not, simulations are used to estimate the likelihood. The numerical likelihood simulation is achieved using `.offspring_ll()`, an internal wrapper around `simulate_chain_stats()`.

The output type of `likelihood()` depends on the combination of the arguments `individual`, `obs_prob`, and `nsim_obs` as summarised in the table below:

| `individual` | `obs_prob` | Output type | Output length | Element length |
|--------------|------------|-------------|---------------|----------------|
| `FALSE` | 1 | `<numeric>` | 1 | NA |
| `FALSE` | `obs_prob` >= 0 and `obs_prob` <= 1 | `<numeric>`| `nsim_obs` | NA |
| `TRUE` | 1 | `<list>` | 1 | input data |
| `TRUE` | `obs_prob` >= 0 and `obs_prob` <= 1 | `<list>` | `nsim_obs` | input data |

## Naming and documentation style

The package uses the following naming conventions:

* Functions and arguments are named using snake_case, example, `simulate_chains()`.

* Internal functions are prefixed with a period, for example, `.offspring_ll()`. This is only a visual cue and does not have any technical implications.

* In the documentation:

- Classes and objects are enclosed in angle brackets, for example, `<epichains>`.
- Packages are enclosed in curly braces, for example, `{epichains}`.
- All function arguments are defined in sentence case and punctuated (especially with full stops).
- Function titles are in imperative form.
- Functions Are referred to with `function_name()`

## Dependencies

* Input validation:
- [checkmate](https://github.com/mllg/checkmate/): exports many `test_*()`,
`check_*()` and `assert_*()` functions and is available on CRAN.
- [checkmate](https://github.com/mllg/checkmate/): exports many `test_*()`, `check_*()` and `assert_*()` functions and is available on CRAN.

## Development journey

{epichains} is a successor to the
[bpmodels](https://github.com/epiverse-trace/bpmodels) package,
which had the same functionality but not structured in an object-oriented
manner. {epichains} is a major refactoring of {bpmodels} to provide an
object-oriented approach to simulating and handling transmission chains in R.
{epichains} is a successor to the [bpmodels](https://github.com/epiverse-trace/bpmodels) package, which will be retired when {epichains} is released.

{epichains} was born out of a need to refactor {bpmodels}, which led to a name change and subsequent changes in design that would have required significant disruptive changes to {bpmodels}. {epichains} is a major refactoring of {bpmodels} to provide a simulation function that accounts for susceptible depletion and population immunity without restrictions on the offspring distributions, better function and long form documentation, and an object-oriented approach to simulating and handling transmission chains in R.

Future plans include simulation of contacts of infectees, the incorporation of network effects, an object-oriented approach to estimating chain size and length likelihoods, and interoperability with the [{epiparameter}](https://epiverse-trace.github.io/epiparameter) package for ease of setting up various epidemiological delays.

0 comments on commit 1657537

Please sign in to comment.