Guillem Hurault 2022-05-16
The purpose of this project is to showcase best coding practices in R. The analysis is adapted from a coding test, where fake data is generated and the goal is to develop models to predict an outcome variable (count outcome). The proposed analysis does not attempt to find the best possible model (the true data generating mechanism is known) but is merely here to illustrate a reproducible workflow.
This project is organised as a research compendium, with a similar structure as R packages:
-
Functions/helpers are located in the
R/
directory. There is only a single scriptfunctions.R
. Note that functions are documented using an Roxygen2 skeleton. -
data/
contains the training set, testing set and true coefficients as csv files. -
data-raw
contains a script (generate_fakedata.R
) to generate the fake data. -
Analysis reports/scripts are located in the
analysis/
directory. There are two Rmarkdown documents, one for the exploratory data analysis (exploration.Rmd
) and one for the modelling (modelling.Rmd
). -
HTML reports are located in
docs/
and can be accessed on the GitHub project site: -
Tests are located in
tests/
folder. -
renv/
andrenv.lock
are folders/files used by the renv package to manage package dependencies (see details below). The.Rprofile
is also created by renv to launch renv at the start of a new session. -
_targets.R
and_targets/
are files/folders used by the targets package to manage the computational workflow (see details below).run.R
andrun.sh
are also created by targets and store the command to run the pipeline (namelytargets::tar_make()
).
Research compendiums are often R organised as de facto R packages, but I find this is not always helpful. In particular:
- renv is better suited to reproduce the computational environment than a DESCRIPTION file. For example, DESCRIPTION file does not specify the exact version of the packages needed to reproduce an analysis. We could use renv within a package, but then we have to follow two standards to specify the computational environment.
- A package is designed to share code, which is not the main focus of a one-off analysis.
Having said that, this structure makes it easy to convert the project to an R package if needed (especially if the documentation is already written using Roxygen skeleton). A package could notably be useful to take advantage of the package ecosystem, for example to quickly build a website for the project using pkgdown or easily implement continuous integration with GitHub Actions.
I use Docker, renv and targets to ensure the analysis is fully reproducible. Briefly,
- Docker is used to run the analysis on a virtual environment (container), irrespective of the machine that is being used and the system’s configuration.
- renv is used to manage package dependencies and reproduce the computational environment by restoring the project library.
- targets is used to specify the computational workflow (i.e. how to produce output files from scratch).
To reproduce the analysis on a Docker container, make sure you have installed Docker on your computer first.
To reproduce the analysis:
- Clone this repository.
- In the command line (e.g. Git Bash on Windows), navigate to the project directory.
- Pull the Docker image from Docker
Hub with
docker pull ghurault/reproducible-workflow
(alternatively, build the Docker image withdocker build . -t ghurault/reproducible-workflow
, although the image may not be exactly the same as the one used for the analysis). - Run the container with
MSYS_NO_PATHCONV=1 docker run -d --rm -p 8787:8787 -e DISABLE_AUTH=true -v $(pwd):/home/rstudio/reproducible-workflow -v /home/rstudio/reproducible-workflow/renv ghurault/reproducible-workflow
. - Go to
http://localhost:8787/
. - Open the
reproducible-workflow
directory and the click onreproducible-workflow.Rproj
to open the RStudio project. - Run
targets::tar_make()
.
The Dockerfile
specifies the configuration of this
virtual environment. Here I am setting up a Linux machine where I
install R, RStudio Server, various system dependencies and the packages
needed to reproduce the analysis (using renv).
renv is used to manage
package dependencies. The details of the packages needed to reproduce
the analysis is stored in renv.lock
and configuration
files and the project library (ignored by git) is stored in
renv/
. After installing renv
itself
(install.packages("renv")
), the project library can be restored by
calling renv::restore()
.
docker build . -t ghurault/reproducible-workflow
builds the docker
image of the current directory (cf. .
) and name it
ghurault/reproducible-workflow
. This is the image that was pushed to
Docker Hub.
Instruction 4) runs the container as a daemon (detached mode, cf. -d
option). In addition,
MSYS_NO_PATHCONV=1
is only useful if the command is run on Git Bash on Windows, as it prevents file path conversion when mounting volume.-rm
removes the containers when it is exited.-p 8787:8787
publish the container tohttp://localhost:8787/
.DISABLE_AUTH=true
disables login (otherwise password is generated automatically or can be specified as an option).-v $(pwd):/home/rstudio/reproducible-workflow
mounts the current directory in the/home/rstudio/reproducible-workflow
in the container. This means that local files are available in the container, and that files that changes made inside the container happens on the local files as well. In addition,-v /home/rstudio/reproducible-workflow/renv
prevents from mounting the/renv
directory.
Once the virtual environment is set up and the RStudio project is
opened, the analysis can be reproduced using targets. The workflow is
declared in _targets.R
and information that the package
needs is stored in _targets/
. We can inspect the current
status of the pipeline by using targets::tar_visnetwork()
and run the
pipeline with targets::tar_make()
.
Unit tests are implemented using the testthat
package. The test are written in R scripts
beginning by test-
in tests/testthat/
and can be
run by sourcing tests/testthat.R
. Currently, the
tests implemented here are only here to showcase how we can perform unit
tests outside a package environment. In practice, more/better tests
should be implemented.
- Set up continuous integration, e.g. when there is a push to the remote repository, run the analysis in GitHub actions (cf. tldr above) and publish the report (and potentially the pipeline status), in addition to running unit tests.
The pipeline dependency graph can also be accessed here as an HTML widget.