Showcasing a reproducible workflow in R

Guillem Hurault 2022-05-16

File structure
Reproducibility
Testing
Possible improvements
Current analysis status (cf. targets)

The purpose of this project is to showcase best coding practices in R. The analysis is adapted from a coding test, where fake data is generated and the goal is to develop models to predict an outcome variable (count outcome). The proposed analysis does not attempt to find the best possible model (the true data generating mechanism is known) but is merely here to illustrate a reproducible workflow.

File structure

This project is organised as a research compendium, with a similar structure as R packages:

Functions/helpers are located in the R/ directory. There is only a single script functions.R. Note that functions are documented using an Roxygen2 skeleton.
data/ contains the training set, testing set and true coefficients as csv files.
data-raw contains a script (generate_fakedata.R) to generate the fake data.
Analysis reports/scripts are located in the analysis/ directory. There are two Rmarkdown documents, one for the exploratory data analysis (exploration.Rmd) and one for the modelling (modelling.Rmd).
HTML reports are located in docs/ and can be accessed on the GitHub project site:
- Exploratory data analysis
- Modelling report
Tests are located in tests/ folder.
renv/ and renv.lock are folders/files used by the renv package to manage package dependencies (see details below). The .Rprofile is also created by renv to launch renv at the start of a new session.
_targets.R and _targets/ are files/folders used by the targets package to manage the computational workflow (see details below). run.R and run.sh are also created by targets and store the command to run the pipeline (namely targets::tar_make()).

Research compendiums are often R organised as de facto R packages, but I find this is not always helpful. In particular:

renv is better suited to reproduce the computational environment than a DESCRIPTION file. For example, DESCRIPTION file does not specify the exact version of the packages needed to reproduce an analysis. We could use renv within a package, but then we have to follow two standards to specify the computational environment.
A package is designed to share code, which is not the main focus of a one-off analysis.

Having said that, this structure makes it easy to convert the project to an R package if needed (especially if the documentation is already written using Roxygen skeleton). A package could notably be useful to take advantage of the package ecosystem, for example to quickly build a website for the project using pkgdown or easily implement continuous integration with GitHub Actions.

Reproducibility

I use Docker, renv and targets to ensure the analysis is fully reproducible. Briefly,

Docker is used to run the analysis on a virtual environment (container), irrespective of the machine that is being used and the system’s configuration.
renv is used to manage package dependencies and reproduce the computational environment by restoring the project library.
targets is used to specify the computational workflow (i.e. how to produce output files from scratch).

To reproduce the analysis on a Docker container, make sure you have installed Docker on your computer first.

tldr

To reproduce the analysis:

Clone this repository.
In the command line (e.g. Git Bash on Windows), navigate to the project directory.
Pull the Docker image from Docker Hub with docker pull ghurault/reproducible-workflow (alternatively, build the Docker image with docker build . -t ghurault/reproducible-workflow, although the image may not be exactly the same as the one used for the analysis).
Run the container with MSYS_NO_PATHCONV=1 docker run -d --rm -p 8787:8787 -e DISABLE_AUTH=true -v $(pwd):/home/rstudio/reproducible-workflow -v /home/rstudio/reproducible-workflow/renv ghurault/reproducible-workflow.
Go to http://localhost:8787/.
Open the reproducible-workflow directory and the click on reproducible-workflow.Rproj to open the RStudio project.
Run targets::tar_make().

Explanations

The Dockerfile specifies the configuration of this virtual environment. Here I am setting up a Linux machine where I install R, RStudio Server, various system dependencies and the packages needed to reproduce the analysis (using renv).

renv is used to manage package dependencies. The details of the packages needed to reproduce the analysis is stored in renv.lock and configuration files and the project library (ignored by git) is stored in renv/. After installing renv itself (install.packages("renv")), the project library can be restored by calling renv::restore().

docker build . -t ghurault/reproducible-workflow builds the docker image of the current directory (cf. .) and name it ghurault/reproducible-workflow. This is the image that was pushed to Docker Hub.

Instruction 4) runs the container as a daemon (detached mode, cf. -d option). In addition,

MSYS_NO_PATHCONV=1 is only useful if the command is run on Git Bash on Windows, as it prevents file path conversion when mounting volume.
-rm removes the containers when it is exited.
-p 8787:8787 publish the container to http://localhost:8787/.
DISABLE_AUTH=true disables login (otherwise password is generated automatically or can be specified as an option).
-v $(pwd):/home/rstudio/reproducible-workflow mounts the current directory in the /home/rstudio/reproducible-workflow in the container. This means that local files are available in the container, and that files that changes made inside the container happens on the local files as well. In addition, -v /home/rstudio/reproducible-workflow/renv prevents from mounting the /renv directory.

Once the virtual environment is set up and the RStudio project is opened, the analysis can be reproduced using targets. The workflow is declared in _targets.R and information that the package needs is stored in _targets/. We can inspect the current status of the pipeline by using targets::tar_visnetwork() and run the pipeline with targets::tar_make().

Testing

Unit tests are implemented using the testthat package. The test are written in R scripts beginning by test- in tests/testthat/ and can be run by sourcing tests/testthat.R. Currently, the tests implemented here are only here to showcase how we can perform unit tests outside a package environment. In practice, more/better tests should be implemented.

Possible improvements

Set up continuous integration, e.g. when there is a push to the remote repository, run the analysis in GitHub actions (cf. tldr above) and publish the report (and potentially the pipeline status), in addition to running unit tests.

Current analysis status (cf. targets)

The pipeline dependency graph can also be accessed here as an HTML widget.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Showcasing a reproducible workflow in R

File structure

Reproducibility

tldr

Explanations

Testing

Possible improvements

Current analysis status (cf. targets)

About

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
R		R
_targets		_targets
analysis		analysis
data-raw		data-raw
data		data
docs		docs
renv		renv
tests		tests
.Rprofile		.Rprofile
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.Rmd		README.Rmd
README.md		README.md
_targets.R		_targets.R
renv.lock		renv.lock
reproducible-workflow.Rproj		reproducible-workflow.Rproj
run.R		run.R
run.sh		run.sh

License

ghurault/reproducible-workflow

Folders and files

Latest commit

History

Repository files navigation

Showcasing a reproducible workflow in R

File structure

Reproducibility

tldr

Explanations

Testing

Possible improvements

Current analysis status (cf. targets)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages