-
Notifications
You must be signed in to change notification settings - Fork 11
Discuss the dependency and reproducibility page #6
Comments
This is all fine and well on personal machines, but usually build tools are locked down on government systems (for good reason!) As a result it can be impossible to return to older states. |
There also seem to be issues installing packages with compilation in packrat. For instance, at MOJ, we can install devtools only when packrat is disabled. Unsure of the cause. |
I've only used packrat a couple of times and always found it a bit clunky but did manage to get it working on quite a locked down machine. Similarly I've only used docker a couple of times but for true reproducibility saving the final image somewhere is probably easiest. I think a lot of it comes down to proportionality (e.g. for the rap companion I think using packrat or docker is overkill anyway but appreciate for rap projects it may be of greater benefit). We have done a piece of work where we combined a dockerfile with a packrat collection, but the packrat file is so massive I'm unconvinced of the benefit versus just saving the resultant image. Perhaps someone who has thought about it more will chip in. |
Some thoughts:
|
TL;DR: Docker (or equivalent) is probably the most robust tool for reproducibility if used correctly. However, it's also frequently unavailable, easy to use incorrectly, and difficult to understand for less technical colleagues. If you have control over your environment, packrat is probably the easiest tool to gain reproducibility. The best solution may be a combination of Docker and packrat, with the user doing dependency management in packrat, and reproducibility coming from the fact their R Studio environment is running in a Dockerised container (they don't need to know about this). Another possible solution is just to force users to use a specific dockerised R Studio with no ability to install new packages. More details Our objective is to be able to return to an old project or run someone else's project and get the same result. Overall, I've found no workflow that is quick and easy. Various solutions work but are quite technical (less technical users find them confusing, time consuming, or both. Note that I mainly have experience using packrat on R Studio server, running using rocker/rstudio. I also have experience of running packrat on Mac OS X with a local install of R Studio, and on Windows. No dependency management Packrat One of the most frustrating things I've found with packrat is the need to install packages afresh into each project. This is a particular problem on Linux based systems where packages need to re-compile every time. This can take upwards of an hour for a big project. We have a solution for this problem here, but this solution is fairly specific to our setup (running R Studio from rocker/rstudio Docker image) There doesn't seem to be an easy way of using your global package install directory as a 'cache'. This is an option in packrat.opts but it doesn't work the way you may think it should.. This doesn't make much sense to me - if you have a specific version of dplyr installed globally (takes 10 mins plus to compile anew), why isn't there an easy way to simlink my existing install rather than recompiling. Overall the problem seems to be that packrat is too strict. This is discussed in detail by the devs and others here. Docker If you have access to Docker, including a Docker repository that you trust to back up your built images indefinitely, this is probably a reasonable solution. A huge gotya, though, is that if you rebuild the docker image then you might get a different result. It seems fairly common to just use Once you start to worry about this versioning issue, then you probably want packrat anyway - because it will handle cataloguing the specific version of Our 'solution' Having said all of this, I have spent countless hours struggling with packrat, and it's one of my least favourite tools. Here's a recent tweet with various opinions from the community. *Another possible option Other stuff |
ps @alexander-newton - if you're having issues with devtools installation, raise an issue here :-) |
On @RobinL final point, reducing package dependencies also reduces the maintenance of dependencies and their future security vulnerabilities yet to be discovered. Some strategies are discussed here: https://gds-way.cloudapps.digital/standards/tracking-dependencies.html This may or may not be an issue depending on the context of the specific RAP use case. |
Also, there is a rung up on the ladder from no dependency management before you get to packrat and docker: |
It is also important to understand what "reproducibility" is. Reproducibility ensures that the results you achieved can be achieved by others (a good thing when we are publishing). However using packrat / docker images for a piece of analysis that you perform each year is not necessarily ideal if there are bugs in your dependencies that have been fixed in the interim. Whilst unit tests make this easier to check for issues when reproducing tables, it is trickier when you move towards more algorithmic projects (e.g. using regression or optimization algorithms). Adding the aforementioned security issues my current leaning would first be a package (with CI to see if/when it breaks) combined with session info from when the analysis was run. If you want to archive it for reproducibility build it in Docker and make the image available. |
Is
|
There is also checkpoint which may be a good solution for some. One downside as @RobinL points out is that it can take an age on Linux. In the original RAP I did use checkpoint originally, and it works more smoothly than packrat, but it lacks the fine grained control of dependencies that you get from packrat - even if it is rather temperamental. Frankly this lack of easy dependency management in the R ecosystem is a huge pain, and I expect someone will solve it at some point with a recipe based system similar to Python's pip. In the long run I think that containerisation is a sensible solution, and one could envisage a container relating to each publication: here is one from the first RAP: https://github.com/DCMSstats/eesectorsdocker -- I used checkpoint to manage dependencies here. |
I saw this at a UseR event today; looks like someone already did it: https://github.com/trinker/pacman/blob/master/R/p_install_version.R |
@sebastian-fox wrote the new material about present-day you, future you, etc., and the section on renv.
Original issue description: "Packrat is more trouble than it's worth: discuss."
The packrat package is for managing dependencies. Informal comments at the meetup suggested it can be a pain, especially on Windows when attempting to compile old versions of packages. Docker was suggested as an alternative with the benefit that it isn't specific to R.
The text was updated successfully, but these errors were encountered: