Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

Discuss the dependency and reproducibility page #6

Open
nacnudus opened this issue Jan 4, 2019 · 12 comments
Open

Discuss the dependency and reproducibility page #6

nacnudus opened this issue Jan 4, 2019 · 12 comments
Labels
discuss Discuss existing site content

Comments

@nacnudus
Copy link
Collaborator

nacnudus commented Jan 4, 2019

Original issue description: "Packrat is more trouble than it's worth: discuss."

The packrat package is for managing dependencies. Informal comments at the meetup suggested it can be a pain, especially on Windows when attempting to compile old versions of packages. Docker was suggested as an alternative with the benefit that it isn't specific to R.

@alexander-newton
Copy link

CRAN only publishes binary packages for the 'current' version of a particular package, so once you attempt to install an older version of a package (as you might do if you've frozen the R package dependencies for a project, and attempt to transfer that project to a separate machine) you'll need to install from sources, which requires build tools.

Packrat could potentially download and store package binaries within a project for re-use, but unfortunately this is not done right now.

This is all fine and well on personal machines, but usually build tools are locked down on government systems (for good reason!) As a result it can be impossible to return to older states.

@alexander-newton
Copy link

There also seem to be issues installing packages with compilation in packrat. For instance, at MOJ, we can install devtools only when packrat is disabled. Unsure of the cause.

@TimTaylor
Copy link

I've only used packrat a couple of times and always found it a bit clunky but did manage to get it working on quite a locked down machine. Similarly I've only used docker a couple of times but for true reproducibility saving the final image somewhere is probably easiest. I think a lot of it comes down to proportionality (e.g. for the rap companion I think using packrat or docker is overkill anyway but appreciate for rap projects it may be of greater benefit). We have done a piece of work where we combined a dockerfile with a packrat collection, but the packrat file is so massive I'm unconvinced of the benefit versus just saving the resultant image. Perhaps someone who has thought about it more will chip in.

@matt-dray
Copy link
Contributor

Some thoughts:

  1. No two RAP journeys are the same. Sure, packrat may suck for a behemoth publication, but it's probably okay for helping publish a few RAPped tables or a very short doc with few package requirements.
  2. The 'packrat problem' can be helped (but not solved) by following the tinyverse approach as per #81.
  3. With R in mind, what other non-Docker solutions do we have? miniCRAN as per #84, but should we be thinking about RStudio package manager?
  4. As we all know, RAP is language- and tooling-agnostic and the companion can't provide total coverage of these. This might mean an explanation for some approaches (e.g. packrat for R or virtualenv and requirements.txt for Python) but a tutorial or how-to on the One True Way (e.g. Docker) – emphasised words as per the Divio blog.

@RobinL
Copy link

RobinL commented Jan 4, 2019

TL;DR: Docker (or equivalent) is probably the most robust tool for reproducibility if used correctly. However, it's also frequently unavailable, easy to use incorrectly, and difficult to understand for less technical colleagues. If you have control over your environment, packrat is probably the easiest tool to gain reproducibility. The best solution may be a combination of Docker and packrat, with the user doing dependency management in packrat, and reproducibility coming from the fact their R Studio environment is running in a Dockerised container (they don't need to know about this). Another possible solution is just to force users to use a specific dockerised R Studio with no ability to install new packages.

More details

Our objective is to be able to return to an old project or run someone else's project and get the same result. Overall, I've found no workflow that is quick and easy. Various solutions work but are quite technical (less technical users find them confusing, time consuming, or both.

Note that I mainly have experience using packrat on R Studio server, running using rocker/rstudio. I also have experience of running packrat on Mac OS X with a local install of R Studio, and on Windows.

No dependency management
Using no dependency management is extremely bad news. Attempting to get others code working which was written even 6 months before can be an absolute nightmare. I'm completely convinced that despite their flaws, some form of dependency management is absolutely critical.

Packrat
Overall I've found that it's fairly common to run into packrat problems. Often these are failures to packrat::restore or to create working packrat.lock files. Error messages are difficult to decipher, and often don't tell you the root cause of your problem, sending you on a wild goose chase. I have spent hours and hours wrestling with packrat. The 'acid test' is basically whether you can build (packrat::restore) your project in a new Docker environment. If it works, then you're fairly safe...ish.... If you don't test this, then your ability to packrat::restore may be contingent on some operating-system level dependency that isn't tracked by packrat

One of the most frustrating things I've found with packrat is the need to install packages afresh into each project. This is a particular problem on Linux based systems where packages need to re-compile every time. This can take upwards of an hour for a big project. We have a solution for this problem here, but this solution is fairly specific to our setup (running R Studio from rocker/rstudio Docker image)

There doesn't seem to be an easy way of using your global package install directory as a 'cache'. This is an option in packrat.opts but it doesn't work the way you may think it should.. This doesn't make much sense to me - if you have a specific version of dplyr installed globally (takes 10 mins plus to compile anew), why isn't there an easy way to simlink my existing install rather than recompiling.

Overall the problem seems to be that packrat is too strict. This is discussed in detail by the devs and others here.

Docker
Some people don't bother with packrat. It seems that a common pattern is to just run install.packages statements in your Dockerfile. You can then use the Docker cache to prevent having to recompile all the time.

If you have access to Docker, including a Docker repository that you trust to back up your built images indefinitely, this is probably a reasonable solution. A huge gotya, though, is that if you rebuild the docker image then you might get a different result. It seems fairly common to just use install.packages("dplyr") or whatever in the Dockerfile, which of course will pull the latest version. You may say, well, let's use remotes to pull a specific version. But now you're basically writing a packrat.lock file, and I'm not sure how this will deal with updates to dplyr dependencies. It's probably not as reproducible as it looks unless you're very diligent in cataloguing all your docker builds.

Once you start to worry about this versioning issue, then you probably want packrat anyway - because it will handle cataloguing the specific version of dplyr and all it's depedencies. So I feel like to get full reproducibility here you need a Dockerfile that performs a packrat::restore within it.

Our 'solution'
At the MoJ, we're deploying R Studio to users from this Dockerfile. The user therefore doesn't realise they're using Docker - but it doesn't matter - they're guaranteed that their computing environment is fully specified and under version control. They then use packrat for reproducibility. They get fast Linux based package installs using our custom CRAN proxy.

Having said all of this, I have spent countless hours struggling with packrat, and it's one of my least favourite tools. Here's a recent tweet with various opinions from the community.

*Another possible option
Given all of these troubles, I wonder whether a more extreme solution may turn out to be better for users is to give then a 'reproducibility' Docker build of R Studio when they're doing certain projects like RAPs. This would be a totally fixed computing environment, where the list of packages is predetermined and changes only very infrequently (annually?). If I were to start again with RAP, I think this might actually be my preferred approach.

Other stuff
I've slowly learned that it's almost always best to reduce the number of package dependencies to the minimum possible. The promise of re-using others code comes associated with a big dependency management cost. So others' packages should be used judiciously and each one should be treated as a 'cost'. I don't agree with this tweet but I do think it contains a grain of truth.

@RobinL
Copy link

RobinL commented Jan 4, 2019

ps @alexander-newton - if you're having issues with devtools installation, raise an issue here :-)

@mammykins
Copy link

On @RobinL final point, reducing package dependencies also reduces the maintenance of dependencies and their future security vulnerabilities yet to be discovered. Some strategies are discussed here: https://gds-way.cloudapps.digital/standards/tracking-dependencies.html

This may or may not be an issue depending on the context of the specific RAP use case.

@mammykins
Copy link

Also, there is a rung up on the ladder from no dependency management before you get to packrat and docker:
devtools::session_info()

@TimTaylor
Copy link

It is also important to understand what "reproducibility" is. Reproducibility ensures that the results you achieved can be achieved by others (a good thing when we are publishing). However using packrat / docker images for a piece of analysis that you perform each year is not necessarily ideal if there are bugs in your dependencies that have been fixed in the interim. Whilst unit tests make this easier to check for issues when reproducing tables, it is trickier when you move towards more algorithmic projects (e.g. using regression or optimization algorithms). Adding the aforementioned security issues my current leaning would first be a package (with CI to see if/when it breaks) combined with session info from when the analysis was run. If you want to archive it for reproducibility build it in Docker and make the image available.

@matt-dray
Copy link
Contributor

Is renv going to be the solution for R? Keep an eye on it.

The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.

@ivyleavedtoadflax
Copy link

There is also checkpoint which may be a good solution for some. One downside as @RobinL points out is that it can take an age on Linux. In the original RAP I did use checkpoint originally, and it works more smoothly than packrat, but it lacks the fine grained control of dependencies that you get from packrat - even if it is rather temperamental.

Frankly this lack of easy dependency management in the R ecosystem is a huge pain, and I expect someone will solve it at some point with a recipe based system similar to Python's pip.

In the long run I think that containerisation is a sensible solution, and one could envisage a container relating to each publication: here is one from the first RAP: https://github.com/DCMSstats/eesectorsdocker -- I used checkpoint to manage dependencies here.

@ivyleavedtoadflax
Copy link

I expect someone will solve it at some point with a recipe based system similar to Python's pip.

I saw this at a UseR event today; looks like someone already did it: https://github.com/trinker/pacman/blob/master/R/p_install_version.R

@nacnudus nacnudus transferred this issue from ukgovdatascience/rap_companion Apr 15, 2019
@nacnudus nacnudus changed the title Packrat is more trouble than it's worth: discuss Discuss the dependencies and reproducibility page Apr 15, 2019
@nacnudus nacnudus changed the title Discuss the dependencies and reproducibility page Discuss the dependency and reproducibility page Apr 15, 2019
@nacnudus nacnudus added the discuss Discuss existing site content label Apr 23, 2019
nacnudus added a commit that referenced this issue May 3, 2019
@sebastian-fox wrote the new material about present-day you, future you, etc.,
and the section on renv.
nacnudus added a commit that referenced this issue May 7, 2019
Redraft dependency article with renv (#6 #8)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discuss Discuss existing site content
Projects
None yet
Development

No branches or pull requests

7 participants