forked from b-rodrigues/rap4all
-
Notifications
You must be signed in to change notification settings - Fork 0
/
repro_intro.qmd
734 lines (605 loc) · 29.4 KB
/
repro_intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
# Basic reproducibility: freezing packages
We are at a stage where the analysis is done. Converting our scripts into Rmds
was quite easy to justify because writing the Rmds is also writing the report
that we need to send to our boss (or our research paper, etc). But it might be
harder to justify writing further documentation or package the functions we’ve
had to write for reuse later and otherwise ensure that the analysis is and stays
reproducible. So we are going to start with the simplest and most cost-effective
solution to the reproducibility issue, which is recording the versions of the
packages that were used. This is quite easy and quick to do and provides at
least some hope that the analysis will stay reproducible. But this will not do
anything to make the analysis more easily re-usable, will not improve the
documentation, nor ensure that what we wrote is indeed correct. For this, we
would need to write tests, which are missing from our current analysis. We only
wrote one test, which made sure that all the communes were accounted for. This
is why going with a package is so useful (and I need to stress here that the aim
of writing a package is NOT to upload it to CRAN): packages offer us a great
framework for documenting, testing and sharing our code (even if only sharing
internally in your company/team, or even just future you). So in order to take
advantage of what packaging our code has to offer, we will learn about the
`{fusen}` package in the next chapter. `{fusen}` will enable us to convert our
Rmd files into a package quite quickly; your analysis is much closer to being a
package than you think. As you shall see in the next chapter, going from our
Rmds to a fully functioning package is much easier than you expect, even if
you’ve never written a package in your life.
So I hope that I made my point clear: it is not recommended to stop at this
stage, but I also recognize that we live in the real world with real physical
constraints. So because we live in this imperfect world, sometimes we need to
deliver imperfect work. So let’s see what we can do that is very cheap in terms
of effort and time, but that still allows us to have some hope of having our
analysis reproducible, all thanks to the `{renv}` package. It is quite easy to
get started with `{renv}`: simply install it, and get a record of the used
packages and their versions with one single command. This record gets saved
inside a file that can then be used to restore this project’s library in the
future, and without interfering with the other packages that you already have
installed on your machine. You see, `{renv}` creates a per-project library (a
library is the set of R packages installed on your machine) which means that you
can have as many versions of `{dplyr}` as needed (one per project). The right
version of `{dplyr}` will be installed and used for the right project only, and
without interfering with other installed versions.
Let’s see how this works by creating such a project-specific library for our little
project.
## Recording packages’ version with `{renv}`
So now that you’ve used functional and literate programming, we need to start
thinking about the infrastructure surrounding our code. By infrastructure I
mean:
- the R version;
- the packages used for the analysis;
- and otherwise the whole computational environment, even the computer hardware itself.
`{renv}` enables you to create **R**eproducible **Env**ironments and is a
package that takes care of point number 2: it allows you to easily record the
packages that were used for a specific project. This record is a file called
`renv.lock` which will appear at the root of your project once you’ve set up
`{renv}` and run it. You can use `{renv}` once you’re done with an analysis like
in our case, or better yet, immediately at the start of the project. You can
keep updating the `renv.lock` file as you add or remove packages from your
analysis. The `renv.lock` file can then be used to restore the exact same
package library that was used for your analysis on another computer, or on the
same computer but in the future.
This works because `{renv}` does more than simply create a list of the used
packages and recording their versions inside the `renv.lock` file: it actually
creates a per-project library that is completely isolated for the main, default,
R library on your machine, but also from the other `{renv}` libraries that you
might have set up for your other projects (remember, the *library* is the set of
R packages installed on your computer). To save time when setting up an `{renv}`
library, packages simply get copied over from your main library instead of being
re-downloaded and re-installed (if the required packages are already installed
in your default library).
To get started, install the `{renv}` package (make sure to start a fresh R
session):
```{r, eval = F}
install.packages("renv")
```
and then go to the folder containing the Rmds we wrote together in the previous
chapter. Make sure that you have the two following files in that folder:
- `save_data.Rmd`, the script that downloads and prepares the data;
- `analyse_data.Rmd`, the script that analyses the data.
Also, make sure that the changes are correctly backed up on Github.com, so if
you haven’t already, commit and push any change to the `rmd` branch. Because we
will be experimenting with a new feature, create a new branch called `renv`. You
should know the drill by now, but if not simply follow along:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git checkout -b renv
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git checkout -b renv
```
:::
```bash
Switched to a new branch 'renv'
```
We will now be working on this branch. Simply work as usual, but when pushing,
make sure to push to the `renv` branch:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git add .
owner@localhost ➤ git commit -m "some changes"
owner@localhost ➤ git push origin renv
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git add .
owner@localhost $ git commit -m "some changes"
owner@localhost $ git push origin renv
```
:::
Once this is done, start an R session, and simply type the following in a console:
```{r, eval = F}
renv::init()
```
You should see the following:
```{r, eval = F}
* Initializing project ...
* Discovering package dependencies ... Done!
* Copying packages into the cache ... [76/76] Done!
The following package(s) will be updated in the lockfile:
# CRAN ===============================
***and then a long list of packages***
The version of R recorded in the lockfile will be updated:
- R [*] -> [4.2.2]
* Lockfile written to 'path/to/housing/renv.lock'.
* Project 'path/to/housing' loaded. [renv 0.16.0]
* renv activated -- please restart the R session.
```
Let’s take a look at the files that were created (if you prefer using your file
browser, feel free to do so, but I prefer the command line):
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ ls -la
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ ls -la
```
:::
```bash
total 1070
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:44 .
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:35 ..
-rw-r--r-- 1 owner Domain Users 27 Feb 27 12:44 .Rprofile
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:40 .git
-rw-r--r-- 1 owner Domain Users 306 Feb 27 12:35 README.md
-rw-r--r-- 1 owner Domain Users 2398 Feb 27 12:38 analyse_data.Rmd
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:44 renv
-rw-r--r-- 1 owner Domain Users 20502 Feb 27 12:44 renv.lock
-rw-r--r-- 1 owner Domain Users 6378 Feb 27 12:38 save_data.Rmd
```
As you can see, there are two new files and one folder. The new files are the
`renv.lock` file that I mentioned before and a file called `.Rprofile`. The new
folder is simply called `renv`. The `renv.lock` is the file that lists all the
packages used for the analysis. `.Rprofile` files are files that get read by R
automatically at startup (as discussed at the very beginning of part one of this
book). You should have a system-wide one that gets read on startups of R, but if
R discovers an `.Rprofile` file in the directory it starts on, then that file
gets read instead. Let’s see the contents of this file (you can open this file
in any text editor, like Notepad on Windows, but then again I prefer the command
line):
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ cat .Rprofile
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ cat .Rprofile
```
:::
The files contains the single line:
```r
source("renv/activate.R")
```
`activate.R` is an R script that was also create by `renv::init()`, which you
can find in the `renv` folder. Let’s take a look at the contents of this folder:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ ls renv
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ ls renv
```
:::
```bash
total 107
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:44 .
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:35 ..
-rw-r--r-- 1 owner Domain Users 27 Feb 27 12:44 activate.R
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:40 .gitignore
drwxr-xr-x 1 owner Domain Users 0 Feb 27 12:40 library
-rw-r--r-- 1 owner Domain Users 6378 Feb 27 12:38 settings.dcf
```
So inside the `renv` folder, there is another folder called `library`: this is
the folder that contains our isolated library for just this project. That’s
something that we would not want to back up on Github as it grows quite large.
To avoid tracking this folder and backing it up on Github.com a file called
`.gitignore` (notice the `.` at the start of the name) gets used. This is a file
that contains the paths to other files and folders that should be ignored. If
you open it, you will see that the `library/` folder is listed there so it will
be ignored. You can have as many `.gitignore` files as necessary, but if you put
one in the root folder of your project, then that `.gitignore` will work
project-wide.
For example, if you are working with sensitive data, you could also add a
`.gitignore` file in the root of the project’s directory, and simply list the
folder containing the sensitive data. Create this file using your favourite text
editor and simply add, for example if you’re working with sensitive data, the
following:
```bash
datasets/
```
This will prevent the `datasets/` folder from being tracked and backed up.
Let’s restart a fresh R session in our project’s directory; you should see the
following startup message:
```{r, eval = F}
* Project 'path/to/housing' loaded. [renv 0.16.0]
```
This means that this R session will use the packages installed in the isolated
library we’ve just created. Let’s now take a look at the `renv.lock` file:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ cat renv.lock
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ cat renv.lock
```
:::
::: {.content-visible when-format="pdf"}
```json
{
"R": {
"Version": "4.2.2",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.rstudio.com/ all/latest"
}
]
},
"Packages": {
"MASS": {
"Package": "MASS",
"Version": "7.3-58.1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "762e1804143a332333c054759f89a706",
"Requirements": []
},
"Matrix": {
"Package": "Matrix",
"Version": "1.5-1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "539dc0c0c05636812f1080f473d2c177",
"Requirements": [
"lattice"
]
***and many more packages***
```
:::
::: {.content-hidden when-format="pdf"}
```json
{
"R": {
"Version": "4.2.2",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.rstudio.com/all/latest"
}
]
},
"Packages": {
"MASS": {
"Package": "MASS",
"Version": "7.3-58.1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "762e1804143a332333c054759f89a706",
"Requirements": []
},
"Matrix": {
"Package": "Matrix",
"Version": "1.5-1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "539dc0c0c05636812f1080f473d2c177",
"Requirements": [
"lattice"
]
***and many more packages***
```
:::
The `renv.lock` file is a json file listing all the packages, as well as their
dependencies, that are used for the project. At the top of the file, the R
version that was used to generate the `renv.lock` file is also named. It is
important to remember that when you’ll use `{renv}` to restore a project’s
library on a new machine, the R version will not be restored: so in the future,
you might be restoring this project and running old versions of packages on a
newer version of R, which may sometimes be a problem (but we’re going to discuss
this later).
So... that’s it. You’ve generated the `renv.lock` file, which means that future
you, or someone else, can restore the library that you used to write this
analysis. All that’s required is for that person (or future you) to install
`{renv}` and then use the `renv.lock` file that you generated to restore the
library. Let’s see how this works by cloning the following Github repository on
this
[link](https://github.com/b-rodrigues/targets-minimal)^[https://is.gd/jMVfCu]
(forked from this one
[here](https://github.com/wlandau/targets-minimal)^[https://is.gd/AAnByB]):
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git clone [email protected]:b-rodrigues/targets-minimal.git
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git clone [email protected]:b-rodrigues/targets-minimal.git
```
:::
You should see a `targets-minimal` folder on your computer now. Start an R session
in that folder and type the following command:
```{r, eval = F}
renv::restore()
```
You should be prompted to activate the project before restoring:
```{r, eval = F}
This project has not yet been activated.
Activating this project will ensure the project library
is used during restore.
Please see `?renv::activate` for more details.
Would you like to activate this project before restore? [Y/n]:
```
Type `Y` and you should see a list of packages that need to be installed. You’ll
get asked once more if you want to proceed, type `y` and watch as the packages
get installed. If you pay attention to the links, you should see that many of
them get pulled from the CRAN archive, for example:
::: {.content-visible when-format="pdf"}
```r
Retrieving 'https://cloud.r-project.org/src/contrib/Archive /vroom/vroom_1.5.5.tar.gz'
```
:::
::: {.content-hidden when-format="pdf"}
```r
Retrieving 'https://cloud.r-project.org/src/contrib/Archive/vroom/vroom_1.5.5.tar.gz'
```
:::
Notice the word "Archive" in the url? That’s because this project uses `{vroom}`
1.5.5, but as of writing (early 2023), `{vroom}` is at version 1.6.1.
Now, maybe you’ve run `renv::restore()`, but the installation of the packages
may have failed. If that’s the case, let me explain what likely happened.
I tried restoring the project’s library on two different machines: a Windows
laptop and a Linux workstation. `renv::restore()` failed on the Windows laptop,
but succeeded on the Linux workstation.
Why does that happen? Well in the case of the Windows laptop, compilation of the
`{dplyr}` package failed. This is likely because my Windows laptop does not have
the right version of Rtools installed. If you look inside the `renv.lock` file
that came with the `targets-minimal` project, you should notice that the
recorded R version is 4.1.0, but I’m running R 4.2.2 on my computers. So
libraries get compiled using Rtools 4.2 and not Rtools 4.0 (which includes the
libraries for R 4.1 as well).
So in order to run this project successfully, I should install the right version
of R and Rtools which on Windows should not be too complicated. But that might
be a problem on other operating systems. Does that mean that `{renv}` is
useless? No, not at all.
At a minimum, `{renv}` ensures that a project’s library doesn’t interfere with
another project’s library while you’re working on different projects. You can be
working on several projects and be sure that if you update your library (for
example, to use a nice new function from one specific package), that update will
only affect the project where the library was updated, and not the other
libraries from the other projects. When working with a system-wide library,
updating packages for a project could introduce breaks in other projects.
Without `{renv}`, such breaks could happen because another function coming from
some other package that also got updated and that you use in another long-term
project got removed, or renamed, or simply works differently now. This types of
issues can be avoided by using a per-project library, which is exactly what
`{renv}` does.
But also, apart from that already quite useful feature, `renv.lock` files
provide a very useful blueprint for Docker, which we are going to explore in a
future chapter. Only to give you a little taste of what’s coming: since the
`renv.lock` file lists the R version that was used to record the packages, we
can start from a Docker image that contains the right version of R. From there,
restoring the project using `renv::restore()` should succeed without issues. If
you have no idea what this all means, do not worry, you will know by the end of
the book, so hang in there.
So should you use `{renv}`? I see two scenarios where it makes sense:
- You’re done with the project and simply want to keep a record of the packages used. Simply call `renv::init()` at the end of the project and commit and push the `renv.lock` file on Github.
- You want to use `{renv}` from the start to isolate the project’s library from your whole R installation’s library to avoid any interference (I would advise you to do it like this).
In the next section, we’ll quickly review how to use `{renv}` on a "daily basis".
### Daily `{renv}` usage
So let’s say that you start a new project and want to use `{renv}` right from
the start. You start with an empty directory, and add a template `.Rmd` file,
and let’s say it looks like this:
````{verbatim}
---
title: "My new project"
output: html_document
---
```{r setup, include=FALSE}
library(dplyr)
```
## Overview
## Analysis
````
Before continuing, make sure that it correctly compiles into a HTML file by
running `rmarkdown::render("test.Rmd")` in the correct directory.
In the `setup` chunk you load the packages that you need. Now, save this file,
and start a fresh session in the same directory and run `renv::init()`. You
should see the familiar prompts described above, as well as the `renv.lock` file
(which will only contain `{dplyr}` and its dependencies).
Now, after the `library(dplyr)` line, add the following `library(ggplot2)` (or
any other package that you use on a daily basis). Make sure to save the `.Rmd`
file and try to render it again by using `rmarkdown::render("test.Rmd")` (or if
you’re using RStudio, by clicking the right button), but, spoiler alert, it
won’t work. Instead you should see this:
```r
Quitting from lines 7-9 (my_new_project.Rmd)
Error in library(ggplot2) : there is no package called 'ggplot2'
```
Don’t be confused: remember that `{renv}` is now activated, and that each
project where `{renv}` is enabled has its own project-specific library. You may
have `{ggplot2}` installed on your system-wide library, but this
`{renv}`-enabled project does not have it yet on its own specific library. This
means that you need to install `{ggplot2}` for your project. To do so, simply
start an R session within your project and run `install.packages("ggplot2")`. If
the version installed on your system-wide library is the latest version
available on CRAN, the package will simply be copied over from the system-wide
to the project-specific library, if not, the latest version will be downloaded
onto your project’s library. You can now update the `renv.lock` file. This is
done using `renv::snapshot()`; this will show you a list of new packages to
record inside the `renv.lock` file and ask you to continue:
```{r, eval = F}
**list of many packages over here**
Do you want to proceed? [y/N]:
* Lockfile written to 'path/to/my_new_project/renv.lock'.
```
If you now open the `renv.lock` file, and look for the string `"ggplot2"` you
should see it listed there alongside its dependencies. Let me reiterate: this
version of `{ggplot2}` is now unique to this project. You can work on other
projects with other versions of `{ggplot2}` without interfering with this one.
If for some reason you didn’t want to have the latest version of `{ggplot2}`
installed in the project-specific library, you could have installed an older
version, still thanks to `{renv}`. For example, to install an older version of
`{AER}`:
```{r, eval = F}
renv::install("[email protected]") # this is a version from August 2008
```
But just like in the previous section, where we wanted to restore an old project
that used `{renv}`, installation of older packages may fail. If you need to use
old packages there are other easier approaches which we are also going to
explore in this chapter.
When you install more of the required packages for your project, you need to
call `renv::snapshot()` to add the packages to the `renv.lock` file. Once you’re
done with your project, call `renv::snapshot()` one last time to make sure that
every dependency is correctly accounted for. Don't forget to version the
`renv.lock` file using Git!
### Collaborating with `{renv}`
`{renv}` is also quite useful when collaborating. You can start the project and
generate the lock file, and when your teammates clone the repository from
Github, they can get the exact same package versions as you. All of you only
need to make sure that everyone is running the same R version to avoid any
issues.
There is a vignette on just this that I invite you to read for more details,
see [here](https://rstudio.github.io/renv/articles/collaborating.html)^[https://is.gd/sXpWVp].
### {renv}’s shortcomings
As useful as `{renv}` is, it also comes with some shortcomings (which I’ve
already alluded to before). It is quite important to understand what `{renv}`
does and what it doesn’t do, and why `{renv}` alone is not enough to ensure the
long-term reproducibility of your projects.
The first problem, and I’m repeating myself here, is that `{renv}` only records
the R version used for the project, but does not restore it when calling
`renv::restore()`. You need to install the right R version yourself. On Windows
this should be fairly easy to do, but then you need to start juggling R versions
and know which scrips need which R version, which can get confusing.
There is the `{rig}` package that makes it easy to install and switch between R
versions that you could check
[out](https://github.com/r-lib/rig)^[https://is.gd/dvH2Sj] if you’re interested.
However, I don’t think that `{rig}` should be used for our purposes. I believe
that it is safer to use Docker instead, and we shall see how to do so in the
coming chapters.
The other issue of using `{renv}` is that future you, or your teammates or
people that want to reproduce your results need to install packages that may be
quite difficult to install, either because they’re very old by now, or because
their dependencies are difficult to satisfy. Have you ever tried to install a
package that depended on `{rJava}`? Or the `{rgdal}` package? Installing these
packages can be quite challenging because they need specific system
requirements that may be impossible for you to install (either because you don’t
have admin rights on your workstation, or because the required version of these
system dependencies is not available anymore). Having to install these packages
(and potentially quite old versions at that) can really hinder the
reproducibility of your project. Here again, Docker provides a solution. Future
you, your teammates or other people simply need to be able to run a Docker
container, which is a much lower bar than installing these old libraries.
Note also that during your journey, you will want to improve your scripts, you
maybe will also want to upgrade some dependencies to take advantage of recent
developments in some packages. This will require upgrading the `{renv}`
environment. To do so, run `update.packages()` to update the packages, and then
`renv::snapshot()` to to generate a new `renv.lock` file. But how will you be
sure that these upgrades will not impair some other parts of your code? Turning
the analysis into a package will help with this, as I’ll explain in the next
chapter, because once the analysis is turned into a package, testing gets much
easier. If the tests indicate that something went wrong, you can simply restore
the previous lock file using Git and restore the old library.
I want to stress that this does not mean that `{renv}` is useless: we will keep
using it, but together with Docker to ensure the reproducibility of our project.
As I’ve written above already, *at a minimum `{renv}` ensures that a project’s
library doesn’t interfere with another project’s library* and this is in itself
already quite useful. The `renv.lock` file also provides a blueprint that can be
used (by you or someone else) to build a Docker image for long-term
reproducibility.
Let’s now quickly discuss two other packages before finishing this chapter,
which provide an answer to the question: *how to rerun an old analysis if no
`renv.lock` file was made available at the time the analysis was made?*.
## Becoming an R-cheologist
So let’s say that you need to run an old script, and there’s no `renv.lock` file
around for you to restore the right package library. There might still be a
solution (apart from running the script on the current version on R and
packages, and hope that everything goes well), but for this you need to at least
know roughly *when* that script was written. Let’s say that you know that
this script was written back in 2017, somewhere around October. If you know
that, you can use the `{rang}` or `{groundhog}` packages to download the
packages as of October 2018 in a separate library and then run your script.
`{rang}` is fairly recent as of writing (February 2023) so I won’t go into too
much detail now, as it is likely that the package will keep evolving rapidly in
the coming weeks. So if you want to use it and follow its development, take a
look at its Github repository
[here](https://github.com/chainsawriot/rang)^[https://is.gd/sQu7NV] and read the
[prepint](https://arxiv.org/abs/2303.04758)^[https://arxiv.org/abs/2303.04758] [@schoch2023].
`{groundhog}` ([website](https://groundhogr.com)^[https://groundhogr.com]) is another option that has been around for more time and is fairly
easy to use. Suppose that you have a script from October 2018 that looks like
this:
```{r, eval = F}
library(purrr)
library(ggplot2)
data(mtcars)
myplot <- ggplot(mtcars) +
geom_line(aes(y = hp, x = mpg))
ggsave("/home/project/output/myplot.pdf", myplot)
```
If you want to run this script with the versions of `{purrr}` and `{ggplot2}`
that were current in October 2017, you can achieve this by simply changing the
`library()` calls:
```{r, eval = F}
groundhog::groundhog.library("
library(purrr)
library(ggplot2)",
"2017-10-04"
)
data(mtcars)
myplot <- ggplot(mtcars) +
geom_line(aes(y = hp, x = mpg))
ggsave("/home/project/output/myplot.pdf", myplot)
```
but you will get the following message:
```r
-----------------------------------------------
|IMPORTANT.
| Groundhog says: you are using R-4.2.2, but the version of R current
| for the entered date, '2017-10-04', is R-3.4.x. It is recommended
| that you either keep this date and switch to that version of R, or
| you keep the version of R you are using but switch the date to
| between '2022-04-22' and '2023-01-08'.
|
| You may bypass this R-version check by adding:
| `tolerate.R.version='4.2.2'`as an option in your groundhog.library()
| call. Please type 'OK' to confirm you have read this message.
| >ok
```
Groundhog is speaking to us, advising us to switch to the version of R that was
current at that time the script was written. If we want to ignore the message’s
advice, and add `tolerate.R.version = '4.2.2'`, we may get the script to run
anyways:
```{r, eval = F}
groundhog.library("
library(purrr)
library(ggplot2)",
"2017-10-04",
tolerate.R.version = "4.2.2")
data(mtcars)
myplot <- ggplot(mtcars) +
geom_line(aes(y = hp, x = mpg))
ggsave("/home/project/output/myplot.pdf", myplot)
```
But just like for `{renv}` (or `{rang}`), installation of the packages can fail,
and for the same reasons (unmet system requirements most of the time).
So here again, the solution is to take care of the missing piece of the
reproducibility puzzle, which is the whole computational environment itself.
## Conclusion
In this chapter you had a first (maybe a bit sour) taste of reproducibility.
This is because while the tools presented here are very useful, they will not be
sufficient if we want our project to be truly reproducible, especially in the
long term. There are too many things that can go wrong when re-installing old
package versions, so we must instead provide a way for users to not have to do
it at all. This is where Docker is going to be helpful. But before that, we need
to hit the development bench again. We are actually not quite done with our
project; before going to full reproducibility, we should turn our analysis into
a package. And as you will see, this is going to be much, much, easier than you
might expect. You already did 95% of the job! There are many advantages to
turning our analysis into a package, and not only from a reproducibility
perspective.