-
Notifications
You must be signed in to change notification settings - Fork 12
/
60-rr.Rmd
311 lines (235 loc) · 12.7 KB
/
60-rr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# Reproducible research {#sec-rr}
**Learning Objectives**
- Understand the concept of reproducible research and reproducible
documents.
- Understand the process by which a source document is compiled into a
final report.
- Generate a reproducible report in html or pdf from an Rmarkdown
document using RStudio.
For a general introduction on the topic in French, see
@Pouzat:2015. If you want to explore the topic of reproducible
research in French, the [Recherche reproductible : principes
méthodologiques pour une science
transparente](https://www.fun-mooc.fr/courses/course-v1:inria+41016+session01bis/about)
MOOC is of interest.
Reproducible research refers to research that can be reproduced under
various conditions and by different people. It applies to every area
of research, both experimental and computational, but is often (but
not always) easier to implement for computational work. The different
levels of reproducibility can formalised[^rrblog] as follows:
[^rrblog]: The nomenclature is taken from [this blog
post](https://lgatto.github.io/rr-what-should-be-our-goals/) that
provides links and highlights that confusion that exists around
the terms and concepts of reproducibility.
- **Repeat** my experiment, i.e. obtain the same tables/graphs/results
using the same setup (data, software, ...) in the same lab or on the
same computer. That's basically re-running one of my analysis some
time after I original developed it.
- **Reproduce** an experiment (not ones own), i.e. obtain the same
tables/graphs/results in a different lab or on a different computer,
using the same setup (the data would be downloaded from a public
repository and the same software, but possibly using a different
version, or a different operation system).
- **Replicate** an experiment, i.e. obtain the same (similar enough)
tables/graphs/results in a different set up. The data could still be
downloaded from the public repository, or possibly
re-generated/re-simulated, and the analysis would be re-implemented
based on the original description.
- Finally, **re-use** the information/knowledge from one experiment to
run a different experiment with the aim to **confirm** results from
scratch.
The table below summerised these concepts focusing on data and code in
computational projects.
| | Same data | Different data |
|--------------------|:---------:|:--------------:|
| **Same code** | Repeat | Reproduce |
| **Different code** | Reproduce | Replicate |
There are many reasons to work reproducibly, and @Markowetz:2015
nicely summarises 5 good reasons. Importantly, he stressed out that
the first beneficiary of reproducible work are the student/researcher
that apply these principles:
1. Reproducibility helps to avoid disaster.
2. Reproducibility makes it easier to write papers[^rrexam1].
3. Reproducibility helps reviewers see it your way[^rrexam2].
4. Reproducibility enables continuity of your work.
5. Reproducibility helps to build your reputation.
[^rrexam1]: And course reports! The exam of this course will consist
in a reproducible report using the tools described below.
[^rrexam2]: Not only reviewers, also professors that read exams. See
previous footnote.
## `knitr` and `rmarkdown`
Reproducible research is an essential part of any data analysis. With
the tools that are available, one can argue that it has become more
difficult not to produce reproducible reports than to producing then.
Reproducible documents have been a part of R since the very
beginning. See for example @biocwp2, to see how such *compendia* play a
central role within the [Bioconductor](https://bioconductor.org/)
project (more about Bioconductor in it's dedicated
chapter). Originally, these were written in LaTeX, interleaved with R
code chunks, forming so called Sweave documents (with extension
`.Rnw`).
```{r rmarkdownsticker, results='asis', fig.margin=TRUE, fig.cap="The rmarkdown sticker", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE}
knitr::include_graphics("./figs/rmarkdown-200x232.png")
```
More recently, it has become to use the
[markdown](https://daringfireball.net/projects/markdown/) syntax
markup language, rather than LaTeX. Once interleaved with R code
chunks, these documents become **Rmarkdown** files (`.Rmd`). The can be
converted into markdown using `knitr::knit`, that executes the code
chunk and incorporates their output in the resulting markdown
documents, which itself is converted to one of many output formats,
typically pdf of html using [pandoc](http://pandoc.org/). In R, this
final conversion is done using `rmarkdown::render` (that relies on
pandoc).
- `knitr::knit` converts the `Rmd` into `md` by executing the code
chunks and replacing the code by its output (text, tables, figures,
...).
- The `md` file is then compiled into the desired [output
format](https://rmarkdown.rstudio.com/lesson-9.html) (typically html
or pdf) using `pandoc`.
- In practice, in R, these two steps are automatically handled in one
go by `rmarkdown::render()`.
```{r rmarkdownflow, results='asis', fig.cap="The rmarkdown workflow (image from RStudio)", out.width = '100%', echo=FALSE, purl=FALSE}
knitr::include_graphics("./figs/rmarkdownflow.png")
```
The [rmarkdown](http://rmarkdown.rstudio.com/) package is developed
and maintained by RStudio and benefits from excellent documentation,
support and integrates into the RStudio editor.
An Rmarkdown document is composed of
- An optional YAML **header**, delimited by `---`.
- Text in [simple **markdown**
format](https://pandoc.org/MANUAL.html#pandocs-markdown).
- One or more R **code chunks** delimited by three backticks. Each
code chunk can be uniquely named and parametrised with a set of code
chunk [options](http://yihui.name/knitr/options/).
These respective parts of an `Rmd` file are show below and will be
demonstrated during the course.
![](./figs/rmd.png)
RStudio also supports
[Notebook](https://bookdown.org/yihui/rmarkdown/notebook.html)
documents[^jupyter] that execute individual code chunks independently
and display directly in the source document.
![](./figs/rnb.png)
[^jupyter]: See also [Jupyter notebooks](https://jupyter.org/)
initially developed for Python, but that can run R code as well.
Here is an [R markdown cheat
sheet](https://rmarkdown.rstudio.com/lesson-15.html) provided by
RStudio and an [introduction
article](https://rmarkdown.rstudio.com/articles_intro.html).
The following video, [*R Markdown: The bigger
picture*](https://resources.rstudio.com/rstudio-conf-2019/r-markdown-the-bigger-picture)
by Garrett Grolemund at a 2019 RStudio conference provides a very nice
introduction on the many reasons why writing reproducible documents in
essential in data science and biomedical research.
## Additional features
- Among the most options that can set for code chunks is
`cache`. Setting `cache = TRUE` will avoid that specific code chunk to
be cached and not recomputed every time the documented in *knitted*,
unless the code chunk was modified. This is an important feature
when long computations are necessary.
- The `DT::datatable` function allows to create dynamic tables
directly from R, as show below.
```{r dtexample}
ip <- installed.packages()
DT::datatable(ip[, c(1, 3, 5, 6, 10)], rownames = FALSE)
```
- It is always useful to finish a Rmarkdown report with a section
providing all the session information details with the
`sessionInfo()` function, such at the end of this material. This
allows readers to review the version of R itself and all the
packages that were used to produce the report.
Using Rmarkdown, it is also possible to produce
[slides](https://rmarkdown.rstudio.com/lesson-11.html),
[websites](https://rmarkdown.rstudio.com/lesson-13.html), and
[complete books](https://bookdown.org/), [interactive
documents](https://rmarkdown.rstudio.com/lesson-14.html) and [R
package vignettes](http://bioconductor.org/help/package-vignettes/).
`r msmbstyle::question_begin()`
Prepare an Rmarkdown report summarising the portal ecology data. The
report should include a *Material and methods* section where the data
is read in (ideally from the online file) and briefly described, a
*Data preparation* section where rows with missing values are filtered
out, and a *Visualisation* section where one or two plots are
rendered. Finish your report with a *Session information* section.
`r msmbstyle::question_end()`
The [R Markdown Cheat
Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown.pdf)
and [Reference
Guide](https://www.dataquest.io/blog/r-markdown-guide-cheatsheet/)
will help you with the markdown synatax, R code chunk options, and
RStudio utlisation.
**Note**: When you prepare an `Rmd` report, it is advised to start
with a code chunk that load all packages, and update that code chunk
as you proceed with your analysis and use new packages. This avoids
the situation where some commands work in your R console (where you
initially loaded the package) but fail when you compile your
report. Indeed, an `Rmd` file is compiled in a new, clean R session,
without access to your working session, the packages that were loaded,
and the variables that were created.
<iframe width="560" height="315" src="https://www.youtube.com/embed/4OXyyMIM6A8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Docker containers
There are other tools for reproducible research, that aim to
disseminate more than code and data. Docker containers for example
enable to share the complete image of an operating system, including
all system dependencies and software/data to repeat a complete
analysis. These are useful tools, even though their aren't necessarily
ideal, and beyond the aim of this course. In the annex, (chapter
\@ref(sec-anx)), we show how to use a pre-build course-specific
RStudio cloud instance based on the [Renku](https://renkulab.io/)
infrastructure.
## Additional exercises
`r msmbstyle::question_begin()`
Repeat the *Beer consumption* analysis using the that was done in
chapters \@ref(sec-dplyr) and \@ref(sec-vis).
To test if your report is fully reproducible, uniquely name the `Rmd`
file as `surname_surname_beers_report.Rmd`, post it on the course
forum and ask your neighbour to download it, compile it into pdf and
provide feedback on whether the document was reproducible and easy to
follow[^asinexam].
[^asinexam]: This is an important exercise, as it will mimic the exam
situation, where you will hand in your `Rmd` reports that will
need to be compiled (to pdf) before marking. In addition, it
demonstrates the challenges of writing and reproducing an
understandable and reproducible document.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
1. Create a new R markdown file called `student_results.Rmd`, to be
compiled in either pdf or html. Remove everything but the lines
containing the header. In the example below, the header is composed
of the 6 first lines, starting and ending with `---`.
```
---
title: "Student results"
author: "Laurent"
date: "March 14, 2020"
output: pdf_document
---
```
2. Add a first section called *Data input*, in which you will load the
`rWSBIM1207` package, use the `interroA.csv()` function to get the
name of a csv file containing test results for a set of students,
and read these data into R using `read_csv`. [^readf] Display the
few first observations and write a short sentence explaining the
data.
[^readf]: It is important here **not** to copy/paste the filename
returned by `interroA.csv()`. That file is distributed with the
`rWSBIM1207` package, and has a computer-specific path. Because we
want our reports to be reproducible, we want to use the filename
as returned by the function on any computer, not the one of a
specific computer. Make sure you either pass `interroA.csv()`
directly to `read_csv()` or store its output into a variable that
is passed to `read_csv()`.
3. Make sure that you can compile you `Rmd` file into either pdf or html.
4. Create a new section called *Data visualisation*.
Here, the goal is to visualise the score distributions for the four
tests using `ggplot2`. These distributions will be visualised using
boxplots. You will need to visualise these distribution for each
test separately, and for male and female students.
5. As discussed during the course, we need data in a long format to be
able to use `ggplot2`. Start by converting these data into a long
format using `pivot_longer()`. Display the first rows of these new
data and write a short sentence describing them and the
transformation you just applied.
6. Use `ggplot2` to visualise the score distributions along boxplots
for each test and for female and male students.
`r msmbstyle::question_end()`