-
Notifications
You must be signed in to change notification settings - Fork 0
/
main.rmd
398 lines (287 loc) · 20.3 KB
/
main.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
---
title: "Combining scientific data processing and publication using Literate Programming to enable reproducible research"
author: "Andy Clifton"
date: '`r Sys.Date()`'
output:
bookdown::word_document2:
fig_caption: yes
reference_docx: default
bookdown::pdf_document2:
latex_engine: pdflatex
toc: true
number_sections: true
fig_caption: true
keep_tex: true
citation_package: natbib
bookdown::html_document2:
number_sections: true
toc: true
toc_float:
collapsed: false
smooth_scroll: false
fontsize: 11pt
geometry: margin=1in
graphics: yes
bibliography: main.bib
linkcolor: blue
urlcolor: red
citecolor: cyan
link_citations: true
---
<!-- If you are reading the source code:
This is R Markdown. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com> or <http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html>.
This script is designed to be used with knitr in R. -->
# Introduction{#sec:intro}
<!-- Something something reproducible research, mumble, grumble, get off my lawn, grumble. -->
Scientific progress is underpinned by the exchange of information between scientists. This allows others to reuse methods, repeat work and confirm -- or refute -- findings, and adds to the reputation of the research. These ideals are sometimes lumped together under the heading of ``reproducible research''. As a result of applying these methods, new developments can make their way into the body of knowledge.
Reproducible research is beneficial in many non-academic settings, for example by showing that a consultant applied all relevant standards in designing a product, or during an audit when it can be used to show conformance to business processes.
Reproducible research is however often just a goal, rather than a reality. For example, analysis processes are often separated from publication because of the need to switch software. Analysis might be done using python, Fortran, Matlab, or any number of tools, while the writing might be done using LaTeX, a desktop publishing software, or something else. The use of multiple software types means that there is a disconnect between the two steps; it is possible that the published document does not actually accurately represent the data. This is not, then, reproducible research.
Recent growth in the capabilities of data repositories such as GitHub and Zenodo means that is feasible to store data and documents together and share them. This assists in sharing data and algorithms and thus replicating research, but the disconnect between data and publication persists. What is therefore needed is a method to clearly link data, algorithms, and publications, and allow that to be shared. This would be beneficial to many people.
## Linking analysis and publication workflows
This document demonstrates the application of Literate Programming to reproducible research. Literate programming means that the program documentation is complete and contained within the program itself [@Knuth1984]^[Yes, that's the same `Knuth' who invented LaTeX]. It is important to note that the documentation is effectively a publication, and thus it is possible to combine data analysis with the creation of a publication in the same file. The use of literate programming therefore mitigates this barrier to reproducible research.
A literate program could be stored together with the input files, processing algorithms, and output documents in a repository. That repository could be shared with others directly who can then use the contents however they need. As an example, the document that you are reading has been archived in a GitHub repository, and can be downloaded and used by anyone to replicate the output documents. Later in this document I will explore some different ways of storing and distributing such information depending on confidentiality requirements.
## How Literate Programming was used to write this document
In this example, an output PDF document and results are generated from a file called _main.rmd_. _main.rmd_ is an [R markdown file](https://rmarkdown.rstudio.com/authoring_basics.html).
R markdown is a flavor of markdown -- basically pandoc -- that can be processed by the R programming language [@R-base] to run code (i.e, do analysis) and create documentation from the same document. It looks like normal text and the markdown document contains a mixture of documentation and ``code chunks''.
The code chunks look like this:
````
`r ''````{r, echo=TRUE}
y = 40 + 2
print(y)
```
````
.. which evaluates to
```{r y1, echo=TRUE}
y = 40 + 2
print(y)
```
The code chunks above are written in R. They are evaluated at the time the document is processed. This is enabled by the _knitr_ package, which also includes support for other languages (See section \@ref(sec:pleaseNotR)). Code chunks can be configured so that their outputs are echoed to the document (or not). You can thus completely hide data wrangling operations in your output PDF and just display results.
Pandoc supports raw latex, and so we can also leverage all of the capabilities of LaTeX, but care should be taken not to start loading lots of latex packages; you need to remember the limitations of your output formats.
There are a lot of different possible output formats, including PDF, HTML, Notebooks, other markdown formats, and many others. Corporate formatting can usually be applied without modifying their content. The details of this are out of scope for this paper; instead, see @R-Markdown-Guide or https://bookdown.org/yihui/rmarkdown/documents.html for more information.
Instructions for how to run process (render) _main.rmd_ are included in the _HOWTO.md_ file in this repository. A far more detailed guide to writing using R markdown can be found in @R-Markdown-Guide.
There is a learning curve to all of this. I suggest reading this PDF together with the R markdown file (_main.rmd_) and possibly the _knitr_ package's instructions^[See https://yihui.name/knitr/]. This will greatly help in understanding what is done in the processing and what makes it to the publication.
<!-- Execute some code without displaying results. -->
```{r clean up, echo = FALSE}
rm(list = ls())
```
## Please not R{#sec:pleaseNotR}
If you can't handle learning yet another new language, this statement might interest you:
> "A less well-known fact about R Markdown is that many other languages are also supported, such as Python, Julia, C++, and SQL. The support comes from the _knitr_ package, which has provided a large number of language engines."
>
> --- @R-Markdown-Guide
You can use the other programming languages by replacing the `` {r, '' in the code chunk with the name of another language. The currently available language engines are:
```{r languages,}
require(knitr)
names(knitr::knit_engines$get())
```
So, you have no excuse. You can write your code in any of those `r NROW(names(knitr::knit_engines$get()))` languages, and off you go.
The only challenge that might arise is that the document is _R-flavoured_ markdown. This means that it needs to be rendered from within R, but that step can be dealt with relatively easily, for example leveraging online resources such as the stackoverflow network.
# Implementing a coupled analysis and publication workflow
Let's assume that you have formed a hypothesis and gathered a bunch of data to test it. You now want to analyse that data and create a report or article that summarizes your findings. An analysis and publication workflow usually follows a similar path:
1. Set up the computing environment
1. Load some external packages
1. Load our own data processing routines
1. Import data
1. Plot it
1. Do some operations
1. Plot some more
1. Write
1. Save
1. Iterate around items 1-9 for a while
1. Format for publication
1. Submit
All of this can be captured in our _main.rmd_ R Markdown file, with the exception of the final ``submit'' stage.
## Setting up the computing environment
Like most scripts, _main.rmd_ includes a few variables that the user must set to run the analysis.
* The `base.dir` variable defines the location of the files required for this analysis.
* The `made.by` variable forms part of a label that will be added to the plots.
An advantage of using markdown and the _knitr_ package is that we can execute the code and show the code and results inline:
```{r user-defined options, echo=TRUE, results = 'hold'}
# Where can files be found?
base.dir <- file.path(getwd())
# Who ran this script
made.by = "A. Clifton"
```
We can also show the value of those variables in the documentation. For example:
* `base.dir` is `r base.dir`
* `made.by` is `r made.by`.
Imagine that when we started our project, we set up several subdirectories in `base.dir`. These are:
* _code/_ contains functions required for the analysis
* _data/_ contains the data files to be analyzed.
Let's tell the code where to find these directories. And, while we do it, we can also change R's working directory (`working.dir`) to the root directory of the project.
```{r define file locations, message=FALSE, echo = TRUE}
# define the working directory
working.dir <- base.dir
setwd(working.dir)
#identify data directory
data.dir = "data"
#identify code directory
code.dir = "code"
```
We now want to create a new directory for the results of the analysis.
```{r make output directory, message=FALSE, echo = FALSE}
# define the path to the directory that we will use to store all of the data
output.dir = "analysis"
# create this directory if it doesn't already exist
dir.create(file.path(base.dir,
output.dir),
showWarnings = FALSE,
recursive = TRUE)
```
Looking at your file system, you'll see there is now a new directory called _`r output.dir`_.
## Load packages
Packages are required to supplement base functions in R and many other languages. For example, this script requires the _reticulate_, _bookdown_, _ggplot2_, _grid_, _knitr_, _RColorBrewer_, _rgdal_, and _stringr_ packages to run. These are called from the script using the `require()` function. This assumes that the packages are available on your system.^[For details of how to install packages, see the RStudio help.]
** Note: ** The use of packages represents a challenge to reproducable and repeatable research as it is possible that the function and output of the packages may change over time.
```{r load packages, message=FALSE, echo = FALSE}
require(bookdown)
require(ggplot2)
if(packageVersion("ggplot2") < "2.2.0") {
stop("Need package:ggplot2 2.2 or higher for labs function")
}
require(grid)
require(reshape2)
require(stringr)
```
## Loading our own routines
Every data processing workflow requires its own scripts or functions to run. In this example, they are included in the `code.dir` (_`r code.dir`_) directory and sourced during the preparation of this document. I have included output below to show these codes being called.
```{r source codes, message=TRUE, echo = TRUE}
# source these functions
code.files = dir(file.path(base.dir,code.dir), pattern = "\\.R$")
for (file in code.files){
source(file = file.path(base.dir,code.dir,file))
print(paste0("Sourcing ", file, "."))
}
```
## Load the data
We now analyse the data from the simple data set. In this case, code has been written to load all of the files in the `data.dir` directory (`r data.dir`). I'm also going to map the three columns in the data files to the variables $x$, $y$, and $z$.^[See https://www.calvin.edu/~rpruim/courses/s341/S17/from-class/MathinRmd.html for more information about including maths in R markdown]
```{r load individual data files, results='asis', echo = FALSE}
# identify the data sets that we have available
data.files = dir(file.path(base.dir, data.dir),
recursive = TRUE,
pattern = "\\.csv$")
# load each file
for (data.file in data.files){
# Read this data set;
df.new <- read.csv(file.path(base.dir, data.dir, data.file),
col.names = c("x","y","z"),
header=TRUE)
# capture information about this data
df.new$source = data.file
# check for the existence of an aggregate data set. If it's not there, create a new one
if (!exists("df.in")){
# theres no data
df.in <- df.new
} else {
# append new data
df.in <- rbind(df.in,
df.new)
}
}
```
## Plot input data
The next step is to plot the input data. Let's plot all of the input data together using the `plotSomething()` function in the `code.dir` directory.
```{r configure graphics, message=FALSE, echo = FALSE}
# configure graphics appearance
#theme_set(theme_Literate(base_size = 8,
# base_family = "sans"))
```
```{r plot-input-data, echo= FALSE, fig.show='hold', fig.width = 6, fig.height = 3, fig.cap="Input Data"}
p <- plotSomething(df=df.in, made.by)
plot(p)
```
I've used the `ggplot2` package to make this figure. This has the advantage that figures can be given a consistent look and feel through ggplot's themes.
For convenience, we'll also save a copy of the figure as a _.png_ file to the `output.dir` (_analysis/_) directory.
```{r save plot of input data, message = FALSE, echo=FALSE}
ggsave(filename = file.path(base.dir,output.dir, "DataAll.png"),
plot = last_plot(),
width = 6,
height = 4,
units = "in",
dpi = 300)
```
## Operate on the data
At this point we can do any number of operations on the data. For sake of demonstration, let's add 2 to all $x$ values.
```{r add two,}
df.all <- df.in
df.all$x <- df.in$x + 2.0
```
## Plot the results
Let's run that `plotSomething()` routine again.
```{r plot-modified-data, echo= FALSE, fig.show='hold', fig.width = 6, fig.height = 3, fig.cap="Modified Data"}
p <- plotSomething(df=df.all, made.by)
plot(p)
```
And, as we can see in Figure \@ref(fig:plot-modified-data), the data have shifted along $x$ by a small amount.
## Write {#implementWrite}
Writing in an R Markdown document is similar to most other types of markdown.
So far we have demonstrate that we can import and manipulate data and plot results. Another important part of a publication is the ability to generate statistics or summary information from data and include that in our text.
To demonstrate that, I can calculate that the maximum value of $y$ in the input data sets was `r max(df.all$y)`. This can be confirmed by checking the input data files. I could also include more complex logic in these statements, for example to say if one statistic is bigger or larger than another.
I can also include equations, and reference them (e.g. Equation \@ref(eq:eq)). Markdown does not directly support equations, so we can be a bit flexible in notation. For simplicity in working with other parts of the workflow, I use Latex `begin{equation}...end{equation}` to indicate an equation.
\begin{equation}
A=\frac{\pi}{27d^2}
(\#eq:eq)
\end{equation}
This also works well with Word-format output.
<!-- see e.g. https://stackoverflow.com/questions/55923290/consistent-math-equation-numbering-in-bookdown-across-pdf-docx-html-output?noredirect=1&lq=1 for comparisons -->
You've already seen citations [@R-base], which refers to bibtex entries in [_main.bib_](main.bib). These first appeared in section \@ref(sec:intro). Footnotes work too.^[See what I did there?]
We sometimes need to include formatted tables in documents. This can be done using the `kable()` function (Table \@ref(tab:dfall)).
```{r dfall, echo = TRUE, fig.cap="The _df.all_ data frame."}
knitr::kable(df.all,
format="pandoc",
caption = "The _df.all_ data frame.")
```
If our goal was just to produce HTML, we could add styling, too. This is detailed in the _kable_ package vignettes.^[In this example I have explicitly stated `_format="pandoc"_` so that this file will work in HTML and PDF.]
## Save{#implementSave}
We now write our processed data to file.
```{r save data, echo=TRUE}
# save the data
save(list = c("base.dir",
"made.by",
"df.all"),
file = file.path(base.dir, output.dir, "Data.RData"),
envir = .GlobalEnv)
```
In R it is also possible to save the whole workspace. We can do that here as well:
```{r save workspace, echo=TRUE}
# save the workspace
save.image(file=file.path(base.dir, output.dir,"workspace.RData"))
```
<!--- ? Is it possible to save packages locally the first time they are called and then pick them up there afterwards, instead of from a repo? --->
## Iterate {#implementIterate}
To re-run the analysis, render _main.rmd_ every so often following the instructions in _HOWTO.md_.
Some IDEs also allow the code chunks to be evaluated separately, which might help when dealing with larger data sets, more complex analysis, or bigger documents.
## Apply formatting {#implementFformat}
Scientific Journals often have their own formatting requirements. These requirements can still be met using markdown. The mechanics of such a process are beyond the scope of this paper and should probably be done as the last step in the publishing process. The reader is encouraged to look at the [_rticles_](https://github.com/rstudio/rticles) package and to use the detailed instructions in section 13 of the R Markdown Guide [@R-Markdown-Guide].
# Does this approach lead to reproducable research?
If you would like to test the reproducibility of this approach, try this:
1. Make sure you have set up the computing environment.
1. Make a copy of the entire working directory. Put that somewhere safe.
1. In the working directory, remove the results:
1. Delete the _`r output.dir`_ directory.
1. Delete the _publications/_ directory and all of its subdirectories. This is where _main.PDF_, _main.html_, and _main.md_ are.
You should now be left with some raw data and a few codes. There's no documentation anymore and no results.
5. Render _main.rmd_. There are instructions on how to do this in the _HOWTO.md_ file included in the root directory of the repository.
You now should have your data folder back and all of the documentation, but you'll notice that the dates on the figures might have changed. This is reproducable research in action.
### R^5
> "need to finish this"
## Is this FAIR?
``FAIR'' data stands for data that are Findable, Accessible, Interoperable, and Reusable.
> "need to finish this"
### Findable
### Accesible
### Interoperable data
Our data were stored as comma-separated values in .csv files. These are not ideal; besides the meaningless column headers, there's no metadata. Instead, it would have been better if the data had been in a relevant, `standardized` format that included metadata in the headers. An even better solution would be a self-describing format that includes metadata for all of the included fields.
### Reusable
## Making the results portable
One way to ensure repeatability and reproducability may be to have a generic ``data processing'' image of a computer system, e.g. as a Docker image, that is started for each new project and used _only_ for that project. This clean system is then used for one data processing task which is managed through a file such as the one you are reading. When the project is complete, the image is simply stored for as long as required. This would also avoid problems associated with changes introduced by package and system updates. A drawback to this approach would be the need to migrate the data every 5 years or so to a new system, which would be required to avoid data being stranded on old software.
> "need to finish this"
<!--- Algorithms could be called directly from third party services or accessed via APIs. This shifts the onus to those third parties to provide the tracking required for auditting, but does preserve intellectual property. --->
# Conclusions
Literate programming allows the creation of a single document that captures all of the process of preparing and analysing data, and creating a publication to describe that data. This is a fundamental requirement of reproducible research.
# Referencing this document {-}
This document has been assigned the Digital Object Identifier [10.5281/zenodo.3497450](http://dx.doi.org/10.5281/zenodo.3497450). Citations in a range of formats can be obtained through Zenodo.
[![DOI](DOIBadge3497450.pdf)](https://doi.org/10.5281/zenodo.3497450)
The source code for this document is available through [github.com/AndyClifton/lit-pro-sci-pub-demo](https://github.com/AndyClifton/lit-pro-sci-pub-demo).
# Acknowledgements {-}
Many thanks to Nikola Vasiljevic at DTU for prompting me to get this done.
# Bibliography {-}