forked from b-rodrigues/rap4all
-
Notifications
You must be signed in to change notification settings - Fork 0
/
project_rewrite.qmd
755 lines (597 loc) · 26.1 KB
/
project_rewrite.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
# Rewriting our project
In this chapter, we will use what we’ve learned until now to rewrite our
project. As a reminder, here are the scripts we wrote together:
- save_data.R: [https://is.gd/7PhUjd](https://is.gd/7PhUjd)
- analysis.R: [https://is.gd/X7XXJg](https://is.gd/X7XXJg)
The `analysis.R` file already includes one change: the one
from the chapter on collaborating with Github, where Bruno wrote a function to
make the plots for each commune.
If you skipped part one of the book, or for any other reason do not have a
Github repository with these two files yet, then now is the time to do so.
Create a repository and name it *housing_lux* or anything you’d like, and put
these two files there. I will assume that you have these files safely versioned,
and will not be telling you systematically when to commit and push. Simply do so
as often as you’d like! You should have a repository with a *master* or *main*
branch containing these two scripts. On your computer, calling `git status` in
Git Bash (on Windows) or in a terminal (for Linux and macOS) should result in
this:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git status
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git status
```
:::
```bash
On branch master
nothing to commit, working tree clean
```
If that’s the case, congrats, we can start working. Start by creating a new branch,
and call it `rmd`:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git checkout -b rmd
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git checkout -b rmd
```
:::
```bash
Switched to a new branch 'rmd'
```
We will now be working on this branch, simply work as usual, but when pushing, make
sure to push to the `rmd` branch:
::: {.content-hidden when-format="pdf"}
```bash
owner@localhost ➤ git add .
owner@localhost ➤ git commit -m "some changes"
owner@localhost ➤ git push origin rmd
```
:::
::: {.content-visible when-format="pdf"}
```bash
owner@localhost $ git add .
owner@localhost $ git commit -m "some changes"
owner@localhost $ git push origin rmd
```
:::
This will push whatever changes you've made to files to the `rmd` branch. By using two branches like this, you keep the
original `.R` scripts in the main branch, and then will end up with the `.Rmd`
files in the `rmd` branch.
Before moving forward now is the right moment to actually discuss why you would
want to convert the script into Rmds. There are several reasons. First, as
argued in the chapter on literate programming, a document that mixes prose and
code is easier to read and share than a script. Next, since this Rmd file can
get knitted into any type of document (PDF, Word, etc...), it also makes it
easier to arrive at what interests us, the output. A script is simply a means,
it’s not an end. The end is (in most cases) a document so we might as well use
literate programming to avoid the cursed loop of changing the script, editing
the document, going back to the script, etc.
But there is yet another benefit; even if the Rmd file is not supposed to get
shared with anyone else, we will, later on, use it as our starting point for the
*Rmd first* method of package development as promoted by Sébastien Rochette, the
author of `{fusen}`. This Rmd first method involves making use of a
*development* Rmd file that contains all the usual steps that we would take to
create a package. This is in contrast with the usual package development
process, in which we would type the required commands to build the package in
the terminal. The functions, tests, and documentation that we want to add to the
package get defined using Rmd files as well. This makes them much easier to read
and also share with a non-technical audience. All these Rmd files can then be
converted (or *inflated* in `{fusen}` jargon) to create a fully working package.
If this sounds complicated or confusing, don’t worry. Trust the process,
push on, and all the pieces of the puzzle will elegantly fit together in a
couple of chapters.
In the following sections I will rewrite the scripts by using functional and
literate programming: if you don’t want to rewrite everything, don’t worry, I
link the final Rmd files at the end of each section. But I would advise that you
follow along by writing everything as it will make absorbing the contents much
simpler.
## An Rmd for cleaning the data
So, let’s start with the `save_data.R` script. Since we are going to use
functional programming and literate programming, we are going to start from an
empty `.Rmd` file. So open an empty `.Rmd` file and start with the following
lines:
````{verbatim}
---
title: "Nominal house prices data in Luxembourg - Data cleaning"
author: "Put your name in here"
date: "`r Sys.Date()`"
---
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(ggplot2)
library(janitor)
library(purrr)
library(readxl)
library(rvest)
library(stringr)
```
## Downloading the data
````
We start by writing a header to define the title of the document, the name of
the author and the current date using inline R code (you can also hardcode the
date as a string if you prefer). We then load packages in a chunk with options
`warning=FALSE` and `message=FALSE` which will avoid showing packages’ startup
messages in the knitted document.
Then we start with a new section called `## Downloading the data`. We then add
a paragraph explaining from where and how we are going to download the data:
::: {.content-hidden when-format="pdf"}
````{verbatim}
This data is downloaded from the Luxembourguish [Open Data
Portal](https://data.public.lu/fr/datasets/prix-annonces-des-logements-par-commune/)
(the data set called *Série rétrospective des prix
annoncés des maisons par commune, de 2010 à 2021*),
and the original data is from the "Observatoire de
l'habitat". This data contains prices for houses sold
since 2010 for each luxembourguish commune.
The function below uses the permanent URL from the Open Data
Portal to access the data, but I have also rehosted the data,
and use my link to download the data (for archival purposes):
````
:::
::: {.content-visible when-format="pdf"}
````{verbatim}
This data is downloaded from the Luxembourguish [Open Data
Portal](https://data.public.lu/fr/datasets/ prix-annonces-des-logements-par-commune/)
(the data set called *Série rétrospective des prix
annoncés des maisons par commune, de 2010 à 2021*),
and the original data is from the "Observatoire de
l'habitat". This data contains prices for houses sold
since 2010 for each luxembourguish commune.
The function below uses the permanent URL from the Open Data
Portal to access the data, but I have also rehosted the data,
and use my link to download the data (for archival purposes):
````
:::
This is much more detailed than using comments in a script, one of the benefits
of literate programming. Then comes a function to download and get the data.
This function simply wraps the lines from our original script that did the
downloading and the cleaning. As a reminder, here are the lines from the
original script, which I will then rewrite as a function:
```{r, eval = FALSE}
url <- "https://is.gd/1vvBAc"
raw_data <- tempfile(fileext = ".xlsx")
download.file(url, raw_data, method = "auto", mode = "wb")
sheets <- excel_sheets(raw_data)
read_clean <- function(..., sheet){
read_excel(..., sheet = sheet) |>
mutate(year = sheet)
raw_data <- map(
sheets,
~read_clean(raw_data,
skip = 10,
sheet = .)
) |>
bind_rows() |>
clean_names()
raw_data <- raw_data |>
rename(
locality = commune,
n_offers = nombre_doffres,
average_price_nominal_euros = prix_moyen_annonce_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant
) |>
mutate(locality = str_trim(locality)) |>
select(year, locality, n_offers, starts_with("average"))
```
and here is the same code, but as a function:
````{verbatim}
```{r, eval = FALSE}
get_raw_data <- function(url = "https://is.gd/1vvBAc"){
raw_data <- tempfile(fileext = ".xlsx")
download.file(url,
raw_data,
mode = "wb")
sheets <- excel_sheets(raw_data)
read_clean <- function(..., sheet){
read_excel(..., sheet = sheet) %>%
mutate(year = sheet)
}
raw_data <- map_dfr(
sheets,
~read_clean(raw_data,
skip = 10,
sheet = .)) %>%
clean_names()
raw_data %>%
rename(
locality = commune,
n_offers = nombre_doffres,
average_price_nominal_euros = prix_moyen_annonce_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant
) %>%
mutate(locality = str_trim(locality)) %>%
select(year, locality, n_offers, starts_with("average"))
}
```
````
As you see, it’s almost exactly the same code. So why use a function? Our
function has the advantage that it uses the url of the data as an argument. This
means that we can use it on other datasets (let’s remember that we are here
focusing on prices of houses, but there’s another dataset of prices of
apartments) or use it on an updated version of this dataset (which gets updated
yearly). We can now more easily re-use this function later on (especially once
we’ve turned this Rmd into a package in the next chapter). You can decide to
show the source code of the function or hide it with the chunk option
`include=FALSE` or `echo=FALSE` (the difference between `include` and `echo` is
that `include` hides both the source code chunk and the output of that chunk).
Showing the source code in the output of your Rmd file can be useful if you want
to share it with other developers. The next part of the Rmd file is simply using
the function we just wrote:
````{verbatim}
```{r}
raw_data <- get_raw_data(url = "https://is.gd/1vvBAc")
```
````
We can now continue by explaining what’s wrong with the data and what cleaning
steps need to be taken:
````{verbatim}
We need clean the data: "Luxembourg" is "Luxembourg-ville" in 2010 and 2011,
then "Luxembourg". "Pétange" is also spelt non-consistently, and we also need
to convert columns to the right type. We also directly remove rows where the
locality contains information on the "Source":
```{r}
clean_raw_data <- function(raw_data){
raw_data %>%
mutate(locality = ifelse(grepl("Luxembourg-Ville", locality),
"Luxembourg",
locality),
locality = ifelse(grepl("P.tange", locality),
"Pétange",
locality)
) %>%
filter(!grepl("Source", locality)) %>%
mutate(across(starts_with("average"), as.numeric))
}
```
```{r}
flat_data <- clean_raw_data(raw_data)
```
````
The chunk above explains what we’re doing and why we’re doing it, and so we
write a function (based on what we already wrote). Here again, the advantage of
having this as a function will make it easier to run on updated data.
We now continue with establishing a list of communes:
````{verbatim}
We now need to make sure that we got all the communes/localities
in there. There were mergers in 2011, 2015 and 2018. So we need
to account for these localities.
We’re now scraping data from Wikipedia of former Luxembourguish communes:
```{r}
get_former_communes <- function(
url = "https://w.wiki/_wFe7",
min_year = 2009,
table_position = 3
){
read_html(url) %>%
html_table() %>%
pluck(table_position) %>%
clean_names() %>%
filter(year_dissolved > min_year)
}
```
```{r}
former_communes <- get_former_communes()
```
We can scrape current communes:
```{r}
get_current_communes <- function(
url = "https://w.wiki/6nPu",
table_position = 1
){
read_html(url) %>%
html_table() %>%
pluck(table_position) %>%
clean_names()
}
```
```{r}
current_communes <- get_current_communes()
```
````
This is quite a long chunk, but there is nothing new in here, so I won’t explain
it line by line. What’s important is that the code doing the actual work is all
being wrapped inside functions. I reiterate: this will make reusing, testing and
documenting much easier later on. Using the objects `former_communes` and
`current_communes` we can now build the complete list:
````{verbatim}
Let’s now create a list of all communes:
```{r}
get_test_communes <- function(former_communes, current_communes){
communes <- unique(c(former_communes$name, current_communes$commune))
# we need to rename some communes
# Different spelling of these communes between wikipedia and the data
communes[which(communes == "Clemency")] <- "Clémency"
communes[which(communes == "Redange")] <- "Redange-sur-Attert"
communes[which(communes == "Erpeldange-sur-Sûre")] <- "Erpeldange"
communes[which(communes == "Luxembourg-City")] <- "Luxembourg"
communes[which(communes == "Käerjeng")] <- "Kaerjeng"
communes[which(communes == "Petange")] <- "Pétange"
communes
}
```
```{r}
former_communes <- get_former_communes()
current_communes <- get_current_communes()
communes <- get_test_communes(former_communes, current_communes)
```
````
Once again, we write a function for this. We need to merge these two lists, and
need to make sure that the spelling of the communes’ names is unified between
this list and between the communes’ names in the data.
We now run the actual test:
````{verbatim}
Let’s test to see if all the communes from our dataset are represented.
```{r}
setdiff(flat_data$locality, communes)
```
If the above code doesn’t show any communes, then this means that we are
accounting for every commune.
````
This test is quite simple, and we will see how to create something a bit more
robust and useful later on.
Now, let’s extract the national average from the data and create a separate
dataset with the national level data:
````{verbatim}
Let’s keep the national average in another dataset:
```{r}
make_country_level_data <- function(flat_data){
country_level <- flat_data %>%
filter(grepl("nationale", locality)) %>%
select(-n_offers)
offers_country <- flat_data %>%
filter(grepl("Total d.offres", locality)) %>%
select(year, n_offers)
full_join(country_level, offers_country) %>%
select(year, locality, n_offers, everything()) %>%
mutate(locality = "Grand-Duchy of Luxembourg")
}
```
```{r}
country_level_data <- make_country_level_data(flat_data)
```
````
and finally, let’s do the same but for the commune level data:
````{verbatim}
We can finish cleaning the commune data:
```{r}
make_commune_level_data <- function(flat_data){
flat_data %>%
filter(!grepl("nationale|offres", locality),
!is.na(locality))
}
```
```{r}
commune_level_data <- make_commune_level_data(flat_data)
```
````
We can finish with a chunk to save the data to disk:
````{verbatim}
We now save the dataset in a folder for further analysis (keep chunk option to
`eval = FALSE` to avoid running it when knitting):
```{r, eval = FALSE}
write.csv(commune_level_data,
"datasets/house_prices_commune_level_data.csv",
row.names = FALSE)
write.csv(country_level_data,
"datasets/house_prices_country_level_data.csv",
row.names = FALSE)
```
````
This last chunk is something I like to add to my Rmd files.
Instead of showing it in the final document but not evaluating its contents
using the chunk option `eval = FALSE`, like I did, you could use `include = FALSE`, so
it doesn’t appear in the compiled document at all. The first time you compile
this document, you could change the option to `eval = TRUE`, so that the data gets
written to disk, and then change it to `eval = FALSE` to avoid overwriting the data
on subsequent knittings. This is up to you, and it also depends on who the
audience of the knitted output is (do they want to see this chunk at all?).
Ok, and that’s it. You can take a look at the finalised file
[here](https://raw.githubusercontent.com/b-rodrigues/rap4all/master/rmds/save_data.Rmd)^[https://is.gd/eBbcsR].
You can now remove the `save_data.R` script, as you have successfully ported the
code over to an RMarkdown file. If you have not done it yet, you can commit
these changes and push.
Let’s now do the same thing for the analysis script.
## An Rmd for analysing the data
We will follow the same steps as before to convert the analysis script into an
analysis RMarkdown file. Instead of showing the whole file here, I will show you
two important points.
The first point is removing redundancy. In the original script, we had the
following lines:
```{r, eval = F}
#Let’s compute the Laspeyeres index for each commune:
commune_level_data <- commune_level_data %>%
group_by(locality) %>%
mutate(p0 = ifelse(year == "2010",
average_price_nominal_euros,
NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == "2010",
average_price_m2_nominal_euros,
NA)) %>%
fill(p0_m2, .direction = "down") %>%
ungroup() %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2 * 100)
#Let’s also compute it for the whole country:
country_level_data <- country_level_data %>%
mutate(p0 = ifelse(year == "2010",
average_price_nominal_euros,
NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == "2010",
average_price_m2_nominal_euros,
NA)) %>%
fill(p0_m2, .direction = "down") %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2 * 100)
```
As you can see, this is almost exactly the same code twice. The only difference
between the two code snippets, is that we need to group by commune when
computing the Laspeyeres index for the communes (remember, this index will make
it easy to make comparisons). Instead of repeating 99% of the lines, we should
create a function that will group the data if the data is the commune level
data, and not group the data if it’s the national data. Here is this function:
```{r, eval = F}
get_laspeyeres <- function(dataset, start_year = "2010"){
which_dataset <- deparse(substitute(dataset))
group_var <- if(grepl("commune", which_dataset)){
quo(locality)
} else {
NULL
}
dataset %>%
group_by(!!group_var) %>%
mutate(p0 = ifelse(year == start_year,
average_price_nominal_euros,
NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == start_year,
average_price_m2_nominal_euros,
NA)) %>%
fill(p0_m2, .direction = "down") %>%
ungroup() %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2*100)
}
```
So, the first step is naming the function. We’ll call it `get_laspeyeres()`, and
it’ll be a function of two arguments. The first is the data (commune or national
level data) and the second is the starting date of the data. This second
argument has a default value of "2010". This is the year the data starts, and so
this is the year the Laspeyeres index will have a value of 100.
The following lines are probably the most complicated:
```{r, eval = FALSE}
which_dataset <- deparse(substitute(dataset))
group_var <- if(grepl("commune", which_dataset)){
quo(locality)
} else {
NULL
}
```
The first line replaces the variable `dataset` by its bound value (that’s what
`substitute()` does) for example, `commune_level_data`, and then converts this
variable name into a string (using `deparse()`). So when the user provides
`commune_level_data`, `which_dataset` will be defined as equal to
`"commune_level_data"`. We then use this string to detect whether the data needs
to be grouped or not. So if we detect the word "commune" in the `which_dataset`
variable, we set the grouping variable to `locality`, if not to `NULL`. But you
might have the following questions: why is `locality` given as an input to
`quo()`, and what is `quo()`?
A simple explanation: `locality` is a variable in the `commune_level_dataset`.
If we don’t *quote* it using `quo()`, our function will look for a variable
called `locality` in the body of the function, but since there is no variable
defined that is called `locality` in there, the function will look for this
variable in the global environment. But this is not a variable defined in the
global environment either, it is a column in our dataset. So we need a way to
tell this to our function: *don’t worry about evaluating this just yet, I’ll
tell you when it’s time*.
So by using `quo()`, we can delay evaluation. So how can we tell the function
that it’s time to evaluate `locality`? This is where we need `!!` (pronounced
*bang-bang*). You’ll see that `!!` gets used on `group_var` inside `locality`:
```{r, eval = FALSE}
group_by(!!group_var)
```
So if we are calling the function on `commune_level_dataset`, then `group_var`
is equal to `locality`, if not it’s `NULL`. `!!group_var` means that now it’s
time to evaluate `group_var` (or rather, `locality`). Because `!!group_var` gets
replaced by `quo(locality)`, and because `group_by()` is a `{dplyr}` function
that knows how to deal with quoted variables, `locality` gets looked up among
the columns of the data frame. If it’s `NULL` nothing happens, so the data
doesn’t get grouped.
This is a big topic unto itself, so if you want to know more you can start by
reading the famous `{dplyr}` vignette called *Programming with dplyr*
[here](https://dplyr.tidyverse.org/articles/programming.html)^[https://dplyr.tidyverse.org/articles/programming.html].
In case you use `{dplyr}` a lot, I recommend you study this vignette because
mastering *tidy evaluation* (the name of this framework) is key to becoming
comfortable with programming using `{dplyr}` (and other *tidyverse* packages).
You can also read the chapter I wrote on this in my other [free
ebook](http://modern-rstats.eu/defining-your-own-functions.html#functions-that-take-columns-of-data-as-arguments)^[https://is.gd/f11De1].
The next lines of the script that we need to port over to the Rmd are quite
standard, we write code to create some plots (which were already refactored into
a function in the chapter on collaborating on Github). But remember, we want to
have an Rmd file that can be compiled into a document that can be read by
humans. This means that to make the document clear, I suggest that we create one
subsection by commune that we plot. Thankfully, we have learned all about child
documents in the literate programming chapter, and this is what we will be using
to avoid having to repeat ourselves. The first part is simply the function that
we’ve already written:
````{verbatim}
```{r}
make_plot <- function(commune){
commune_data <- commune_level_data %>%
filter(locality == commune)
data_to_plot <- bind_rows(
country_level_data,
commune_data
)
ggplot(data_to_plot) +
geom_line(aes(y = pl_m2,
x = year,
group = locality,
colour = locality))
}
```
````
Now comes the interesting part:
````{verbatim}
```{r, results = "asis"}
res <- lapply(communes, function(x){
knitr::knit_child(text = c(
'\n',
'## Plot for commune: `r x`',
'\n',
'```{r, echo = FALSE}',
'print(make_plot(x))',
'```'
),
envir = environment(),
quiet = TRUE)
})
cat(unlist(res), sep = "\n")
```
````
I won’t explain this now in great detail, since that was already done in the
chapter on literate programming. Before continuing, really make sure that you
understand what is going on here. Take a look at the finalised file
[here](https://raw.githubusercontent.com/b-rodrigues/rap4all/master/rmds/analyse_data.Rmd)^[https://is.gd/L2GICG].
You’ll notice that at the start of the RMarkdown file, I also load some package
and the data saved by the `save_data.Rmd` RMarkdown file.
You can see how the outputs look like by browsing to the links below:
- [save_data.html, compiled from the save_data.Rmd source](https://is.gd/Z15Ycy)^[https://is.gd/Z15Ycy]
- [analyse_data.html, compiled from the analyse_data.Rmd source](https://is.gd/D1o4XJ)^[https://is.gd/D1o4XJ]
Of course, you could compile the files into Word documents or PDF, depending on
your needs, and you could of course write many more details than me. I wanted to
keep it short; the point of this chapter was to show you how to use literate
programming and not to write a very detailed analysis.
## Conclusion
This chapter was short, but quite dense, especially when we converted the
analysis script to an Rmd, because we’ve had to use two advanced concepts, *tidy
evaluation* and Rmarkdown child documents. *Tidy* evaluation is not a topic that
I wanted to discuss in this book, because it doesn’t have anything to do with
the main topic at hand. However, part of building a robust, reproducible
pipeline is to avoid repetition. In this sense, programming with `{dplyr}` and
tidy evaluation are quite important. As suggested before, take a look at the
linked vignette above, and then the chapter from my other free ebook. This
should help get you started.
The end of this chapter marks an important step: many analyses stop here, and
this can be due to a variety of reasons. Maybe there’s no time left to go
further, and after all, you’ve got the results you wanted. Maybe this analysis
is useful, but you don’t necessarily need it to be reproducible in 5, 10 years,
so all you want is to make sure that you can at least rerun it in some months or
only a couple of years later (but be careful with this assessment, sometimes an
analysis that wasn’t supposed to be reproducible for too long turns out to need
to be reproducible for way longer than expected...).
Because I want this book to be a pragmatic guide, I will now talk about putting
the least amount of effort to make your current analysis reproducible, and this
is by freezing package versions, which I will show you in the next chapter.