-
Notifications
You must be signed in to change notification settings - Fork 72
/
cm105.Rmd
388 lines (276 loc) · 15.7 KB
/
cm105.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
# (5) Automate tasks and pipelines I
```{r include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
```{r}
library(tidyverse)
```
**100% complete**
## Today's Agenda
- Announcements: (10 mins)
- COVID-19 precautions
- Assignment 1 & 2 solutions will be posted tonight
- Assignment 3 is released (Peer review)
- Milestone 3 is released
- In the (currently very unlikely) event of UBC's closure
- Part 1: Odds and ends (5 mins)
- Loading/Saving model objets in R
- Running knitr from command line
- Part 2: Run an analysis from start to finish (15 mins)
- Clone the demo_project repo
- Run each script individually
- Delete files and run each script
- Part 3: Building a data analysis pipeline using MAKE (45 mins)
- `Make` targets
- `Make` clean
- `Make` all
## Part 1: Odds and ends
### Knitting files from the command line as a script
You can knit an Rmd file by clicking th "Knit" button in RStudio or by running the following command in an R console:
```
rmarkdown::render('docs/finalreport.Rmd',
c("html_document", "pdf_document"))
```
Notice that the first argument is the file that should be knit, and the second argument is a vector of the output types: `html` and `pdf`.
Though it seems silly to just have a single line of code as a script, there are many more options available when knitting an Rmd file.
You'll see more of these options in Thursday's lecture (cm107)
### Saving and loading R objects
The `.rds` and `.rda` file formats are special files that permit you to save single (rds) or multiple R objects (rda).
```
saveRDS(object, file = "filename.rds")
```
This file can be the output of a script (for instance, your analysis script).
To read the object back into R, simply use `readRDS`:
```
readRDS(file = "filename.rds")
```
To save multiple R objects, specify them with commas:
```
save(data1, data2, file = "data1and2.rda")
```
and then to load it again,
```
load("data1and2.rda")
```
The above material was from [this documentation](http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata).
## Part 2: Run an analysis from start to finish
Building a data analysis pipeline using the UNIX shell
To illustrate how to `Make` a data analysis pipeline, we are going to work through an example together.
First, you will have to [clone the demo project locally here](https://github.com/STAT547-UBC-2019-20/demo_project).
It contains several files, but the most relevant are the ones in the `docs` directory and the `src` (scripts) directory:
- load.R: Loads the data file into a CSV.
- clean.R: Cleans the data.
- eda.R: Performs exploratory data analysis.
- knit.R: Knits the .Rmd file into pdf and html files.
- finalreport.Rmd : template of the final report (technically Rmd files are also scipt)
All those scripts are linked, they use each others' outputs as their inputs.
can say that those scripts implement a data analysis pipeline.
1. Load the data from a URL and save it to a location on your computer
2. Save the dataset as a simple csv for later loading
3. Clean and re-save the data
4. Create the EDA and generate the corresponding plots
5. Do the modelling (to be done in milestone3!)
6. Generate the final report (to be done in milestone3!)
## Part 3: Building a data analysis pipeling using MAKE
### Test to ensure you have make installed
- Open a new Terminal window
- Type in `make` and press enter
- If you have make installed, it should show something like: `make: *** No targets specified and no makefile found. Stop.`
- If you do not have it installed, it will say something like "Command not found"
**Notes for Windows Users**
- If you do not have MAKE installed (should not be a problem on Windows or macOS), DO NOT spend class time trying to install it, it's better you try to work through the class and understand how it works.
We can try and help you during office hours to get it installed.
These instructions do not apply to Mac or Linux machines!
1. Go to the [Make for Windows](http://gnuwin32.sourceforge.net/packages/make.htm) web site.
1. [Download](http://gnuwin32.sourceforge.net/downlinks/make.php) the Setup program.
1. Install the file you just downloaded and copy to your clipboard the directory in which it is being installed.
1. FYI: The default directory is C:\Program Files (x86)\GnuWin32\
1. You now have make installed, but you need to tell Windows where to find the program. 1. This is called updating your PATH. You will want to update the PATH to include the bin directory of the newly installed program.
Here is how to update your path:
If you installed Make for Windows (as opposed to the make that comes with Git for Windows), you still need to update your PATH.
These are the steps on Windows 7 (we don’t have such a write-up yet for Windows 8 – feel free to send one!):
- Click on the Windows logo.
- Right click on Computer.
- Select Properties.
- Select Advanced System Settings.
- Select Environment variables.
- Select the line that has the PATH variable. You may have to scroll down to find it.
- Select Edit.
- Go to the end of the line and add a semicolon ;, followed by the path where the program was installed, followed by \bin.
- Typical example of what one might add: ;C:\Program Files (x86)\GnuWin32\bin
- Click Okay and close all the windows that you opened.
- Quit RStudio and open it again.
- You should now be able to use make from RStudio and the command line.
### Introduction to MAKE
GNU `Make` is a tool that controls the execution of files.
In order to explain to `Make` how to run our data analysis pipeline, we need the create a new file called *Makefile* (no file extension).
Each Makefile is made of blocks of code, called rules.
A rule follows the following structure :
```{structure_rule}
file_to_create.png : data_it_depends_on.csv script_it_depends_on.R
action
```
- `file_to_create.png` is a "target": it is the file that we want to output/create.
If there are multiple targets, you just have to seperate the different target names by a space:
```{structure_rule}
file_to_create.png file2.png : data_it_depends_on.csv script_it_depends_on.R
action
```
- `data_it_depends_on.csv` and `script_it_depends_on.R` are the dependencies : they are the scripts/files needed to build or update the target.
There can be zero or more dependencies.
*Note* : The target and the dependencies are separated by `:`.
- `action` is an action : it is the command to run in order to build/update the target using the dependencies.
You **MUST** use a TAB to indent an action.
Let's create our first Makefile now!
### Create a Makefile
Open RStudio. Select File > New File > Text File. Save this file with the name `Makefile` in your project directory.
This is very important as by default, `Make` looks for the file called `Makefile`.
Now, we are going to add several rules to this `Makefile`.
Let's add our first rule to this Makefile.
Try to add the following code to your file :
```{makefile_1}
# author: Your name
# date: The date
# download data
autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"
```
- `download_data` is the "name" of the target you want to create (this should be closely related to the action that gets triggered).
- `src/load.R` is the script that we want to execute.
- `Rscript src/load.R --data_filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"` is the action that we want to do : it is the command to run in order to build the target using the dependencies.
#### Don't forget to use a TAB to indent the action!!
It's worth it for us to pause here and make sure that the usage of TAB is emphasized.
By default, RStudio replaces the "TAB" character with two spaces.
Once you’ve saved the file with the name `Makefile`, RStudio *should* indent with tabs instead of spaces.
I recommend you display whitespace in order to visually confirm this: RStudio > Preferences > Code > Display > Display whitespace characters.
#### Return to the Makefile
Now, make sure that you save this file in the root directory of the repository.
Once this is done, let's try to run Make on the target that we created: "download_data".
To do so, you have to run the following command in your Terminal :
```{ru_make}
make autism.csv
```
`Make` then runs the script in the action you created.
Note that by default, the **first** target is run if you just type in `make` in a directory with a `Makefile`.
If you see this kind of error:
```{makefile_error}
Makefile:3: *** missing separator. Stop.
```
check the syntax, make sure the colon is in the right place, script names and directories are correct, and finally check that you used the TAB key to indent the action.
If you try to run `Make` again, without changing anything, `Make` will tell you that everything is up to date.
This is one of the reasons why `Make` is more efficient than running individual scripts one by one : before running a script, `Make` checks the 'last modification time' of both the target and its dependencies.
If any of the dependencies has been updated after the target was generated for the last time, `Make` will run the action.
In other words, if the target already exists and the dependencies have not been modified since the creation of the target, `Make` will not run the script again!
This is why `Make` is faster than running multiple scripts over and over again.
Another advantage of `Make` is that it allows us to remove all the intermediate data files that have been created by our data analysis pipeline, with only one command.
This can be useful if we want to run our analysis from scratch.
To do so, we have to create a new target called `clean`,.
As the actions for this target, it should deletes all files generated from our script.
In this case, the `clean` target should not have any dependencies.
Here is how to create that target :
```{makefile_clean}
# author: Your name
# date: The date
# download data
autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"
clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html
```
Note that you can even have multiple actions if you put one one each line!
The `-f` flag in the `rm` command ignores nonexistent files and arguments, and does not prompt you to confirm (see the [manual](http://man7.org/linux/man-pages/man1/rm.1.html) for more flags)
Now, try to delete the `data/autism.csv` file as well as all files (* character) in the images folder using `Make - be careful using this!.
```{command_run_make_clean}
make clean
```
If you take a look at your `data/autism.csv` file, you will see it is gone, and all files in the `images` folder will be gone.
### Your turn to finish the Makefile for the demo_project
Try to complete the `Makefile` that we created in order to execute the whole data analysis pipeline.
```{answer_makefile}
# author: Your name
# date: The date
# download data
data/autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"
# clean data
## YOUR SOLUTION HERE
data/autism_cleaned.csv : src/clean.R data/autism.csv
<TAB>Rscript src/clean.R --filepath_raw="data/autism.csv" --filepath_cleaned="data/autism_cleaned.csv"
# EDA
## YOUR SOLUTION HERE
images/barplot.png images/correlation.png images/propbarplot.png : src/eda.R data/autism_cleaned.csv
<TAB>Rscript src/eda.R --filepath_cleaned="data/autism_cleaned.csv"
# Knit report
## YOUR SOLUTION HERE
docs/finalreport.html docs/finalreport.pdf : images/barplot.png images/correlation.png images/propbarplot.png docs/finalreport.Rmd data/autism_cleaned.csv src/knit.R docs/asd_refs.bib
<TAB>Rscript src/knit.R --finalreport="docs/finalreport.Rmd"
clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html
```
At this point, the major part of the `Makefile` is done!
There is just one last thing we have to do.
Now that we have several targets, if we want to run the `Makefile`, we have to specify to `Make` which target we are interrested in.
For instance, if I just want to update/create the figure `images/correlation.png`, I would run in my console:
```{command_run_make_1}
$ make images/correlation.png
```
But several times, the final output of you analysis is made of several files/figures, which do not always depend on each other... so we would have to call `make` on each one of those files/figures.
We would have to run multiple `make` command to make sure we capture everything and all the dependencies.
This is prone to errors.
This is why we are going to use another target `all`.
`all` is going to be a target, and its dependencies will be the final outputs of our data analysis pipeline. There will be no action associated to this target.
If we come back to our example, the `all` target would be the following :
```{makefile_all}
all: docs/finalreport.html docs/finalreport.pdf
```
We can see that `all` has only one dependency because all the other plots/files are already dependencies of previous targets.
### "Phony" targets
The .PHONY line is where you declare which targets are "phony" i.e. are not actual files to be made in the literal sense.
Both `all` and `clean` are phony targers.
It’s a good idea to explicitly tell `make` which targets are phony, instead of letting it try to deduce this.
`make` can get confused if you create a file that has the same name as a phony target.
If for example you create a directory named `clean` to hold your clean data and run `make clean`, then `make` will report `clean` is up to date, because a directory with that name already exists.
To take care of this, we just need to tell our `Makefile` (at the top of the file) which of the targets are "phony":
```{phony_targets}
.PHONY: all clean
```
Then, the final `Makefile` is :
```{makefile_final}
# author: Your name
# date: The date
.PHONY: all clean
## YOUR SOLUTION HERE FOR THE `all` target
all: docs/finalreport.html docs/finalreport.pdf
# download data
data/autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"
# clean data
## YOUR SOLUTION HERE
data/autism_cleaned.csv : src/clean.R data/autism.csv
<TAB>Rscript src/clean.R --filepath_raw="data/autism.csv" --filepath_cleaned="data/autism_cleaned.csv"
# EDA
## YOUR SOLUTION HERE
images/barplot.png images/correlation.png images/propbarplot.png : src/eda.R data/autism_cleaned.csv
<TAB>Rscript src/eda.R --filepath_cleaned="data/autism_cleaned.csv"
# Knit report
## YOUR SOLUTION HERE
docs/finalreport.html docs/finalreport.pdf : images/barplot.png images/correlation.png images/propbarplot.png docs/finalreport.Rmd data/autism_cleaned.csv src/knit.R docs/asd_refs.bib
<TAB>Rscript src/knit.R --finalreport="docs/finalreport.Rmd"
clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html
```
Now, you just have to run `make all` on your Terminal, and it will run the whole data analysis pipeline from beginning to end!
And if you want to clear all the outputs of your different scripts, you just have to run `make clear`.
This is it! Now you know how to build an efficient data analysis pipeline from beginning to end!
## Additional Resources
- [Automating Pipelines](https://stat545.com/automating-pipeline.html) chapter by Jenny Bryan on stat545.com; [associated slides](https://github.com/STAT545-UBC/STAT545-UBC-original-website/blob/master/automation01_slides/slides.md)
- Working demo of a Makefile in the [demo_project](http://github.com/STAT547-UBC-2019-20/demo_project)