-
Notifications
You must be signed in to change notification settings - Fork 1
/
06_reproducible-research-with-git.qmd
486 lines (336 loc) · 19.5 KB
/
06_reproducible-research-with-git.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
---
title: "Reproducible Research with Git and GitHub"
abstract: Git and GitHub are powerful software tools used to control different versions of a codebase, track changes, and collaborate with other programmers. This section introduces both tools.
format:
html:
toc: true
code-line-numbers: true
---
![Hansel and Gretel by Arthur Rackham](images/Hansel-and-gretel-rackham.jpg)
```{r}
#| echo: false
exercise_number <- 1
```
```{r}
#| label: quarto-setup
#| echo: false
#| message: false
#| warning: false
knitr::opts_chunk$set(fig.align = "center")
library(tidyverse)
library(gt)
library(knitr)
library(RXKCD)
source("src/motivation.R")
```
```{r}
#| label: tbl-roadmap
#| tbl-cap: "Opinionated Analysis Development"
#| echo: false
motivation |>
filter(!is.na(Section), Section == "Version Control") |>
select(-`Analysis Feature`) |>
arrange(Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = c(Tool, Section))
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
::: {.callout-note}
`<>` are used throughout this chapter to indicate blanks that need to be filled in. Don't actually submit `<>`. Instead, replace them with the desired text.
:::
## Command Line
The command line (also known as shell or console) is a way of controlling computers without using a graphical user interface (i.e. pointing-and-clicking). The command line is useful because pointing-and-clicking is tough to reproduce or scale and because lots of useful software is only available through the command line. Furthermore, cloud computing often requires use of the command line.
There are different ways to use the command line.
Macs use the Terminal (@fig-terminal). Open Terminal like any other program on Mac.
![Mac Terminal](images/terminal.png){#fig-terminal width="400" fig-align="center" width=70%}
Git Bash, which is installed with Git, works well on Windows. If you have Git Bash, you should be able to right-click in a desired directory and select "Git Bash Here" to access Git Bash on Windows.
RStudio contains a terminal in the tab adjacent to the console (@fig-terminal-rstudio). This will allow us to work at the common line with a common experience on Mac-, Windows-, and Linux-based computers.
![RStudio Terminal](images/terminal-rstudio.png){#fig-terminal-rstudio width="400" fig-align="center" width=70%}
### Bash
Bash is a shell program and command language that allows us to control our computer at the command line. Fortunately, we only need to know a little Bash for version control with Git.
- `pwd` - print working directory - prints the file path to the current location in the
- `ls` - list - lists files and folders in the current working directory.
- `cd` - change directory - move the current working directory. Specify the relative path to move down in a directory. Use `cd ..` to move up a directory.
- `mkdir` - make directory - creates a directory (folder) in the current working directory.
- `touch` - creates a text file with the provided name.
- `mv` - move - moves a file from one location to the other.
- `cat` - concatenate - concatenate and print a file.
### Useful tips
- Tab completion can save a ton of typing. Hitting tab twice shows all of the available options that can complete from the currently typed text.
- Hit the up arrow to cycle through previously submitted commands.
- Use `man <command name>` to pull up help documentation. Hit `q` to exit.
- Typing `..` refers to the directory above the working directory. Writing `cd ..` changes to the directory above the working directory.
- Typing just `cd` changes to the home directory.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
1. Navigate to the `example-project` directory using `cd` in the RStudio terminal.
2. Submit `pwd` to confirm you are in the correct directory.
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
Use the following code to create a new text document called `haiku.txt` in the working directory.
``` bash
pwd
ls
touch haiku.txt
ls
```
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
Use following to add the haiku "The Old Pond" by Matsuo Basho to `haiku.txt`.
``` bash
cat haiku.txt
echo "An old silent pond" >> haiku.txt
cat haiku.txt
echo "A frog jumps into the pond-" >> haiku.txt
echo "Splash! Silence again." >> haiku.txt
echo "~Matsuo Basho" >> haiku.txt
cat haiku.txt
```
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
Use the following code to move `haiku.txt` to a subdirectory called `poems/`.
``` bash
ls
mkdir poems
ls
mv haiku.txt poems/haiku.txt
ls
cat poems/haiku.txt
```
:::
### Programs
We can run programs from the command line. Commands from programs always start with the name of the program. Running git commands intuitively start with `git`. For example:
``` bash
git init
git status
```
## Why version control?
::: {.callout-tip}
## Version Control
Version control is a system for managing and recording changes to files over time.
:::
Version control is essential to managing code and analyses. Good version control can:
- Create a permanent record of changes to code
- Easily undo mistakes by switching between iterations of code
- Allow multiple paths of development while protecting working versions of code
- Encourage communication between collaborators
- Facilitate multiple code reviews
- Be used for external communication
## Why distributed version control?
::: {.callout-tip}
## Centralized version control
Centralized version control stores all files and the log of those files in one centralized location.
:::
::: {.callout-tip}
## Distributed version control
Distributed version control stores files and logs in one or many locations and has tools for combining the different versions of files and logs.
:::
Centralized version control systems like Google Drive or Box are good for sharing a Microsoft Word document, but they are terrible for collaborating on code.
Distributed version control allows for the simultaneous editing and running of code. It also allows for code development without sacrificing a working version of the code.
::: {.callout-note}
Git and GitHub are difficult to motivate a priori but the value is obvious after adopting the tools. We've done our best to motivate the tools. If you are unconvinced, we ask that you just trust us on this one.
:::
## Git vs. GitHub
::: {.callout-tip}
## Git
Git is a distributed version-control system for tracking changes in code. Git is free, open-source software and can be used locally without an internet connection. It's like a turbo-charged version of Microsoft Word's track changes for code.
:::
::: {.callout-tip}
## GitHub
[GitHub](https://github.com/), which is owned by Microsoft, is an online hosting service for version control using Git. It also contains useful tools for collaboration and project management. It's like a turbo-charged version of Google Drive or Box for sharing repositories created using Git.
:::
At first, it's easy to mix up Git and GitHub. Just try to remember that they are separate tools that complement each other well.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
1. If you don't already have an account, sign up for [GitHub](https://github.com/).
:::
## SSH Keys for Authentication
GitHub started requiring token-based or SSH-based authentication in [2021](https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/). We will focus on creating SSH keys for authentication. For instructions on creating a personal access token for authentication, see @sec-ap-a below.
We will follow the [instructions for setting up SSH keys](https://happygitwithr.com/ssh-keys.html#option-2-set-up-from-the-shell) using the console, or terminal window, from Jenny Bryan's fantastic *Happy Git with R*.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
1. Follow [the instructions](https://happygitwithr.com/ssh-keys.html#option-2-set-up-from-the-shell) for setting up SSH keys using the console. We recommend using the default key location and key name. You can choose whether or not to add a password for the key. Note that if you choose to add a password, you will need to enter that password every time you perform operations with GitHub - so make sure you'll be able to remember it!
2. When you get to the section of the instructions to provide the public key to GitHub, we recommend obtaining the public key as follows:
- In a terminal window, run `cat ~/.ssh/id_ed25519.pub`
- Highlight the public key that is printed to the console and copy the text.
- Follow the instructions from Jenny Bryan at 10.5.3 to add the copied key to GitHub.
:::
## Git + GitHub Workflow
::: {.callout-important}
Git does not work well with shared drives like Box, Google Drive, and SharePoint. Fortunately, those tools aren't necessary for a Git + GitHub workflow.
:::
::: {.callout-tip}
## Repository
A repository is a collection of files, often a directory, where files are organized and logged by Git.
:::
Git and GitHub organize projects into repositories. Typically, a "repo" will correspond with the place where you started a .Rproj. When working with Git and GitHub, your files will exist in two places: **locally** on your computer and **remotely** on GitHub.
When creating a new repository, you can use either of the following alternatives:
- Initialize the repo locally on your computer and later add the repo to GitHub
- Initialize the repo remotely on GitHub and then copy (clone) the repo to your computer.
To create a repository (only needs to be done once per project):
`git init` initializes a local Git repository.
OR
`git clone <link>` copies a remote repository from GitHub to the location of the working directory on your computer.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
We're going to create a repo remotely on GitHub and clone it to our computer.
1. Go [here](https://github.com/awunderground/2024-08-03_reproducible-research-bootcamp) to see the repository for today's course materials.
2. Click the green "Code" button and copy the URL under SSH.
3. At the command line, navigate to where you want to clone the course notes.
4. `git clone https://github.com/awunderground/2024-08-03_reproducible-research-bootcamp`
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
We're going to create a repo remotely on GitHub and clone it to our computer.
1. Go to GitHub.
2. Click the big green "New" button.
3. Call the repo `my-first-repo`. Select a public repository and check the box that says `Add a README file`.
4. Click the "Create Repository" button.
5. Click the green "Code" button and copy the SSH link.
6. Using the command line, navigate to a sensible place on your computer to store a project.
7. Run `git clone < link >`, where you replace `< link >` with the SSH link you copied. This will create a folder called `my-first-repo` with your repo files, including the `README.md` file you created when you initialized the repo.
:::
### Basic Approach
1. Initialize a repository for a project (we've already done this!).
2. Tell Git which files to track. Track scripts. Avoid tracking data or binary files like `.docx` and `.xlsx`. [^files]
3. Take a snapshot of tracked files and add a commit message.
4. Save the tracked files to the remote GitHub repository.
5. Repeat, repeat, repeat
[^files]: GitHub refuses to store files larger than 100 MiB. This poses a challenge to writing reproducible code. However, many data sources can be downloaded directly from the web or via APIs, allowing code to be reproducible without relying on storing large data sets on GitHub.
![Basic Git + GitHub workflow](images/git-workflow-commands.png){fig-align="center" width=100%}
### Commands
`git status` prints out all of the important information about your repo. Use it before and after most commands to understand how code changes your repo.
`git add <file-name>` adds a file to staging. It says, "hey look at this!".
`git commit -m "<message>"` commits changes made to added files to the repository. It says, "hey, take a snapshot of the things you looked at in the previous command." **Don't forget the `-m`.** [^m]
[^m]: The `-m` stands for `message`. Writing a brief commit message like "fixes bug in data cleaning script" can help collaborators (including your future self) understand the purpose of your commits.
`git push origin main` pushes your local commits to the remote repository on GitHub. It says, "hey, take a look at these snapshots I just made". It is possible to push to branches other than `main`. Simply replace `main` with the desired branch name.
`git log --oneline` shows the commit history in the repository.
`git diff` shows changes since the last commit.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
1. Navigate to your `example-project` directory on the command line and run `git status` to confirm that there isn't a git repo.
2. Use `git init` to initialize a new repository.
3. In the terminal, run `git add analysis.qmd analysis.html`. Then run `git status`.
4. Then run `git commit -m "add draft analysis"` to commit the two files with a commit message. Then run `git status` again and observe the change.
5. Create a repository on GitHub called `example-project`. Select a public repository and **do not** check the box that says `Add a README file`.
6. Follow the instructions to `…or push an existing repository from the command line`.
:::
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
1. Make some changes to `index.qmd` and re-render the document.
2. Add your files.
3. Commit your files.
4. Push your files.
5. In the GitHub repo, click on `index.qmd` then click on the History icon in the top right corner. Note that you'll see your two commits. Click on the most recent commit and notice that you can see the changes that you've made to the file.
:::
## GitHub Pages {#sec-ghp}
[GitHub Pages](https://pages.github.com/) are free websites hosted directly from a GitHub repository. With a free GitHub account, a GitHub repo must be public to create a GitHub page with that repo. When you create a GitHub page, you associate it with a specific branch of your repo. GitHub Pages will look for an `index.html`, `index.md`, or `README.md` file as the entry file for your site.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
1. Go to your GitHub repository and select "Settings" on the top right.
2. Under the "Code and Automations" menu on the left side, select "Pages".
3. Under "Build and Deployment" and "Branch", use the drop down menu to change the branch to from "None" to "main". This will trigger the deployment of your GitHub page from the "main" branch of your repository. It will take a bit of time for your GitHub page to be ready.
4. Refresh the page, and when you see a box that says "your site is live" at the top of your page, click the link to go to your website!
:::
## .gitignore
We don't want to add every file to Git or GitHub. We want to avoid binary files like Word documents. We can't add very large files.
By default, every file in a local repository will show up after `git status` and those files we don't want to add will be at risk of being added by accident. This is annoying.
Fortunately, we can use a `.gitignore` to ignore files and directories. This cleans up our `git status` and protects us from accidentally add a file we don't want to add. To ignore a file, just add the name of the file or folder to the `.gitignore`.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- 1 + exercise_number
```
1. Run `git status`.
2. Add a file called `.gitignore`. In the file, add `poems/`.
3. Run `git status`.
:::
## Conclusion
Git is a distributed version-control system. It is used for tracking changes in the code. GitHub is an online hosting service for version control using git. Key workhorse commands are `git status`, `git add`, `git commit -m <message>` `git push` and `git diff`. GitHub is also great because it will host websites using GitHub Pages.
### Git is Confusing
```{r}
#| echo: false
#| fig-width: 6.5
#| fig-height: 6
meta_data <- getXKCD(which = 1597)
```
We promise that it's worth it.
### Resources
- [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf)
- [Happy Git and GitHub for the UserR](https://happygitwithr.com/)
- [Git Pocket Guide](https://www.amazon.com/Git-Pocket-Guide-Working-Introduction/dp/1449325866)
- [Getting Git Right](https://www.atlassian.com/git)
- [Git Learning Lab](https://lab.github.com/)
- [Git Handbook](https://guides.github.com/introduction/git-handbook/)
- [Mastering Markdown](https://guides.github.com/features/mastering-markdown/)
- [Understanding the GitHub Flow](https://guides.github.com/introduction/flow/)
- [Documenting Your Projects on GitHub](https://guides.github.com/features/wikis/)
- [Git Tutorial](https://www.katacoda.com/courses/git)
## Personal Access Tokens for Authentication {#sec-ap-a}
1. Starting on your GitHub account navigate through the following:
a. Click your icon in the far right
b. Select **Settings** at the bottom of the drop down menu
c. Select **Developer Settings** on the bottom left
d. Select **Personal access tokens** on the bottom left
e. Select **Generate new token**
2. Set up your Personal Access Token (PAT)
a. Add a note describing the use of your token. This is useful if you intend to generate multiple tokens for different uses.
b. Select "No expiration". You may want tokens to expire if that access sensitive resources.
c. Select scopes. You must select at least the "repo" scope. You may want to add other scopes but they are not required for this course.
3. Click **Generate token**
4. This is your only chance to view the token. Copy and paste the token and store it somewhere safe. If you lose the token, you can always generate a new token.
5. Git will prompt you for your GitHub username and password sometimes while cloning repositories or pushing to private repositories. Use your GitHub username when prompted for username. Use your generated PAT when prompted for password.
## Initialize a Repo Locally and Add to GitHub {#sec-app-b}
*This only needs to happen once per repository*
1. Initialize a local repository with `git init` as outlined above.
2. On GitHub, click the plus sign in the top, right corner and select `New Repository`.
3. Create a repository with the same name as your directory.
4. Copy the code under **...or push an existing repository from the command line** and submit it in the command line that is already open.