forked from jennybc/analyze-github-stuff-with-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
366 lines (276 loc) · 16.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
output:
# html_document:
# keep_md: TRUE
md_document:
variant: markdown_github
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
How to obtain a bunch of GitHub issues or pull requests with R
=================
I want to make [`purrr`](https://github.com/hadley/purrr) and [`dplyr`](https://github.com/hadley/dplyr) and [`tidyr`](https://github.com/hadley/tidyr) play nicely with each other. How can I use `purrr` for iteration, while still using `dplyr` and `tidyr` to manage the data frame side of of the house?
Three motivating examples, where I marshal data from the GitHub API using the excellent [`gh` package](https://github.com/gaborcsardi/gh):
* In [STAT 545](http://stat545-ubc.github.io), 10% of the course mark is awarded for engagement. I want to use contributions to the course [Discussion](https://github.com/STAT545-UBC/Discussion/issues) as a primary input here. This is how I fell down this rabbit hole in the first place.
* Oliver Keyes [tweeted](https://twitter.com/quominus/status/670398322696392705) that he wanted "a script that goes through all my GitHub repositories and generates a list of which ones have open issues". How could I resist this softball? Sure, there are [easier ways to do this](https://twitter.com/millerdl/status/670430991278858240), but why not use R?
* Jordan Ellenberg, [writing for the Wall Street Journal](http://www.wsj.com/articles/the-summers-most-unread-book-is-1404417569), used Amazon's "Popular Highlights" feature to define the **Hawking Index**:
> Take the page numbers of a book's five top highlights, average them, and divide by the number of pages in the whole book. The higher the number, the more of the book we're guessing most people are likely to have read.
I mean, how many people really stick with "A Brief History of Time" to the bitter end? I was reading through Hadley Wickham's [Advanced R](http://adv-r.had.co.nz) when I read Jordan's article and wondered ... how many people read this entire book? Or do they start and sort of fizzle out? So I wanted to look at the distribution of pull requests. Are they evenly distributed throughout the book or do they cluster in the early chapters?
This is a glorified note-to-self. It might be interesting to a few other people. But I presume a lot of experience with R and a full-on embrace of `%>%`, `dplyr`, etc.
* [Oliver's open issues](#olivers-open-issues)
* [Pull requests on a repo](#pull-requests-on-a-repo)
* [Issue threads](#issue-threads)
### Oliver's open issues
Let's start with the easiest task: does have Oliver issues? If so, can we be more specific?
First, load packages. Install `gh` and `purrr` from GitHub, if necessary. `gh` is not on CRAN and `purrr` is under active development; I doubt my code code would work with CRAN version.
```{r}
# install_github("gaborcsardi/gh")
# install_github("hadley/purrr")
library(gh)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr))
```
Use `gh()` to retrieve all of Oliver's public GitHub repositories.
```{r}
repos <- gh("/users/ironholds/repos", .limit = Inf)
length(repos)
```
Create a data frame with one row per repo and two variables.
* `repo` = repository name. Use `purrr::map_chr()` to extract all elements named `name` from the repository list. The map functions are much like base `lapply()` or `vapply()`. There is a lot of flexibility around how to specify the function to apply over the input list. Here I use a shortcut: the character vector `"name"` is converted into an extractor function.
* `issue` = list-column of the issues for each repository. Again, I use a map function, in this case to provide vectorization for `gh()`. I use a different shortcut: the `~` formula syntax creates an anonymous function on-the-fly, where `.x` stands for "the input".
```{r oliver-issue-data-frame, cache = TRUE}
iss_df <-
data_frame(
repo = repos %>% map_chr("name"),
issue = repo %>%
map(~ gh(repo = .x, endpoint = "/repos/ironholds/:repo/issues",
.limit = Inf))
)
str(iss_df, max.level = 1)
```
Create a decent display of how many open issues there are on each repo. I use `map_int()` to count the open issues for each repo and then standard `dplyr` verbs to select, filter, and arrange. I'm not even bothering with `knitr::kable()` here because these experiments are definitely not about presentation.
```{r}
iss_df %>%
mutate(n_open = issue %>% map_int(length)) %>%
select(-issue) %>%
filter(n_open > 0) %>%
arrange(desc(n_open)) %>%
print(n = nrow(.))
```
A clean script for this is available in [open-issue-count-by-repo.R](open-issue-count-by-repo.R).
### Pull requests on a repo
*Even though it was [Advanced R](http://adv-r.had.co.nz) that got me thinking about this, I first started playing around with [R Packages](http://r-pkgs.had.co.nz), which happens to have 50% fewer PRs than Advanced R. But I've done this for both books and present a script and figure for each at the end of this example.*
Load packages. Even more this time.
```{r}
library(gh)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
suppressPackageStartupMessages(library(purrr))
library(curl)
suppressPackageStartupMessages(library(readr))
```
Use `gh()` to retrieve all pull requests on [`hadley/r-pkgs`](https://github.com/hadley/r-pkgs).
```{r pr-list, cache = TRUE}
owner <- "hadley"
repo <- "r-pkgs"
pr_list <-
gh("/repos/:owner/:repo/pulls", owner = owner, repo = repo,
state = "all", .limit = Inf)
length(pr_list)
```
Define a little helper function that [won't be necessary forever](https://github.com/hadley/purrr/issues/110), but is useful below when we dig info out of `pr_list`.
```{r}
map_chr_hack <- function(.x, .f, ...) {
map(.x, .f, ...) %>%
map_if(is.null, ~ NA_character_) %>%
flatten_chr()
}
```
Use `map_*()` functions to extract and data-frame-ize the potentially useful parts of the pull request list. I'm extracting much more than I ultimately use, which betrays how overly optimistic I was when I started. So far I can't figure out how to use the API to directly compare two commits, but I haven't given up yet.
```{r pr-df, cache = TRUE}
pr_df <- pr_list %>%
{
data_frame(number = map_int(., "number"),
id = map_int(., "id"),
title = map_chr(., "title"),
state = map_chr(., "state"),
user = map_chr(., c("user", "login")),
commits_url = map_chr(., "commits_url"),
diff_url = map_chr(., "diff_url"),
patch_url = map_chr(., "patch_url"),
merge_commit_sha = map_chr_hack(., "merge_commit_sha"),
pr_HEAD_label = map_chr(., c("head", "label")),
pr_HEAD_sha = map_chr(., c("head", "sha")),
pr_base_label = map_chr(., c("base", "label")),
pr_base_sha = map_chr(., c("base", "sha")),
created_at = map_chr(., "created_at") %>% as.Date(),
closed_at = map_chr_hack(., "closed_at") %>% as.Date(),
merged_at = map_chr_hack(., "merged_at") %>% as.Date())
}
pr_df
```
I want to know which files are affected by each PR. If I had all this stuff locally, I would do [something like this](http://stackoverflow.com/questions/1552340/how-to-list-the-file-names-only-that-changed-between-two-commits):
``` shell
git diff --name-only SHA1 SHA2
```
I have to emulate that with the GitHub API. It seems the [compare two commits feature](https://developer.github.com/v3/repos/commits/#compare-two-commits) only works for two branches or two tags, but not two arbitrary SHAs. Please enlighten me and answer [this question on StackOverflow](http://stackoverflow.com/questions/26925312/github-api-how-to-compare-2-commits) if you know how to do this.
My current workaround is to get info on the diff associated with a pull request from its associated patch file. We've already stored these URLs in the `pr_df` data frame. You can read my rather hacky helper function, [`get_pr_affected_files_from_patch()`](get-pr-affected-files-from-patch.R), if you wish, but I'll just source it here.
```{r}
source("get-pr-affected-files-from-patch.R")
```
Add a list-column to the data frame of pull requests. It holds one data frame per PR, with info on the file changes. We use `map()` again and also use `dplyr` and `purrr` together here, in order to preserve association between the existing PR info and the modified files. *This takes around 4 minutes for me FYI.*
```{r fetch-and-parse-patch, cache = TRUE}
pr_df <- pr_df %>%
mutate(pr_files = patch_url %>% map(get_pr_affected_files_from_patch))
```
Sanity check the `pr_files` list-column. First, look at an example element. We have one row per file and two variables: `file` and `diffstuff` (currently I do nothing with this but ...). Do all elements of the list-column have exactly two variables? What's the distribution of the number of rows? I expect to see that the vast majority of PRs affect exactly 1 file, because there are lots of typo corrections.
```{r}
pr_df$pr_files[[69]]
pr_df$pr_files %>% map(dim) %>% do.call(rbind, .) %>% apply(2, table)
```
Simplify the list-column elements from data frame to character vector. Then use `tidyr::unnest()` to "explode" things, i.e. give each element its own row. Each row is now a file modified in a PR.
```{r}
nrow(pr_df)
pr_df <- pr_df %>%
mutate(pr_files = pr_files %>% map("file")) %>%
unnest(pr_files)
nrow(pr_df)
```
Write `pr_df` out to file, omitting lots of the variables I currently have no use for.
```{r}
pr_df %>%
select(number, id, title, state, user, pr_files) %>%
write_csv("r-pkgs-pr-affected-files.csv")
```
Here's a figure depicting how often each chapter has been the target of a pull request. I'm not adjusting for length of the chapter or anything, so take it with a huge grain of salt. But there's no obvious evidence that people read and edit the earlier chapters more. We like to make suggestions about Git apparently!.
![](r-pkgs-pr-affected-files-barchart.png)
Recap of files related to PRs on R Packages
* script to marshal data: [r-pkgs-pr-affected-files.R](r-pkgs-pr-affected-files.R)
* ready-to-analyze data: [r-pkgs-pr-affected-files.csv](r-pkgs-pr-affected-files.csv)
* barchart: [r-pkgs-pr-affected-files-barchart.png](r-pkgs-pr-affected-files-barchart.png)
* script to make barchart: [r-pkgs-pr-affected-files-figs.R](r-pkgs-pr-affected-files-figs.R)
I went through the same steps with all pull requests on [`hadley/adv-r`](https://github.com/hadley/adv-r), the repository for [Advanced R](http://adv-r.had.co.nz).
Here's the same figure as above but for Advanced R. There's a stronger case for earlier chapters being targeted with PRs more often.
![](adv-r-pr-affected-files-barchart.png)
Recap of files related to PRs on Advanced R:
* script to marshal data:
[adv-r-pr-affected-files.R](adv-r-pr-affected-files.R)
* ready-to-analyze data: [adv-r-pr-affected-files.csv](adv-r-pr-affected-files.csv)
* barchart: [adv-r-pr-affected-files-barchart.png](adv-r-pr-affected-files-barchart.png)
* script to make barchart: [adv-r-pr-affected-files-figs.R](adv-r-pr-affected-files-figs.R)
### Issue threads
[STAT 545](http://stat545-ubc.github.io) has a public Discussion repo, where we use the issues as a discussion board. I want to look at the posts there, as something related to student engagement that I can actually quantify.
This starts out fairly similar to the previous example: I retrieve all issues that have been modified since September 1, 2015.
```{r stat-545-issue-list, cache = TRUE}
owner <- "STAT545-UBC"
repo <- "Discussion"
issue_list <-
gh("/repos/:owner/:repo/issues", owner = owner, repo = repo,
state = "all", since = "2015-09-01T00:00:00Z", .limit = Inf)
(n_iss <- length(issue_list))
```
This retrieves `r n_iss` issues. I use this list to create a conventional data frame with one row per issue.
```{r}
issue_df <- issue_list %>%
{
data_frame(number = map_int(., "number"),
id = map_int(., "id"),
title = map_chr(., "title"),
state = map_chr(., "state"),
n_comments = map_int(., "comments"),
opener = map_chr(., c("user", "login")),
created_at = map_chr(., "created_at") %>% as.Date())
}
issue_df
```
It turns out some of these issues were created during the 2014 run but show up here because I closed them in early September. Get rid of them.
```{r}
issue_df <- issue_df %>%
filter(created_at >= "2015-09-01T00:00:00Z")
(n_iss <- nrow(issue_df))
```
Down to `r n_iss` issues.
My ultimate goal is a data frame with one row per issue comment, but it's harder than you expect to get there. Each issue should be represented by at least one row and many will have several rows, as there are typically follow up comments.
I need to loop over the issues and retrieve the follow up comments. I mean that literally -- the [Issue Comment endpoint](https://developer.github.com/v3/issues/comments/) does not return a comment for the opening of the issue. This makes for a little extra data manipulation ... and more practice with `purrr` and `dplyr`!
Make a data frame of issue "opens" with a set of variables chosen for maximum bliss in future binds and joins. The `i` variable records comment position within the thread.
```{r}
opens <- issue_df %>%
select(number, who = opener) %>%
mutate(i = 0L)
opens
nrow(opens)
```
Make a data frame of issue follow up comments. At first, this has to hold an unfriendly list-column `res` where I dump issue comments as returned by the API.
```{r stat-545-issue-comments, cache = TRUE}
comments <- issue_df %>%
select(number) %>%
mutate(res = number %>% map(
~ gh(number = .x,
endpoint = "/repos/:owner/:repo/issues/:number/comments",
owner = owner, repo = repo, .limit = Inf)))
str(comments, max.level = 1)
```
What is the `res` variable? A list-column of length `r n_iss`, each component of which is another list of comments, each of which is also a nested list. Here's a look at 3 elements corresponding to issues that generated anywhere from no discussion to lots of discussion.
```{r}
comments %>%
filter(number %in% c(275, 273, 272)) %>%
select(res) %>%
walk(str, max.level = 2, give.attr = FALSE)
```
All I really want to know is *who* made the comment, so I mutate `res` into `who` using `map_chr()` and a character vector as extractor function. Push this one level down in the `res` nested list. I can drop the nasty `res` variable and revisit the same threads above to show how much simpler things have gotten.
```{r}
comments <- comments %>%
mutate(who = res %>% map(. %>% map_chr(c("user", "login")))) %>%
select(-res)
comments %>%
filter(number %in% c(275, 273, 272))
```
Use `tidyr::unnest()` to "explode" the `who` list-column and get one row per follow up comment. I now add the `i` variable for numbering within the thread.
```{r}
comments <- comments %>%
unnest(who) %>%
group_by(number) %>%
mutate(i = row_number(number)) %>%
ungroup()
comments
```
No more list-columns!
It's time for a sanity check. Do the empirical counts of follow up comments match the number of comments initially reported by the API?
```{r}
count_empirical <- comments %>%
count(number)
count_stated <- issue_df %>%
select(number, stated = n_comments)
checker <- left_join(count_empirical, count_stated)
with(checker, n == stated) %>% all() # hopefully TRUE
```
I row bind issue "opens" and follow up comments, feeling very smug that that they have exactly the same variables, though it is no accident.
```{r}
atoms <- bind_rows(opens, comments)
```
Join back to the original data frame of issues, since that still holds issue title, state and creation date. It is intentional that the `number` variable has been set up as the natural `by` variable.
```{r}
finally <- atoms %>%
left_join(issue_df) %>%
select(number, id, opener, who, i, everything()) %>%
arrange(desc(number), i)
```
A quick look at this and ... we're ready for analysis. Our work here is done.
```{r}
finally
finally %>%
count(who, sort = TRUE)
#write_csv(finally, "stat545-discussion-threads.csv")
```
A clean script for this example is in [stat545-discussion-threads.R](stat545-discussion-threads.R).
---
Thanks to [`@hadley`](https://github.com/hadley) and [`@lionel-`](https://github.com/lionel-) for patiently answering all of my `purrr` questions. There have been many.
---
```{r}
devtools::session_info()
```