-
Notifications
You must be signed in to change notification settings - Fork 1
/
mxnet_gitlog.Rmd
768 lines (573 loc) · 31.5 KB
/
mxnet_gitlog.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
---
title: "Mxnet Authors"
author: "Augustina Ragwitz"
date: "July 17, 2017"
output: html_document
params:
git_path: "incubator-mxnet"
gitlog_sha: '92a93b80'
gitlog_out: 'data/mxnet_gitlog.txt'
gh_id: !r Sys.getenv("API_KEY_GITHUB_ID")
gh_secret: !r Sys.getenv("API_KEY_GITHUB_SECRET")
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(gh)
library(httr)
library(jsonlite)
library(readr)
library(reshape2)
library(stringr)
library(tidyr)
library(zoo)
```
# Overview
This notebook lives in Github and is the first in a series: https://github.com/missaugustina/mxnet_reports
# Commit History
Use the "git log" command to extract commit history. The "pretty" option allows the log output to be formatted.
```{r git_log}
git_log_cmd <- paste0('cd ', params$git_path,
'; git log ', params$gitlog_sha,
#' --no-merges ',
' --date=short --pretty=tformat:"%ad|%an|%ae|%h" > ../', params$gitlog_out)
system(git_log_cmd)
git_log_cmd
```
```{r gitlog_raw}
gitlog_raw <- read.csv(params$gitlog_out, header = FALSE, sep = "|", quote="",
col.names=c("git_log_date", "name", "email", "sha"),
stringsAsFactors = FALSE)
# add date formatting so these sort properly
gitlog_parsed_dates <- gitlog_raw %>%
mutate(commit_date=as.Date(git_log_date, tz="UTC"),
commit_month=as.yearmon(commit_date),
name=str_to_lower(name),
email=str_to_lower(email)) %>%
select(commit_date, commit_month, name, email, sha)
```
## Normalize Authors
Authors use different names and email addresses, so to get an accurate count of unique authors, we need to consolidate them. This method uses the first email address found in the commit log for an author as the authoritative one. While other methods should be explored and compared for better accuracy, this is sufficient to identify unique authors.
Due to a significant mismatch with Github's number of contributors (see below), the following authors were associated with the same Github login. They have been updated manually in this report, but future versions should come up with a better way to automate this.
```{r parsed_emails}
# manual updates for known authors that make things complicated
# TODO create table from commits_github_login and merge instead
gitlog_parsed_emails <- gitlog_parsed_dates %>%
mutate(name=ifelse(name=="tornadomeet", "wei wu", name),
name=ifelse(name=="yuan (terry) tang" | name == "terrytangyuan", "terry tang", name),
name=ifelse(name=="el potaeto" | name=="pommedeterresautee", "michaël benesty", name),
name=ifelse(name=="=", "luke metz", name),
name=ifelse(name=="codingcat", "nan zhu", name),
name=ifelse(name=="kaixhin", "kai arulkumaran", name),
name=ifelse(name=="byliu", "liu benyuan", name),
name=ifelse(name=="matthew hill", "matt hill", name),
name=ifelse(name=="linmin", "lin min", name),
name=ifelse(name=="nitnelave", "valentin tolmer", name),
name=ifelse(email=="[email protected]"
| email=="[email protected]"
| email=="[email protected]"
| name=="dongguosheng"
| name=="qiaohaijun",
"qiao hai-jun", name),
name=ifelse(name=="shallyan", "yan xiao", name),
name=ifelse(name=="wuwei", "wei wu", name),
name=ifelse(name=="wang gu", "gu wang", name),
name=ifelse(name=="yzhang87", "yu zhang", name)
)
# known email provider domains
email_provider_domains <- c("gmail.com",
"users.noreply.github.com",
"hotmail.com",
"googlemail.com",
"qq.com",
"126.com",
"163.com",
"outlook.com",
"me.com",
"live.com",
"yahoo.com", # manual verification showed these are not yahoo employees
"yahoo.fr",
"yahoo.de",
"foxmail.com",
"protonmail.com",
"msn.com",
"mail.com",
"comcast.net")
```
```{r normalize_authors}
# create an authoritative email address
# all matching emails should have the same sha
gh_authors_by_email <- gitlog_parsed_emails %>%
arrange(desc(commit_date)) %>%
group_by(email, name) %>%
summarise(num_commits = n(),
last_commit=max(commit_date))
source("normalize_authors.R")
gh_authors_lookup <- build_authors_lookup(gh_authors_by_email, email_provider_domains)
gitlog_parsed_authors <- merge(gitlog_parsed_emails, gh_authors_lookup, by=c("email"), all=TRUE)
gitlog_parsed_authors <- unique(gitlog_parsed_authors)
# check that each SHA only has one author
gitlog_parsed_authors_check <- gitlog_parsed_authors %>% group_by(sha) %>% mutate(n=n()) %>% filter(n>1)
# TODO use an assert
paste("Commits with more than one author (should be zero):", nrow(gitlog_parsed_authors_check))
```
```{r exclude_july}
# exclude July for summaries
gitlog_parsed <- gitlog_parsed_authors %>% filter(commit_month < "July 2017")
```
# Data Summaries
## Commit Months
Commits are grouped by month and author to determine frequencies. Commit frequency bins are created using the rounded natural log of the total commits for the author. The output is a list of total commits and authors per month.
```{r summaries_by_month}
# total number of commits per month (used for %)
total_commits_per_month <- gitlog_parsed %>% group_by(commit_month) %>%
summarise(num_commits=n())
# commits per author
commits_per_author <- gitlog_parsed %>%
group_by(commit_month, author) %>%
summarise(num_author_commits=n(),
email_id_host=first(email_id_host)
) %>%
group_by(author) %>% # number of months the author has commits for
mutate(num_months=n()) %>%
group_by(commit_month) %>% # total authors per month
mutate(num_authors=n())
commits_summary <- total_commits_per_month %>%
inner_join(commits_per_author, by="commit_month") %>%
mutate(commits_pct = num_author_commits/num_commits) # determine percent using total commits
# Authors with just one commit month
single_month_authors <- commits_per_author %>%
filter(num_months == 1) %>%
group_by(commit_month) %>%
summarise(num_authors_single=n())
commits_summary <- commits_summary %>% inner_join(single_month_authors, by="commit_month")
# clean up
rm(total_commits_per_month, single_month_authors)
```
```{r host_summaries}
# commits per month by host
commits_summary <- commits_summary %>%
group_by(commit_month, email_id_host) %>%
mutate(
num_host_commits = sum(num_author_commits),
num_host_authors = n_distinct(author)
) %>%
group_by(email_id_host) %>%
mutate(
num_host_months = n_distinct(commit_month),
total_host_authors = sum(num_host_authors)) %>%
group_by(commit_month) %>%
mutate(num_hosts = n_distinct(email_id_host))
```
## Commit Patterns
Do authors typically follow a similar number
ex: 5 months -> commits log = 1, 1, 2, 3, 2 -> frequency = 1:2, 2:2, 3:1
commit_freq_pct: 1:2/5, 2:2/5, 3:1/5 <- out of 5 months how often that commit log was observed
freq_bins_pct: 3/5 months share the same commit log
```{r commit_bins}
# create bins using the rounded log of number of commits
commits_summary <- commits_summary %>%
mutate(author_commits_log = round(log(num_author_commits))) %>%
group_by(author_commits_log) %>% # min/max commits range for each bin, makes reporting more clear
mutate(commits_min = min(num_author_commits),
commits_max = max(num_author_commits))
```
Do authors have similar commit activity each month?
```{r commit_patterns}
commit_patterns <- commits_summary %>%
group_by(author, author_commits_log) %>%
summarise(
num_months = first(num_months), # number of months with commits
commit_freq_months = n(), # number of months similar number of commits observed
commit_freq_months_pct = round(commit_freq_months/num_months, 2), # % months commit pattern observed
author_commits_min = min(num_author_commits),
author_commits_max = max(num_author_commits),
commits_min = first(commits_min), # min and max commits for each bin
commits_max = first(commits_max),
email_id_host = first(email_id_host)) %>%
group_by(author) %>%
mutate(
commit_dev_months = n(), # months that shared a commit pattern (or number of different patterns)
commit_dev_months_pct = round(commit_dev_months/num_months, 2)) # % months sharing a pattern
# ex: 5 months -> commits log = 1, 1, 2, 3, 2 -> frequency = 1:2, 2:2, 3:1
# observed_months_pct: 1:2/5, 2:2/5, 3:1/5 <- out of 5 months how often that commit log was observed
# shared_months_pct: 3/5 months share the same commit log
```
## Commit Delta
As authors continue to engage with a project, does their commit activity increase or decrease?
## Commit Percent Change
Does an author need to make more or less commits to acheive the same proportion?
# Github Comparison
How different are metrics in the Gitlog compared to what Github reports?
On July 31, 2017, Github reported the total Commits as 5,719 and total Contributors as 391.
![MXNet Contributors and Commits on Github - July 31, 2017](png/commits_contributors.png)
```{r github_stats}
total_github_commits <- 5719
total_github_authors <- 391
```
## Commits
We have the same number of commits as what Github reports.
```{r check_commits}
total_commits <- nrow(gitlog_parsed_authors)
paste("Total Commits in Git Log:", total_commits)
paste("Total Commits reported by GitHub:", total_github_commits)
paste("Difference from Github:", total_github_commits - total_commits)
```
## Contributors
If our normalization is accurate, Github should have more authors than us. A single author with multiple Github accounts or using an email address unassociated with a Github account would be counted by Github multiple times.
We show more authors than Github because Github does not count authors without a Github login when determining its total. The extra authors we've identified are authors that do not have a Github login nor are otherwise associated with an account that does. This is illustrated below.
```{r check_authors}
# authors stats
gitlog_authors_total <- gitlog_parsed_authors %>%
group_by(author) %>%
summarise(email_id=first(email_id), commits=n())
total_authors <- nrow(gitlog_authors_total)
paste("Total Authors in Git Log:", total_authors)
paste("Total Authors reported by GitHub:", total_github_authors)
authors_difference <- total_authors - total_github_authors
paste("Difference from Github:", authors_difference)
```
Because this is a big difference, we should get commits from the Github API to compare logins.
```{r gh_api_commits, eval=FALSE}
url <- "https://api.github.com/repos/apache/incubator-mxnet/commits"
query_params <- list(
client_id=params$gh_id,
client_secret=params$gh_secret,
per_page=100)
get_gh_commits <- function (url, query) {
req <- GET(url, query=query)
print(paste(req$url))
json <- content(req, as = "text")
commits <- fromJSON(json, flatten=TRUE)
return(commits)
}
commits <- data_frame()
for (n in 1:58) {
print(paste("Getting commits for page:", n))
commits <- bind_rows(commits, get_gh_commits(url, append(query_params, c(page=n))))
saveRDS(commits, paste0("commits/_commits_", n, ".rds"))
}
commits <- commits %>% mutate(sha_long=sha, sha = str_sub(sha_long, 1, 7))
saveRDS(commits, "mxnet_commits.rds")
```
The Github API returned 5 additional commits compared to a) what is in the Git log and b) what is reported on Github's web page. Github uses the number of commits in the Git log to determine the total number of commits. Therefore, we can exclude any commits from the API that are not in the Git log.
The number of authors reported by the API commits is also slightly higher than what Github reports. Given what we know about commits, it is most likely that the discrepancy is explained by a unique author in a more recent commit than what is in the Git log.
Another interesting observation is that authors that do not have a Github login are not counted as part of the total.
```{r, check_commits}
commits <-readRDS("mxnet_commits.rds")
github_api_commits_cnt <- nrow(commits)
paste("The number of commits from the Github API is:", github_api_commits_cnt)
paste("Difference between Github API and Github:", github_api_commits_cnt - total_github_commits)
commits_github_na_login <- commits %>% filter(is.na(author.login))
commits_github_login <- commits %>% filter(!is.na(author.login))
# do githubs own numbers match themselves?
github_api_cnt <- length(unique(commits_github_login$author.login))
github_na_cnt <- length(commits_github_na_login$author.login)
paste("Per commits from the Github API, the number of authors is:", github_api_cnt)
paste("Number of commits with no author.login:", github_na_cnt)
```
When we exclude all authors without a Github login, our total authors is slight less than Github's. When we consider authors we've identified as distinct that have multiple Github accounts, we are able to acheive about the same total of authors as what Github reports. The discrepancy is due to a timezone issue. Github uses "local" time, which is the local timezone of the commit author per the commit timestamp. We are using UTC to normalize the timezone.
```{r, gitlog_extra_authors}
# commits we have that are also in github
commits_check <- commits %>% inner_join(gitlog_parsed_authors, by="sha")
paste("Unique Authors (Gitlog):", length(unique(commits_check$author)))
commits_check_na <- commits_check %>%
filter(!is.na(author.login)) %>%
select(author, author.login) %>%
unique()
paste("Unique Authors (Gitlog, No Github Login):", length(unique(commits_check_na$author)))
commits_check_multi_login <- commits_check_na %>%
group_by(author) %>%
mutate(num_gh_logins=n_distinct(author.login)) %>%
filter(num_gh_logins > 1) %>%
mutate(num_extra_logins = num_gh_logins-1) %>%
select(author, num_extra_logins) %>%
unique()
paste("Unique Authors (Single in Gitlog, Multiple Github Login):",
sum(commits_check_multi_login$num_extra_logins))
```
## Pulse Report
This Pulse report shows the same number of commits and 1 additional author. Since our total unique authors count is off by 6, and the number of commits is correct, we have to assume there is a problem with the author normalization. A manual verification of authors who had more than one distinct email address all pointed to the same Github accounts. Timezones could be another factor, this report uses UTC to normalize the time zones and it's not clear if Git is using local or UTC for its total. We can't recreate the Pulse snapshot because the tensorflow-gardener continually "rewrites" the commit log history.
Overall a discrepancy of 1 author isn't significant.
According to the Github Pulse Report, "Excluding merges, 54 authors have pushed 148 commits to master and 850 commits to all branches. On master, 361 files have changed and there have been 22,499 additions and 4,039 deletions."
![MXNet Github Pulse Report - July 31, 2017](png/github_pulse_report.png)
```{r}
gh_pulse_check <- gitlog_parsed_authors %>%
filter(commit_date > as.Date("2017-06-29") & commit_date < as.Date("2017-07-31"))
gh_pulse_check_authors <- gh_pulse_check %>%
group_by(author) %>%
summarise(num_author_commits=n())
total_authors_pulse_check <- nrow(gh_pulse_check_authors)
paste("Total Unique Authors: ", total_authors_pulse_check)
paste("Difference From Pulse Check: ", total_authors_pulse_check - 54)
total_commits_pulse_check <- n_distinct(gh_pulse_check$sha)
paste("Total Commits: ", total_commits_pulse_check)
paste("Difference From Pulse Check: ", total_commits_pulse_check - 148)
```
Therefore, we can conclude that our methods for counting commits match Github's results. Despite our previous author discrepancy, the method appears to sufficiently accurate for the purposes of this analysis.
# Authors
## Unique Authors per Month
The number of unique authors has grown steadily per month.
```{r, authors_per_month}
ggplot(data = commits_summary, aes(x = factor(commit_month))) +
geom_point(aes(y = num_authors, color="Authors"), group=1) +
geom_line(aes(y = num_authors, color="Authors"), group=1) +
ylab("Count") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Unique Email Domains per Month
The number of unique email domains has grown steadily per month.
```{r, hosts_per_month}
ggplot(data = commits_summary, aes(x = factor(commit_month))) +
geom_point(aes(y = num_hosts, color="Email Domains"), group=1) +
geom_line(aes(y = num_hosts, color="Email Domains"), group=1) +
ylab("Count") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Proportion of Authors by Email Domain
```{r, authors_by_host}
# Exclude email-provider addresses
# more than one overall author (otherwise there are too many to plot)
ggplot(commits_summary %>%
filter(!email_id_host %in% email_provider_domains
& num_months > 1
& is.na(str_match(email_id_host, "\\.edu"))),
aes(x=factor(commit_month),
y=num_host_authors,
fill=reorder(email_id_host, -total_host_authors))) +
geom_bar(stat="identity", position="dodge") +
xlab("Months") +
ylab("Authors") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
guides(fill=guide_legend(title="Hosts"))
# contains edu
ggplot(commits_summary %>%
filter(!is.na(str_match(email_id_host, "\\.edu"))),
aes(x=factor(commit_month),
y=num_host_authors,
fill=reorder(email_id_host, -total_host_authors))) +
geom_bar(stat="identity", position="dodge") +
xlab("Months") +
ylab("Authors") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
guides(fill=guide_legend(title="Hosts"))
```
# Proportion of Commits by Email Domain
```{r, commits_by_host}
# does not contain edu + not in email_provider_domains
ggplot(commits_summary %>%
filter(!email_id_host %in% email_provider_domains
& num_months > 1
& is.na(str_match(email_id_host, "\\.edu"))),
aes(x=factor(commit_month), y=num_host_commits, fill=email_id_host)) +
geom_bar(stat="identity", position="dodge") +
xlab("Months") +
ylab("Email Domain") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
guides(fill=guide_legend(title="Hosts"))
# contains edu
ggplot(commits_summary %>%
filter(!is.na(str_match(email_id_host, "\\.edu"))),
aes(x=factor(commit_month), y=num_host_commits, fill=email_id_host)) +
geom_bar(stat="identity", position="dodge") +
xlab("Months") +
ylab("Email Domain") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
guides(fill=guide_legend(title="Hosts"))
```
# Authors vs Commits
## Unique Authors vs Total Commits per Month
Overall in the Tensorflow project, the number of authors and commits are not only decreasing but appear to be correlated. Logically one can assume that less authors means less commits.
```{r}
ggplot(data = commits_summary, aes(x = factor(commit_month))) +
geom_point(aes(y = num_authors*5, color="Authors (scaled)"), group=1) +
geom_line(aes(y = num_authors*5, color="Authors (scaled)"), group=1) +
geom_point(aes(y = num_commits, color="Commits"), group=2) +
geom_line(aes(y = num_commits, color="Commits"), group=2) +
ylab("Count") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(data = commits_summary, aes(x = num_commits, y = num_authors)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
xlab("Commits") +
ylab("Authors")
```
## Unique Authors vs Commits per Author per Month
How are commits per author distributed as the number of authors per month increases?
As the number of authors increases the proportion of commits becomes more evenly distributed, so we see a downward trend in percent of commits per author. Early on in the project, more authors had a higher proportion of the commits.
```{r}
ggplot(data = commits_summary, aes(x = num_authors)) +
geom_jitter(aes(y=num_author_commits/2, color="Commits")) +
geom_jitter(aes(y=commits_pct * 100, color="Commits %")) + # scaled for comparison
geom_smooth(aes(y=commits_pct*100), method="lm", se=FALSE) +
xlab("Authors") +
ylab("Commits per Author (Scaled)")
```
# Commits per Author
## Commits Per Author Frequency
What is the overall distribution for commits per author? How many commits do most authors make per month?
Commits per author were grouped into bins based on the rounded natural log. The maximum value of that set is reported in the density plots below.
The highest per month commit frequency seen is 1-4 per month.
```{r}
ggplot(commits_summary, aes(x=factor(commits_max))) +
geom_density() +
xlab("Commits per Author")
```
How many months have authors been contributing? Earlier analysis indicated that there were less authors in 2015 when the project started than there are in 2017.
The majority of Authors only have commits for 5 or less months.
```{r}
ggplot(commits_summary, aes(x=factor(num_months))) +
geom_bar() +
xlab("Author Months")
```
## Commits per Author Deviation
Commit deviation is a possible indicator of new committer proportion. As commit activity per author increases, so does the variability of commit activity from month to month. A project with a higher proportion of contributors with fewer months of commit history could be an indication of high developer interest. "Drive bys" are contributions from people do not continue to stay involved with the project, although they may continue to consume it. A high number of "drive bys" could indicate a strong developer user base.
Do authors make roughly the same number of commits each month?
The plot below shows what percentage of months the author comitted on have the same commit activity. In other words, for what percentage of months was the commit pattern observed?
This density plot would indicate that authors are fairly consistent in terms of their monthly commit activity. This consistency is likely due to the fact that the majority of authors have less than 6 months of commits.
```{r, commit_frequency}
ggplot(commit_patterns,
aes(x=commit_freq_months_pct*100)) +
geom_density() +
xlab("% of Months with Observed Commit Pattern")
```
The next plot looks at how much variation there was amongst commit patterns for authors for each month. What percentage of months share an observed commit pattern? If an author has the same commit pattern for 5 out of their 10 months, then 50% of that author's months share a commit pattern. The higher this percentage, the lower the overall deviation.
This density plot would indicate that a majority of authors show 100% shared commit pattern. This consistency is likely due to the fact that the majority of authors have less than 6 months of commits.
```{r, commit_deviation}
ggplot(commit_patterns,
aes(x=commit_dev_months_pct*100)) +
geom_density() +
xlab("% of Months with Shared Commit Pattern")
```
Do authors who have contributed for more than 6 months show more deviation in their commit patterns?
Authors with greater than 6 months of commit history represent about 7% of all authors (71/1002). The majority of authors only have one month of commit activity.
```{r, commit_deviation_6_months}
ggplot(commit_patterns %>% filter(num_months > 5),
aes(x=commit_freq_months_pct*100)) +
geom_bar() +
xlab("% of Months with Observed Commit Pattern") +
xlim(0,100)
ggplot(commit_patterns %>% filter(num_months > 5),
aes(x=commit_dev_months_pct*100)) +
geom_bar() +
xlab("% of Months with Shared Commit Pattern") +
xlim(0,100)
```
When did authors with only one commit month make that commit?
```{r, single_authors}
ggplot(data = commits_summary, aes(x = factor(commit_month))) +
geom_point(aes(y = num_authors_single, color="Singles"), group=2) +
geom_line(aes(y = num_authors_single, color="Singles"), group=2) +
ylab("Count") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
What proportion of authors each month do single authors represent?
Around 25-40% of authors each month will only ever commit for that month. When there are less authors in a particular month, the proportion of single month committers is significantly lower.
```{r, single_authors_percent}
commits_summary <- commits_summary %>% mutate(authors_single_pct=num_authors_single/num_authors)
ggplot(data = commits_summary, aes(x = factor(commit_month))) +
geom_point(aes(y = num_authors, color="Authors"), group=1) +
geom_line(aes(y = num_authors, color="Authors"), group=1) +
geom_point(aes(y = num_authors_single, color="Singles"), group=2) +
geom_line(aes(y = num_authors_single, color="Singles"), group=2) +
ylab("Count") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(data = commits_summary, aes(x = factor(commit_month), y = authors_single_pct*100)) +
geom_point(group=1) +
geom_line(group=1) +
ylab("% of Authors w/ 1 Month") +
xlab("Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Commit Author Frequency vs Deviation
Another indicator of project participation levels is the overall number of commits per author compared with commit frequency. Do authors that commit more frequently tend to have significantly more commits as well? In the Tensorflow project we see a small increase.
```{r}
ggplot(commits_summary %>% filter(commit_month > "Nov 2015"),
aes(x=factor(num_months), y=round(commits_pct,2)*100, colour=factor(commits_max))) +
geom_jitter() +
xlab("# Months w/ Commits") +
ylab("% Commits per Author") +
guides(colour=guide_legend(title="Commits"))
```
## Commits per Author Percent Change
Early on in the project there were less authors as is indicated by the higher proportion of commits. As the project has progressed however, the number of commits is more distributed.
The "tensor-gardener" is the only "author" consistently responsible for a high proportion of the commit activity. This comes from Google internally when employees make commits and they can't be linked to an existing Github account. The proportion of these "Google" commits appears to be gradually decreasing which could indicate an increase in non-Google authors.
```{r, commits_percent_change}
ggplot(data = commits_summary,
aes(x = factor(commit_month), y = commits_pct, fill=author)) +
geom_bar(stat="identity", position="stack") +
ylab("Commits") +
xlab("Month") +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Authors > 6 months
ggplot(data = commits_summary %>% filter(num_months > 6),
aes(x = factor(commit_month), y = commits_pct, fill=author)) +
geom_bar(stat="identity", position="stack") +
ylab("Commits") +
xlab("Month") +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(data = commits_summary %>% filter(num_months < 4),
aes(x = factor(commit_month), y = commits_pct, fill=author)) +
geom_bar(stat="identity", position="stack") +
ylab("Commits") +
xlab("Month") +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Does the commit percentage change increase or decrease over time? That is, does an author need to make more or less commits to achieve the same commit proportion?
If project activity is gradually increasing, then we should see a decrease. That is, as more authors engage with a project and make more commits, an author needs to make more commits to acheive the same commit proportion. If project activity is becoming more distributed, authors who started out with a higher percentage of commits should gradually start representing a lower percentage. If, however, project activity is not well distributed, then we would contiue to see a high percentage of commits from the original authors, thus indicating a decreasing or neutral percent change.
```{r}
ggplot(commits_summary %>%
filter(author_commits_log > 1 & author_commits_log < 5),
aes(x=factor(commit_month), y=commits_pct*100, colour=factor(commits_max))) +
geom_jitter() +
xlab("Month") +
ylab("Commits per Author (%)") +
guides(colour=guide_legend(title="Commits")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Is any one group responsible for most of the single-month contributors?
Authors commiting with google.com email domains are only responsible for a small proportion of the single commit month activity. While it's the largest proportion of any non-email-provider host, it's not likely that Google is dominating the single-month committer population.
```{r}
ggplot(commits_summary %>%
filter(num_months == 1
& (! email_id_host %in% email_provider_domains)
& is.na(str_match(email_id_host, "\\.edu"))),
aes(x=factor(commit_month),
y=num_host_authors,
fill=reorder(email_id_host, -total_host_authors))) +
geom_bar(stat="identity", position="stack") +
xlab("Months") +
ylab("Authors w/ 1 Commit Month") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
guides(fill=guide_legend(title="Hosts"))
```
# Conclusions
The key advantage of git log analysis is that all of the information is available locally (once the repository has been cloned) and does not take up much space. This is particularly advantageous for older projects that may have a large number of commits.
The key disadvantage is the inability to easily consolidate authors. Company affiliations identified through email address also need to be normalized and a change in company needs to be properly indicated for the new months. This was not fully addressed in this analysis. Tensorflow is a fairly young project so it isn't as significant here as it would be with an older project where more authors may have changed companies.
The analysis of Tensorflow's commit history suggests that a growing project will trend towards a smaller proportion of overall commits per month per author. The overall number of commits per author on its own does not provide much insight into overall community health or growth. The proportion of commits per author and whether the same number of commits per author goes up or down in proportion, however, is an indicator of activity distribution and is a potentially useful metric.
## Proposed Metrics
### Commit Months
Frequency of months with at least one commit. Shows consistency in involvement over time.
### Commit Delta
Rate of change in commit frequency. An overall low rate of change suggests a high proportion of new authors.
### Commit Deviation
Frequency of change in commit frequency. A high frequency suggests a high proportion of new authors.
### Commit Percent Change
Commit frequency percent change. A high decrease suggests an increase in overall commit distribution per author.
## Next Steps
* Apply the proposed metrics to a larger sample of projects to see how they compare.
* Improve author normalization
* Improve company-domain identification
* Compare domain identification with Github profile company
# References
* "Why Google Stores Billions of Lines of Code in a Single Repository" Communications of the ACM, Vol. 59 No. 7, Pages 78-87. https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
* "How the TensorFlow team handles open source support" O'Reilly Ideas. May 4, 2017. https://www.oreilly.com/ideas/how-the-tensorflow-team-handles-open-source-support