-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdsf-r_18.Rmd
385 lines (288 loc) · 9.8 KB
/
dsf-r_18.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: "Let's play together: Collaborative Data Science"
subtitle: |
| Data Science Conference 4.0
institute: "Quantargo"
author: "Mario Annau"
date: "September 19, 2018"
output:
revealjs::revealjs_presentation:
includes:
in_header: header.html
center: yes
reveal_options:
previewLinks: yes
slideNumber: yes
self_contained: yes
theme: black
toc: yes
transition: slide
css: styles.css
highlight: zenburn
fig_caption: no
incremental: no
---
```{r setup, include=FALSE}
library(h5)
library(zoo)
library(magrittr)
library(reshape2)
library(visNetwork)
library(readr)
library(dplyr)
library(plotly)
removeBraces <- function(x) {
x <- gsub("\\s*\\[[^\\)]+\\]","", x)
x <- gsub("\\s*\\([^\\)]+\\)","", x)
x <- gsub("\\s*<[^\\)]+>","", x)
x
}
splitAndComma <- function(x) {
strsplit(x, "(,)|(and)", perl = TRUE)
}
removeArtifacts <- function(x) {
x <- gsub("(c\\()", "", x, perl = TRUE)
x <- gsub("\\n", "", x)
x <- gsub("person", "", x)
x <- gsub("\"", "", x)
x <- gsub("\\(", "", x)
x <- gsub("\\)", "", x)
x
}
wstrim <- function(x) {
lapply(x, trimws)
}
#pdb <- tools:::CRAN_package_db()
#saveRDS(pdb, "pdb.rds")
pdb <- readRDS("data/pdb.rds")
if (file.exists('test.h5')) file.remove('test.h5')
```
## Why is it so hard?
- Data Science is an interdisciplinary field.
- Most scientists care more about methods than code.
- Most engineers care more about code than methods.
- Psychological barriers exists for people to collaborate.
##
<img src="img/The Righteous Mind.jpg" style="height: 90vh;">
## Why is it so important?
- Review of models and code improves overall quality.
- Collaboration can generate new ideas.
- Network effects if more people work efficiently together.
## Network Effects
<!-- Positive:
- Code is well managed and reusable.
- Team works efficiently together.
- Ideas are generated on a continous basis.
-->
<!-- Negative:
- Spaghetti code
- Teams work against each other
- Ideas are not well discussed, spurious
-->
```{r, echo=FALSE}
f1 <- list(
family = "Arial, sans-serif",
size = 18,
color = "white"
)
flegend <- list(
color = "white"
)
xaxt <- list(
title = "Number of Data Scientists",
ticks='',
showticklabels=FALSE,
titlefont = f1
)
yaxt <- list(
title = "Productivity",
ticks='',
showticklabels=FALSE,
titlefont = f1
)
x <- 1:10
y.pos <- 1:10 * (1.02)^(1:10)
y.neg <- 1:10 * (1.02)^-(1:10)
plot_ly(x = x, y = x, type = 'scatter', mode = 'lines',
name = 'Linear Productivity') %>%
add_lines(x = x, y = y.neg, name = 'Negative Network Effect',
fill = 'tonexty', line = list(color = "red")) %>%
add_lines(x = c(x, 10), y = c(y.pos, 10), name = 'Positive Network Effect',
fill = 'toself', line = list(color = "green")) %>%
layout(xaxis = xaxt, yaxis = yaxt, paper_bgcolor = 'rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
legend = list(x = 0.1, y = 0.9, font = flegend))
```
## Improving Network Effects
- How can code be managed to have positive network effects?
- How can teams efficiently communicate and collaborate together?
Case study: The CRAN package repository
## CRAN Packages Published
```{r, echo=FALSE, warning=FALSE}
load("data/pkgs.rda")
library(ggplot2)
title <- sprintf("CRAN Packages Published as of %s", format(max(pkgs$Date), "%m/%Y"))
xaxt <- list(
title = "",
titlefont = f1,
tickfont = f1,
color="white"
)
yaxt <- list(
title = "Number of Packages Published",
titlefont = f1,
tickfont = f1,
color="white"
)
plot_ly(pkgs, x = ~Date, y = ~published, type = 'scatter', mode = 'lines',
name = 'Linear Productivity', fill = 'tonexty') %>%
layout(xaxis = xaxt, yaxis = yaxt, paper_bgcolor = 'rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
legend = list(x = 0.1, y = 0.9, font = flegend))
```
##
```{r, eval=TRUE, echo=FALSE, warning=FALSE, error=FALSE}
# TODO: Remove packages already shipped with R
imports <- pdb$Imports %>% removeBraces %>% splitAndComma %>% wstrim
depends <- pdb$Depends %>% removeBraces %>% splitAndComma %>% wstrim
impdep <- lapply(1:length(imports), function(x) {
out <- c(imports[[x]], depends[[x]])
out <- out[!is.na(out)]
out <- out[out != "R"]
out
})
names(impdep) <- pdb$Package
impdep.df <- melt(impdep)
impdep.df$value.idx <- match(impdep.df$value, names(impdep))
impdep.df$L1.idx <- match(impdep.df$L1, names(impdep))
impdep.df <- na.omit(impdep.df)
#pkg_stats <- read_csv("data/2018-09-03.csv", col_types = cols())
#saveRDS(pkg_stats, file = "data/pkg_stats.rds")
#pkg_stats <- read_rds("data/pkg_stats.rds")
toppkg <- read_rds("data/pkg_stats.rds") %>%
group_by(package) %>%
summarise(ndownloads = n())
toppkglist <- toppkg %>% arrange(desc(ndownloads)) %>%
select(package) %>% unlist(use.names=FALSE)
nodes.df <- impdep.df %>% filter(L1 %in% head(toppkglist, 300))
reversedep <- impdep.df %>% group_by(value) %>% summarise(dep = n())
nclasses = 9
reversedep$inverval <- cut(log(reversedep$dep), breaks = nclasses)
nodes.df.join <- nodes.df %>% inner_join(reversedep, by = c("L1" = "value"))
hcols <- rev(RColorBrewer::brewer.pal(nclasses, "YlGnBu"))
nodes.df.join$cols <- hcols[as.numeric(nodes.df.join$inverval)]
nodes <- data.frame(id = nodes.df.join$L1.idx,
label = nodes.df.join$L1,
color = nodes.df.join$cols
)
edges <- data.frame(from = nodes.df.join$L1.idx, to = nodes.df.join$value.idx)
visNetwork::visNetwork(unique(nodes), edges) %>%
visInteraction(dragNodes = TRUE, dragView = TRUE, zoomView = TRUE) %>%
visNodes(font = list(color = "white")) %>%
visLayout(improvedLayout = FALSE)
```
## Authors per Package
```{r, echo=FALSE}
authors <- pdb$Author
authors.split <- authors %>% removeArtifacts %>% removeBraces %>% splitAndComma %>% wstrim
pdb$NumAuthors <- sapply(authors.split, length)
pdb$NumAuthors <- factor(ifelse(pdb$NumAuthors <= 10, pdb$NumAuthors, "More"), levels = c(1:10, "More"))
xaxt <- list(
title = "",
titlefont = f1,
tickfont = f1,
color="white"
#type="log"
)
yaxt <- list(
title = "Number of Packages Published",
titlefont = f1,
tickfont = f1,
color="white"
)
plot_ly(x = pdb$NumAuthors, type = "histogram") %>%
layout(xaxis = xaxt, yaxis = yaxt, paper_bgcolor = 'rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
legend = list(x = 0.1, y = 0.9, font = flegend))
#p <- ggplot(pdb, aes(x = NumAuthors)) + geom_histogram(bins = 10) + scale_x_log10() + theme_minimal() +
```
## Package Redundancy
- Lack of communication between authors can lead to redundant packages.
- Redundancy not helpful for infrastructure packages.
- Example: R-Excel Package
- Example: HDF5 package development
##
```{r, eval=TRUE, echo=FALSE, warning=FALSE, error=FALSE}
networkfun <- function(pkg) {
filter.excel <- (impdep.df$L1 %in% pkg) | (impdep.df$value %in% pkg)
nodes.df.excel <- impdep.df[filter.excel, ]
nodes.df.excel$cols <- ifelse(nodes.df.excel$L1 %in% pkg, "red", "green")
nodes <- data.frame(id = nodes.df.excel$L1.idx,
label = nodes.df.excel$L1,
color = nodes.df.excel$cols)
edges <- data.frame(from = nodes.df.excel$L1.idx, to = nodes.df.excel$value.idx)
visNetwork::visNetwork(unique(nodes), edges) %>%
visInteraction(dragNodes = TRUE, dragView = TRUE, zoomView = TRUE) %>%
visNodes(font = list(color = "white")) %>%
visLayout(improvedLayout = FALSE)
}
excelpkg <- c('readxl', 'xlsx', 'openxlsx', 'XLConnect', 'gdata')
networkfun(excelpkg)
```
## HDF5 packages
- Store large amounts of data, e.g. tick data
- Unsatisfied with rhdf5, hdf5, h5r, ... → h5
![Image Title](img/data_exchange3.png){height=100%}
<!-- ![HDF5 data exchange](img/data_exchange2.png) -->
[^1]: Developed by NCSA[^1] and the tri-labs, supported by NASA[^2]
##
> 2 years ago ...
- Presentation of **h5** at R/Finance 2016
- **Rcpp** to interface HDF5 C++ API
- Basic HDF5 features implemented
## ... 2 months later ...
On June 21, 2016 Holger wrote:
> ... my name is Holger Hoefling, I have developed a new version of a wrapper library for hdf5 (R6 Classes, almost all function calls wrapped, full support for all datatypes including tables etc) ...
## And I replied:
On June 21, 2016 Mario wrote:
> sounds interesting!
<!-- > - is it already open source/on github? would like to take a look... -->
<!-- > - are you planning to release it on cran? -->
<!-- > - are you building against the c++ api? -->
## What's different in hdf5r?
- Automatic code generation against HDF5 C API
- Usage of **R6** (instead of **S4**) classes
- Close connections during garbage collection
- Broad coverage of low-level library features
## Merging codebases
- Maintain high-level interface and test cases from **h5**
- Get low-level HDF5 support within R
![Merge Git](img/merge-git_white.png){width=200px}
##
On Oct 10, 2016 Holger wrote:
> thanks - merged!
## The Joys Collaboration
(after overcoming psychological barriers)
- Code reviews
- Higher Quality Code
- End product of higher qualtity than separate packages.
##
> Q: How can code be managed to have positive network effects?
- Put it into re-usable package.
- Continous code-reviews and tests.
- Transparent platform to inspect.
##
> Q: How can teams efficiently communicate and collaborate together?
- Have the right tools and mindset in place.
- Incentivise collaborative efforts.
- Accept unexpected hypotheses and failures
- Open mindedness.
## Collaboration Torvalds Style
Tools used: E-mail, Git
![Merge Git](img/linus_screenshot.png)
https://www.youtube.com/watch?v=LE0JtUeyVJA
## {data-background-iframe="https://github.com/r-lib"}
## {data-background-iframe="https://www.quantargo.com"}
## Thank you!
<!-- TODO: Insert font-awesome icons for homepage, coffee?? -->
Check out our homepage at
https://www.quantargo.com
Presentation source available at
https://github.com/Quantargo/data-science-collaboration