-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
843 lines (697 loc) · 34.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
---
title: "Navigating through the R packages for movement: Supporting information"
author: "Rocio Joo, Matthew E. Boone, Thomas A. Clay, Samantha C. Patrick, Susana Clusella-Trullas, and Mathieu Basille."
date: "May 16, 2019"
output:
github_document:
toc: true
toc_depth: 3
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE,
fig.path = "figures/")
library("igraph")
library("scales")
library("Matrix")
library("dplyr")
library("cowplot")
library("viridis")
library("reshape")
library("RColorBrewer")
## library("kableExtra")
library("ggrepel")
library("printr")
```
```{r data}
data_dir <- "data/"
data <- read.csv(paste0(data_dir, "survey-responses.csv"), stringsAsFactors = FALSE)
data_all <- data %>% filter(completion == 100)
packages <- read.csv(paste0(data_dir, "pkg-list-survey.csv"),
stringsAsFactors = FALSE)
data_question_1 <- data_all[, grep("q1", colnames(data_all))]
colnames(data_question_1) <- t(packages)
# dropping trajr that was added in the end and only got 1
# response
data_question_1 <- data_question_1[, -grep("trajr", colnames(data_question_1))]
packages_new <- packages[-which(packages == "trajr"), ]
Total <- sapply(1:dim(data_question_1)[1], function(x) {
sum(as.numeric(data_question_1[x, ] != "Never"))
})
discard_rows <- which(Total == 0)
data_all <- data_all[-discard_rows, ]
# Table with information of all packages from the survey
pkg_info <- read.csv(paste0(data_dir, "pkg-info.csv"), stringsAsFactors = FALSE)
```
[![DOI](https://zenodo.org/badge/153035894.svg)](https://zenodo.org/badge/latestdoi/153035894)
## Overview
This repository is a companion to the manuscript "*Navigating through
the R packages for movement: a review for users and developers*", from
Rocio Joo, Matthew E. Boone, Thomas A. Clay, Samantha C. Patrick,
Susana Clusella-Trullas, and Mathieu Basille (pre-print available on
[arXiv.org](https://arxiv.org/abs/1901.05935)). This document is
actually a dynamic R report, for which RMarkdown sources are available
[here](README.Rmd) with full code. The repository also serves to store
data about:
1. Information for [74 R packages](data/pkg-info.csv) related to
tracking data processing and analysis. Information was collected
between March and August 2018. **58** of the packages were described in
the review, and **72** of those packages were the focus of a survey on
their users about their use, relevance and quality of their
documentation (see [packages included in the survey](#packages-included-in-the-survey)
for more details). Additional details about this data file are available
[here](data/README_pkg-info.md).
2. [Responses to an anonymous survey](data/survey-responses.csv) about
the use, relevance and quality of the documentation of 72 packages
related to movement. The survey was executed in the Fall of 2018.
Additional details about this data file are available
[here](data/README_survey-responses.md).
This repository can be cited using its DOI: 10.5281/zenodo.3066226
## A large amount of R packages for movement
The manuscript presents a review of R packages for movement. R is one
of the most used programming softwares to process, visualize and
analyze data from tracking devices. The large amount of existing
packages makes it difficult to keep track of the spectrum of choices,
with an increasing number of available packages every year (this is
**Figure 2** of the manuscript):
```{r ms-fig-2, fig.width = 7, fig.height = 5}
# only keeping columns that we will use
data_pkg <- pkg_info[, c("Package", "Year")]
# Loading the names of packages we include in the review
mov_pac <- read.csv(paste0(data_dir, "pkg-list-paper.csv"), stringsAsFactors = FALSE)
mov_pac[mov_pac == "SGAT/TripEstimation"] <- "SGAT"
mov_pac[mov_pac == "TwGeos/BAStag"] <- "TwGeos"
mov_pac[mov_pac == "ukfsst/kfsst"] <- "ukfsst"
# Subsetting by packages in review
ind.mov.pac <- match(mov_pac$Package, data_pkg$Package)
data_pkg <- data_pkg[ind.mov.pac, ]
# counting packages per year
theTable <- as.data.frame(table(data_pkg$Year, useNA = "no"))
colnames(theTable) <- c("Year", "Total")
# in case there are some years without publication of
# packages
theTable$Year <- as.numeric(levels(theTable$Year))
# filter out 2018 which is not complete
theTable <- theTable %>% filter(Year < 2018)
range_year <- range(theTable$Year)
values_year <- range_year[1]:range_year[2]
missing_years <- setdiff(values_year, theTable$Year)
theTable <- rbind.data.frame(cbind.data.frame(Year = missing_years,
Total = rep(0, length(missing_years))), theTable)
theTable$Year <- factor(theTable$Year, levels = sort(theTable$Year))
ggplot(theTable, aes(x = Year, y = Total)) +
geom_bar(stat = "identity", position = "identity") +
xlab("Year of publication") + ylab("Number of packages") +
scale_y_continuous(minor_breaks = seq(0, 12, 1), breaks = seq(0,
12, 3)) +
background_grid(major = "y", minor = "none", colour.major = "grey80",
size.major = 0.5)
## ggsave("figures/Fig2.pdf", width = 7, height = 5)
```
Since the packages were reviewed between March and August 2018, this
last year was incomplete and not included in the graph.
Many packages are actually not connected to each others, showing a very
fragmented landscape of tracking packages in R. Here we show a network
representation of the dependency and suggestion between tracking packages
(this is **Figure 4** of the manuscript). The arrows go towards the package
the others suggest (dashed arrows) or depend on (solid arrows). Bold font
corresponds to active packages. The size of the circle is proportional to
the number of packages that suggest or depend on this one.
```{r ms-fig-4, fig.width = 12, fig.height = 12}
# loading the import + suggest information for each package
imports <- read.csv(paste0(data_dir, "pkg-import-suggest.csv"),
stringsAsFactors = FALSE)
# getting the names of all packages that participate here as
# a dependent or dependency (or suggestion)
packages_dep <- unique(c(imports$Package, imports$Dependency))
imports$Dependency[which(imports$Dependency == "-")] <- NA
# accomodating everything as a matrix of counts
packages_dep <- packages_dep[-which(packages_dep == "-")] # excluding packages with neither dependencies or suggestions
pack.id <- 1:length(packages_dep)
table.counts <- as.data.frame.matrix(table(imports$Package, imports$Dependency)) # table of counts with rows as list of packages and columns the packages they depend on/suggest
# but we only want to account for tracking packages.
# So in the end, we want a matrix of counts which would be
# square, with rows and columns of tracking packages.
# Problem? Not all tracking packages are counted as
# dependencies or suggestions of other packages, so they are
# not in the columns. We are going to add the missing ones
# first (with zeros) and then remove the columns that should
# not be there
new_col <- setdiff(t(mov_pac), colnames(table.counts))
zero_matrix <- matrix(0, ncol = length(new_col), nrow = dim(table.counts)[1])
row.names(zero_matrix) <- row.names(table.counts)
colnames(zero_matrix) <- new_col
table.counts <- cbind.data.frame(table.counts, zero_matrix)
ind.mov.row <- match(t(mov_pac), row.names(table.counts))
ind.mov.col <- match(t(mov_pac), colnames(table.counts))
table.counts <- table.counts[ind.mov.row, ind.mov.col]
# making sure that the order is right
col.names.df <- colnames(table.counts)
row.names.df <- row.names(table.counts)
table.counts <- table.counts[order(row.names.df, decreasing = FALSE),
order(col.names.df, decreasing = FALSE)]
# only keeping columns that we will use
data_pkg <- pkg_info[, c("Package", "Active")]
data_pkg$Package[data_pkg$Package == "SGAT/TripEstimation"] <- "SGAT"
data_pkg$Package[data_pkg$Package == "TwGeos/BAStag"] <- "TwGeos"
data_pkg$Package[data_pkg$Package == "ukfsst/kfsst"] <- "ukfsst"
ind.mov.cran <- match(t(mov_pac), data_pkg$Package)
data_pkg <- data_pkg[ind.mov.cran, ]
table.mov <- t(table.counts) # transposing for plotting
num_sugg <- apply(table.mov, 1, sum) # total number of dep/sugg
g.matrix <- graph.adjacency(t(as.matrix(table.mov)), weighted = TRUE,
mode = "directed", diag = FALSE)
g <- simplify(g.matrix)
font_text <- rep(1, nrow(data_pkg))
font_text[data_pkg$Active == "Yes"] <- 2 # bold for active packages
color_back <- "white" #alpha('snow3',data_pkg$downloads/max(data_pkg$downloads))
# now, we want to make dashed and more transparent arrows for
# suggestion but darker and solid arrows for import
el <- as_edgelist(g)
edges_sugg <- data_frame(suggesting = V(g)[el[, 1]]$name, suggested = V(g)[el[,
2]]$name, type = "suggestion")
edges_sugg$type <- as.character(edges_sugg$type)
imports <- read.csv(paste0(data_dir, "pkg-import.csv"), stringsAsFactors = FALSE) # we need an only import file
imports <- imports[imports$Package %in% mov_pac$Package, ]
for (i in 1:nrow(edges_sugg)) {
ind <- (imports$Package %in% as.character(edges_sugg$suggesting[i])) +
(imports$Dependency %in% as.character(edges_sugg$suggested[i]))
if (any(ind == 2)) {
edges_sugg$type[i] <- "dependency"
}
}
line_type <- rep(2, length(E(g)))
line_type[edges_sugg$type == "dependency"] <- 1
line_color <- rep(alpha("#ef8a62", 0.5), length(E(g)))
line_color[edges_sugg$type == "dependency"] <- alpha("#ef8a62",
0.8)
# General options for plotting.
V(g)$label.family <- "Helvetica"
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
layout1 <- layout.fruchterman.reingold(g)
V(g)$label.color <- "darkblue"
V(g)$label.font <- font_text
V(g)$frame.color <- alpha("black", 0.7)
V(g)$label.cex <- 1.5
V(g)$color <- color_back
V(g)$size <- 40 * num_sugg/sum(num_sugg)
egam <- 3
E(g)$width <- egam * 1.5
E(g)$arrow.size <- egam/3
E(g)$lty <- line_type
E(g)$color <- line_color
# pdf('NetworkImportSuggestTrack.pdf',width = 18,height = 16)
plot(g, layout = layout1, vertex.label.dist = 0.5)
# dev.off()
```
## The survey
Our review aimed at an objective introduction to the packages
organized by the type of processing or analyzing they focused on, and
to provide feedback to developers from a user perspective. For the
second objective, we elaborated a survey for package users regarding:
1. How popular those packages are;
2. How well documented they are;
3. How relevant they are for users.
Those were the three questions that we asked about the packages, plus
one about the level as an R user of the survey participant. In the
review we showed results regarding package documentation. In the
following, we present the complete results of the survey.
### Packages included in the survey
In theory, any package could be potentially useful for movement
analysis; either a time series package, a spatial analysis one or even
`ggplot2` to make more beautiful graphics! For the review, we
considered only what we referred to as **tracking packages**. Tracking
packages were those created to either analyze tracking data
(i.e. $(x,y,t)$) or to transform data from tagging devices into proper
tracking data. For instance, a package that would use accelerometer,
gyroscope and magnetometer data to reconstruct an animal's trajectory
via path integration, thus transforming those data into an $(x,y,t)$
format, would fit into the definition. But a package analyzing
accelerometry series to detect changes in behavior would not fit.
For this survey, we added packages that, though not tracking packages
*per se*, were created to process or analyze data extracted from
tracking devices in other formats (e.g. `accelerometry` for
accelerometry data, `diveMove` for time-depth recorders or
`pathtrackr` for video tracking data). Packages from any public
repository (e.g. CRAN, GitHub, R-forge) were included in the
survey. Packages created for eye, computer-mouse or fishing vessel
movement were not considered here. A couple of packages that were
finally discarded from the review because of either being in early
stages of development (`movement`) or because of being archived in
CRAN due to unfixed problems (`sigloc`), were included in the
survey. Two packages, `lsmnsd` and `segclust2d`, were added for an
updated version of the review but did not make it in time for the
survey. The package `trajr` was added to the survey in a late stage,
but because of that, and the fact that it got only one response, we
filtered it out of the analysis.
A total of 72 packages were included in this survey: `acc`,
`accelerometry`, `adehabitatHR`, `adehabitatHS`, `adehabitatLT`,
`amt`, `animalTrack`, `anipaths`, `argosfilter`, `argosTrack`,
`BayesianAnimalTracker`, `BBMM`, `bcpa`, `bsam`, `caribou`, `crawl`,
`ctmcmove`, `ctmm`, `diveMove`, `drtracker`, `EMbC`, `feedR`,
`FLightR`, `GeoLight`, `GGIR`, `hab`, `HMMoce`, `Kftrack`, `m2b`,
`marcher`, `migrateR`, `mkde`, `momentuHMM`, `move`, `moveHMM`,
`movement`, `movementAnalysis`, `moveNT`, `moveVis`, `moveWindSpeed`,
`nparACT`, `pathtrackr`, `pawacc`, `PhysicalActivity`, `probgls`,
`rbl`, `recurse`, `rhr`, `rpostgisLT`, `rsMove`, `SDLfilter`,
`SGAT/TripEstimation`, `sigloc`, `SimilarityMeasures`, `SiMRiv`,
`smam`, `SwimR`, `T-LoCoH`, `telemetr`, `trackdem`, `trackeR`,
`Trackit`, `TrackReconstruction`, `TrajDataMining`, `trajectories`,
`trip`, `TwGeos`/`BAStag`, `TwilightFree`, `Ukfsst`/`kfsst`, `VTrack`
and `wildlifeDI`.
### Participation in the survey
The survey was designed to be completely anonymous, meaning that we
had no way to know who participated. There was no previous selection
of the participants and no probabilistic sampling was involved. The
survey was advertised by Twitter, mailing lists (r-sig-geo and
r-sig-ecology), individual emails to researchers and the [lab's website](https://mablab.org/post/2018-08-31-r-movement-review/).
The survey got exemption from the Institutional Review Board at University of Florida
(IRB02 Office, Box 112250, University of Florida, Gainesville, FL 32611-2250).
A total of `r data %>% filter(!is.na(completion)) %>% nrow()` people
participated in the survey, and `r data_all %>% nrow()` answered all
four questions. To answer all questions the participant had to have
tried at least one of the packages. In the following sections, we
analyze only completed surveys.
### Survey representativity
To get a rough idea of how representative the survey was of the
population of the package users, we compared the number of
participants that used each package to the number of monthly downloads
that each package has.
The number of downloads were calculated using the R package
`cran.stats`. It calculates the number of independent downloads by
each package (substracting downloads by dependencies) by day. It only
provides download statistics for packages on CRAN, downloaded using
the RStudio CRAN mirror—total downloads are likely to be an order of
magnitude higher. We computed the average number of downloads per
month, from September 2017 to August 2018; fewer months were
considered for packages that were younger than one year old.
There is no perfect match between the number of users and the number
of downloads per package, but a correlation of 0.85 for the 49
packages on CRAN provides evidence of an overall good representation
of the users of tracking packages in the survey. Moreover, most of the
packages with very few users in the survey regardless of their
relatively high download statistics were accelerometry packages for
human patients, which would be revealing that we did not reach that
subpopulation of users through Twitter and emails.
A log-log plot for both metrics is shown in the figure below.
```{r representativity, fig.width = 14, fig.height = 8}
data_question <- data_all[, grep("q1", colnames(data_all))]
colnames(data_question) <- t(packages)
# I'm dropping trajr that was added in the end and only got 1
# response
data_question <- data_question[, -grep("trajr", colnames(data_question))]
packages_new <- packages[-which(packages == "trajr"), ]
categories <- c("Never", "Rarely", "Sometimes", "Often")
use_counts <- t(sapply(1:ncol(data_question), function(x) {
data_line <- factor(data_question[, x], levels = categories)
count_p_use <- as.numeric(table(data_line))
return(count_p_use)
}))
use_counts <- data.frame(use_counts)
colnames(use_counts) <- categories
rownames(use_counts) <- t(packages_new)
use_counts$Package <- row.names(use_counts)
use_counts <- use_counts %>% mutate(users = Rarely + Sometimes +
Often)
funciones <- read.csv(paste0(data_dir, "pkg-info.csv"))
matrix_fun <- left_join(use_counts, funciones)
matrix_fun <- matrix_fun[(!is.na(matrix_fun$monthly.downloads)),
]
ggplot(matrix_fun, aes(x = users, y = monthly.downloads)) +
geom_point(size = 3) +
geom_text_repel(label = matrix_fun$Package, segment.size = 0.6,
force = 3, segment.alpha = 0.5, hjust = 0, box.padding = 0.5,
min.segment.length = 0.1, size = 6, direction = "both") +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
xlab("Number of users") + ylab("Monthly downloads")
```
## The questions
### User level
Let's see first the level of use in R of the participants. The options
were:
* Beginner: You only use existing packages and occasionally write some
lines of code.
* Intermediate: You use existing packages but you also write and
optimize your own functions.
* Advanced: You commonly use version control or contribute to develop
packages.
```{r user-experience, fig.width=7, fig.height=5, fig.cap="Level of R use"}
data_question_4 <- data_all[, grep("q4", colnames(data_all))]
categories <- c("Beginner", "Intermediate", "Advanced")
data_question_4 <- factor(unlist(lapply(strsplit(data_question_4,
"\\:"), "[[", 1)), levels = categories)
use_counts <- as.numeric(table(data_question_4))
prop <- round(as.numeric(prop.table(table(data_question_4))) *
100, 1)
use_counts <- data.frame(levels = categories, total = use_counts)
use_counts$levels <- factor(as.character(use_counts$levels),
levels = categories)
ggplot(data = use_counts, aes(x = levels, y = total)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_text(aes(label = total), size = 6, vjust = 1.5, hjust = .5,
color = "white") +
xlab("Level of use") + ylab("Total respondents")
```
Most participants considered themselves in an intermediate level
(`r prop[2]`%), meaning that they could write functions in R. Some others
were beginners (`r prop[1]`%) and advanced (`r prop[3]`%) R users.
### Package use
The first question about package use was: How often do you use each of
these packages? (Never, Rarely, Sometimes, Often)
The bar graphics below show that most packages were unknown (or at
least had never been used) by the survey participants. The
`adehabitat` packages (HR, LT and HS) were the most used
packages. These packages provide a collection of tools to estimate
home range, handle and analyze trajectories, and analyze habitat
selection, respectively. On the bottom of the graphic, `smam` (for
animal movement models), `PhysicalActivity`, `nparACT`, `GGIR` (these
three for accelerometry data on human patients) and `feedr` (to handle
radio telemetry data) had no users among the participants. For that
reason, those 5 packages will not appear in the analysis of the next
questions.
```{r relevance, fig.width = 16, fig.height = 15, fig.cap = "Package use"}
data_question <- data_all[, grep("q1", colnames(data_all))]
colnames(data_question) <- t(packages)
# I'm dropping trajr that was added in the end and only got 1
# response
data_question <- data_question[, -grep("trajr", colnames(data_question))]
packages_new <- packages[-which(packages == "trajr"), ]
categories <- c("Never", "Rarely", "Sometimes", "Often")
use_counts <- t(sapply(1:ncol(data_question), function(x) {
data_line <- factor(data_question[, x], levels = categories)
count_p_use <- as.numeric(table(data_line))
return(count_p_use)
}))
use_counts <- data.frame(use_counts)
colnames(use_counts) <- categories
rownames(use_counts) <- t(packages_new)
use_counts$package <- row.names(use_counts)
df1 <- melt(use_counts, id.vars = "package", variable_name = "response")
g <- unlist(by(df1, df1$package, function(x) sum(x$value[x$response !=
"Never"])))
df1$package <- factor(df1$package, levels = names(sort(g, decreasing = FALSE)))
color.pallete <- brewer.pal(5, "YlGnBu")
color.pallete[1] <- "lightgray"
ggplot(data = df1) +
geom_col(aes(x = package, y = value, fill = response)) +
coord_flip() +
scale_fill_manual(values = color.pallete) +
ylab("Count")
```
If you want to check the numbers for specific packages, the complete
table is below:
```{r relevance-table}
use_counts[, 1:length(categories)]
## kable(use_counts[, 1:length(categories)]) %>%
## kable_styling(bootstrap_options = c("striped", "hover",
## "condensed", "responsive"))
```
There is actually not much difference in the number of packages used
by the distinct levels of R users as you can see in the boxplots
below:
```{r boxplot-users-packages, fig.width = 7, fig.height = 5, fig.cap = "Packages per user level"}
Total <- Total[-discard_rows]
new_df <- cbind.data.frame(Total, user = data_question_4)
categories <- c("Beginner", "Intermediate", "Advanced")
new_df <- cbind.data.frame(Total, user = data_question_4)
new_df$user <- factor(as.character(new_df$user), levels = categories)
ggplot(new_df, aes(x = user, y = Total)) +
geom_boxplot() +
scale_y_continuous(breaks = 0:max(new_df$Total)) +
xlab("User level") + ylab("Package count")
```
### Package documentation
Without proper user testing and peer editing, package documentation
can lead to large gaps of understanding and limited usefulness of the
package. If functions and workflows are not expressly defined, a
package's capacity to help users is undermined.
In this survey we asked the participants how helpful was the
documentation provided for each of the packages they stated to
use. Documentation includes what is contained in the manual and help
pages, vignettes, published manuscripts, and other material about the
package provided by the authors. The participants had to answer using
one of the following options:
* Not enough: "It's not enough to let me know how to do what I need"
* Basic: "It's enough to let me get started with simple use of the
functions but not to go further (e.g. use all arguments in the
functions, or put extra variables)"
* Good: "I did everything I wanted and needed to do with it"
* Excellent: "I ended up doing even more than what I planned because
of the excellent information in the documentation"
* Don't remember: "I honestly can't remember…"
```{r documentation, fig.width = 16, fig.height = 15, fig.cap = "Bar plots of absolute frequency of each category of package documentation"}
data_question <- data_all[, grep("q2", colnames(data_all))]
colnames(data_question) <- t(packages)
# I'm dropping trajr that was added in the end and only got 1
# response
data_question <- data_question[, -grep("trajr", colnames(data_question))]
packages_new <- packages[-which(packages == "trajr"), ]
categories <- c("Not enough", "Basic", "Good", "Excellent", "Don't remember")
use_counts <- t(sapply(1:ncol(data_question), function(x) {
data_line <- factor(data_question[, x], levels = categories)
count_p_use <- as.numeric(table(data_line, useNA = "no"))
return(count_p_use)
}))
use_counts <- data.frame(use_counts)
colnames(use_counts) <- categories
rownames(use_counts) <- t(packages_new)
total_package <- rowSums(use_counts)
use_counts$package <- row.names(use_counts)
use_counts <- use_counts[total_package > 0, ]
df1 <- melt(use_counts, id.vars = "package", variable_name = "response")
g <- unlist(by(df1, df1$package, function(x) sum(x$value)))
color.pallete <- brewer.pal(5, "YlGnBu")
color.pallete[1] <- "lightgray"
df1$package <- factor(df1$package, levels = names(sort(g, decreasing = FALSE)))
df1$response <- factor(df1$response, levels = rev(c("Excellent",
"Good", "Basic", "Not enough", "Don't remember")))
ggplot(data = df1) +
geom_col(aes(x = package, y = value, fill = response)) +
coord_flip() +
scale_fill_manual(values = color.pallete) +
background_grid(major = "x", minor = "x", colour.major = "grey80",
colour.minor = "grey80", size.major = 0.5) +
ylab("Count")
```
```{r useage counts}
df2 <- use_counts[, c("Not enough", "Basic", "Good", "Excellent")]
use_per <- df2/apply(df2, 1, sum) * 100
use_per$good_excellent <- signif(apply(use_per[, c("Good", "Excellent")],
1, sum), 4)
use_per$counts <- apply(df2[, c("Good", "Excellent")], 1, sum)
```
Remember that participants could only give their opinion on
documentation regarding the packages they had used. Hence, the
packages with many users got many documentation answers. The figure
above allows for a closer look at the proportion of type of
response for each package.
To identify some packages with remarkably good documentation, let's
first only consider those packages with at least 10 responses on the
quality of documentation (regardless of the "Don't remember"). These
are 26 (you can see the table of responses below). Among them,
`momentuHMM` had more than 50% of the responses
(`r round(use_per["momentuHMM","Excellent"],2)`%;
`r use_counts["momentuHMM","Excellent"]`) as "excellent documentation",
meaning that the documentation was so good that thanks to it, more
than half of its users discovered additional features of the package
and were able to do more analyses than what they initially
planned. Moreover, 10 packages had more than 75% of the responses as
either "good" or "excellent":
`momentuHMM` (`r use_per["momentuHMM","good_excellent"]`%;
`r use_per["momentuHMM","counts"]`),
`moveHMM` (`r use_per["moveHMM","good_excellent"]`%;
`r use_per["moveHMM","counts"]`),
`adehabitatLT` (`r use_per["adehabitatLT","good_excellent"]`%;
`r use_per["adehabitatLT","counts"]`),
`adehabitatHR` (`r use_per["adehabitatHR","good_excellent"]`%;
`r use_per["adehabitatHR","counts"]`),
`EMbC` (`r use_per["EMbC","good_excellent"]`%; `r use_per["EMbC","counts"]`),
`wildlifeDI` (`r use_per["wildlifeDI","good_excellent"]`%;
`r use_per["wildlifeDI","counts"]`),
`ctmm` (`r use_per["ctmm","good_excellent"]`%; `r use_per["ctmm","counts"]`),
`GeoLight` (`r use_per["GeoLight","good_excellent"]`%;
`r use_per["GeoLight","counts"]`),
`move` (`r use_per["move","good_excellent"]`%; `r use_per["move","counts"]`),
`recurse` (`r use_per["recurse","good_excellent"]`%;
`r use_per["recurse","counts"]`). The two leading packages, `momentuHMM`
and `moveHMM`, focus on the use of Hidden Markov models which allow
identifying different patterns of behavior called states.
One way to visualize the quality of documentation is to relate the
rating to the number of respondents who declared using each package
(this is **Figure 3** of the manuscript). This figure shows the
proportion of good and excellent documentation for packages with at
least 10 respondents; light green corresponds to packages with
standard documentation only, blue is for packages with vignettes, and
purple is for packages that also have peer-reviewed articles
published:
```{r ms-fig-3, fig.width = 14, fig.height = 8}
# question about documentation
data_question <- data_all[, grep("q2", colnames(data_all))]
colnames(data_question) <- t(packages)
# I'm dropping trajr that was added in the end and only got 1
# response
data_question <- data_question[, -grep("trajr", colnames(data_question))]
packages_new <- packages[-which(packages == "trajr"), ]
categories <- c("Not enough", "Basic", "Good", "Excellent", "Don't remember")
use_counts <- t(sapply(1:ncol(data_question), function(x) {
data_line <- factor(data_question[, x], levels = categories)
count_p_use <- as.numeric(table(data_line, useNA = "no"))
return(count_p_use)
}))
use_counts <- data.frame(use_counts)
colnames(use_counts) <- categories
rownames(use_counts) <- t(packages_new)
total_package <- rowSums(use_counts)
use_counts$Package <- row.names(use_counts)
use_counts <- use_counts[total_package > 0, ]
matrix_fun <- left_join(use_counts, funciones)
# not all of the packages from the survey are included in the
# review
packages_paper <- read.csv(paste0(data_dir, "pkg-list-paper.csv"))
matrix_fun <- inner_join(packages_paper, matrix_fun)
# clarifying documentation options
matrix_fun$documentation <- c("Manual")
matrix_fun$documentation[matrix_fun$Vignettes == "Yes" & matrix_fun$Papers ==
"No"] <- c("Manual+Vignette")
matrix_fun$documentation[matrix_fun$Vignettes == "Yes" & matrix_fun$Papers ==
"Yes"] <- c("Manual+Vignette+Paper")
matrix_fun$num_answers <- rowSums(matrix_fun[2:5])
matrix_fun$Good_Exc <- (rowSums(matrix_fun[4:5]))/matrix_fun$num_answers *
100
# discarding the packages we did not get users from:
matrix_fun_2 <- matrix_fun[matrix_fun$num_answers > 0, ] # missing: 'feedR' 'smam'
matrix_fun_3 <- matrix_fun[matrix_fun$num_answers >= 10, ]
ggplot(matrix_fun_3, aes(x = num_answers, y = Good_Exc, col = documentation)) +
geom_point(size = 3) +
geom_text_repel(label = matrix_fun_3$Package, segment.size = 1,
force = 1, segment.alpha = 0.4, hjust = 0, box.padding = 0.4,
min.segment.length = 0.25, size = 6) +
xlab("Number of respondents") + ylab("Good or Excellent Rating") +
## scale_colour_manual(values = color.pallete) +
scale_color_viridis(discrete = TRUE, end = .75, option = "D", direction = -1)+
background_grid(major = "xy", colour.major = "grey80") +
theme(legend.position = "none")
## ggsave("figures/Fig3.pdf", width = 14, height = 8)
```
If you want to check the numbers for specific packages, the complete
table is below:
```{r documentation-table}
use_counts[, 1:length(categories)]
## kable(use_counts[, 1:length(categories)]) %>%
## kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```
### Package Relevance
Participants were asked how relevant was each of the packages they use
for their work. They had to answer using one of the following options:
* Not relevant: "I tried the package but really didn't find it a good
use for my work"
* Slightly relevant: "It helps in my work, but not for the core of it"
* Important: "It's important for the core of my work, but if it didn't
exist, there are other packages or solutions to obtain something
similar"
* Essential: "I wouldn't have done the key part of my work without
this package"
```{r importance, fig.width = 16, fig.height = 15, fig.cap = "Bar plots of absolute frequency of each category of package relevance"}
data_question <- data_all[, grep("q3", colnames(data_all))]
colnames(data_question) <- t(packages)
# I'm dropping trajr that was added in the end and only got 1
# response
data_question <- data_question[, -grep("trajr", colnames(data_question))]
packages_new <- packages[-which(packages == "trajr"), ]
categories <- c("Not relevant", "Slightly relevant", "Important",
"Essential")
use_counts <- t(sapply(1:ncol(data_question), function(x) {
data_line <- factor(data_question[, x], levels = categories)
count_p_use <- as.numeric(table(data_line, useNA = "no"))
return(count_p_use)
}))
use_counts <- data.frame(use_counts)
colnames(use_counts) <- categories
rownames(use_counts) <- t(packages_new)
total_package <- rowSums(use_counts)
use_counts$package <- row.names(use_counts)
use_counts <- use_counts[total_package > 0, ]
df1 <- melt(use_counts, id.vars = "package", variable_name = "response")
g <- unlist(by(df1, df1$package, function(x) sum(x$value)))
color.pallete <- brewer.pal(5, "YlGnBu")
color.pallete[1] <- "lightgray"
df1$package <- factor(df1$package, levels = names(sort(g, decreasing = F)))
df1$response <- factor(df1$response, levels = rev(c("Essential",
"Important", "Slightly relevant", "Not relevant")))
ggplot(data = df1) +
geom_col(aes(x = package, y = value, fill = response)) +
coord_flip() +
scale_fill_manual(values = color.pallete) +
background_grid(major = "x", minor = "x", colour.major = "grey80",
colour.minor = "grey80", size.major = 0.5) +
ylab("Count")
```
```{r importance-counts}
df2 <- use_counts[, c("Not relevant", "Slightly relevant", "Important",
"Essential")]
use_per <- df2/apply(df2, 1, sum) * 100
use_per$good_excellent <- signif(apply(use_per[, c("Important",
"Essential")], 1, sum), 4)
use_per$counts <- apply(df2[, c("Important", "Essential")], 1,
sum)
```
The two barplots show the absolute and relative frequency of the
answers for each package, respectively. We identified the packages
that were highly relevant for their users, considering only those
packages with at least 10 responses. Among these 33 packages, three
were regarded as either "Important" or "Essential" for more than 75%
of their users:
`bsam` (`r use_per['bsam','good_excellent']`%; `r use_per['bsam','counts']`),
`adehabitatHR` (`r use_per['adehabitatHR','good_excellent']`%;
`r use_per['adehabitatHR','counts']`), and
`adehabitatLT` (`r use_per['adehabitatLT','good_excellent']`%;
`r use_per['adehabitatLT','counts']`). `bsam` allows fitting Bayesian
state-space models to animal tracking data.
```{r importance-percentage,fig.width = 16, fig.height = 10, fig.cap = "Bar plots of relative frequency of each category of package relevance (for packages with more than 5 users)"}
package_levels <- row.names(use_per[order(use_per$good_excellent,
use_per$Essential, use_per$Important), ])
use_per2 <- use_per
use_per2[is.na(use_per2)] <- 0
use_per2$package <- row.names(use_per2)
use_per2 <- subset(use_per2, counts > 5)
df1 <- melt(subset(use_per2, select = -c(good_excellent, counts)),
id.vars = "package", variable_name = "response")
df1$package <- factor(df1$package, levels = package_levels)
df1$response <- factor(df1$response, levels = c("Not relevant",
"Slightly relevant", "Important", "Essential"))
color.pallete <- brewer.pal(4, "YlGnBu")
color.pallete[1:2] <- c("lightgray", "darkgray")
# color.pallete[1]<-c('lightgray')
ggplot(data = df1) +
geom_col(aes(x = package, y = value, fill = response)) +
coord_flip() +
scale_fill_manual(values = color.pallete) +
ylab("Percentage")
```
If you want to check the numbers for specific packages, the complete
table is below:
```{r importance-table}
use_counts[, 1:length(categories)]
## kable(use_counts[, 1:length(categories)]) %>%
## kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```
## Summary
* Most packages had very few users among the participants. The vast
landscape of packages could be leading users to: 1) rely on "old"
and established packages (like the `adehabitat` family) that gather
most functions for common analyses in movement and 2) search for
other packages when doing other specific analyses. Moreover, many
packages contain functions that other packages have implemented too
(see more details in the review manuscript), so repetition could
make users spread between packages.
* After the `adehabitat` family of packages, several packages for
modeling animal movement (`momentuHMM`, `moveHMM`, `crawl` and
`ctmm`) showed to be very popular, which could be an indicator of an
increase in research on movement models.
* Few of the packages had remarkably good documentation (>75% of
"good" or "excellent" documentation), and, on the other end of the
spectrum, a couple of packages got less than 50% of "good" or
"excellent" rates.
* Most packages were relevant for the work of their users, which is a
positive feature!