-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathOLD R-data-analysis.Rpres
683 lines (454 loc) · 18.6 KB
/
OLD R-data-analysis.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
Intermediate data manipulation and analysis using R
========================================================
author: Ben Bond-Lamberty
date: March 2022
font-family: 'Helvetica'
autosize: true
A short workshop covering reproducibility and data management; data reshaping and joining; and summarizing and manipulation.
Virginia Commonwealth University
http://rpubs.com/bpbond/r-data-vcu
The plan
========================================================
* Reproducible research and data management (~15 minutes)
* Filtering and reshaping data (~45 minutes; the `gapminder` dataset)
* Summarizing and manipulating data (~60 minutes; the `babynames` dataset)
Feedback: <a href="mailto:bondlamberty@pnnl">[email protected]</a> or [@BenBondLamberty](https://twitter.com/BenBondLamberty).
Requirements
========================================================
This workshop assumes an intermediate knowledge of R.
If you want to do the hands-on exercises (encouraged!), make sure up-to-date versions of the following packages are installed:
* `dplyr`
* `tidyr`
* `gapminder`
* `babynames`
Note the first two constitute a [particular and popular dialect of R](http://tidyverse.org), but the principles we'll go over are broadly applicable.
Reproducibility and data management
========================================================
type: section
Reproducibility
========================================================
We are in the era of collaborative 'big data', but even if you work by yourself with 'little data' you have to have some skills to deal with those data.
**Most fundamentally, your results have to be reproducible.**
>Your most important collaborator is your future self. It’s important to make a workflow that you can use time and time again, and even pass on to others in such a way that you don’t have to be there to walk them through it. [Source](http://berkeleysciencereview.com/reproducible-collaborative-data-science/)
Reproducibility
========================================================
Even if you don't buy it, **prepare yourself for the future**. Funders, journals, governments, colleagues are all pushing for more reproducibility and openness. It's a slow but steady ratchet.
NSF, DOE, Wellcome, Gates, etc. are increasingly requiring data management plans; data deposition; publication in open-access journals.
Reproducibility generally means *scripts* tied to *open source software* with effective *data management* and *archiving*.
You can't reproduce
========================================================
...what you've lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago?
- Incremental backups (minimum)
- Don't depend on yourself! Has to be _automatic_.
- Version control. A *repository* holds files and tracks changes: what, by whom, why
***
<img src="images/tardis.jpg" width="400" />
Version control
========================================================
**Git** (and website **GitHub**) are the most popular version control tools for use with R, and many other languages:
- version control
- sharing code with collaborators in a *repository*
- issue tracking
- public or private
***
<img src="images/git_2x.png" width="400" />
Data management during analysis
========================================================
Version control and scripts address two of the biggest problems with managing code and data: tracking *changes over time*, and understanding/reproducing *analytical steps*.
Ideally, _every_ step in your analysis is programmatic--done by a script--so it can be 'read': understood and reproduced later.
Reproducibility is a process
========================================================
*Don't let the perfect be the enemy of the good.* Upgrade and improve your workflow and skills over time.
>Organizing analyses so that they are reproducible is not easy. It requires diligence and a considerable investment of time: to learn new computational tools, and to organize and document analyses as you go.
>But partially reproducible is better than not at all reproducible. Just try to make your next paper or project better organized than the last.
A great and practical guide: http://kbroman.org/steps2rr/
<img src="images/repository.png" />
Reproducible research example
========================================================
A typical project/paper directory for me, slightly idealized:
```
1-download.R
2-prepdata.R
3-analyze_data.R
4-manuscript_report.Rmd
logs/
output/
rawdata/
```
This directory contains *scripts* that are backed up both *locally* and *remotely*. It is under *version control*, so it's easy to track changes over time.
There's also [drake](https://github.com/ropensci/drake), but that's a topic for another day.
Hands-on: setting up
========================================================
type: prompt
incremental: false
If you're doing the exercises and problems, you'll need these
packages:
- `dplyr` - fast, flexible tool for working with data frames
- `tidyr` - reshaping and cleaning data
- `ggplot2` - popular package for visualizing data
We'll also use these data package:
- `babynames` - names provided to the SSA 1880-2013
- `gapminder` - life expectancy, GDP per capita, and population for 142 countries
Reshaping datasets
========================================================
In honor of the late [Hans Rosling](https://en.wikipedia.org/wiki/Hans_Rosling), we'll use the `gapminder` dataset today.
```{r}
library(dplyr)
library(gapminder)
gapminder
```
```{r, echo=FALSE}
library(ggplot2)
theme_set(theme_bw())
```
Pipelines
========================================================
The `magrittr` package (used by both `dplyr` and `tidyr`) provides the `%>%` operator, which allows us to _pipe_ an object forward into a function or call expression.
Note that `x %>% f` is _usually_ equivalent to `f(x)`.
```{r, eval=FALSE}
print(gapminder)
gapminder %>% print
gagminder %>% head
gapminder %>% head(n=20)
gapminder %>%
print %>%
summary # what is non-piped equivalent?
summary(print(gapminder))
```
dplyr
========================================================
The `dplyr` package uses _verbs_ (functions) to operate on _tibbles_ (data frames).
```{r, eval=FALSE}
some_data_frame %>%
do_something() %>%
do_something_else() %>%
getting_dizzy_from_so_much_doing()
```
Let's go over some of those possible `do_something` steps.
filter
========================================================
Very commonly used.
```{r, eval=FALSE}
gapminder %>% filter(country == "Egypt")
gapminder %>% filter(country == "Egypt", year > 2000)
gapminder %>% filter(country %in% c("Egypt", "New Zealand", "Chad"))
```
filter
========================================================
```{r}
gapminder %>%
filter(year == 1977) %>%
ggplot(aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
```
select
========================================================
Also extremely useful. Note different notations for selecting columns:
```{r, eval=FALSE}
select(gapminder, pop, year)
gapminder %>% select(pop, year)
gapminder %>% select(-lifeExp, -gdpPercap)
gapminder %>% select(-1)
```
There are lots of other cool ways to select columns--see `?select`.
Reshaping data
========================================================
type: prompt
incremental: true
Let's focus on a single country's data for a bit. Write a pipeline that picks out Egypt data only, removes the continent and country columns, and assigns the result to a variable `Egypt`.
```{r}
gapminder %>%
filter(country == "Egypt") %>%
select(-continent, -country) ->
Egypt
```
```{r, echo=FALSE}
Egypt
```
Reshaping data
========================================================
Put this into long (or _tidy_) format--where every row is a different observation. For this we use `tidyr::gather`, which asks: what's the data source? Name of variable column (i.e. that will get old names of columns)? Name of data column? And what columns to operate on?
```{r}
library(tidyr)
Egypt %>% gather(variable, value, lifeExp, pop, gdpPercap)
```
Reshaping data
========================================================
```{r}
library(ggplot2)
Egypt %>%
gather(variable, value, -year) %>%
ggplot(aes(year, value)) + geom_line() +
facet_wrap(~variable, scales = "free")
```
Reshaping data
========================================================
Experiment. Why do these do what they do?
```{r, eval=FALSE}
Egypt %>% gather(variable, value, lifeExp)
Egypt %>% gather(variable, value, -lifeExp)
```
Why?
Reshaping data
========================================================
We can also spread our data out into a table form, like what you'd see in a spreadsheet, using `spread`:
```{r, eval=FALSE}
Egypt %>%
gather(variable, value, -year) %>%
spread(year, value)
```
`spread` is easy. It asks,
* What goes across the new column names?
* What's the data column to use?
Uniting, separating, mutating, and renaming
========================================================
These functions can be very useful.
```{r, eval=FALSE}
gapminder %>% unite(coco, country, continent)
gapminder %>%
unite(coco, country, continent) %>%
separate(coco,
into = c("country", "continent"),
sep = "_",
extra = "merge")
gapminder %>% mutate(logpop = log(pop))
gapminder %>% rename(population = pop)
```
Summarizing data
========================================================
type: section
Summarizing and manipulating data
========================================================
Thinking back to the typical data pipeline, we often want to summarize data by groups as an intermediate or final step. For example, for each subgroup we might want to:
* Compute mean, max, min, etc. (`n`->1)
* Compute rolling mean and other window functions (`n`->`n`)
* Fit models and extract their parameters, goodness of fit, etc.
Specific examples:
* `gapminder`: what's the year of maximum GDP for each country?
* `babynames`: what's the most common name over time?
Split-apply-combine
========================================================
These are generally known as *split-apply-combine* problems.
<img src="images/split_apply_combine.png" width="600" />
From https://github.com/ramnathv/rblocks/issues/8
dplyr
========================================================
The `dplyr` package specializes in data frames, but also allows you to work with remote, out-of-memory databases, using exactly the same tools, because it abstracts away *how* your data is stored.
`dplyr` is very fast for most, though not all, operations on _data frames_ (tabular data).
Verbs
========================================================
`dplyr` provides functions for each basic *verb* of data manipulation. These tend to have analogues in base R, but use a consistent, compact syntax, and are high performance.
* `filter()` - subset rows; like `base::subset()`
* `arrange()` - reorder rows; like `order()`
* `select()` - select (or drop) columns
* `mutate()` - add new columns
* `summarise()` - like base R's `aggregate`
Why use dplyr?
========================================================
Why use `dplyr`?
* Clean, concise, and consistent syntax.
* High performance in most cases.
* Same code can work with data frames or remote databases.
* Very popular; lots of help/documentation
Why not?
* Package still changing--your code might 'break'
* Programming can be trickier in some circumstances
* Not as fast in certain cases
* The `data.table` package is also worth checking out for its speed.
Grouping
========================================================
`dplyr` verbs become particularly powerful when used in conjunction with *groups* we define in the dataset. This doesn't change the data but instead groups it, ready for the next operation we perform.
```{r}
library(dplyr)
gapminder %>%
group_by(country)
```
Summarising
========================================================
```{r}
gapminder %>%
group_by(country) %>%
summarise(maxpop = max(pop))
```
Summarising
========================================================
```{r}
gapminder %>%
group_by(country) %>%
summarise(maxpop = max(pop))
```
Summarising
========================================================
We can apply a function to multiple columns, or multiple functions to a column (or both):
```{r, eval=FALSE}
gapminder %>%
select(-continent, -year) %>%
group_by(country) %>%
summarise_all(max)
```
```{r, eval=FALSE}
gapminder %>%
select(country, pop) %>%
group_by(country) %>%
summarise_all(max)
```
```{r, eval=FALSE}
gapminder %>%
group_by(country) %>%
summarise_if(is.numeric, max)
```
Summarising
========================================================
We can build up a long pipeline to, e.g., summarise min, mean, max for all numeric variables and end up with a table with min-mean-max as columns headers, and variable (gdpPercap, lifeExp, pop) rows.
```{r, eval = FALSE}
gapminder %>%
select(-year) %>%
group_by(country) %>%
summarise_if(is.numeric, funs(min, max, mean)) %>%
gather(variable, value, -country) %>%
separate(variable, into = c("variable", "stat")) %>%
spread(stat, value)
```
Introducing `babynames`
========================================================
Explore `babynames` a bit. How many rows, columns does it have? How many unique names?
```{r}
library(babynames)
babynames
```
Summarizing babynames
========================================================
What does this calculate?
```{r}
babynames %>%
group_by(year, sex) %>%
summarise(prop = max(prop),
name = name[which.max(prop)])
```
Summarizing babynames
========================================================
<img src="images/popular_babynames.png" width="800" />
https://en.wikipedia.org/wiki/Linda_(1946_song)
Hands-on: the `babynames` dataset
========================================================
type: prompt
incremental: false
Load the dataset using `library(babynames)`.
Read its help page. Look at its structure (rows, columns, summary).
Use `dplyr` to calculate the total number of names in the SSA database for each year. Hint: `n()`.
Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).
Summarizing babynames
========================================================
```{r, eval=FALSE}
babynames %>%
filter(name == "Benjamin") %>%
qplot(year, n, color = sex, data = .)
babynames %>%
group_by(year, sex) %>%
mutate(rank = row_number(desc(n))) %>%
filter(name == "Benjamin") %>%
qplot(year, rank, color = sex, data = .)
```
Summarizing babynames
========================================================
```{r, echo=FALSE}
babynames %>%
group_by(year, sex) %>%
mutate(rank = row_number(desc(n))) %>%
filter(name %in% c("Benjamin", "Rachel")) %>%
qplot(year, rank, color = name, shape = sex, data = ., main = "Benjamin and Rachel")
```
Important things we didn't talk about
========================================================
- getting your data _into_ R
- working with non-text/tabular data
- lists
- joins (merging datasets)
- handling dates and timestamps
- plotting (though we had many examples)
But the tools we've covered here can do a lot!
Split-apply-combine: Max
========================================================
Compute relative growth for each treatment.
```{r, eval=FALSE}
tree_diameter_data %>%
# Each day's diameter is in a separate column. Reshape
gather(date, diameter, -treatment) %>%
# Make sure the data are in chronological order
arrange(date) %>%
# For each tree, compute growth from beginning of the season
group_by(tree_id) %>%
mutate(growth = diameter - first(diameter)) %>%
# For each treatment and species,
# compute average growth over all trees
group_by(treatment, species) %>%
summarise(mean_growth = mean(growth)) ->
tree_growth_by_trt
```
Split-apply-combine: Kayla
========================================================
How does the spatial variability of soil respiration evolve over time?
```{r, eval=FALSE}
licor_sr_data %>%
# Select the columns we're interested in
select(date, flux, collar_number, t10, sm) %>%
# We are fundamentally interested in treatment; use a 'join'
# to pull in that information from a table of collars/treatments
left_join(collar_treatments, by = "collar_number") %>%
# Compute variability between collars by month and treatment
group_by(month(date), treatment) %>%
summarise(spatial_variability = sd(flux) / mean(flux)) ->
tree_growth_by_trt
```
Split-apply-combine: Lisa
========================================================
Calculate canopy and subcanopy assimilation for aspen in three size classes, weighting appropriately.
```{r, eval=FALSE}
licor_leaf_data %>%
# We only have a tree number; pull in complete information
left_join(tree_database, by = "tree_number") %>%
filter(species == "POGR") %>%
# Split the trees into small, medium, big
mutate(size_class = cut(dbh, breaks = 3)) %>%
# We want to weight the computation by leaf area, which scales
# nonlinearly with diameter
group_by(size_class, canopy_position) %>%
summarise(flux = weighted.mean(flux, w = dbh ^ 2)) ->
leaf_assimilation
```
Last thoughts
========================================================
>The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians.
>
>-- Bow Cowgill
All the source code for this presentation is available at https://github.com/bpbond/R-data-picarro (under the `vcu` branch)
>source code for this presentation
Wait, what?
_This presentation was generated from an RMarkdown document._
<img src="images/no_pkg_error.png" width="800" />
Resources
========================================================
type: section
Resources
========================================================
* [CRAN](http://cran.r-project.org) - The Comprehensive R Archive Network.
* [GitHub](https://github.com/JGCRI) - The JGCRI organization page on GitHub.
* [RStudio](http://www.rstudio.com) - the integrated development environment for R. Makes many things hugely easier.
* [Advanced R](http://adv-r.had.co.nz) - the companion website for “Advanced R”, a book in Chapman & Hall’s R Series. Detailed, in depth look at many of the issues covered here.
Resources
========================================================
R has many contributed *packages* across a wide variety of scientific fields. Almost anything you want to do will have packages to support it.
[CRAN](http://cran.r-project.org) also provides "Task Views". For example:
***
- Bayesian
- Clinical Trials
- Differential Equations
- Finance
- Genetics
- HPC
- Meta-analysis
- Optimization
- [**Reproducible Research**](http://cran.r-project.org/web/views/ReproducibleResearch.html)
- Spatial Statistics
- Time Series