-
Notifications
You must be signed in to change notification settings - Fork 12
/
30-dplyr.Rmd
1337 lines (1029 loc) · 42.1 KB
/
30-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Manipulating and analyzing data with dplyr {#sec-dplyr}
**Learning Objectives**
- Describe the purpose of the **`dplyr`** and **`tidyr`** packages.
- Select certain columns in a data frame with the **`dplyr`** function
`select`.
- Select certain rows in a data frame according to filtering
conditions with the **`dplyr`** function `filter` .
- Link the output of one **`dplyr`** function to the input of another
function with the 'pipe' operator `%>%` or `|>`.
- Add new columns to a data frame that are functions of existing
columns with `mutate`.
- Use the split-apply-combine concept for data analysis.
- Use `summarize`, `group_by`, and `count` to split a data frame into
groups of observations, apply summary statistics for each group, and
then combine the results.
- Describe the concept of a wide and a long table format and for which
purpose those formats are useful.
- Reshape a data frame from long to wide format and back with the
`pivot_wider()` and `pivot_longer()` commands from the
**`tidyr`** package.
## Data Manipulation using **`dplyr`** and **`tidyr`**
Bracket subsetting is handy, but it can be cumbersome and difficult to
read, especially for complicated operations. Enter
**`dplyr`**. **`dplyr`** is a package for making tabular data
manipulation easier. It pairs nicely with **`tidyr`** which enables
you to swiftly convert between different data formats for plotting and
analysis.
Packages in R are basically sets of additional functions that let you
do more stuff. The functions we've been using so far, like `str()` or
`data.frame()`, come built into R; packages give you access to more of
them. Before you use a package for the first time you need to install
it on your machine, and then you should import it in every subsequent
R session when you need it. You should already have installed the
**`tidyverse`** package. This is an "umbrella-package" that installs
several packages useful for data analysis which work together well
such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc.
The [**`tidyverse`** packages](https://www.tidyverse.org/) address 3
common issues that arise when doing data analysis with some of
functions that come with R:
1. The results from a base R function sometimes depend on the type of
data.
2. Using R expressions in a non standard way, which can be confusing
for new learners.
3. Hidden arguments, having default operations that new learners are
not aware of.
Let's start by loading several of the **tidyverse** packages with:
```{r, message = FALSE, purl = FALSE}
library("tidyverse")
```
The [Data Transformation Cheat
Sheet](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
provides an overview of the `dplyr` grammar, offering more details and
functions that we will see in this chapter. The [Tidy Data
Tutor](https://tidydatatutor.com/) is a wonderful tool to visually
describe what the tidy data operations do. The following
[tutorial](https://www.andrewheiss.com/blog/2024/04/04/group_by-summarize-ungroup-animations/)
offers nice animations of `mutate()`, `summarize()`, `group_by()`, and
`ungroup()`.
## What are **`dplyr`** and **`tidyr`**?
The package **`dplyr`** provides easy tools for the most common data
manipulation tasks. It is built to work directly with data frames,
with many common tasks optimized by being written in a compiled
language (C++). An additional feature is the ability to work directly
with data stored in an external database. The benefits of doing this
are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the
query are returned.
This addresses a common problem with R in that all operations are
conducted in-memory and thus the amount of data you can work with is
limited by available memory. The database connections essentially
remove that limitation in that you can connect to a database of many
hundreds of GB, conduct queries on it directly, and pull back into R
only what you need for analysis.
The package **`tidyr`** addresses the common problem of wanting to
reshape your data for plotting and use by different R
functions. Sometimes we want data sets where we have one row per
measurement. Sometimes we want a data frame where each measurement
type has its own column, and rows are instead more aggregated groups -
like plots or aquaria. Moving back and forth between these formats is
nontrivial, and **`tidyr`** gives you tools for this and more
sophisticated data manipulation.
To learn more about **`dplyr`** and **`tidyr`** after the workshop,
you may want to check out this [handy data transformation with
**`dplyr`**
cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
and this [one about
**`tidyr`**](https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf).
We'll read in our data using the `read_csv()` function, from the
tidyverse package **`readr`**, instead of `read.csv()`.
```{r dplyr_setp, results = 'hide', purl = FALSE}
rna <- read_csv("data/rnaseq.csv")
## inspect the data
str(rna)
## preview the data
# View(rna)
```
Notice that the class of the data is now `tbl_df`
This is referred to as a "tibble". Tibbles tweak some of the behaviors
of the data frame objects we introduced in the previous episode. The
data structure is very similar to a data frame. For our purposes the
only differences are that:
1. In addition to displaying the data type of each column under its name, it
only prints the first few rows of data and only as many columns as fit on one
screen.
2. Columns of class `character` are never converted into factors.
We're going to learn some of the most common **`dplyr`** functions:
- `select()`: subset columns
- `filter()`: subset rows on conditions
- `mutate()`: create new columns by using information from other columns
- `group_by()` and `summarize()`: create summary statisitcs on grouped data
- `arrange()`: sort results
- `count()`: count discrete values
## Selecting columns and filtering rows
To select columns of a data frame, use `select()`. The first argument
to this function is the data frame (`rna`), and the subsequent
arguments are the columns to keep.
```{r}
select(rna, gene, sample, tissue, expression)
```
To select all columns *except* certain ones, put a "-" in front of
the variable to exclude it.
```{r}
select(rna, -organism, -strain)
```
This will select all the variables in `rna` except `organism`
and `strain`.
To choose rows based on a specific criteria, use `filter()`:
```{r, purl = FALSE}
filter(rna, sex == "Male")
filter(rna, sex == "Male" & infection == "NonInfected")
```
## Pipes
What if you want to select and filter at the same time? There are
three ways to do this: use intermediate steps, nested functions, or
pipes.
With intermediate steps, you create a temporary data frame and use
that as input to the next function, like this:
```{r, purl = FALSE}
rna2 <- filter(rna, sex == "Male")
rna3 <- select(rna2, gene, sample, tissue, expression)
rna3
```
This is readable, but can clutter up your workspace with lots of
objects that you have to name individually. With multiple steps, that
can be hard to keep track of.
You can also nest functions (i.e. one function inside of another),
like this:
```{r, purl = FALSE}
rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression)
rna3
```
This is handy, but can be difficult to read if too many functions are
nested, as R evaluates the expression from the inside out (in this
case, filtering, then selecting).
The last option, *pipes*, are a recent addition to R. Pipes let you
take the output of one function and send it directly to the next,
which is useful when you need to do many things to the same dataset.
Pipes in R look like `%>%` (as made available via the **`magrittr`**
package, installed automatically with **`dplyr`**) or `|>` (as
available in base R). If you use RStudio, you can type the pipe with
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or
<kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac.
```{r, purl = FALSE}
rna |>
filter(sex == "Male") |>
select(gene, sample, tissue, expression)
```
Or
```{r, purl = FALSE}
rna |>
filter(sex == "Male") |>
select(gene, sample, tissue, expression)
```
In the above code, we use the pipe to send the `rna` dataset first
through `filter()` to keep rows where `sex` is Male, then through
`select()` to keep only the `gene`, `sample`, `tissue`, and
`expression`columns. Since `%>%` (and `|>`) takes the object on its
left and passes it as the first argument to the function on its right,
we don't need to explicitly include the data frame as an argument to
the `filter()` and `select()` functions any more.
Some may find it helpful to read the pipe like the word "then". For
instance, in the above example, we took the data frame `rna`, *then*
we `filter`ed for rows with `sex == "Male"`, *then* we `select`ed
columns `gene`, `sample`, `tissue`, and `expression`. The **`dplyr`**
functions by themselves are somewhat simple, but by combining them
into linear workflows with the pipe, we can accomplish more complex
manipulations of data frames.
If we want to create a new object with this smaller version of the
data, we can assign it a new name:
```{r, purl = FALSE}
rna3 <- rna |>
filter(sex == "Male") |>
select(gene, sample, tissue, expression)
rna3
```
`r msmbstyle::question_begin()`
Using pipes, subset the `rna` data to genes with an expression higher
than 50000 in male mice at time 0, and retain only the columns `gene`,
`sample`, `time`, `expression` and `age`
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, eval=TRUE, purl=FALSE}
rna |>
filter(expression > 50000, sex == "Male", time == 0 ) |>
select(gene, sample, time, expression, age)
```
`r msmbstyle::solution_end()`
## Mutate
Frequently you'll want to create new columns based on the values in existing
columns, for example to do unit conversions, or to find the ratio of values in two
columns. For this we'll use `mutate()`.
To create a new column of time in hours:
```{r, purl = FALSE}
rna |>
mutate(time_hours = time * 24) |>
select(time, time_hours)
```
You can also create a second new column based on the first new column within the same call of `mutate()`:
```{r, purl = FALSE}
rna |>
mutate(time_hours = time * 24, time_mn = time_hours * 60) |>
select(time, time_hours, time_mn)
```
If this runs off your screen and you just want to see the first few rows, you
can use a pipe to view the `head()` of the data. (Pipes work with non-`dplyr`
functions, too, as long as the **`dplyr`** or `magrittr` package is loaded).
```{r, purl = FALSE}
rna |>
mutate(time_hours = time * 24, time_mn = time_hours * 60) |>
select(time, time_hours, time_mn) |>
head()
```
Let's imagine we are interested in the human homologs of the mouse
genes analysed in this dataset. This information can be found in the
last column of the `rna` tibble, named `hsapiens_homolog_associated_gene_name`.
```{r, purl = FALSE}
rna |>
select(gene, hsapiens_homolog_associated_gene_name)
```
Some mouse gene have no human homologs. These can be retrieved using a `filter()`
in the chain, and the `is.na()` function that determines whether something is an `NA`.
```{r, purl = FALSE}
rna |>
select(gene, hsapiens_homolog_associated_gene_name) |>
filter(is.na(hsapiens_homolog_associated_gene_name))
```
If we want to keep only mouse gene that have a human homolog, we can
insert a `!` symbol that negates the result, so we're asking for
every row where `hsapiens_homolog_associated_gene_name` *is not* an
`NA`.
The first few rows of the output are full of `NA`s, so if we wanted to remove
those we could insert a `filter()` in the chain:
```{r, purl = FALSE}
rna |>
select(gene, hsapiens_homolog_associated_gene_name) |>
filter(!is.na(hsapiens_homolog_associated_gene_name))
```
`r msmbstyle::question_begin()` Create a new data frame from the `rna`
data that meets the following criteria: contains only the `gene`,
`chromosome_name`, `phenotype_description`, `sample`, and `expression`
columns and a new column giving the log expression the gene. This
data frame must only contain gene located on autosomes and associated
with a `phenotype_description`.
**Hint**: think about how the commands should be ordered to produce
this data frame!
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, eval=TRUE, purl=FALSE}
rna |>
filter(chromosome_name != "X", chromosome_name != "Y") |>
mutate(log_expression = log(expression)) |>
select(gene, chromosome_name, phenotype_description, sample, log_expression) |>
filter(!is.na(phenotype_description))
```
`r msmbstyle::solution_end()`
## Pulling a variable
The tidy functions that we have seen (and will see below) always take
a tidy table (typically a `tibble`) as input, and return a new,
transformed one, possibly composed of a single column/variables. This
is an important feature that enables the piping of commands one into
another. Sometimes however, one needs to extract the variable that
composes the column of a table. This can be done with the `pull()`
function.
Below, note the difference between `select()`, that here returns a
one-column `tibble` ...
```{r}
mini_rna <- head(rna)
mini_rna |> select(gene)
```
... and `pull()`, that returns a vector:
```{r}
mini_rna |> pull(gene)
```
After using `pull()`, on can't pipe the output into any of the
functions that expect to get a `tibble`.
## Split-apply-combine data analysis
Many data analysis tasks can be approached using the
*split-apply-combine* paradigm: split the data into groups, apply some
analysis to each group, and then combine the results. **`dplyr`**
makes this very easy through the use of the `group_by()` function.
```{r}
rna |>
group_by(gene)
```
The `group_by()` function doesn't perform any data processing, it
groups the data into subsets: in the example above, our initial
`tibble` of `r nrow(rna)` observations is split into
`r length(unique(rna$gene))` groups based on the `gene` variable.
Once the data have been combined, subsequent operations will be
applied on each group independently. To remove this grouping, simply
use the `ungroup()` function.
```{r}
grouped_rna <- rna |>
group_by(gene)
grouped_rna
ungroup(rna)
```
### The `summarize()` function
`group_by()` is often used together with `summarize()`, which
collapses each group into a single-row summary of that group.
`group_by()` takes as arguments the column names that contain the
**categorical** variables for which you want to calculate the summary
statistics. So to compute the mean `expression` by gene:
```{r}
rna |>
group_by(gene) |>
summarize(mean_expression = mean(expression))
```
You may also have noticed that the output from these calls doesn't run off the
screen anymore. It's one of the advantages of `tbl_df` over data frame.
You can also group by multiple columns:
```{r}
rna |>
group_by(gene, infection, time) |>
summarize(mean_expression = mean(expression))
```
Here, again, the output from these calls doesn't run off the screen
anymore. If you want to display more data, you can use the `print()` function
at the end of your chain with the argument `n` specifying the number of rows to
display:
```{r, purl = FALSE}
rna |>
group_by(gene, infection, time) |>
summarize(mean_expression = mean(expression)) |>
print(n = 15)
```
Once the data is grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add
columns indicating the median `expression` by gene and by condition:
```{r, purl = FALSE}
rna |>
group_by(gene, infection, time) |>
summarize(mean_expression = mean(expression),
median_expression = median(expression))
```
It is sometimes useful to rearrange the result of a query to inspect the values.
For instance, we can sort on `mean_expression` to put the genes lowly expressed first:
```{r, purl = FALSE}
rna |>
group_by(gene, infection, time) |>
summarize(mean_expression = mean(expression),
median_expression = median(expression)) |>
arrange(mean_expression)
```
To sort in descending order, we need to add the `desc()` function:
```{r, purl = FALSE}
rna |>
group_by(gene, infection, time) |>
summarize(mean_expression = mean(expression),
median_expression = median(expression)) |>
arrange(desc(mean_expression))
```
### Counting
When working with data, we often want to know the number of observations found
for each factor or combination of factors. For this task, **`dplyr`** provides
`count()`. For example, if we wanted to count the number of rows of data for
each infected and non infected, we would do:
```{r, purl = FALSE}
rna |>
count(infection)
```
The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, `rna |> count()` is equivalent to:
```{r, purl = FALSE}
rna |>
group_by(infection) |>
summarise(count = n())
```
For convenience, `count()` provides the `sort` argument:
```{r, purl = FALSE}
rna |>
count(infection, sort = TRUE)
```
Previous example shows the use of `count()` to count the number of rows/observations
for *one* factor (i.e., `infection`).
If we wanted to count *combination of factors*, such as `infection` and `time`,
we would specify the first and the second factor as the arguments of `count()`:
```{r purl = FALSE}
rna |>
count(infection, time)
```
With the above code, we can proceed with `arrange()` to sort the table
according to a number of criteria so that we have a better comparison.
For instance, we might want to arrange the table above by time:
```{r purl = FALSE}
rna |>
count(infection, time) |>
arrange(time)
```
or by counts:
```{r purl = FALSE}
rna |>
count(infection, time) |>
arrange(n)
```
`r msmbstyle::question_begin()`
1. How many genes were analysed in each sample?
2. Use `group_by()` and `summarize()` to evaluate the sequencing depth
(the sum of all expressions) in each sample. Which sample has the
highest sequencing depth?
3. Calculate the mean expression level of gene "Dok3" by timepoints.
4. Pick one sample and evaluate the number of genes by biotype
5. Identify genes associated with "abnormal DNA methylation" phenotype description,
and calculate their mean expression (in log) at time 0, time 4 and time 8.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
1. How many genes/observations were analysed in each sample?
```{r, purl=FALSE}
rna |>
count(sample)
```
2. Use `group_by()` and `summarize()` to evaluate the sequencing depth
(the sum of all counts) in each sample. Which sample has the highest sequencing depth?
```{r, purl=FALSE}
rna |>
group_by(sample) |>
summarize(seq_depth = sum(expression)) |>
arrange(desc(seq_depth))
```
3. Calculate the mean expression level of gene "Dok3" by timepoints.
```{r, purl=FALSE}
rna |>
filter(gene == "Dok3") |>
group_by(time) |>
summarize(mean = mean(expression)) |>
arrange(time)
```
4. Pick one sample and evaluate the number of genes by biotype
```{r, purl=FALSE}
rna |>
filter(sample == "GSM2545336") |>
group_by(gene_biotype) |>
count(gene_biotype) |>
arrange(desc(n))
```
5. Identify genes associated with "abnormal DNA methylation" phenotype description,
and calculate their mean expression (in log) at time 0, time 4 and time 8.
```{r, purl=FALSE}
rna |>
filter(phenotype_description == "abnormal DNA methylation") |>
group_by(gene, time) |>
summarize(mean_expression = mean(log(expression))) |>
arrange()
```
`r msmbstyle::solution_end()`
It is important to be able to conceptually visualise how these
different functions operate on data. Being able to do that allows to
[mentally run a set of commands and get a feeling whether this will
work, rather that randomly try things
out](https://lgatto.github.io/exam-qst/). The [Tidy Data
Tutor](https://tidydatatutor.com/) is a wonderful tool to do exactly
that.
## Reshaping data
In `rna`, the rows contain expression values that are associated with
a combination of 2 other variables: `gene` and `sample`. All the
other columns correspond to variables describing either the sample
(age, sex, organism...) or the gene (gene_biotype, ENTREZ_ID,
product...). The variables for the same gene and sample pair have the
same value in all the rows. This structure is called a *long format*,
as one column contains all the values, and other columns describe the
context of the value. These data also tend to be quite long.
In certain cases, the *long format* is not really human-readable or
the most appropriate for the task, and another format, called *wide
format* is preferred. It is also generally as a more compact way of
representing the data (although we don't need to worry about the size
of the data here). This is typically the case with gene expression
values that scientists are used to look as matrices of quantitative
data, were rows represent genes (or more generally features or
variables) and columns represent samples.
To convert the gene expression values from `rna` into a wide-format,
we need to create a new table where each row is composed of expression
values associated with each gene. In practical terms this means the
values of the `sample` column in `rna` would become the names of
column variables, and the cells would contain the expression values
measured on each gene.
The key point here is that we are still following a tidy data
structure (a single value per cell), but we have **reshaped** the data
according to the observations of interest: expression levels per gene
instead of recording them per gene and per sample.
With this new table, it would become therefore straightforward to
explore the relationship between the gene expression levels within,
and between, the samples.
The opposite transformation would be to transform column names into
values of a new variable.
We can do both these of transformations with two `tidyr` functions,
`pivot_longer()` and `pivot_wider()` (see
[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for
details).
### Pivoting the data into a wider format
For simplicity, let's first select the 3 first columns of `rna` and
use `pivot_wider()` to transform data in a wide-format.
```{r, purl = FALSE}
rna_exp <- rna |>
select(gene, sample, expression)
rna_exp
```
`pivot_wider` takes the following three main arguments:
1. the data to be transformed;
2. the `names_from` column name whose values will become new column
names;
3. the `values_from` column name whose values will fill the new
columns.
![](./figs/pivot_wider.png)
`pivot_wider()` generates a new table with 1474 gene for 22 samples -
one row for each gene, one column for each sample. We can also
directly pipe the data into the `pivot_wider()`, as illustrated below:
```{r, purl = FALSE}
rna_wide <- rna_exp |>
pivot_wider(names_from = sample,
values_from = expression)
rna_wide
```
We can now easily compare the gene expression levels in different
samples.
Note that the `pivot_wider()` function comes with an optional
`values_fill` argument that can be useful when dealing with missing
values. Let's imagine that for some reason, we had some missing
expression values for some genes in certain samples. In the following
example, the gene Cyp2d22 has only one expression value, in GSM2545338
sample.
```{r}
rna_with_missing_values <- rna |>
select(gene, sample, expression) |>
filter(gene %in% c("Asl", "Apod", "Cyp2d22")) |>
filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) |>
arrange(sample) |>
filter(!(gene == "Cyp2d22" & sample != "GSM2545338"))
rna_with_missing_values
```
By default, the `pivot_wider()` function will add `NA` for missing values.
```{r, purl = FALSE}
rna_with_missing_values |>
pivot_wider(names_from = sample,
values_from = expression)
```
But in some cases, we may wish to fill in the missing values by setting `values_fill`
to a specific value.
```{r, purl = FALSE}
rna_with_missing_values |>
pivot_wider(names_from = sample,
values_from = expression,
values_fill = 0)
```
### Pivoting data into a longer format
The opposing situation could occur if we had been provided with data
in the form of `rna_wide`, where the sample IDs are column names, but
we wished to treat them as values of a `sample` variable instead.
In this situation we are using the column names and turn them into a
pair of new variables and need to arrange the expression values
accordingly in a new variable. This can be done with the
`pivot_longer()` function. It takes the following four main arguments:
1. the data to be transformed;
2. the new `names_to` column we wish to create and populate with the
current column names;
3. the new `values_to` column we wish to create and populate with
current values;
4. the names of the columns to be used to populate the `names_to` and
`values_to` variables (or altarnatively, those to drop using a
`-`).
To recreate `rna_long` from `rna_long` we would create a key
called `sample` and value called `expression` and use all columns
except `gene` for the key variable. Here we drop `gene` column
with a minus sign.
Notice how the new variable names are to be quoted here.
```{r}
rna_long <- rna_wide |>
pivot_longer(names_to = "sample",
values_to = "expression",
-gene)
rna_long
```
![](./figs/pivot_longer.png)
Note that if we had missing values in the wide-format, the `NA` would be
included in the new wide format. Pivoting to wider and longer formats can
be a useful way to balance out a dataset so every replicate has the same composition.
```{r}
wide_with_NA <- rna_with_missing_values |>
pivot_wider(names_from = sample,
values_from = expression)
wide_with_NA
wide_with_NA |>
pivot_longer(names_to = "sample",
values_to = "expression",
-gene)
```
We could also have used a specification for what columns to
include. This can be useful if you have a large number of identifying
columns, and it's easier to specify what to gather than what to leave
alone. Here the `starts_with()` function can help to retrieve sample
names without having to list them all!
Another possibility would be to use the `:` operator!
```{r}
rna_wide |>
pivot_longer(names_to = "sample",
values_to = "expression",
cols = starts_with("GSM"))
rna_wide |>
pivot_longer(names_to = "sample",
values_to = "expression",
GSM2545336:GSM2545380)
```
`r msmbstyle::question_begin()`
Subset genes located on X and Y chromosomes from the `rna` data.frame
and create a new (wide) data.frame with `sex` as columns,
`chromosome_name` as rows, containing the mean expression of genes
located in each chromosome as the values, as shown below.
![](./figs/Exercise_pivot_W.png)
You will need to summarize before reshaping!
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Let's have a look to variables of interest
```{r, answer=TRUE, purl=FALSE}
rna |>
filter(chromosome_name == "Y" | chromosome_name == "X") |>
select(gene, sample, sex, expression, chromosome_name) |>
arrange(gene)
```
```{r, answer=TRUE, purl=FALSE}
rna_1 <- rna |>
filter(chromosome_name == "Y" | chromosome_name == "X") |>
group_by(sex, chromosome_name) |>
summarize(mean = mean(expression)) |>
pivot_wider(names_from = sex,
values_from = mean)
rna_1
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Now take that data frame and transform it with `pivot_longer()` so
each row is a unique `chromosome_name` by `gender` combination.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r, answer=TRUE, purl=FALSE}
rna_1 |>
pivot_longer(names_to = "gender",
values_to = "mean",
-chromosome_name)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Use the `rna` dataset to create an new table were each row represents
the mean expression levels of genes and columns represent the
different timepoints.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
rna |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression)) |>
pivot_wider(names_from = time,
values_from = mean_exp)
```
Notice that this generates a tibble with some column names starting by a number.
If we wanted to select the column corresponding to the timepoints,
we could not use the column names directly... What happens when we select the colum 4?
```{r}
rna |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression)) |>
pivot_wider(names_from = time,
values_from = mean_exp) |>
select(gene, 4)
```
To select the timepoint 4, we would have to quote the column name, with backticks "`"
```{r}
rna |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression)) |>
pivot_wider(names_from = time,
values_from = mean_exp) |>
select(gene, `4`)
```
Another possibility would be to rename the column,
choosing a name that doesn't start by a number :
```{r}
rna |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression)) |>
pivot_wider(names_from = time,
values_from = mean_exp) |>
rename("time0" = `0`, "time4" = `4`, "time8" = `8`) |>
select(gene, time4)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Use the previous data frame containing mean expression levels per
timepoint and create a new column containing fold-changes between
timepoint 8 and timepoint 0, and fold-changes between timepoint 8 and
timepoint 4. Convert this table in a long-format table gathering the
foldchanges calculated.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r}
rna_fc <- rna |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression)) |>
pivot_wider(names_from = time,
values_from = mean_exp) |>
mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`)
rna_fc
```
```{r}
rna_fc |>
pivot_longer(names_to = "comparisons",
values_to = "Fold_changes",
time_8_vs_0:time_8_vs_4)
```
`r msmbstyle::solution_end()`
## Exporting data
Now that you have learned how to use **`dplyr`** to extract
information from or summarize your raw data, you may want to export
these new data sets to share them with your collaborators or for
archival.
Similar to the `read_csv()` function used for reading CSV files into
R, there is a `write_csv()` function that generates CSV files from
data frames.
Before using `write_csv()`, we are going to create a new folder,
`data_output`, in our working directory that will store this generated
dataset. We don't want to write generated datasets in the same
directory as our raw data. It's good practice to keep them
separate. The `data` folder should only contain the raw, unaltered
data, and should be left alone to make sure we don't delete or modify
it. In contrast, our script will generate the contents of the
`data_output` directory, so even if the files it contains are deleted,
we can always re-generate them.
In preparation for our next lesson on plotting, we are going to
prepare a table representing for each gene, the fold-changes (in log
values) between timepoint 8 and timepoint 0, and the fold-changes
between timepoint 8 and timepoint 0.
```{r}
rna_fc <- rna |>
mutate(expression_log = log(expression)) |>
group_by(gene, time) |>
summarize(mean_exp = mean(expression_log)) |>
pivot_wider(names_from = time,