-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-tidy.qmd
856 lines (610 loc) · 34.9 KB
/
06-tidy.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
# Data Tidying {#sec-tidy .incomplete-chapter}
```{r, include=FALSE, eval = FALSE}
# setup - doesn't need to run every time
# creates and saves tidy_data.csv and untidy_data.csv
set.seed(90210)
tidy_data <- crossing(
customer_id = 1:5,
year = 2018:2020
) |>
mutate(items = sample(0:10, nrow(.), T)) |>
group_by(year) |>
mutate(price_per_item = round(rnorm(1, 5, 2), 2)) |>
ungroup() |>
mutate(totalprice = items * price_per_item)
untidy_data <- tidy_data |>
mutate(itemsprice = glue::glue("{items} ({price_per_item})")) |>
select(-items, -price_per_item) |>
pivot_wider(id_cols = customer_id,
names_from = year,
values_from = c(itemsprice, totalprice)
)
write_csv(tidy_data, "data/tidy_data.csv")
write_csv(untidy_data, "data/untidy_data.csv")
```
## Intended Learning Outcomes {#sec-ilo-tidy - .ilo}
- [ ] Be able to reshape data between long and wide formats
- [ ] Separate, change, reorder, and rename columns
- [ ] Use pipes to chain together functions
## Functions used {#functions-tidy}
* built-in (you can always use these without loading any packages)
* base:: `library()`, `c()`, `list()`
* tidyverse (you can use all these with `library(tidyverse)`)
* readr:: `read_csv()`, `type_convert()`
* dplyr:: `filter()`, `group_by()`, `summarise()`, `select()`, `mutate()`, `glimpse()`
* tidyr:: `pivot_longer()`, `pivot_wider()`, `separate()`
* ggplot2:: `ggplot()`, `aes()`, `geom_histogram()`
* stringr:: `str_replace_all()`
## Set-up {#sec-setup-tidy -}
1. Open your `reprores` project
1. Create a new quarto file called `06-tidy.qmd`
1. Update the YAML header
1. Replace the setup chunk with the one below:
```{r}
#| echo: fenced
#| label: setup
#| include: false
library(tidyverse) # for data wrangling
```
You'll also need to make a folder called "data" and download two data files into it:
<a href="https://psyteachr.github.io/ads-v2/data/tidy_data.csv" download>tidy_data.csv</a> and
<a href="https://psyteachr.github.io/ads-v2/data/untidy_data.csv" download>untidy_data.csv</a>.
Download the [Data tidying cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf).
## Data Structures
The data you work with will likely come in many different formats and structures. Some of these structures may be driven by how the software you use outputs the data, but data structures may also differ because of human intervention or attempts at organisation, some of which may not be particularly helpful.
Data cleaning and tidying will likely be the most time consuming and difficult task you perform. Whilst you can create code recipes for analyses and visualisations, as Hadley Whickham puts it, ["every messy dataset is messy in its own way"](https://vita.had.co.nz/papers/tidy-data.pdf), which means that you will often have to solve new problems that are specific to the dataset. Additionally, moving between data structures that are intuitive to read by humans and those that are useful for a computer requires a conceptual shift that only comes with practice.
This is all a verbose way of saying that what lies ahead in this chapter is unlikely to sink in on the first attempt and you will need to practice with different examples (preferably with data you know well) before you truly feel comfortable with it.
First, some terminology.
An `r glossary("observation")` is all the information about a single "thing" in a single condition, such as at one point in time. These things can be customers, sales, orders, feedback questionnaires, tweets, or really anything. Observations should have a way to identify them, such as a unique ID or a unique combination of values like country and year.
A `r glossary("variable")` is one type of information about the observation. For example, if the observation is a sale, the variables you might have about the sale are the sale ID, the customer's ID, the date of the sale, the price paid, and method of payment.
A `r glossary("value")` is the data for one variable for one observation. For example, the value of the date variable from the observation of a sale might be `2021-08-20`.
::: {.try}
The following table is data that shows the number of items each customer bought each year.
```{r, echo = FALSE}
tidy_data <- read_csv("data/tidy_data.csv", show_col_types = FALSE)
tidy_data |>
filter(customer_id < 3) |>
group_by(customer_id, year) |>
summarise(
items = sum(items),
.groups = "drop"
)
```
- What is `items`? `r mcq(c(x = "Observation", answer = "Variable", x = "Value"))`
- How many observations are there in this dataset? `r fitb(6)`
- What is `8`? `r mcq(c(x = "Observation", x = "Variable", answer = "Value"))`
`r hide("Explain these answers")`
- There are three variables, `customer_id`, `year`, and `items`.
- There are six observations, one for each of two customers for each of three years.
- `8` is a value because it is a single data point for one variable for one observation.
`r unhide()`
:::
### Untidy data
First, let's have a look at an example of a messy, or untidy, dataset. Each row has all of the data relating to one customer.
```{r}
untidy_data <- read_csv("data/untidy_data.csv", show_col_types = FALSE)
```
```{r tbl-untidy-data, echo = FALSE}
#| tbl-cap: Untidy table
untidy_data
```
* The `itemsprice_{year}` columns contain two values (number of items and price per item)
* The `totalprice_{year}` columns contain the total amount spent by that customer that year, i.e., items * price.
* There is data for three different years in the dataset.
Let's say you wanted to calculate the total price per customer over the three years and the total number of items bought per customer. You can't perform mathematical operations on the `itemsprice_{year}` columns because they are `r glossary("character")` `r glossary("data type", "data types")`.
You would probably normally use Excel to
1. split `itemsprice_2018` column into `item_2018` and `price_2018` columns
2. split `itemsprice_2019` column into `item_2019` and `price_2019` columns
3. split `itemsprice_2020` column into `item_2018` and `price_2020` columns
4. add `item_2018 + item_2019 + item_2020` to get the total number of items bought per customer
5. add `totalprice_2018 + totalprice_2019 + totalprice_2020` to get the total price per customer
::: {.try}
Think about how many steps in Excel this would be if there were 10 years in the table, or a different number of years each time you encountered data like this.
:::
### Tidy data
There are three rules for "`r glossary("tidy data")`, which is data in a format that makes it easier to combine data from different tables, create summary tables, and visualise your data.
- Each observation must have its own row
- Each variable must have its own column
- Each value must have its own cell
This is the tidy version:
```{r}
tidy_data <- read_csv("data/tidy_data.csv", show_col_types = FALSE)
```
```{r tbl-tidy-data, echo = FALSE}
#| tbl-cap: Tidy table
tidy_data
```
* There are now five variables (columns) because there are five different types of information we have for each observation: the customer id, the year, number of items bought, price per item, and total price.
* Each row is a customer's orders in a particular year.
* The number of items (`items`) and price per item (`price_per_item`) are in separate columns, so now you can perform mathematical operations on them.
To calculate the total price per customer over the three years and the total number of items bought per customer in R, you could then:
1. group the table by customer_id
2. sum the `items` column to get the total number of items bought per customer
3. sum the `totalprice` column to get the total price per customer
```{r, message=FALSE}
tidy_data |>
group_by(customer_id) |>
summarise(
total_items = sum(items),
total_price = sum(totalprice)
)
```
::: {.callout-note}
If there were 10 years in the table, or a different number of years each time you encountered data like this, the code for producing the tables and plots above never changes.
:::
If you have control over how the data are recorded, it will make your life easier to record it in a tidy format from the start. However, we don't always have control, so this class will also teach you how to convert untidy tables into tidy tables.
## Reshaping Data {#sec-reshaping-data}
Data tables can be in `r glossary("wide")` format or `r glossary("long")` format (or a mix of the two). Wide data are where all of the observations about one thing are in the same row, while long data are where each observation is on a separate row. You often need to convert between these formats to do different types of summaries or visualisation. You may have done something similar using pivot tables in Excel.
```{r img-pivot-table, echo=FALSE, fig.cap="Converting between wide and long formats using pivot tables in Excel."}
knitr::include_graphics("images/tidy/excel_pivot_table.png")
```
It can be easier to just consider one type of measurement at a time. `untidy_data` has two types of measurements, total price and price per item. Let's look at just the `totalprice` data first.
We can select just the columns we want using the `dplyr::select()` function. This function's first argument is the data table you want to select from, then each argument after that is either the name of a column in that table, or `new_name = old_name`. This is a useful function for changing the column names and order of columns, as well as selecting a subset of columns.
::: {.callout-warning}
Note that because the names of the columns are numbers, they need to be wrapped in backticks otherwise you'll get an error like:
```
Error: unexpected '=' in:
" customer_id,
2018 ="
```
:::
```{r}
# select just the customer ID and 3 total price columns
wide_totalprice <- select(
.data = untidy_data,
customer_id,
`2018` = totalprice_2018,
`2019` = totalprice_2019,
`2020` = totalprice_2020
)
```
```{r tbl-wide-data, echo = FALSE}
#| tbl-cap: Wide table
wide_totalprice
```
This is in wide format, where each row is a customer, and represents the data from several years. This is a really intuitive way for humans to read a table, but it's not as easy to process with code.
The same data can be represented in a long format by creating a new column that specifies what `year` the observation is from and a new column that specifies the `totalprice` of that observation. This is easier to use to make summaries and plots.
```{r long-data}
long_totalprice <- pivot_longer(
data = wide_totalprice,
cols = `2018`:`2020`,
names_to = "year",
values_to = "totalprice")
```
```{r tbl-long-data, echo = FALSE}
#| tbl-cap: Long table
long_totalprice
```
It also makes it very easy to use with `ggplot()`. Run the following plot, and consider how you'd make it with the wide version (it is likely impossible).
```{r long-ggplot-demo, fig.cap = "Most plots are easier to make with data in a long format."}
ggplot(long_totalprice, aes(x = totalprice, fill = year)) +
geom_histogram(binwidth = 10, color = "black")
```
::: {.try}
Create a long version of the following table of how many million followers each band has on different social media platforms. You don't need to use code, just sketch it in a notebook or make a table in a spreadsheet.
| band | twitter | instagram |
|:-------------------|:--------|:----------|
| The Beatles | 3.8 | 3.8 |
| The Rolling Stones | 3.4 | 3.1 |
| One Direction | 31.3 | 22.8 |
`r hide("Answer")`
Your answer doesn't need to have the same column headers or be in the same order.
| account | social_media | followers |
|:-------------------|:-------------|:----------|
| The Beatles | twitter | 3.8 |
| The Beatles | instagram | 3.8 |
| The Rolling Stones | twitter | 3.4 |
| The Rolling Stones | instagram | 3.1 |
| One Direction | twitter | 31.3 |
| One Direction | instagram | 322.8 |
`r unhide()`
:::
::: {.callout-note}
If you're a researcher and you're used to thinking about IVs and DVs, you may find it easier to remember that each IV and DV should have its own column, rather than a column for each level of the IV.
:::
The pivot functions allow you to transform a data table from wide to long or long to wide.
### Wide to long
The function `pivot_longer()` converts a wide data table to a longer format by converting the headers from specified columns into the values of new columns, and combining the values of those columns into a new condensed column.
This function has several arguments:
- `cols`: the columns you want to make long; you can refer to them by their names, like `` c(`2018`, `2019`, `2020`) `` or `` `2018`:`2020` `` or by their numbers, like `c(2, 3, 4)` or `2:4`
- `names_to`: what you want to call the new columns that the `cols` column header names will go into
- `values_to`: what you want to call the new column that contains the values in the `cols`
With the pivot functions, it can be easier to show than tell - run the below code and then compare `wide_totalprice` with `long_totalprice` and the pivot code and try to map each argument to what has changed.
```{r pivot-longer}
long_totalprice <- pivot_longer(
data = wide_totalprice,
cols = `2018`:`2020`, # columns to make long
names_to = "year", # new column name for headers
values_to = "totalprice" # new column name for values
)
```
### Long to wide
We can also go from long to wide format using the `pivot_wider()` function. Instead of returning to the original table with a row for each customer and a column for each year, this new wide table will have a row for each year and a column for each customer. It can be awkward to have numbers for column names, so we use `names_prefix` to add "C_" before each new column name.
- `id_cols`: the column(s) that uniquely identify each new row
- `names_from`: the column(s) that contain your new column headers
- `names_prefix`: A prefix to add to the values in the names column
- `values_from`: the column that contains the values for the new columns
```{r pivot-wider}
wide_by_customer <- pivot_wider(
data = long_totalprice,
id_cols = year, # identifying column(s)
names_from = customer_id, # the new column names
names_prefix = "C_", # prefix for new column names
values_from = totalprice # the new column values
)
```
```{r tbl-wide-pivot, echo = FALSE}
#| tbl-cap: Data made wider with pivot_wider()
wide_by_customer
```
## Multi-step tidying {#sec-multistep}
You often need to go from wide, to long, to an intermediate shape in order to get your data into a format that is useful for plotting, where there is a column for each variable that you want to represent with an aesthetic.
Our full `untidy_data` table has seven columns: a customer ID, three columns for `itemsprice` and 3 columns for `totalprice`.
```{r, echo = FALSE}
untidy_data
```
We want to get it into the tidy format below where each row is an observation of one customer per year, with the columns of `customer_id`, `year`, `item`, `price_per_item` and `totalprice`. Before trying to reshape any dataset, you should be able to visualise what it will look like. Sketching out your tables on a piece of paper can really help make these transformations make sense.
```{r, echo = FALSE}
tidy_data
```
### One observation per row
The original table has observations from each customer over three years. This is too many observations per row, so first we'll start by making the table long. We need to make 6 rows for each customer, one for each category (item price/total price) and year combination, with columns for the customer ID, year, category, and value.
Because we'll be combining columns with numeric (`totalprice`) and character (`itemsprice`) data, we need to make the new `value` column a character data type using `values_transform`, since numbers can be represented as characters (like `"3.5"`), but character strings can't be represented as numbers.
The argument `names_sep` is set to the character string used to join names if `names_from` is more than one column. Alternatively, you can use the argument `names_pattern`, which can be more powerful but also a little harder to understand how to set up.
```{r}
longer_data <- pivot_longer(
data = untidy_data,
cols = itemsprice_2018:totalprice_2020, # columns to make long
names_to = c("category", "year"), # new column names for cols
names_sep = "_", # how to split cols into new columns
# names_pattern = "(.*)_(.*)", # alternative to names_sep
values_to = "value", # new column name for values
# make sure new columns are the right data type
names_transform = list(year = as.integer),
values_transform = list(value = as.character)
)
```
```{r tbl-wide-to-long, echo = FALSE}
#| tbl-cap: Untidy data converted from wide to long.
longer_data
```
### One variable per column
Now this table is long, but not tidy. The `value` column contains data from two different variables. We need to make the table wider, but not as wide as before. We want to keep the `year` column and make new columns called `itemsprice` and `totalprice` with the relevant customer's `value` for that variable and year.
```{r}
wider_data <- pivot_wider(
data = longer_data,
id_cols = c(customer_id, year),
names_from = category,
values_from = value
)
```
```{r tbl-intermediate, echo = FALSE}
#| tbl-cap: Data converted from long to an intermediate shape.
wider_data
```
::: {.callout-note}
Techinically, you can skip setting the `id_cols` argument, because all of the columns apart from the `names_from` column and the `values_from` column identify the observation (e.g., each observation is identified by the unique combination of `customer_id` and `year`). You only have to set the `id_cols` argument when this is not the case.
:::
### One value per cell
The cells in the `itemsprice` column actually contain two different values. We need to split it into two columns for the variables `items`, and `price_per_item`. You can split a column into parts with the function `tidyr::separate()`. There is a space between the number of items and the brackets, so we can split it along this space -- if you are in charge of how data is stored, ensuring data is entered consistently makes this much easier.
```{r}
split_data <- separate(
data = wider_data,
col = itemsprice, # the column to split
into = c("items", "price_per_item"), # the new columns to create
sep = " ", # split col by space
remove = TRUE, # whether to remove to old col
convert = TRUE # whether to fix the data type of the new columns
)
```
```{r tbl-split, echo = FALSE}
#| tbl-cap: The itemsprice column split into items and price_per_item using separate().
split_data
```
::: {.callout-warning}
If the new columns should have a different `r glossary("data type")` from the old column, set `convert = TRUE` to automatically fix them. This is common when you have columns that contain multiple numbers, separated by commas or semicolons. These are character types before they are separated, but should be numeric types after so that you can do mathematical operations like sum them.
:::
### Altering data
The column `price_per_item` is still a character column because it has parentheses. There are a few ways to fix this. You can use the `dplyr::mutate()` function to change a column or add a new one.
Here, we'll use `stringr::str_replace_all()` to replace all of the "(" and ")" with "".
```{r}
mutated_data <- mutate(
.data = split_data,
price_per_item = stringr::str_replace_all(
string = price_per_item,
pattern = "[()]",
replacement = ""
)
)
```
```{r tbl-mutate, echo = FALSE}
#| tbl-cap: Mutating data to remove the parentheses from price_per_item.
mutated_data
```
### Fixing data types
The `price_per_item` and `totalprice` columns are still characters, so you can't do things like calculate the sum of `totalprice`.
```{r}
# check the data types
glimpse(mutated_data)
```
Once the data are clean and tidy, you can fix all of your column data types in one step using `readr::type_convert()`. This is good practice when you've finished cleaning a data set. If the automatic type detection doesn't work as expected, this usually means that you still have non-numeric characters in a column where there were only supposed to be numbers. You can also manually set the column types in the same way as for `readr::read_csv()` (see @sec-data)).
```{r, message = FALSE}
tidy_data <- type_convert(
df = mutated_data,
trim_ws = TRUE # removes spaces before and after values
)
# check the data types
glimpse(tidy_data)
```
## Pipes {#sec-pipes}
::: {style="width: 200px; float: right;"}
<img src="images/tidy/pipe_sticker.png" style="width: 100%"/>
:::
We've already introduced pipes in @sec-pipes-first but this type of data processing is where they really start to shine, as they can significantly reduce the amount of code you write.
As a recap, a pipe takes the result of the previous function and sends it to the next function as its first argument, which means that you do not need to create intermediate objects. Below is all the code we've used in this chapter, and in the process we created five objects. This can get very confusing in longer scripts.
```{r, message = FALSE}
untidy_data <- read_csv("data/untidy_data.csv",
show_col_types = FALSE)
longer_data <- pivot_longer(
data = untidy_data,
cols = itemsprice_2018:totalprice_2020,
names_to = c("category", "year"),
names_sep = "_",
values_to = "value",
names_transform = list(year = as.integer),
values_transform = list(value = as.character)
)
wider_data <- pivot_wider(
data = longer_data,
id_cols = c(customer_id, year),
names_from = category,
values_from = value
)
split_data <- separate(
data = wider_data,
col = itemsprice,
into = c("items", "price_per_item"),
sep = " ",
remove = TRUE,
convert = TRUE
)
mutated_data <- mutate(
.data = split_data,
price_per_item = stringr::str_replace_all(
string = price_per_item,
pattern = "[()]",
replacement = ""
)
)
tidy_data <- type_convert(
df = mutated_data,
trim_ws = TRUE
)
```
::: {.callout-warning}
You *can* give each object the same name and keep replacing the old data object with the new one at each step. This will keep your environment clean, but it makes debugging code much harder.
:::
For longer series of steps like the one above, using pipes can eliminate many intermediate objects. This also makes it easier to add an intermediate step to your process without having to think of a new table name and edit the table input to the next step (which is really easy to accidentally miss).
```{r, message = FALSE}
tidy_data <- read_csv(file = "data/untidy_data.csv",
show_col_types = FALSE) |>
pivot_longer(
cols = itemsprice_2018:totalprice_2020,
names_to = c("category", "year"),
names_sep = "_",
values_to = "value",
names_transform = list(year = as.integer),
values_transform = list(value = as.character)
) |>
pivot_wider(
id_cols = c(customer_id, year),
names_from = category,
values_from = value
) |>
separate(
col = itemsprice,
into = c("items", "price_per_item"),
sep = " ",
remove = TRUE,
convert = TRUE
) |>
mutate(
price_per_item = stringr::str_replace_all(
string = price_per_item,
pattern = "[()]",
replacement = ""
)
) |>
type_convert(
trim_ws = TRUE
)
```
You can read the code above like this:
1. Read the data with `read_csv()`
- `file`: from the file at r path("data/untidy_data.csv")\`,
- `show_col_types`: do not show the colukmn types message; **and then**
2. Reshape the data longer with `pivot_longer()`
- `cols`: take the columns from `itemsprice_2018` to `totalprice_2020`,
- `names_to`: create new columns called "category" and "year" from the `cols` header names,
- `names_sep`: separate the column names using "\_"
- `values_to`: create a new column called "value" from the `cols` values,
- `names_transform`: transform the `year` column to integers,
- `values_transform`: transform the `value` column to characters; **and then**
3. Reshape the data wider with `pivot_wider()`
- `id_cols`: each row should be an observation of a unique `customer_id` and `year`,
- `names_from`: get the new column names from the values in the `category` column,
- `values_from`: get the new column values from the values in the `value` column; **and then**
4. Split multiple values in the same column with `separate()`
- `col`: separate the column `itemsprice`,
- `into`: into new columns called "items" and "price_per_item",
- `sep`: separate the values at each " ",
- `remove`: do remove the old column,
- `convert`: do convert the new columns into the right data types; **and then**
5. Change a column with `mutate()`
- `price_per_item`: replace the existing column `price_per_item` with the result of a search and replace with `str_replace_all()`:
- `string`: the strings to modify come from the `price_per_item` columns,
- `pattern`: search for left or right parentheses,
- `replacement`: replace them with ""; **and then**,
6. Fix data types with `type_convert()`
- `trim_ws`: remove spaces, tabs, and line breaks from the start and end of each value
Don't feel like you always need to get all of your data wrangling code into a single pipeline. You should make intermediate objects whenever you need to break up your code because it's getting too complicated or if you need to debug something.
::: {.callout-note}
You can debug a pipe by highlighting from the beginning to just before the pipe you want to stop at. Try this by highlighting from `data <-` to the end of the `separate` function and typing command-enter (mac) or control-enter (PC). What does `data` look like now?
:::
## Exercises
Let's try a couple of examples.
* Save your current script, close it, and open a new quarto script named "patient-survey.qmd".
* Download a copy of [wide_exercise-1.csv](data/wide_exercise-1.csv) and [wide_exercise-2.csv](data/wide_exercise-2.csv) into your data folder.
* In the set-up code chunk, load the <pkg>tidyverse</pkg> then load the two data files in using `read_csv()` and name the objects `wide1` and `wide2`
```{r, message = FALSE, webex.hide = TRUE}
library(tidyverse)
wide1 <- read_csv("data/wide_exercise-1.csv")
wide2 <- read_csv("data/wide_exercise-2.csv")
```
The two datasets represent simulated data from a patient satisfaction survey. We'll do them one at a time, as they differ in complexity.
### Survey 1
`wide1` has data from 50 patients who were asked five questions about their most recent experience at a health centre. The results from this questionnaire are typically reported as a single overall satisfaction score, which is calculated by taking the mean of the five responses. Additionally, the survey also records whether the patient was attending the clinic for the first time, or as a repeat patient.
* Use your method of choice to look at the dataset and familiarise yourself with its structure and data.
As noted, it's important to think through what your tidied data should look like. Often, the problem with data wrangling in R isn't actually the code, it's a lack of understanding of the data that's being worked on.
1. How many **variables** should the long-form version of `wide` have? `r fitb(4)`
2. How many **observations** should the long-form version of `wide1` have? `r fitb(250)`
`r hide("Explain these answers")`
1. There should be four variables, as there are 4 types of data: patient id, whether they are a repeat patient, the question they were asked, and their response.
2. There will be 250 observations or rows of data because each patient will have 5 rows of data (one per question) and there are 50 patients (50 * 5 = 250).
`r unhide()`
### Tidy 1
Transform `wide1` to long-form using `pivot_longer()` and store it in an object named `tidy1`
```{r webex.hide = TRUE}
tidy1 <- wide1 |>
pivot_longer(cols = q1:q5,
names_to = "question",
values_to = "response")
```
### Survey 2
`wide2` also has data from 50 patients, however, there are now two measures included in the questionnaire. There are still five questions that relate to satisfaction, but there are also five questions that relate to whether the patient would recommend the medical practice to a friend. Both measures are typically reported by calculating an overall mean for each of the five items.
* Use your method of choice to look at the dataset and familiarise yourself with its structure and data.
This is not as simple as the first exercise because there's actually two potential ways you might tidy this data, depending on what you want to do with it and how you conceptualise the two different measurements. It's important to recognise that many of your coding problems will not have just one solution.
#### Tidy 2a
For the first option, we're going to treat the "satisfaction" and "recommendation" measurements as two categories of the same variable. This will be a fully long-form data set with five variables `id`, `repeat_patient`, `question` (the question number), `category` (whether it's `sat` or `rec`), and `response` (the numerical rating).
* How many **observations** should the fully long-form version of `wide2` have? `r fitb(500)`
`r hide("Explain this answer")`
There will be 500 rows of data because each participant will have 10 rows: 5 for the satisfaction questions and five for the recommendation questions.
`r unhide()`
Transform `wide2` to full long-form using `pivot_longer()` and store it in an object named `tidy2a`.
This exercise requires multiple steps and you may need to look at the help documentation.
`r hide("Hint 1")`
`data |> pivot_longer() |> separate()`
`r unhide()`
`r hide("Hint 2")`
`into = c("col1", "col2")`
`r unhide()`
```{r webex.hide = TRUE}
tidy2a <- wide2 |>
pivot_longer(cols = q1_sat:q5_rec,
names_to = "question",
values_to = "response") |>
separate(col = "question", into = c("question", "category"))
```
```{r webex.hide = "Alternative solution"}
# combine pivot_longer and separate by setting two values for names_to
# must include names_sep to determine how to separate the column names
tidy2a <- wide2 |>
pivot_longer(cols = q1_sat:q5_rec,
names_to = c("question", "category"),
names_sep = "_",
values_to = "response")
```
#### Tidy 2b
The second option is to treat the satisfaction and recommendation scores as two distinct variables. This only makes sense if the satisfaction and recommendation scores for each question number are related to each other (e.g., q1 is about the same thing for both questions), making them part of the same observation.
This version should also have five variables, but it won't be fully long-form, it'll be a slight mix of the two that we're going to call "semi-long". The variables in the semi-long version will be `id`, `repeat`, `question` (the question number), `sat` (the response for the satisfaction question), and `rec` (the response for the recommendation question).
* How many **observations** should the semi-long version of `wide2` have? `r fitb(250)`
`r hide("Explain this answer")`
There will be 250 rows of data because, just like `tidy1`, each participant will have 5 rows, one for each of the five questions. The different responses to the satisfaction and recommendation questions are in different variables.
`r unhide()`
This also takes multiple steps.
`r hide("Hint 1")`
You can reuse the code from `tidy2a`, you just need to add on an extra line that makes the data slightly wider.
`r unhide()`
`r hide("Hint 2")`
`data |> pivot_longer() |> separate() |> pivot_wider()`
`r unhide()`
```{r webex.hide = TRUE}
tidy2b <- wide2 |>
pivot_longer(cols = q1_sat:q5_rec,
names_to = "question",
values_to = "response") |>
separate(col = "question", into = c("question", "category")) |>
pivot_wider(names_from = "category", values_from = "response")
```
### Analysis and visualisation
Using `group_by()` and `summarise()`, calculate the mean score for each participant for both satisfaction and recommendation. Do this for both versions of the dataset so that you can see how the structure of the dataset changes the approach you need to take.
```{r eval = FALSE, webex.hide = TRUE}
tidy2a |>
group_by(id, category) |>
summarise(mean = mean(response),
.groups = "drop")
tidy2b |>
group_by(id) |>
summarise(mean_satisfaction = mean(sat),
mean_rec = mean(rec))
```
Replicate the following:
#### Plot 1
Scatterplot showing the relationship between satisfaction and recommendation scores, by whether the patient is a repeat patient.
`r hide("Hint")`
`geom_jitter()`
`r unhide()`
```{r echo = FALSE, message=FALSE, warning=FALSE}
ggplot(tidy2b, aes(x = sat, y = rec, colour = repeat_patient)) +
geom_jitter() +
geom_smooth(method = "lm") +
labs(x = "Satisfaction score", y = "Recommendation score", title = "Satisfaction and recommendation scores") +
theme_classic()
```
```{r webex.hide = TRUE, eval = FALSE}
ggplot(tidy2b, aes(x = sat, y = rec, colour = repeat_patient)) +
geom_jitter() +
geom_smooth(method = "lm") +
labs(x = "Satisfaction score", y = "Recommendation score", title = "Satisfaction and recommendation scores") +
theme_classic()
```
#### Plot 2
Boxplots showing satisfaction and recommends scores for new and repeat patients separately.
```{r echo = FALSE}
ggplot(tidy2a, aes(x = repeat_patient, y = response, fill = repeat_patient)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~category)+
theme_bw() +
scale_fill_brewer(palette = "Dark2")
```
```{r webex.hide = TRUE, eval = FALSE}
ggplot(tidy2a, aes(x = repeat_patient, y = response, fill = repeat_patient)) +
geom_boxplot(show.legend = FALSE) +
facet_wrap(~category)+
theme_bw() +
scale_fill_brewer(palette = "Dark2")
```
#### Plot 3
Histogram showing the distribution of all responses, across questions and categories.
```{r echo = FALSE}
ggplot(tidy2a, aes(x = response)) +
geom_histogram(binwidth = 1, colour = "black", fill = "Grey") +
labs(x = "Responses across all questions and categories") +
theme_bw()
```
```{r webex.hide = TRUE, eval = FALSE}
ggplot(tidy2a, aes(x = response)) +
geom_histogram(binwidth = 1, colour = "black", fill = "Grey") +
labs(x = "Responses across all questions and categories") +
theme_bw()
```
If your head hurts a bit at this point, rest assured it's absolutely normal. As we said at the start, reshaping and tidying data is a conceptual leap and there's no shortcut to the fact it just takes a bit of time and practice with different datasets - you will get there eventually!
## Glossary {#sec-glossary-tidy -}
```{r, echo = FALSE}
glossary_table()
```
## Further resources {#sec-resources-tidy -}
- [Data tidying cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf)
- [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)
- [Chapter 5: Tidy Data](https://r4ds.hadley.nz/data-tidy)) in *R for Data Science*
- [Chapter 4.3: Pipes](https://r4ds.hadley.nz/workflow-style.html#sec-pipes) in *R for Data Science*