-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy path015-viz.qmd
1159 lines (896 loc) · 73.9 KB
/
015-viz.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{{< include _setup.qmd >}}
# Visualization {#sec-viz}
::: {.callout-note title="learning goals"}
- Analyze the principles behind informative visualizations
- Incorporate visualization into an analysis workflow
- Learn to make "the design plot": a standard visualization of experimental data
- Select different visualizations of variability and distribution
- Connect visualization concepts to measurement principles
:::
What makes visualizations so useful, and what role do they play in the experimenter's toolkit?
Simply put, data visualization is the act of "making the invisible visible."
Our visual systems are remarkably powerful pattern detectors, and relationships that aren't at all clear when scanning through rows of raw data can immediately jump out at us when presented in an appropriate graphical form [@zacks2020designing].
Good visualizations aim to deliberately harness this power and put it to work at every stage of the research process, from the quick sanity checks we run when first reading in our data to the publication-quality figures we design when we are ready to communicate our findings.
Yet, our powerful pattern detectors can also be a liability; if we're not careful, we can easily be fooled into seeing patterns that are unreliable or even misleading.
As psychology moves into an era of bigger data and more complex behaviors, we become increasingly reliant on **data visualization literacy**\index{data visualization literacy} [@borner2019data] to make sense of what is going on. Further, as a researcher reporting about your data, creating appropriate visualizations that are aligned with your analyses (as well as your design and preregistration) is an important part of [transparency]{.smallcaps} and [bias reduction]{.smallcaps} in your reporting.
::: {.callout-note title="case study"}
### Mapping a pandemic {-}
In 1854, a deadly outbreak of cholera\index{cholera} was sweeping through London.
The scientific consensus at the time was that diseases like cholera\index{cholera} spread through breathing poisonous and foul-smelling vapors, an idea known as the "miasma theory" [@halliday2001death].
An obstetrician and anesthesiologist named John Snow\index{Snow, John} proposed an alternative theory:\newpage
rather than spreading through foul air, he thought that cholera\index{cholera} was spreading through a polluted water supply [@snow1855mode].\index{Snow, John}
To make a public case for this idea, he started counting cholera deaths.
He marked each case on a map of the area and indicated the locations of the water pumps for reference.
Furthermore, a line could be drawn representing the region that was closest to each water pump, a technique that is now known as a Voronoi diagram.\index{Voronoi diagram}
The resulting illustration clearly reveals that cases clustered around an area called Golden Square, which received water from a pump on Broad Street (@fig-viz-cholera).
Although the precise causal role of these maps in Snow's own thinking is disputed, and it is likely that he produced them well after the incident [@brody2000map], they nonetheless played a significant role in the history of data visualization [@friendly2021history].
![Mapping out a cholera epidemic [@snow1854report]. The dotted line shows the region for which Broad Street pump is nearest.](images/viz/snow_cholera_voronoi.png){#fig-viz-cholera width="45%" fig-alt="An old map of Broad Street, with bars for number of cases at each address and dotted line around the area of most bars."}
\vspace{-1em}
Nearly two centuries later, as the COVID-19 pandemic\index{COVID-19 pandemic} swept through the world, governmental agencies like the Centers for Disease Control and Prevention (CDC)\index{Centers for Disease Control and Prevention (CDC)} produced maps of the outbreak that became much more familiar (@fig-viz-covid).
![A map showing the counts of COVID\index{COVID-19 pandemic} deaths by state since March 2020 as of Novermber 2024 (from the CDC\index{Centers for Disease Control and Prevention (CDC)} COVID Data Tracker, <https://covid.cdc.gov/covid-data-tracker>). _Usage does not constitute endorsement or recommendation by the US Government, Department of Health and Human Services, or Centers for Disease Control and Prevention._](images/viz/cdc-covid-map.png){#fig-viz-covid width="70%" fig-alt="A map of the US with each state's color indicating number of COVID deaths. TX, FL, CA, NY have darkest colors."}
\vspace{-1em}
These maps make abstract statistics visible: By assigning higher cumulative deaths rates to darker colors, we can see at a glance which areas have been most affected.
And we're not limited by the spatial layout of a map.
We're now also used to seeing the horizontal axis correspond to *time* and the vertical axis correspond to some value at that time.
Curves like the following, showing the weekly counts of new deaths, allow us to see other patterns, like the *rate of change*.
Even though more and more cases accumulate every day, we can see at a glance the different "waves" of cases and when they peaked (@fig-viz-cases).
![Weekly counts of new reported COVID\index{COVID-19 pandemic} deaths in the US between March 2020 and November 2024 (from the CDC\index{Centers for Disease Control and Prevention (CDC)} COVID Data Tracker, <https://covid.cdc.gov/covid-data-tracker>). _Usage does not constitute endorsement or recommendation by the US Government, Department of Health and Human Services, or Centers for Disease Control and Prevention._](images/viz/cdc-covid-trend.png){#fig-viz-cases width="65%" fig-alt="A plot of weekly COVID new deaths over time (8/2020 to 11/2024), with several waves where hospitalizations peak."}
\vspace{-1em}
While these visualizations capture purely descriptive statistics, we often want our visualizations to answer more specific questions.
For example, we may ask about the effectiveness of vaccinations: How do case rates differ across vaccinated and unvaccinated populations?
In this case, we may talk about "breaking out" a curve by some other variable, like vaccination status (@fig-viz-cases2).
![Rates of COVID\index{COVID-19 pandemic} cases by vaccination status (from the CDC\index{Centers for Disease Control and Prevention (CDC)} COVID Data Tracker, <https://covid.cdc.gov/covid-data-tracker>). _Usage does not constitute endorsement or recommendation by the US Government, Department of Health and Human Services, or Centers for Disease Control and Prevention._](images/viz/vaccination.png){#fig-viz-cases2 width="55%" fig-alt="A plot of COVID cases over time: curve for vaccinated individuals stays low; for unvaccinated higher, rises rapidly 8/2021."}
\vspace{-1em}
From this visualization, we can see that unvaccinated individuals are about six times more likely to test positive. At the same time, these visualizations were produced using *observational* data, which makes it challenging to draw causal inferences.
For example, people were not randomly assigned to vaccination conditions, and those who have avoided vaccinations may differ in other ways than those who sought out vaccinations.
Additionally, you may have noticed that these visualizations typically do not give a sense of the raw data, the sample sizes of each group, or uncertainty about the estimates.
In this chapter, we will explore how to use visualizations to communicate the results of carefully controlled psychology experiments, which license stronger causal inferences.
:::
## Basic principles of (confirmatory) visualization
In this section, we begin by introducing a few simple guidelines to keep in mind when making informative visualizations in the context of experimental psychology.^[For the purposes of understanding the examples in this chapter, it should be sufficient to work through the tutorials on data manipulation and visualization in @sec-tidyverse and @sec-ggplot.]
Remember that our needs may be distinct from other fields, such as journalism or public policy.
You may have seen beautiful and engaging full-page graphics with small print and a wealth of information.
The art of designing and producing these graphics is typically known as **infoviz**\index{infoviz} and should be distinguished from what we call **statistical visualization** [@gelman2013].
Roughly, infoviz\index{infoviz} aims to construct rich and immersive worlds to visually explore: a reader can spend hours pouring over the most intricate graphics and continue to find new and intriguing patterns.
Statistical visualization, on the other hand, aims to crisply convey the logic of a specific inference at a glance.
These visualizations are the production-ready figures that anchor the results section of a paper and accompany the key, preregistered analyses of interest.
In this section, we review several basic principles of making statistical visualizations.
We then return below to the role of visualization in more exploratory analyses.
### Principle 1: Show the design
There are so many different kinds of graphs (bar graphs, line graphs, scatter plots, and pie charts) and so many different possible attributes of those graphs (colors, sizes, line types).
How do we begin to decide how to navigate these decisions?
The first principle guiding good statistical visualizations is to *show the design* of your experiment.
The first confirmatory plot you should have in mind for your experiment is the **design plot**\index{design plot}.
Analogous to the "default" or "saturated" model in @sec-models, the design plot\index{design plot} should show the key dependent variable of the experiment, broken down by all of the key manipulations.
Critically, design plots\index{design plot} should neither omit particular manipulations because they didn't yield an effect or include extra covariates because they seemed interesting after looking at the data.
Both of these steps are the visual analogue of $p$-hacking!\index{p-hacking} Instead, the design plot\index{design plot} is the "preregistered analysis" of your visualization: it illustrates a first look at the estimated causal effects from your experimental manipulations. In the words of @coppock2019, "Visualize as You Randomize"!
It can sometimes be a challenge to represent the full pattern of manipulations from an experiment in a single plot. Below we give some tricks for maximizing the legible information in your plot. But if you have tried these and the design plot\index{design plot} still looks crowded and messy, that could indicate that your experiment is manipulating too many things at once!
There are strong (unwritten) conventions about how your confirmatory analysis is expected to map onto graphical elements, and following these conventions can minimize confusion.
Start with the variables you manipulate, and make sure they are clearly visible.
Conventionally, the primary manipulation of interest (e.g., condition) goes on the x-axis, and the primary measurement of interest (e.g., responses) goes on the y-axis.
Other critical variables of interest (e.g., secondary manipulations and demographics) are then assigned to other "visual variables" (e.g., color, shape, or size).
::: {.callout-note title="code"}
The visualization library `ggplot` (see @sec-ggplot) makes the mapping of variables in the data to visual data. Part of a ggplot call is an `aes()` (short for aesthetics) mapping:
```{r, opts.label='code'}
aes(
x = ...,
y = ...,
color = ...,
linetype = ...,
)
```
The aesthetics argument serves as a statement of how data are mapped to "marks" on the plot. This transparent mapping makes it very easy to explore different plot types by changing that `aes()` statement, as we'll see below.
:::
As an example, we will consider the data from @stiller2015 that we explored back in @sec-models.
Because this experiment was a developmental study, the primary independent variable of interest was the age group of participants (ages two, three, or four).
So age gets assigned to the horizontal (x) axis.
The dependent variable is accuracy: the proportion of trials that a participant made the correct response (out of four trials).
So accuracy goes on the vertical (y) axis.
Now, we have two other variables that we might want to show: the condition (experimental vs control) and the type of stimuli (houses, beds, and plates of pasta).
When we think about it, though, only condition is central to exposing the design.
While we might be interested in whether some types of stimuli are systematically easier or harder than others, condition is more central for understanding the *logic* of the study.
\clearpage
::: {.callout-note title="code"}
As a reminder, here's our code for loading the @stiller2015 data:
```{r viz-prepost1, echo=TRUE}
repo <- "https://raw.githubusercontent.com/langcog/experimentology/main"
sgf <- read_csv(file.path(repo, "data/tidyverse/stiller_scales_data.csv")) |>
mutate(age_group = cut(age, 2:5, include.lowest = TRUE),
condition = condition |>
fct_recode("Experimental" = "Label", "Control" = "No Label"))
sgf_cond_means <- sgf |>
group_by(condition, age_group) |>
summarize(rating = mean(correct))
```
:::
### Principle 2: Facilitate comparison
![Principles of visual perception can guide visualization choices. Based on @mackinlay1986 [@cleveland1984].](images/viz/viz_hierarchy.png){#fig-viz-hierarchy .column-margin fig-alt="A diagram of visual features listed from more to less accurate: position, length, angle/slope, area, volume, color/density."}
Now that you've mapped elements of your design to the figure's axes, how do you decide which graphical elements to display?
You might think: well, in principle, these assignments are all arbitrary anyway.
As long as we clearly label our choices, it shouldn't matter whether we use lines, points, bars, colors, textures, or shapes.
It's true that there are many ways to show the same data.
But being thoughtful about our choices can make it much easier for readers to interpret our findings.
The second principle of statistical visualizations is to *facilitate comparison* along the dimensions relevant to our scientific questions.
It is easier for our visual system to accurately compare the location of elements (e.g., noticing that one point is a certain distance away from another) than to compare their areas or colors (e.g., noticing that one point is bigger or brighter than another).
@Fig-viz-hierarchy shows an ordering of visual variables based on how accurate our visual system is in making comparisons.
For example, we *could* start by plotting the accuracy of each age group as colors (@fig-viz-prepost2).
```{r fig-viz-prepost2}
#| fig-cap: A first visualization of the @stiller2015 data.
#| fig-alt: A plot of condition (experimental/control) against age group (2-3, 3-4, 4-5); color of each cell corresponds to accuracy.
#| cap-location: margin
#| out-width: 65%
ggplot(sgf_cond_means, aes(x = age_group, y = condition, fill = rating)) +
geom_tile() +
labs(x = "Age group", y = "Condition", fill = "Rating")
```
\clearpage
::: {.callout-note title="code"}
To make this (bad) visualization, we used a `ggplot` function called `geom_tile()`.
```{r fig-viz-prepost2, opts.label='code'}
```
`geom_tile()` is commonly used to make heat maps (<https://en.wikipedia.org/wiki/Heat_map>): for each value of some pair of variables (x, y), it shows a color representing the magnitude of a third variable (z).
While a heat map is a silly way to visualize the @stiller2015 data, consider using `geom_tile()` when you have a pair of continuous variables, each taking a large range of values.
In these cases, bar plots and line plots tend to get extremely cluttered, making it hard to see the relationship between the variables.
Heat maps help these relationships to pop out as clear "hot" and "cold" regions.
For example, in @barnett2022, a heatmap was used to show a specific range of parameters where an effect of interest emerged (see @fig-viz-heatmap).
![A heatmap showing a specific range of continuous parameters where an effect emerged. @barnett2022, Figure 3 (licensed under CC BY 4.0).](images/viz/viz_heatmap_example.png){#fig-viz-heatmap width="35%" fig-alt="A heatmap of stick length against agent bias; there is a bright spot in the bottom right-hand corner."}
:::
Or we could plot the accuracies as sizes/areas (@fig-viz-prepost3).
```{r fig-viz-prepost3}
#| fig-cap: Iterating on the Stiller data using size.
#| fig-alt: A plot the same as 15.6 but instead of 6 colored cells there are 6 dots of various sizes; it's difficult to compare values.
#| cap-location: margin
#| out-width: 65%
ggplot(sgf_cond_means, aes(x = age_group, y = condition, size = rating)) +
geom_point() +
labs(x = "Age group", y = "Condition", size = "Rating")
```
\clearpage
::: {.callout-note title="code"}
To make this (bad) visualization, we mapped the rating to the `size` aesthetic in our `aes()` call.
```{r fig-viz-prepost3, opts.label='code'}
```
:::
These plots allow us to see that one condition is (qualitatively) bigger than others, but it's hard to tell how much bigger.
Additionally, this way of plotting the data places equal emphasis on age and condition, but we may instead have in mind particular contrasts, like the *change* across ages and how that change differs across conditions.
An alternative is to show six bars: three on the left showing the "experimental" phase and three on the right showing the "control" phase.
Maybe the age groups then are represented as different colors, as in @fig-viz-prepost4.
```{r fig-viz-prepost4}
#| fig-cap: A bar graph of the Stiller data.
#| fig-alt: A plot of Mean accuracy against Condition; Age groups represented by bars of different colors.
#| cap-location: margin
#| out-width: 80%
ggplot(sgf_cond_means, aes(x = condition, y = rating, fill = age_group)) +
geom_col(position = "dodge") +
labs(x = "Condition", y = "Mean accuracy", fill = "Age group")
```
::: {.callout-note title="code"}
We make bar plots using the `ggplot` function `geom_col()`.
By default, it creates "stacked" bar plots, where all values associated with the same x value (here, `condition`) get stacked up on top of one another.
Stacked bar plots can be useful if, for example, you're plotting proportions that sum up to 1, or want to show how some big count is broken down into subcategories.
It's also common to use `geom_area()` for this purpose, which connects adjacent regions.
But the more common bar plot used in psychology puts the bars next to one another, or "dodges" them.
To accomplish this, we use the `position = "dodge"` argument:
```{r fig-viz-prepost4, opts.label='code'}
```
:::
\clearpage
This plot is slightly better: it's easier to compare the heights of bars than the "blueness" of squares, and mapping age to color draws our eye to those contrasts.
However, we can do even better by noticing that our experiment was designed to test an *interaction*.\index{interaction effect}
That statistic of interest is a difference of differences.
To what extent does the developmental change in performance on the experimental condition\index{experimental condition} differ from developmental change in performance on the control condition?
Some researchers have gotten proficient at reading off interactions\index{interaction effect} from bar plots, but they also require a complex set of eye movements.
We have to look at the pattern across the bars on the left, and then jump over to the bars on the right, and implicitly judge one difference against the other: the actual statistic isn't explicitly shown anywhere!
What could help facilitate this comparison?
Consider the line plot in @fig-viz-prepost5.
```{r fig-viz-prepost5}
#| fig-cap: A line graph of the Stiller data promotes comparison.
#| fig-alt: A plot of Mean Accuracy against Age group; there are two lines, corresponding to the two conditions.
#| cap-location: margin
#| out-width: 80%
ggplot(sgf_cond_means, aes(x = age_group, y = rating, color = condition)) +
geom_point() +
geom_line(aes(group = condition)) +
labs(x = "Age group", y = "Mean accuracy", color = "Condition")
```
::: {.callout-note title="code"}
Using a combination of `geom_point()` and `geom_line()`:
```{r fig-viz-prepost5, opts.label='code'}
```
:::
The interaction\index{interaction effect} contrast we want to interpret is highlighted visually in this plot.
It is much easier to compare slopes of lines than mentally compute a difference of differences between bars.
Here are a few corollaries of this principle (adapted from [a presentation by Karl Broman](https://biostat.wisc.edu/~kbroman/presentations/IowaState2013/graphs_combined.pdf)).
* It is easier to compare values that are *adjacent* to one another. This is especially important when there are many different conditions included on the same plot. If particular sets of conditions are of theoretical interest, place them close to one another. Otherwise, sort conditions by a meaningful value (rather than alphabetically, which is usually the default for plotting software).
* When possible, color-code labels and place them directly next to data rather than in a separate legend. Legends force readers to glance back and forth to remember what different colors or lines mean.
* When making histograms or density plots, it is challenging to compare distributions when they are placed side by side. Instead, facilitate comparison of distributions by vertically aligning them, or making them transparent and placed on the same axes.
* If the scale makes it hard to see important differences, consider transforming the data (e.g., taking the logarithm).
* When making bar plots, be very careful about the vertical y-axis. A classic "misleading visualization" mistake is to cut off the bottom of the bars by placing the endpoint of the y-axis at some arbitrary value near the smallest data point. This is misleading because people interpret bar plots in terms of the relative *area* of the bars (i.e., the amount of ink taken up by the bar), not just their absolute y-values.
* If a key variable from your design is mapped to color, choose the color scale carefully. For example, if the variable is binary or categorical, choose visually distinct colors to maximize contrast (e.g., black, blue, and orange). If the variable is ordinal\index{ordinal scale} or continuous, use a color gradient. If there is a natural midpoint (e.g., if some values are negative and some are positive), consider using a diverging scale (e.g., different colors at each extreme). Remember also that a portion of your audience may be colorblind.^[Palettes like `viridis` have been designed to be colorblind-friendly and also perceptually uniform (i.e., the perceived difference between 0.1 and 0.2 is approximately the same as the difference between 0.8 and 0.9).]
<!-- Finally, if the same manipulation or variable appears across multiple figures in your paper, keep the color mapping consistent: it is confusing if "red" means something different from figure to figure. -->
### Principle 3: Show the data
Looking at older papers, you may be alarmed to notice how little information is contained in the graphs.
The worst offenders might show just two bars, representing average values for two conditions.
This kind of plot adds very little beyond a sentence in the text reporting the means, but it can also be seriously misleading.
It hides real variation in the data, making a noisy effect based on a few data points look the same as a more systematic one based on a larger sample.
Additionally, it collapses the *distribution* of the data, making a multimodal distribution look the same as a unimodal one.
The third principle of modern statistical visualization is to *show the data* and visualize variability in some form.
\clearpage
The most minimal form of this principle is to *always include error bars*.^[And be sure to tell the reader what the error bars represent---a 95% confidence interval?\index{confidence interval (CI)} A standard error of the mean?---without this information, error bars are hard to interpret (see the [Depth]{.smallcaps} box below).]
Error bars turn a purely descriptive visualization into an inferential one.
They represent a minimal form of uncertainty about the possible statistics that might have been observed, not just the one that was actually observed.
@Fig-viz-show-data1 shows the data with (bootstrapped) error bars.
<!-- As such, they allow the viewer to interpret the means as *estimates*. -->
```{r fig-viz-show-data1}
#| fig-cap: Error bars (95% CIs) added to the Stiller data line graph.
#| fig-alt: A plot the same as 15.10 but with error bars at each point.
#| cap-location: margin
#| out-width: 80%
sgf_cond_boot <- sgf |>
group_by(condition, age_group) |>
tidyboot::tidyboot_mean(correct)
ggplot(sgf_cond_boot, aes(x = age_group, y = empirical_stat, color = condition)) +
geom_point() +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0) +
geom_line(aes(group = condition)) +
labs(x = "Age group", y = "Mean accuracy", color = "Condition")
```
::: {.callout-note title="code"}
A common problem arises when we want to add error bars to a dodged bar plot.
Naively, we'd expect we could just dodge the error bars in the same way we dodged the bars themselves:
```{r, opts.label='code'}
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), position = "dodge")
```
But this doesn't work! The rationale is kind of technical, but the width of the error bars is much narrower than the width of the bars, so we need to manually specify how much to dodge the error bars with `position_dodge()`:
```{r, opts.label='code'}
geom_col(position = position_dodge()) +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
position = position_dodge(width = 0.9))
```
This does the trick!
:::
But we can do even better.
By overlaying the distribution of the actual data points on the same plot, we can give the reader information about not just the statistical inferences but also the underlying data supporting those inferences.
In the case of the @stiller2015 study, data points for individual trials are binary (correct or incorrect).
It's technically possible to show individual responses as dots at 0s and 1s, but this doesn't tell us much (we'll just get a big clump of 0s and a big clump of 1s).
The question to ask yourself when "showing the data" is: What are the theoretically meaningful *units* of variation in the data?
This question is closely related to our discussion of mixed-effects models in @sec-models, when we considered which random effects we should include.
Here, a reader is likely to wonder how much variance was found across *different children* in a given age group.
To show such variation, we aggregate to calculate an accuracy score for each participant.^[While participant-level variation is a good default, the relevant level of aggregation may differ across designs. For example, collective behavior studies may choose to show the data point for each *group*. This choice of unit is also important when generating error bars: if you have a small number of participants but many observations per participant, you are faced with a choice. You may either bootstrap over the flat list of all individual observations (yielding very small error bars), or you may first aggregate within participants (yielding larger error bars that account for the fact that repeated observations from the same participant are not independent).]
There are many ways of showing the resulting distribution of participant-level data.
For example, a boxplot shows the median (a horizontal line) in the center of a box extending from the lower quartile (25%) to the upper quartile (75%).
Lines then extend out to the biggest and smallest values (excluding outliers, which are shown as dots).
@Fig-viz-show-data2 gives the boxplots for the Stiller data, which don't look that informative---perhaps because of the coarseness of individual participant averages due to the small number of trials.
\vspace{3.5em}
```{r fig-viz-show-data2}
#| fig-cap: A boxplot of the Stiller data.
#| fig-alt: A plot with same axes as 15.11 but instead of lines there are boxes with whiskers; color of boxes corresponds to condition.
#| cap-location: margin
#| out-width: 80%
sgf_subj_means <- sgf |>
group_by(condition, age_group, subid) |>
summarize(rating = mean(correct))
ggplot(sgf_subj_means, aes(x = age_group, y = rating, color = condition)) +
geom_boxplot(alpha = 0.8) +
labs(x = "Age group", y = "Mean accuracy", color = "Condition")
```
::: {.callout-note title="code"}
In `ggplot`, we can make box plots using `geom_boxplot()`:
```{r, opts.label='code'}
geom_boxplot(alpha = 0.8)
```
A common problem to run into is that `geom_boxplot()` requires the variable assigned to `x` to be discrete.
If you have discrete levels of a numeric variable (e.g., age groups), make sure you've actually converted that variable to a `factor`.
Otherwise, if it's still coded as `numeric`, `ggplot` will collapse all of the levels together!
:::
\clearpage
It is also common to show the raw data as jittered values with low transparency.
In @fig-viz-show-data4, we jitter the points because many participants have the same numbers (e.g., 50%), and if they overlap it is hard to see how many points there are.
```{r fig-viz-show-data4}
#| fig-cap: Jittered points representing the data distribution of the Stiller data.
#| fig-alt: A plot the same as 15.10 but with clouds of points; points cluster around possible accuracy values for an individual.
#| cap-location: margin
#| out-width: 80%
sgf_subj_means <- sgf |>
group_by(condition, age_group, subid) |>
summarize(rating = mean(correct))
sgf_subj_ci <- sgf_subj_means |>
group_by(condition, age_group) |>
tidyboot::tidyboot_mean(rating) |>
rename(rating = empirical_stat)
ggplot(sgf_subj_ci, aes(x = age_group, y = rating, color = condition)) +
geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper)) +
geom_line(aes(group = condition)) +
geom_jitter(data = sgf_subj_means, alpha = 0.25, width = 0.1, height = .03) +
labs(x = "Age group", y = "Mean accuracy")
```
::: {.callout-note title="code"}
Adding the jittered points is simple using `geom_jitter()`, but we are starting to have a fairly complex plot, so maybe it's worth taking stock of how we get there.
To plot both *condition* means and *participant* means, we need to create two different data frames. Here `sgf_subj_means` is a data frame of means for each participant; `sgf_subj_ci` is a data frame with means and confidence intervals *across* participants. For this purpose, we use the `tidyboot` package [@braginsky2018] and the `tidyboot_mean()` function, which gives us bootstrapped 95% confidence intervals for the means.
```{r fig-viz-show-data4, opts.label='code'}
```
The most noteworthy aspect of this code is that `geom_jitter()` isn't just using a different aesthetic; it also takes a different dataframe altogether! Mixing dataframes can be an important tool for creating complex plots.
:::
\clearpage
Perhaps the format that takes this principle the furthest is the so-called raincloud plot [@allen2019raincloud] shown in @fig-viz-raincloud.
A raincloud plot combines the raw data ("rain") with a smoothed density ("cloud") and a boxplot giving the median and quartiles of the distribution.
```{r fig-viz-raincloud}
#| fig-cap: A raincloud plot of the Stiller data.
#| fig-alt: A plot with axes flipped to show densities as horizontal ridges with small boxplots underneath and clouds of jittered points.
#| cap-location: margin
#| fig-height: 6
#| out-width: 80%
library(ggridges)
library(ggstance)
ggplot(sgf_subj_means, aes(y = age_group, x = rating, color = condition)) +
geom_density_ridges(aes(fill = condition), alpha = 0.2, scale = 0.7,
jittered_points = TRUE, point_alpha = 0.7,
position = position_raincloud(width = 0.05, height = 0.15,
ygap = 0.1)) +
geom_boxploth(width = 0.1, alpha = 0.2, outlier.shape = NA, show.legend = FALSE) +
scale_y_discrete(expand = expansion(mult = c(0.2, 0.4))) +
guides(fill = "none", color = guide_legend(reverse = TRUE)) +
labs(x = "Mean accuracy", y = "Age group", color = "Condition") +
theme(legend.position = "top")
```
::: {.callout-note title="code"}
This raincloud plot requires two additional plotting packages: `ggridges` [@wilke2023] for the densities and `ggstance` [@henry2022] for the horizontal boxplots.
```{r fig-viz-raincloud, opts.label='code'}
```
:::
::: {.callout-note title="depth"}
### Visualizing uncertainty with error bars {-}
One common misconception is that error bars are a measure of variance *in the data*, like the standard deviation of the response variable.
Instead, they typically represent a measure of precision extracted from the statistical model.
In older papers, for example, it was common to use the standard error of the mean (SEM; see @sec-inference).
Remember that this is not the standard deviation of the data distribution but of the *sampling distribution*\index{sampling distribution} of the mean that is being estimated.
Given the central limit theorem,\index{central limit theorem} which tells us that this sampling distribution is asymptotically normal, it was straightforward to estimate the standard error analytically using the empirical standard deviation of the data divided by the square root of the sample size: `sd(x) / sqrt(length(x))`.
Error bars based on the SEM often looked misleadingly small, as they only represent a 68% interval of the sampling distribution\index{sampling distribution} and go to zero quickly as a function of sample size.
As a result, it became more common to show the 95\% confidence interval\index{confidence interval (CI)} instead: [-1.96 $\times$ SEM, 1.96 $\times$ SEM].
While these analytic equations remain common, an increasingly popular alternative is to *bootstrap* confidence intervals (see the [Depth]{.smallcaps} box in @sec-inference for more on bootstrapping).
<!-- A deep theoretical understanding of the bootstrap technique is outside the scope of this text, but you can think of it as a way of deriving a sampling distribution from your dataset using *simulations* instead of mathematical derivations about the properties of the sampling distribution. -->
The bootstrap is a powerfully generic technique, especially when you want to show error bars for summary statistics that are more complex than means, where we do not have such convenient asymptotic guarantees and "closed-form" equations.
An example would be if you're working with a skewed response variable or a dataset with clear outliers and you want to estimate medians and quartiles.
Or, suppose you want to estimate proportions from categorical data, or a more ad hoc statistic like the AUC (area underneath the curve) in a hierarchical design where it is not clear how to aggregate across items or participants in a mixed-effects model.
Analytic estimators of confidence intervals\index{confidence interval (CI)} can in principle be derived for these statistics, subject to different assumptions, but it is often more transparent and reliable in practice to use the bootstrap.
As long as you can write a code snippet to compute a value from a dataset, you can use the bootstrap.
```{r fig-viz-boot}
#| output: false
#| fig-height: 2.25
#| fig-width: 6.3
boot <- sgf |>
group_by(condition, age_group, subid) |>
summarize(rating = mean(correct)) |>
group_by(condition, age_group) |>
tidyboot::tidyboot_mean(rating) |>
select(-n, -mean) |>
mutate(method = "Bootstrapped 95% CI")
ci_sem <- sgf |>
group_by(condition, age_group, subid) |>
summarize(rating = mean(correct)) |>
group_by(condition, age_group) |>
summarize(empirical_stat = mean(rating),
sem = sd(rating)/sqrt(length(rating)),
ci_lower = mean(rating) - sem,
ci_upper = mean(rating) + sem) |>
mutate(method = "Standard error")
ci_ci <- ci_sem |>
mutate(ci_lower = empirical_stat - 1.96 * sem,
ci_upper = empirical_stat + 1.96 * sem) |>
mutate(method = "Analytic 95% CI")
bind_rows(boot, ci_sem, ci_ci) |>
ggplot(aes(x = age_group, y = empirical_stat, color = condition)) +
geom_point() +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0) +
geom_line(aes(group = condition)) +
facet_wrap(~ method) +
labs(x = "Age group", y = "Mean accuracy", color = "Condition") +
theme(strip.text = element_text(size = rel(0.7)))
```
::: {#fig-viz-boot}
::: {.content-visible when-format="html"}
![](015-viz_files/figure-html/fig-viz-boot-1.png){width="80%" fig-alt="3 plots same as 15.11 but with various error bars: larger for analytic/bootstrapped 95% CI than for standard error."}
:::
::: {.content-visible when-format="pdf"}
![](015-viz_files/figure-pdf/fig-viz-boot-1.png){width="80%"}
:::
Three different error bars for the Stiller data: bootstrapped 95% confidence intervals (left), standard error of the mean (middle), and analytically computed confidence intervals (right).
:::
\vspace{-1em}
As we can see, the bootstrapped 95% CI looks similar to the analytic 95% CI derived from the standard error, except the upper and lower limits are slightly asymmetric (reflecting outliers in one direction or another).
Of course, the bootstrap is not a silver bullet and can be abused in particularly small samples.
This is because the bootstrap is fundamentally limited to the sample we run it on.
It can be expected to be reasonably accurate if the sample is reasonably representative of the population.
But at the end of the day, as they say, "There's no such thing as a free lunch."
In other words, we cannot magically pull more information out of a small sample without making additional assumptions about the data generating process.\index{data generating process}
:::
### Principle 4: Maximize information, minimize ink
![This figure uses a lot to ink to show three numbers, for a "ddi" of 0.2 [from the *Washington Post*, 1978, see @wainer1984display].](images/viz/bad-viz1.png){#fig-viz-bad .column-margin width="90%" fig-alt="A plot showing labeled 3D bars over 3 years; they have a Japanese flag pattern and are in front of a American flag backdrop."}
Now that we have the basic graphical elements in place to show our design and data, it might seem like the rest is purely a matter of aesthetic preference, like choosing a pretty color scheme or font. Not so.
There are well-founded principles to make the difference between an effective visualization and a confusing or obfuscating one.
Simply put, we should try to use the simplest possible presentation of the maximal amount of information: we should maximize the "data-ink ratio."
To calculate the amount of information shown, @tufte2001visual suggests a measure called the "data density index," the "numbers plotted per square inch."
The worst offenders have a very low density while also using a lot of excess ink (e.g., @fig-viz-bad and @fig-viz-bad2)
![This figure uses complicated 3D ribbons to compare distributions across four countries [from @roeder1994dna]. How could the same data have been presented more legibly?](images/viz/roeder_badviz.png){#fig-viz-bad2 .margin-caption fig-alt="2 plots showing binned frequency data for 4 countries in using 3D ribbons in slightly different shades of gray."}
The defaults in modern visualization libraries like `ggplot` prevent some of the worst offenses but are still often suboptimal.
For example: consider whether the visual complexity introduced by the default grey background and grid lines in @fig-viz-themes0 is justified, or whether a more minimal theme would be sufficient.
```{r viz-themes0}
#| label: fig-viz-themes0
#| fig-cap: Standard "gray"-themed Stiller figure.
#| fig-alt: A plot same as 15.13 but with a grey background and vertical/horizontal guidelines; color legend on the right.
#| cap-location: margin
#| out-width: 85%
ggplot(sgf_subj_ci, aes(x = age_group, y = rating, color = condition)) +
geom_line(aes(group = condition)) +
geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper)) +
geom_jitter(alpha = 0.25, width = 0.05, height=.03, data = sgf_subj_means) +
scale_color_hue() +
labs(x = "Age group", y = "Mean accuracy", color = "Condition") +
theme_gray()
```
<!-- \clearpage -->
@Fig-viz-themes shows a slightly more "styled" version of the same plot with labels directly on the plot and a lighter-weight theme.^[See the `ggthemes` package [@arnold2023] for a good collection of themes.]
```{r viz-themes}
#| label: fig-viz-themes
#| fig-cap: "Custom-themed Stiller figure with direct labels."
#| fig-alt: A plot same as 15.18 but with no background/guidelines; lines directly labeled with their condition in corresponding colors.
#| cap-location: margin
#| fig-width: 5.5
#| out-width: 85%
library(directlabels)
ggplot(sgf_subj_ci, aes(x = age_group, y = rating, color = condition)) +
geom_line(aes(group = condition)) +
geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper)) +
geom_jitter(alpha = 0.25, width = 0.05, height=.03, data = sgf_subj_means) +
geom_dl(aes(label = condition),
method = list("last.points", dl.trans(x = x + 0.5), fontfamily = .font)) +
scale_x_discrete(expand = expansion(mult = c(0.1, 0.3))) +
labs(x = "Age group", y = "Mean accuracy") +
guides(color = "none")
```
::: {.callout-note title="code"}
To produce the plot in @fig-viz-themes, we've added a few styling elements including:
* The nice and minimal custom theme, with a larger font size.
* A more accessible color palette (`scale_colour_ptol()`) from the `ggthemes` package [@arnold2023].
* Direct labels using `geom_dl()` from the `directlabels` package [@hocking2023].
<!-- This last item is a bit trickier to implement. To do so, we select one set of points to label and feed these in as a filtered data frame. A few manual arguments customize the placement of the labels (you can spend hours tweaking this sort of thing by hand). -->
```{r, opts.label='code'}
geom_dl(aes(label = condition), method = list("last.points", dl.trans(x = x + 0.5)))
```
:::
Here are a few final tips for making good confirmatory visualizations:
* Make sure the font size of all text in your figures is legible and no smaller than other text in your paper (e.g., 10 pt). This change may require, for example, making the axis breaks sparser, rotating text, or changing the aspect ratio of the figure.
* Another important tool to keep in your visualization arsenal is the **facet plot**\index{facet plot}. When your experimental design becomes more complex, consider breaking variables out into a *grid* of facets instead of packing more and more colors and line-styles onto the same axis. In other words, while higher information density is typically a good thing, you want to aim for the sweet spot before it becomes too dense and confusing. Remember principle 2. When there is too much going on in every square inch, it is difficult to guide your reader's eye to the comparisons that actually matter, and spreading it out across facets gives you additional control over the salient patterns.
* Sometimes these principles come into conflict, and you may need to prioritize legibility over, for example, showing all of the data. For example, suppose there is an outlier orders of magnitude away from the summary statistics. If the axis limits are zoomed out to show that point, then most of the plot will be blank space! It is reasonable to decide that it is not worth compressing the key statistical question of your visualization into the bottom centimeter just to show one point. It may suffice to truncate the axes and note in the caption that a single point was excluded.
* Fix the axis labels! A common mistake is to keep the default shorthand you used to name variables in your plotting software instead of more descriptive labels (e.g., "RT" instead of "Reaction Time"). Use consistent terminology for different manipulations and measures in the main text and figures. If anything might be unclear in the figure, explain it in the caption.
* Different audiences may require different levels of detail. Sometimes it is better to collapse over secondary variables (even if they are included in your statistical models) to control the density of the figure and draw attention to the key question of interest.
## Exploratory visualization
So far in this chapter we have focused on principles of *confirmatory* data visualization: how to make production-quality figures that convey the key preregistered analyses without hiding sources of variability or misleading readers about the reliability of the results.
Yet, this is only one role that data visualization plays when doing science.
An equally important role is called *exploratory visualization*: the more routine practice of understanding one's own data by visualizing it.
This role is analogous to the sense of exploratory data analyses discussed in @sec-prereg.
We typically do not preregister exploratory visualizations, and when we decide to include them in a paper they are typically in the service of a secondary argument (e.g., checking the robustness of an effect or validating that some assumption is satisfied).
This kind of visualization plays a ubiquitous role in a researcher's day-to-day activities.
While confirmatory visualization is primarily audience-driven and concerned with visual communication, exploratory visualization is first and foremost a "cognitive tool" for the researcher.
The first time we load in a new dataset, we start up a new feedback loop---we ask ourselves questions and answer them by making visualizations.
These visualizations then raise further questions and are often our best tool for debugging our code.
In this section, we consider some best practices for exploratory visualization.
### Examining distributional information
The primary advantage of exploratory visualization---the reason it is uniquely important for data science---is that it gives us access to holistic information about the distribution of the data that cannot be captured in any single summary statistic.
The most famous example is known as "Anscombe's quartet," a set of four datasets with identical statistics (@fig-viz-anscombe).
They have the same means, the same variances, the same correlation, the same regression line, and the same $R^2$ value.
Yet, when they are plotted, they reveal striking structural differences.
The first looks like a noisy linear relationship---the kind of idealized relationship we imagine when we imagine a regression line.
But the second is a perfect quadratic arc,\index{quadratic effect} the third is a perfectly noiseless line with a single outlier, and the fourth is nearly categorical: every observation except one shares exactly the same x-value.
```{r fig-viz-anscombe}
#| fig-cap: Anscombe's quartet [@anscombe1973graphs].
#| fig-alt: "4 plots with same line of best fit but different points: scattered; quadratic; linear with 1 outlier; all but 1 same value."
#| fig-width: 8
#| fig-height: 2.5
#| cap-location: margin
ans <- datasets::anscombe
ans_long <- ans |>
mutate(i = 1:n()) |>
pivot_longer(-i, names_to = c("name", "k"), names_sep = 1, values_to = "value") |>
pivot_wider()
ggplot(ans_long, aes(x = x, y = y)) +
facet_wrap(vars(k), nrow = 1,
labeller = as_labeller(c("1" = "I", "2" = "II",
"3" = "III", "4" = "IV"))) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, colour = pal$grey, linewidth = 0.5) +
geom_point(alpha = 0.5, size = 1.5) +
scale_y_continuous(breaks = seq(5, 15, 5), limits = c(3, 15)) +
xlim(c(3, 19))
```
If our analyses are supposed to help us distinguish between different data-generating processes, corresponding to different psychological theories, it is clear that these four datasets would correspond to dramatically different theories even though they share the same statistics.
Of course, there are arbitrarily many datasets with the same statistics, and most of these differences don't matter (this is why they are called "summary" statistics, after all!).
@Fig-viz-datasaurus and @tbl-datasaurus show just how bad things can get when we rely on summary statistics.
When we operationalize a theory's predictions in terms of a single statistic (e.g., a difference between groups or a regression coefficient), we can lose track of everything else that may be going on.
Good visualizations force us to zoom out and take in the bigger picture.
<!-- # | fig-height: 6 -->
\clearpage
```{r fig-viz-datasaurus}
#| fig-cap: Originally inspired by the Datasaurus\index{Datasaurus} plot constructed by Alberto Cairo using DrawMyData (<http://robertgrantstats.co.uk/drawmydata.html>), we can construct an arbitrary number of different graphs with exactly the same statistics, such as the Datasaurus Dozen [@matejka2017same;@murray2021generating].
#| fig-alt: A scatterplot that looks like a dinosaur; a grid of 12 plots looking like many different shapes including star and circle.
#| fig-width: 8
#| fig-height: 8
#| out-width: 80%
#| cap-location: margin
dino <- ggplot(datasauRus::datasaurus_dozen |> filter(dataset == "dino"),
aes(x = x, y = y)) +
coord_equal(ratio = 0.7) +
geom_point(size = 0.7) +
theme_void() +
theme(plot.margin = unit(c(0, 0, 0, 0), "cm"))
not_dino <- ggplot(datasauRus::datasaurus_dozen |> filter(dataset != "dino"),
aes(x = x, y = y)) +
facet_wrap(vars(dataset), ncol = 4) +
coord_equal(ratio = 0.7) +
geom_point(size = 0.7) +
theme(strip.text = element_blank(),
plot.margin = unit(c(0, 0, 0, 0), "cm"))
cowplot::plot_grid(dino, not_dino, nrow = 2, rel_heights = c(1, 3))
```
\vspace{-2em}
<!-- ::: {.margin-caption} -->
\footnotesize
```{r tbl-datasaurus}
#| tbl-cap: Summary statistics for the Datasaurus Dozen\index{Datasaurus} datasets [@matejka2017same].
datasauRus::datasaurus_dozen |>
group_by(dataset) |>
summarize(mean_x = mean(x), mean_y = mean(y),
sd_x = sd(x), sd_y = sd(y),
cor_xy = cor(x, y)) |>
mutate(across(c(starts_with("mean"), starts_with("sd")),
\(x) round(x, 1)),
cor_xy = round(cor_xy, 3))
```
\normalsize
\vspace{-1em}
<!-- ::: -->
::: {.callout-note title="accident report"}
### [Distributional] gorillas in our midst {-}
Many data scientists don't bother checking what their data looks like before proceeding to test specific hypotheses.
@yanai2020 cleverly designed an artificial dataset for their students to test for such oversight.
Each row of the dataset contained an individual's body mass index (BMI) and the number of steps they walked on a given day.
While the spreadsheet looked innocuous, the data was constructed such that simply plotting the raw data revealed a picture of a gorilla.
One group of 19 students was given an explicit set of hypotheses to test (e.g., about the relationship between BMI and steps).
Fourteen of these students failed to notice a gorilla, suggesting that they evaluated these hypotheses without ever visualizing their data.
Another group of 14 students were simply asked what, if anything, they could conclude (without being given explicit hypotheses).
More of these students apparently made the visualization, but five of them still failed to notice the gorilla (@fig-viz-gorilla)!
```{r fig-viz-gorilla}
#| output: false
#| fig-width: 4
#| fig-height: 4
# from helper/download_gorilla.R
go <- read_rds("data/viz/gorillas.rds") |>
as_tibble() |>
mutate(group = group |> fct_recode("Female" = "F", "Male" = "M"))
ggplot(go, aes(x = steps, y = bmi, colour = group)) +
geom_point(size = 0.5) +
labs(x = "Steps", y = "Body Mass Index", colour = "") +
theme(legend.position = c(.9, 1), legend.key.width = unit(.05, "lines"))
```
::: {#fig-viz-gorilla}
::: {.content-visible when-format="html"}
![](015-viz_files/figure-html/fig-viz-gorilla-1.png){width=35% fig-alt="A scatterplot of Body Mass Index against Steps with different colors corresponding to Male and Female; looks like a gorilla."}
:::
::: {.content-visible when-format="pdf"}
![](015-viz_files/figure-pdf/fig-viz-gorilla-1.png){width=35%}
:::
A dataset constructed by @yanai2020 that revealed a picture of a gorilla when the raw data were plotted.
:::
\vspace{-1em}
While it may not be surprising that a group of students would take the shortest path to completing their assignment, similar concerns have been raised in much more serious cases concerning how experienced researchers could fail to notice obviously fraudulent data.
For example, when the Datacolada bloggers [-@datacolada] made a simple histogram of the car mileage data reported in @shu2012signing [released publicly by @kristal2020signing], they were immediately able to observe that it followed a perfectly uniform distribution, truncated at exactly 50,000 miles (@fig-viz-uniform).
Given a little thought, this pattern should be extremely puzzling.
Over a given period of time, we would typically expect something more bell-shaped: a small number of people will drive very little (e.g., 1,000 miles), a small number of people will drive a lot (e.g., 50,000 miles), and most people will fall between these tails.
So it is highly surprising to find exactly the same number of drivers in every mileage bin.
While further specialized analyses revealed additional evidence of fraud (e.g., based on patterns of rounding and pairs of duplicated data points), this humble histogram was already enough to set off alarm bells.
A recurring regret raised by the coauthors of this paper is that they never thought to make this visualization before reporting their statistical tests.
```{r fig-viz-uniform}
#| output: FALSE
#| fig-width: 5
#| fig-height: 3
# from helper/driving.R
driving_diffs <- read_rds("data/viz/driving_diffs.rds")
ggplot(driving_diffs, aes(x = miles)) +
geom_histogram(fill = "white", colour = "black") +
xlim(c(0, 100000)) +
labs(x = "Implied miles driven", y = "Count")
```
::: {#fig-viz-uniform}
::: {.content-visible when-format="html"}
![](015-viz_files/figure-html/fig-viz-uniform-1.png){width="50%" fig-alt="A histogram showing counts of implied miles driven; there are no bars above 50000 implied miles driven."}
:::
::: {.content-visible when-format="pdf"}
![](015-viz_files/figure-pdf/fig-viz-uniform-1.png){width="50%"}
:::
A suspiciously uniform distribution abruptly cutting off at 50K miles. Ring the alarm! Adapted from @datacolada.
:::
\vspace{-1em}
Our data are always messier than we expect.
There might be a bug in our coding scheme, a column might be mislabeled, or it might contain a range of values that we didn't expect.
Maybe our design wasn't perfectly balanced, or something went wrong with a particular participant's keyboard presses. Most of the time, it's not tractable to manually scroll through our raw data looking for such problems. Visualization is our first line of defense for the all-important process of running "data diagnostics." If there is a weird artifact in our data, it will pop out if we just make the right visualizations.
:::
### Data diagnostics
So, which visualizations should we start with? The best practice is to always start by making histograms of the raw data. As an example, let's consider the rich and interesting dataset shared by @blake2015ontogeny in their article "Ontogeny of Fairness in Seven Societies." This study examines the emergence of children's reasoning about fairness---both when it benefits them and when it harms them---across cultures.
::: {.callout-note title="code"}
If you want to follow along with this example at home, you can load the data from our repository!
```{r viz-blake1, echo=TRUE}
repo <- "https://github.com/langcog/experimentology/raw/main/"
fairness_raw <- read_csv(file.path(repo, "data/viz/ontogeny_of_fairness.csv"))
fairness <- fairness_raw |>
mutate(trial_num = trial |> str_remove("t") |> as.numeric(),
trial_type = eq.uneq |> fct_recode("Equal" = "E", "Unequal" = "U"),
condition = condition |> fct_recode("Advantageous" = "AI",
"Disadvantageous" = "DI"),
age = floor(actor.age.years),
reject = decision == "reject") |>
select(subj_id = actor.id, age, country, condition, trial_num, trial_type, reject) |>
arrange(country, condition, subj_id, trial_num)
```
:::
In this study, pairs of children played the "inequity game": they sat across from one another and were given a particular allocation of snacks.
On some trials, each participant was allocated the same amount ("equal" trials) and on some trials they were allocated different amounts ("unequal" trials).
One participant was chosen to be the "actor" and got to choose whether to accept or reject the allocation: in the case of rejection, neither participant got anything.
The critical manipulation was between two forms of inequity.
Some pairs were assigned to the "disadvantageous" condition, where the actor was allocated less than their partner on unequal trials (e.g., one vs four).
Others were assigned to the "advantageous" condition, where they were allocated more (e.g., four vs one).
The confirmatory **design plot**\index{design plot} for this study would focus on contrasting developmental trajectories for advantageous vs disadvantageous inequality.
However, this is a complex, multivariate dataset, including 866 pairs from different age groups and different testing sites across the world which used subtly different protocols.
How might we go about the process of exploratory visualization for this dataset?
### Plot data collection details
Let's start by getting a handle on some of the basic sample characteristics.
How many participants were in each age bin (@fig-viz-blake3)?
```{r viz-blake3}
fairness_by_age <- fairness |>
count(age, subj_id) |>
count(age)
```
```{r fig-viz-blake3}
#| fig-cap: Participants by age in the Blake data.
#| fig-alt: A histogram showing counts of participants in each age bin from 4 to 15; most participants around ages 8/9 and fewer older.
#| cap-location: margin
#| out-width: 80%
ggplot(fairness_by_age, aes(x = age, y = n)) +
geom_col() +
xlim(0, 18) +
labs(x = "Age (years)", y = "Count")
```
::: {.callout-note title="code"}
Exploratory histograms are often a combination of an aggregation step and a plotting step. In the aggregation step, we make use of the convenience `count()` function, which gives the number (`n`) of rows in a particular grouping. Here we `count()` twice in order to get first one row per participant and then count the number of participants within each age group.
```{r viz-blake3, opts.label='code'}
```
And then we plot using `ggplot()`:
```{r fig-viz-blake3, opts.label='code'}
```
An alternative (perhaps more elegant) workflow here would be to use a histogram:
```{r, opts.label='code'}
fairness_by_age <- fairness |>
count(age, subj_id)
ggplot(fairness_by_age, aes(x = age)) +
geom_histogram(binwidth = 1) +
labs(x = "Age (years)", y = "Count")
```
:::
\clearpage
How many participants were from each country (@fig-viz-blake5)?
```{r fig-viz-blake5}
#| fig-cap: Participants by country in the Blake data.
#| fig-alt: A histogram showing counts of participants in each country; largest count of participants in USA and smallest in Mexico.
#| cap-location: margin
#| out-width: 80%
fairness |>
count(country, subj_id) |>
count(country) |>
mutate(country = fct_reorder(country, -n)) |>
ggplot(aes(x = country, y = n)) +
geom_col() +
labs(x = "Country", y = "Count")
```
::: {.callout-note title="code"}
Here we are going to make things even terser and use a pipe chain that *includes* the `ggplot()` call, just so we are writing only a single call to produce our plot. It's up to you whether you think this enhances the readability of your code or decreases it. We find that it's sometimes useful when you don't plan on keeping the intermediate data frame for any other use than plotting.
```{r fig-viz-blake5, opts.label='code'}
```
If you use this technique, be careful to use pipe (`|>` or `%>%`) between function calls but use `+` between `ggplot` layers!
The only other trick to point out here is that we use the `fct_reorder()` call to order the levels of the `country` factor in descending order. This function is found in the very useful `forcats` package [@wickham2023] of the `tidyverse`, which contains all sorts of functions for working with factors.
:::
\clearpage
Are ages roughly similar across each country (@fig-viz-blake6)?
```{r fig-viz-blake6}
#| fig-cap: Age distribution across countries in the Blake data.
#| fig-alt: A grid of histograms like 15.24 but by country; USA distribution is fairly uniform; other countries have peak around 8/9.
#| cap-location: margin
#| fig-height: 3.5
#| out.width: 80%
fairness |>
count(country, age, subj_id) |>
count(country, age) |>
mutate(country = fct_reorder(country, -n)) |>
ggplot(aes(x = age, y = n)) +
facet_wrap(vars(country), ncol = 4) +
geom_col() +
xlim(0, 18) +
labs(x = "Age (years)", y = "Count")
```
::: {.callout-note title="code"}
This next plot simply combines the grouping factors of each of the last two plots, and uses `facet_wrap()` to show a separate histogram by country:
```{r fig-viz-blake6, opts.label='code'}
```
:::
These exploratory visualizations help us read off some descriptive properties of the sample.
For example, we can see that age ranges differ somewhat across sites: the maximum age is 11 in India but 15 in Mexico.
We can also see that age groups are fairly imbalanced: in Canada, there are 18 eleven-year-olds but only 5 six-year-olds.
None of these properties are problematic, but seeing them gives us a degree of awareness that could shape our downstream analytic decisions.
For example, if we did not appropriately model random effects, our estimates would be dominated by the countries with larger sample sizes.
And if we were planning to compare specific groups of six-year-olds (for some reason), this analysis would be underpowered.
### Explorating distributions
Now that we have a handle on the sample, let's get a sense of the dependent variable: the participant's decision to accept or reject the allocation.
Before we start taking means, let's look at how the "rejection rate" variable is distributed.
We'll aggregate at the participant level, and check the frequency of different rejection rates, overall (@fig-viz-blake7).
```{r fig-viz-blake7}
#| fig-cap: Rejection rates in the Blake data.
#| fig-alt: A histogram showing binned counts of proportion of offers rejected; most common proportion is 0%; few proportions above 50%.
#| cap-location: margin
#| out-width: 80%
fairness_by_subj <- fairness |>
filter(!is.na(trial_type)) |>
group_by(subj_id) |>
summarize(mean_reject = mean(reject, na.rm = TRUE))
ggplot(fairness_by_subj, aes(x = mean_reject)) +
geom_histogram(binwidth = .05) +
labs(x = "Proportion of offers rejected", y = "Count")
```
::: {.callout-note title="code"}
Rejection rate is a continuous variable, so we switch to using a histogram in this case, choosing 0.05 as a reasonable bin width to see the distribution.
```{r fig-viz-blake7, opts.label='code'}
```
:::
We notice that many participants (`r round(mean(fairness_by_subj$mean_reject == 0),2)*100`%) never reject in the entire experiment.
This kind of "zero-inflated" distribution is not uncommon in psychology, and may warrant special consideration when designing the statistical model.
We also notice that there is clumping around certain values.
This clumping leads us to check how many trials each participant is completing (@fig-viz-blake8).
```{r fig-viz-blake8}
#| fig-cap: Trials per participant in the Blake data.
#| fig-alt: A histogram showing counts of participants by number of trials; most participants have 16, some have 12, a few have 14/15.
#| cap-location: margin
#| out-width: 80%
fairness |>
filter(!is.na(trial_type)) |>
count(subj_id) |>
count(n) |>
ggplot(aes(x = n, y = nn)) +
geom_col() +
labs(x = "Number of trials", y = "Count of participants")
```
::: {.callout-note title="code"}
This histogram is very similar to the ones above; however, we now use `count()` twice, first getting the trial counts for each participant and then counting how many times each count occurs overall!
```{r fig-viz-blake8, opts.label='code'}
```
:::