-
Notifications
You must be signed in to change notification settings - Fork 10
/
ggplot2.Rmd
556 lines (419 loc) · 18.2 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
---
title: "ggplot2 versions of simple plots"
output:
html_document:
fig_caption: no
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
<p><span style="color: #00cc00;">NOTE: This page has been revised for Winter 2021, but may undergo further edits.</span></p>
# Introduction #
The **R** package `ggplot2` by Hadley Wickham provides an alternative approach to the "base" graphics in **R** for constructing plots and maps, and is inspired by Lee Wilkinson's *The Grammar of Graphics* book (Springer, 2nd Ed. 2005). The basic premise of the *Grammar of Graphics* book, and of the underlying design of the package, is that data graphics, like a language, are built upon some basic components that are assembled or layered on top of one another. In the case of English (e.g Garner, B., 2016, *The Chicago Guide to Grammar, Usage and Punctuation*, Univ. Chicago Press), there are eight "parts of speech" from which sentences are assembled:
- *Nouns* (e.g. *computer*, *classroom*, *data*, ...), including gerunds (e.g. *readings*)
- *Pronouns* (e.g. *he*, *she*, *they*, ...)
- *Adjectives* (e.g. *good*, *green*, *my*, *year-to-year*, ..., including articles, e.g. *a*, *the*)
- *Verbs* (e.g. *analyze*, *write*, *discuss*, *computing*, ...)
- *Adverbs* (e.g. "-ly" words, *very*, *loudly*, *bigly*, *near*, *far*, ...)
- *Prepositions* (e.g. *on*, *in*, *to*, ...)
- *Conjunctives* (e.g. *and*, *but*, ...)
- *Interjections* (e.g. *damn*)
(but as Garner notes, those categories "aren't fully settled..." p. 18).
In the case of graphics (e.g. Wickham, H., 2016, *ggplot2 -- Elegant Graphics for Data Analysis, 2nd. Ed.*, Springer, available online from [[http://library.uoregon.edu]](http://library.uoregon.edu)) the components are:
- *Data* (e.g. in a dataframe)
- *Coordinate Systems* (the coordinate system of a plot)
- *Geoms* and their "aesthetic mappings" (the geometric objects like points, lines, polygons that represent the data)
These functions return *Layers* (made up of geometric elements) that build up the plot. In addition, plots are composed of
- *Stats* (or "statistical transformations", i.e. curves)
- *Scales* (that map data into the "aesthetic space", i.e. axes and legends)
- *Facets* (e.g subsets of data in individual plots)
- *Legends* and
- *Themes* (that control things like background colors, fonts, etc.)
```{r source, echo=FALSE}
load(".RData")
```
# Univariate and bivariate plots #
## Simple plots ##
Load the `ggplot2` package:
```{r load ggplot2}
# ggplot2 versions of plots
library(ggplot2)
```
Here is the **R** base graphics simple index plot:
```{r index}
# index plot
attach(sumcr)
plot(Length)
```
The `ggplot2` package contains a method, invoked by the `qplot()` function for creating quick (hence the name) versions of the base-graphics plots. The general recommendation these days is to not rely on `qplot()` that much, but to learn the full capability of `ggplot2` at the beginning.
```{r index_gg}
qplot(seq(1:length(Length)), Length)
```
Here's the stripchart, plain and stacked:
```{r stripcharts}
# stripchart
stripchart(Length)
stripchart(Length, method="stack")
```
... and the `qplot()` versions:
```{r stripcharts_gg1}
qplot(Length, rep(1,length(Length)))
```
The full `ggplot2()` almost reproduces the base-graphics version. Note the use of the `aes()` (aesthetic) function for describing the basic plot, which then has the dotplot added using the `geom_dotplot()` function. The `theme_bw()` term specifies the simple black-and-white theme. As in the case of a histogram, the `binwidth` parameter controls the granularity of the plot, and `coord_fixed() changes the aspect ratio of the plot. (The white space can be cropped when saving the image.)
```{r stripchargs_gg2}
ggplot(data=sumcr, aes(x = Length)) +
geom_dotplot(binwidth = 0.25, method = "histodot") +
coord_fixed(ratio = 4.0)
```
## Dotcharts/Dotplots ##
Here's the base-graphics Cleveland-style dotchart.
```{r dotchart}
# Dot charts
dotchart(WidthWS, labels=as.character(Location), cex=0.5)
```
... and the `ggplot2` equivalent:
```{r dotchart_gg}
ggplot(sumcr, aes(x=WidthWS, y=Location)) + geom_point(size=1)
ggplot(sumcr, aes(x=WidthWS, y=Location)) + geom_point(size=1) + theme_bw()
```
Note the subtle difference produced by the use of the `theme_bw()` term. The plots are basically x-y plots (specified by the `aes(x=WidthWS, y=Location)` arguements), with the points themselves added by the `geom_point()` term.
Here's the dotchart, with rows ordered by the `WidthWS` values:
```{r dotchart_ordered}
index <- order(WidthWS)
dotchart(WidthWS[index], labels=as.character(Location[index]), cex=0.5)
```
... and the `ggplot2` equivalent--note the use of the `theme()` term to modify the label size:
```{r dotchart_ordered_gg}
ggplot(sumcr, aes(x=WidthWS, y=reorder(Location, WidthWS))) +
geom_point(size=1) +
theme(axis.text.y = element_text(size=4, face="italic"))
```
## Boxplots ##
Here is the standard boxplot:
```{r boxplot}
# Boxplot
boxplot(WidthWS ~ HU, range=0)
```
... and the `qplot()` version:
```{r boxplot_q}
qplot(HU, WidthWS, data=sumcr, geom="boxplot")
```
```{r boxplot_gg}
ggplot(data = sumcr, aes(x=HU, y=WidthWS)) + geom_boxplot() + theme_bw()
```
The actual data points can be added to the boxplot, by specifying an additional graphical element using the `geom_point()` term.
```{r boxplot_gg2}
ggplot(sumcr, aes(x=HU, y=WidthWS)) +
geom_boxplot() +
geom_point(colour = "blue")
```
Detach the `sumcr` dataframe.
```{r detach1}
detach(sumcr)
```
## Histograms and density plots ##
The base-graphics histogram, with `breaks = 20` looks like this:
```{r histogram}
# histograms
attach(scanvote)
hist(Yes, breaks=20)
```
Here's the `ggplot2` version, with the biwidth specified explicitly (e.g., 1.0)
```{r hist_gg1}
ggplot(scanvote, aes(x=Yes)) + geom_histogram(binwidth = 1)
```
Here's another `ggplot` version, with more control over appearance, and with the edge of the first bin explicitly specified.
```{r histogram_gg2}
ggplot(scanvote, aes(x=Yes)) +
geom_histogram(binwidth=1, fill="white", color="red", boundary=25.5)
```
Here's a version with the individual data points added--note how it is built up from a basic plot by progressively adding graphical elements:
```{r histogram_gg4}
ggplot(scanvote, aes(x=Yes)) +
geom_histogram(binwidth=1, fill="white", color="red", boundary=25.5) +
geom_point(y=0)
```
Here's a base-graphics density plot:
```{r density}
# density plots
Yes.density <- density(Yes)
plot(Yes.density)
```
... and a version the `ggplot2` density plot:
```{r density_gg}
ggplot(scanvote, aes(x=Yes)) + geom_density() + ylim(0.0, 0.05)
```
The combined histogram, density and rug plot via the base graphics:
```{r histdensrug}
Yes.density <- density(Yes)
hist(Yes, breaks=20, probability=TRUE)
lines(Yes.density)
rug(Yes)
```
And here's a `ggplot` version, with three distinct layers:
```{r histdenrug_gg}
ggplot(scanvote, aes(x=Yes)) +
geom_histogram(data=scanvote, aes(x=Yes, y=..density..), binwidth=1,
fill="white", color="red", boundary=25.5) +
geom_line(stat = "density") +
geom_rug(data=scanvote, aes(x=Yes))
```
geom_density(data=scanvote, aes(x=Yes, y=..density..)) +
```{r detach3}
detach(scanvote)
```
[[Back to top]](ggplot2.html)
# Bivariate plots #
## Scatter diagrams ##
There are `ggplot2` equivalents of the common "base-package" bivariate plots, that offer finer control of the appearance of the plots than can usually be acheived with the base plots.
Here's the base plot version of a scatter diagram:
```{r scat1}
# scatter plots
# use Oregon climate-station data
attach(orstationc)
plot(elev,tann)
```
... and the ``qplot` version:
```{r scat2}
qplot(elev, tann, data=orstationc) + theme_gray()
```
```{r scat3}
detach(orstationc)
```
Here are the two scatter plots of yes votes vs population in the Scandinavian data set, using base graphic:
```{r scat4}
# use Scandinavian EU vote data [scanvote.csv]
attach(scanvote)
plot(Pop, Yes) # arithmetic axis
plot(log10(Pop), Yes) # logrithmic axis
```
The `ggplot2` versions of those plots can be compares side-by-side by generating the plots and saving them as objects (`plot` and `plot2`), and then using a function `grid.arrange()` from the `gridExtra` package to plot the two objects:
```{r scat5}
library(gridExtra)
plot1 <- ggplot(scanvote, aes(x=Pop, y=Yes)) + geom_point()
plot2 <- ggplot(scanvote, aes(x=log10(Pop), y=Yes)) + geom_point()
grid.arrange(plot1, plot2, ncol=2)
```
Note that `ggplot2` allows one to transform the coordinate system as well as the values of the individual variables. Here a log10 scaling of the x-axis is done. Note that the overall appearance of the plots is identical--it's the x-axes that differ.
```{r scat6}
plot1 <- ggplot(scanvote, aes(x=log10(Pop), y=Yes)) + geom_point()
plot2 <- ggplot(scanvote, aes(x=Pop, y=Yes)) + coord_trans(x="log10") + geom_point()
grid.arrange(plot1, plot2, ncol=2)
```
The same kind of "point decoration" offered by the base package can be done using `ggplot2`:
```{r scat7}
plot1 <- ggplot(scanvote, aes(x=log10(Pop), y=Yes, color=Country)) + geom_point()
plot2 <- ggplot(scanvote, aes(x=log10(Pop), y=Yes, shape=Country)) + geom_point()
grid.arrange(plot1, plot2, ncol=2)
```
Labeling of points using `ggplot` is possible, and again offers more fine control, such as some ability to deal with text overlap.
Here is the base plotting version of a text-labelled scatter plot:
```{r scat8}
plot(log10(Pop),Yes, type="n")
text(log10(Pop),Yes, labels=as.character(Country)) # text
```
A `ggplot` version of the same ia shown below. The `check_overlap()` function attempts to move the labels to avoid overplotting. To see where the datapoints actualy lie, they are added as dots.
```{r scat9}
ggplot(scanvote, aes(x=log10(Pop), y=Yes, label=as.character(Country))) +
geom_text(check_overlap = TRUE, size = 3) +
geom_point(size = 0.5) +
theme_gray()
```
```{r scat10}
ggplot(scanvote, aes(x=log10(Pop), y=Yes, label=as.character(Country))) +
geom_label(size = 3) +
geom_point(size = 1, color = "red") +
theme_gray()
```
Note the difference between the plots--the first plots a simple text string using `geom_text()` while the second plots a more formal label using `geom_label()`.
```{r}
detach(scanvote)
```
The grid extra package obviously makes it efficient to plot different variables side by side, which is often a better approach than plotting them on the same plot:
```{r scat11}
# use Oregon climate-station data [orstationc.csv]
attach(orstationc)
left <- ggplot(orstationc, aes(x=elev, y=tann)) + geom_point() + theme_gray()
right <- ggplot(orstationc, aes(x=elev, y=pann)) + geom_point() + theme_gray()
grid.arrange(left, right, ncol=2)
```
```{r}
detach(orstationc)
```
Lastly, here's a scatter plot smoothing example, beginning with the base-package version:
```{r scat12}
# use Oregon climate station annual temperature data
attach(ortann)
plot(elevation, tann)
lines(lowess(elevation,tann))
```
The `ggplot2` version uses `stat_smooth()` to get the lowess curve (and confidence band):
```{r scat13}
ggplot(data = ortann, aes(x = elevation, y = tann)) +
geom_point() +
stat_smooth(span = 0.9) +
theme_bw()
```
```{r}
detach(ortann)
```
[[Back to top]](ggplot2.html)
# Multivariate plots #
## Labelled scatter plots #
In addition to the simple catagorial (or factor) labelling of plots, the `ggplot2` package supports more elaborate, or computed, decoration or coloring of plots to examine the role of additional variables on bivariate relationships.
Here's the `ggplot2` version of the specmap data. Recall that the relationship between ice volume and insolation anomalies (or differences from the present day, which are considered to be the pacemake of the ice ages) is weak when examined superficially.
```{r mv1}
attach(specmap)
plot(O18 ~ Insol, pch=16, cex=0.6)
cor(O18, Insol)
```
As we saw earlier, on approach is to look at the time series plot of O18 values, and use insolation to colorize the individual points.
```{r mv2}
library(RColorBrewer)
library(classInt) # class-interval recoding library
# first block -- setup
plotvals <- Insol
nclr <- 8
plotclr <- brewer.pal(nclr,"PuOr")
plotclr <- plotclr[nclr:1] # reorder colors
class <- classIntervals(plotvals, nclr, style="quantile")
colcode <- findColours(class, plotclr)
# second block -- plot
plot(O18 ~ Age, ylim=c(2.5,-2.5), type="l")
points(O18 ~ Age, pch=16, col=colcode, cex=1.5)
```
Here's a `ggplot2` version, where the colorization is extended to the lines as well as the symbols.
```{r mv3}
ggplot() +
geom_line(data = specmap, aes(x=Age, y=O18, color=Insol)) +
geom_point(data = specmap, aes(x=Age, y=O18, color=Insol), size = 3) +
scale_y_reverse() +
scale_colour_gradientn(colours=rev(brewer.pal(nclr,"PuOr")))
```
As we saw earlier, postive insolation anomalies are associated with trends toward lower ice volume (negative O18 values, note the y-axis labelling), and negative anomalies with increasing ice volume.
```{r}
detach(specmap)
```
Symbol color and shape can be used to encode the values of two additional variables, as demonstrated using the Summit Cr. data:
```{r mv4}
attach(sumcr)
plot(WidthWS ~ CumLen, pch=as.integer(Reach), col=as.integer(HU))
legend(25, 2, c("Reach A", "Reach B", "Reach C"), pch=c(1,2,3), col=1)
legend(650, 2, c("Glide", "Pool", "Riffle"), pch=1, col=c(1,2,3))
```
Here is the `ggplot2` version:
```{r mv5}
ggplot(sumcr, aes(x=CumLen, y=WidthWS, shape=Reach, color=HU)) + geom_point(size = 3)
```
```{r}
detach(sumcr)
```
Symbol size can be used to indicate the value of a third variable, forming a "bubble" or "balloon" plot, to indicate the value of station elevation on a simple longitude by latitude plot of the Oregon climate-station data:
```{r mv6}
plot(orstationc$lon, orstationc$lat, type="n")
symbols(orstationc$lon, orstationc$lat, circles=orstationc$elev, inches=0.1, add=T)
```
Here's the equivalent plot using `ggplot2`
```{r mv7}
ggplot(orstationc, aes(x=lon, y=lat, size=elev)) + geom_point(shape=21, color="black", fill="lightblue")
```
## Trellis/Lattice-like plots ##
Multipanel Trellis or Lattice plots can be constucted, using the "faceting" approach in`ggplot2`. Here's an example of a coplot:
```{r mv8}
library(lattice)
attach(sumcr)
coplot(WidthWS ~ DepthWS | Reach, pch=14+as.integer(Reach), cex=1.5,
number=3, columns=3, overlap=0,
panel=function(x,y,...) {
panel.smooth(x,y,span=.8,iter=5,...)
abline(lm(y ~ x), col="blue")
} )
```
Below is a `ggplot2` version.
```{r mv9}
ggplot(sumcr, aes(x=DepthWS, y=WidthWS)) +
stat_smooth(method = "lm") +
geom_point() +
facet_wrap(~ Reach)
```
```{r}
detach(sumcr)
```
Multiple-variable conditioning plots can also be created. Here are styles of plot using the Yellowsone N.P. precipitation-ratio data. The first plot examines the July vs January precipitation ratios, conditioned by elevation, and the second conditioned by latitude and longitude.
```{r mv10}
# create some conditioning variables
attach(yellpratio)
Elevation <- equal.count(Elev,4,.25)
Latitude <- equal.count(Lat,2,.25)
Longitude <- equal.count(Lon,2,.25)
# January vs July Precipitation Ratios by Elevation
plot2 <- xyplot(APJan ~ APJul | Elevation,
layout = c(2, 2),
panel = function(x, y) {
panel.grid(v=2)
panel.xyplot(x, y)
panel.abline(lm(y~x))
},
xlab = "APJul",
ylab = "APJan")
print(plot2)
```
```{r mv11}
# January vs July Precipitation Ratios by Latitude and Longitude
plot3 <- xyplot(APJan ~ APJul | Latitude*Longitude,
layout = c(2, 2),
panel = function(x, y) {
panel.grid(v=2)
panel.xyplot(x, y)
panel.loess(x, y, span = .8, degree = 1, family="gaussian")
panel.abline(lm(y~x))
},
xlab = "APJul",
ylab = "APJan")
print(plot3)
```
Here's the `ggplot2` version of the first of the two plots above. First, create an elevation-zone factor, simply sorting the elevations into four equal-frequency classes. Then, produce the plot
```{r mv12}
# create an elevation zones factor
yellpratio$Elev_zones <- cut(Elevation, 4, labels=c("Elev1 (lowest)", "Elev2", "Elev3", "Elev4 (highest)"))
```
```{r mv13}
ggplot(yellpratio, aes(x=APJul, y=APJan)) +
stat_smooth(method = "lm") +
geom_point() +
facet_wrap( ~ Elev_zones)
```
Note that the panels are arranged in a different order. Elevation increases from lower left to upper right in the `lattice` plot, while it increases from upper left to lower right in the `ggplot2` version.
Here is a two-conditioning variables plot. The arrangement of the panels is a little different from that produced by the `lattice` plot, and the slices are not overlapping.
```{r mv14}
# create conditioning variables
Loc_factor <- rep(0, length(yellpratio$Lat))
Loc_factor[(yellpratio$Lat >= median(yellpratio$Lat) & yellpratio$Lon < median(yellpratio$Lon))] <- 1
Loc_factor[(yellpratio$Lat >= median(yellpratio$Lat) & yellpratio$Lon >= median(yellpratio$Lon))] <- 2
Loc_factor[(yellpratio$Lat < median(yellpratio$Lat) & yellpratio$Lon < median(yellpratio$Lon))] <- 3
Loc_factor[(yellpratio$Lat < median(yellpratio$Lat) & yellpratio$Lon >= median(yellpratio$Lon))] <- 4
# convert to factor, and add level lables
yellpratio$Loc_factor <- as.factor(Loc_factor)
levels(yellpratio$Loc_factor) = c("Low Lon/High Lat", "High Lon/High Lat", "Low Lon/Low Lat", "High Lon/Low Lat")
```
```{r mv15}
ggplot(yellpratio, aes(x=APJul, y=APJan)) +
stat_smooth(method = "loess", span = 0.9, col="red") +
stat_smooth(method = "lm") +
geom_point() +
facet_wrap(~Loc_factor)
```
[[Back to top]](ggplot2.html)
# Readings #
- the main reference on `ggplot()` is Wickham, H., 2016, *Ggplot2, Elegant Graphics for Data Analysis*, eBook available on line at UO Library (search topics for "ggplot2")
- Peng (*EDA with R*): Ch. 14, 15
- Kabakoff (*Data Visualization with R*): Ch. 3 - 6
- Irizarry (*Introduction to Data Science*): Ch. 6 - 10
[[Back to top]](ggplot2.html)