-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-histograms.Rmd
224 lines (155 loc) · 5.53 KB
/
07-histograms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
# Histograms and Box and whisker plots
Histograms and boxplots are two methods of displaying information about the distribution of a dataset.
The good news is that most of the commands that we have covered in the previous section on scatterplots and generic, and they can also be used when producing histograms and boxplots.
### Histograms
We've already plotted one histogram in the previous section, so let's look again at that plot.
First, we will look at the data.
```{r histograms-1",echo=-c(1:2)}
set.seed(1337)
normal_data <- rnorm(
100,
mean=0,
sd=1
)
hist(
normal_data
)
```
Simple enough. But if like me, you are a little pedantic about your plots, you may also want to make some changes...
```{r histograms-2}
hist(
normal_data,
main = "",
xlab = "Value",
col="gray80",
cex = 1.4,
cex.axis = 1.2,
cex.lab = 1.2
)
```
Bear in mind that we need to make some assumptions when we draw a histogram - i.e. where to choose the break points for the bins (columns). It's usually fine to leave R to make the decision for us, but we can see the result of making changes to the break points below.
```{r histograms-3, fig.height=20,echo=FALSE}
par(mfrow=c(3,1))
for (i in c(5,25,50)) {
hist(
normal_data,
xlab = "Value",
ylim=c(0,30),
col="gray80",
cex = 1.4,
cex.axis = 1.2,
cex.lab = 1.2,
breaks = i,
main = paste("breaks =",i,sep=" ")
)
}
par(mfrow=c(1,1))
```
You may also at some point wish to compare two histograms by overlaying one over the other. This can be done with a few tweaks, as seen in the example below. Note that we have again had to specify the number of breaks. Hence we have made some assumptions about how to display this data.
```{r histograms-4}
set.seed(1337)
# create two normally distributed vectors of random data
data_a <- rnorm(
100,
mean = 0,
sd = 2
)
data_b <- rnorm(
100,
mean = 2,
sd = 2
)
# Plot first graph
p1<-hist(
data_a,
ylim = c(0,25),
col = rgb(1,0,0,0.5),
main = "",
breaks = 10,
xlab = "Value",
cex = 1.4,
cex.axis = 1.2,
cex.lab = 1.2
)
# Note that you must specify an 'alpha' of less than 1 (the fourth argument in
# the rgb command), otherwise, your two colours will be completely solid.
# Plot second graph - note use of 'add = TRUE' to overlay onto the first plot
p2<-hist(
data_b,
breaks = 10,
add = TRUE,
col = rgb(0,0,1,0.5),
cex = 1.4,
cex.axis = 1.2,
cex.lab = 1.2
)
```
### Box and whisker plots
Boxplots similarly display information about the distribution of data, and the have the benefit of not having to make assumptions about bin width. They are not completely free of assumptions however.
A basic boxplot is shown below.
```{r histograms-5, echo=-1}
set.seed(1337)
boxplot_data <- rnorm(
n = 500,
sd = 10,
mean = 1
)
boxplot(
x = boxplot_data,
col = "aliceblue",
cex = 1.4,
cex.axis = 1.2,
cex.lab = 1.2
)
```
In this example the thick line at the centre shows the median, the top and bottom of the box shows the upper and lower quartile, whilst the 'whisker' show the upper and low adjacent values. These are the highest and lowest values within 1.5 times the interquartile range (upper - lower quartile) of the upper and lower quartile respectively. Any values beyond the adjacent values are considered 'outliers' and represented by points.
Note that this is not the only way of representing data in box and whisker plots. The other possibilities are listed here: <http://en.wikipedia.org/wiki/Box_plot>. For this reason, you should always mention what your boxplot is actually showing in the caption.
#### Multiple box plots
One of the great strengths of the boxplot, is that it allows you to plot several different data sets side by side, in a much more simple and easily comparable manner than with histograms.
<div class="ex">
### Exercise
In the next example we will use a real dataset which you are required to download from the internet and load into R using the methods we have learnt so far.
The data are available here: <https://db.tt/JEoe3mQZ>
Save this file to your desktop and import it to R, so that R recognises the column headings. Call the new variable `car_prices`.
</div>
This data compares the asking two prices of two cars: one diesel, and one petrol, from 216 car adverts. In the following example we will produce boxplots which compare the asking prices. We need to use the `split()` command for this.
The `split()` command literally splits a dataset according to a particular categorical variable.
```{r histograms-6", echo=-1, message=FALSE, warning=FALSE}
car_prices <- read.table("discount.csv",sep=",",header=T)
# Check the data
str(
car_prices
)
# Split it according to fuel type - note the $ operator - this is required in this case
split(
car_prices$price,
car_prices$fuel
)
# This can simply be wrapped by a call to the boxplot() function
boxplot(
split(
car_prices$price,
car_prices$fuel
),
col = c("#66C2A5","#FC8D62")
)
```
We can make additional splits to this dataset if we like, by using a `list` of factors to split the data by.
```{r histograms-8,message=FALSE,warning=FALSE}
boxplot(
split(
car_prices$price,
list(
car_prices$fuel,
car_prices$type
)
),
col = c("#66C2A5","#FC8D62"),
ylab=bquote(Asking~Price~("£"))
)
# Using bquote() is required here to display the £ sign correctly
```
<div class="ex">
### Exercise
Using the methods above, produce a `boxplot` of the `car_prices` data, splitting teh data by seller `type`
</div>