forked from kylebittinger/usedist
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
238 lines (181 loc) · 7.14 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
fig.path = "tools/readme/"
)
```
```{r echo=FALSE, message=FALSE}
devtools::load_all()
set.seed(0)
```
# usedist
This package provides useful functions for distance matrix objects in R.
[![Travis-CI Build Status](https://travis-ci.org/kylebittinger/usedist.svg?branch=master)](https://travis-ci.org/kylebittinger/usedist)
## Installation
You can install usedist from github with:
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("kylebittinger/usedist")
```
## Introduction to distance matrices in R
In R, the `dist()` function is used to compute a distance matrix. But the
result you get back isn't really a matrix, it's a `"dist"` object. Under
the hood, the `"dist"` object is stored as a simple vector. When it's
printed out, R knows how to make it look like a matrix. Let's make a distance
object representing the distances between six rows of data.
Here is our data matrix, `X`:
```{r}
X <- matrix(rnorm(30), nrow=6)
rownames(X) <- c("A", "B", "C", "D", "E", "F")
X
```
And here is our `"dist"` object, `d`, representing the distance between rows of
`X`:
```{r}
d <- dist(X)
d
```
These `"dist"` objects are great, but R does not provide a set of functions to
work with them conveniently. That's where the `usedist` package comes in.
## Working with "dist" objects
The `usedist` package provides some basic functions for altering or selecting
distances from a `"dist"` object.
```{r eval=FALSE}
library(usedist)
```
To start, we can make a new `"dist"` object, containing the distances between
rows B, C, F, and D. Our new object contains the rows *in the order we
specified*:
```{r}
dist_subset(d, c("B", "C", "F", "D"))
```
This is especially helpful when arranging a distance matrix to match a data
frame, for instance with the `adonis()` function in `vegan`.
We can extract distances between specified pairs of rows. For example,
we'll pull out the distances for rows A-to-D, B-to-E, and C-to-F. To extract
specific distance values, we use `dist_get()`. This function takes two vectors
of row labels: one vector for the rows of origin, and another for the rows of
destination.
```{r}
origin_row <- c("A", "B", "C")
destination_row <- c("D", "E", "F")
dist_get(d, origin_row, destination_row)
```
If rows are arranged in groups, we might like to have a data frame listing the
distances alongside the groups for each pair of rows. The `dist_groups()`
function makes a data frame from the groups, and also adds in a nice label that
you might use for plots.
```{r}
item_groups <- rep(c("Control", "Treatment"), each=3)
dist_groups(d, item_groups)
```
You might have your own distance function that you'd like to use, beyond the
options available in `dist()` or `vegan::vegdist()`. For example, the RMS
distance is kind of like the Euclidean distance, but you take the mean of the
squared differences instead of the sum inside the square root. Let's define the
distance function:
```{r}
rms_distance <- function (r1, r2) {
sqrt(mean((r2- r1) ^ 2))
}
```
Then, we can pass it to `dist_make()` to create a new distance matrix of RMS
distances.
```{r}
dist_make(X, rms_distance)
```
## Centroid functions
The `usedist` package contains functions for computing the distance to group
centroid positions. This is accomplished without finding the location of the
centroids themselves, though it is assumed that some high-dimensional Euclidean
space exists where the centroids can be situated. References for the formulas
used can be found in the function documentation.
To illustrate, let's create a set of points in 2-dimensional space. Four
points will be centered around the origin, and four around the point (3, 0).
```{r centroid_example}
pts <- data.frame(
x = c(-1, 0, 0, 1, 2, 3, 3, 4),
y = c(0, 1, -1, 0, 0, 1, -1, 0),
Item = LETTERS[1:8],
Group = rep(c("Control", "Treatment"), each=4))
library(ggplot2)
ggplot(pts, aes(x=x, y=y)) +
geom_point(aes(color=Group)) +
geom_text(aes(label=Item), hjust=1.5) +
coord_equal()
```
Our goal is to figure out distances for the group centroids using only the
distances between points. First, we need to put the data in matrix format.
```{r}
pts_matrix <- as.matrix(pts[,c("x", "y")])
rownames(pts_matrix) <- pts$Item
```
Now, we'll compute the point-to-point distances with `dist()`.
```{r}
pts_distances <- dist(pts_matrix)
pts_distances
```
The function `dist_between_centroids()` will calculate the distance between
the centroids of the two groups. Here, we expect to get a distance of 3.
```{r}
dist_between_centroids(
pts_distances, c("A", "B", "C", "D"), c("E", "F", "G", "H"))
```
The function is only using the distance matrix; it doesn't know where the
individual points are in space.
We can use another function, `dist_to_centroids()`, to calculate the distance
from each individual point to the group centroids. Again, this works without
knowing the point locations, only the distances between points. In our example,
the distances within the Control group and within the Treatment group should
all be equal to 1.
```{r}
dist_to_centroids(pts_distances, pts$Group)
```
You can use the Pythagorean theorem to check that the other distances are
correct. The distance between point "G" and the centroid for the *Control*
group should be sqrt(3^2^ + 1^2^) = sqrt(10) = 3.162278.
## Long format data
Many times, the data is not stored as a matrix, but is represented in "long"
format as a data frame. In this case, one column of the data frame gives the
row label for the matrix, another indicates the column label, and a
third provides the value. To get a real data matrix, we have to "pivot" the
data frame and convert to matrix form. Because this is such a common operation,
`usedist` includes a convenience function, `pivot_to_matrix()`.
Here is an example of data in long format:
```{r}
data_long <- data.frame(
row_id = c("A", "A", "A", "B", "B", "C", "C"),
column_id = c("x", "y", "z", "x", "y", "y", "z"),
value = rpois(7, 12))
data_long
```
The data table has no value for row "B" and column "z". By default, a
value of 0 is filled in for missing combinations when we convert to matrix
format. Here is how we convert:
```{r}
data_matrix <- pivot_to_matrix(data_long, row_id, column_id, value)
data_matrix
```
Note that we provide bare column names in the call to
`pivot_to_matrix()`. This function requires some extra packages to
be installed. They are listed as suggestions for `usedist`. If the
additional packages are not installed on your system, you'll get an error
message with the missing packages listed.
The matrix format is what we need for distance calculations. If you want to
convert from long format and use a custom distance function, you can combine
`pivot_to_matrix()` with `dist_make()`:
```{r}
dist_make(data_matrix, rms_distance)
```
## Parallelization
Distance calculations can get computationally expensive with large sample sizes.
With the installation of future.apply package, you cna compute the distances in
parallel to save time.
```{r message=F, warning=F}
library(future.apply)
future::plan(future::multisession)
dist_make(data_matrix, rms_distance)
```