-
Notifications
You must be signed in to change notification settings - Fork 51
/
021_vectors.Rmd
executable file
·222 lines (180 loc) · 8.46 KB
/
021_vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")
# 2.1. Vectors and matrices
[qf-subset]: http://www.quantumforest.com/2013/05/subsetting-data/
[gs-matrix]: https://gastonsanchez.wordpress.com/2012/08/30/cheat-sheet-of-matrix-operations-in-r/
In R, almost everything is a *vector*. If it's not a vector, it might be a special kind of vector. When something is not a vector, then it's not data and is probably some form of *function*.
What is a vector? A vector is an ordered list of $k$ things. The first thing is item #1 of the vector, the second thing is item #2 and so on. Here's a sequence of $k = 4$ integer values:
```{r seq}
# A sequence.
1:4
```
We are going to store the sequence into an object called `x` and verify that it is indeed a vector. And of course it is. It's a vector of dimension 4: it has four items.
```{r vector-check}
# Store it.
x <- c(1:4)
# Is it a vector?
is.vector(x)
```
So vectors are just lists of values. The values must be of the same type: you cannot mix numbers and text, for instance. If you do, R will convert (recycle) all items of the vector to a single type, or "mode", like `numeric` or `character` if there are text elements in the vector.
```{r vector-items}
# A vector of numbers.
x <- c(4, 6, 8)
# A vector of strings, i.e. characters, i.e. text.
y <- c("alea", "jacta", "est")
# A 'mixed' vector will be automatically recycled.
z <- c(4, "a")
```
You can see the different types of your objects in the Workspace. The examples below are also vectors:
```{r vector-ubiquity}
# Also a vector.
y <- "This is a vector."
# Check.
is.vector(y)
# Also a vector.
z <- 1
# Check.
is.vector(z)
# What about...
m <- cbind(1:3, 4:6)
# What type (or class) is that?
class(m)
```
Okay, that's a *matrix*. A vector is a $1 \times k$ unidimensional matrix. So in fact, in R, everything is a matrix, or some special kind of a matrix. Welcome to R. Note that we are going to introduce a lot of terminology below, but this is not a scientific computing class, and we are not computer scientists. We just need to make sure that we grasp the essentials.
## Working with vectors
Let's combine a fictional set of student grades into a vector called `grades`. Let's compute the mean, the median, and finally a whole bunch of summary statistics about the resulting object.
```{r exam}
# Create vector of random exam grades.
exam <- c(7, 13, 19, 8, 12)
# Check result.
exam
# Compute mean of exam vector.
mean(exam)
# Compute median of exam vector.
median(exam)
# Descriptive statistics for exam vector.
summary(exam)
```
Note that R uses a precise taxonomy of objects, and that even though we just used the word "list" to describe a vector, a list is not the same as a vector. If you ask R if the `exam` object is a vector, the result will be `TRUE`, but not if you ask whether `exam` is a list.
```{r exam-vector-object}
# Recall the exam object.
exam
# It's a vector.
is.vector(exam)
# Not a very long one.
length(exam)
```
Select from a vector by adding item numbers in square brackets to its name. The item numbers also form a vector.
```{r exam-vector-item-selection}
# Select its first element; in context: show grade of first student.
exam[1]
# Select more than one element by listing which values you want.
exam[1:3]
# Select a vector of values: show grades of students no. 1, 2, 3 and 5.
exam[c(1:3, 5)]
# In context, it does makes more sense to simply exclude student no. 4's grade.
exam[-4]
# Vectors can be logical.
exam >= 10
# Select a logical vector of values.
exam[exam >= 10]
```
What is shown here is important: in R, everything is an object. Here, `exam` is an object, just like the list of values that you want to show from it. This logic is everywhere in R. It is harder to learn than less flexible syntaxes, but it provides a lot of power to the user. Here's a simple example that creates a logical vector to indicate whether the exam grade is under average:
```{r exam-logical}
# Create a logical vector.
is_under_average <- exam < 10
# Get grades under average.
exam[is_under_average]
# Get grades above average.
exam[!is_under_average]
```
The magic here is that `exam` and `is_under_average` are *two separate elements*, although that brings marvel only to the eyes of those who are fascinated by vectorization: in practice, you will often end up binding the two objects together into a multidimensional vector like a matrix or a data frame, which we cover in the rest of this session.
## Working with matrixes
Let's see what a different object looks like. If we want to organize some data into a "row by column" structure, we need to use a matrix--a table of numerical values, if you prefer. The `as.matrix` command will turn `grades` into a matrix.
```{r exam-as-a-matrix}
# The exam object is not a matrix.
is.matrix(exam)
# If you make it into a matrix, the vector becomes a column of values.
as.matrix(exam)
```
Let's keep that mind and create five more imaginary student essay grades that we are going to combine with the exam grades, using the `cbind` command to form a single matrix out of vector objects. The 'c' in `cbind` means that the combination is done 'vertically', by forming columns out of the vectors.
```{r exam-essay-matrix}
# Show length of grades vector.
length(exam)
# Create a random grades vector of same length.
essay <- as.integer(20 * runif(5))
# Check result.
essay
# Form a matrix.
grades <- cbind(exam, essay)
# Check result.
grades
```
Let's finally compute the average for every student (rows), and read the average for exams and essays (columns).
```{r }
# Compute student average.
final <- rowMeans(grades)
# Combine to grades matrix.
grades <- cbind(grades, final)
# Check result.
grades
# Compute mean exam and essay grades.
colMeans(grades)
```
You can further explore the matrix by inspecting its dimensions and providing names to its rows and columns. These operations are covered in Gaston Sanchez's [cheat sheet of matrix operations][gs-matrix], which you can add to your course bookmarks.
```{r matrix-dimensions}
# How many rows?
nrow(grades)
# How many columns?
ncol(grades)
# Do they have names?
dimnames(grades)
# Create a student id sequence 1, 2, 3, ...
id <- c(1:nrow(grades))
# Check result.
id
# Assign it to row names.
rownames(grades) <- id
# Check result.
grades
```
Let's now select elements from the matrix: as with vectors, you have to use square brackets, but since there two dimensions, you have to specify at least one of them. The syntax is "r x c" by convention: the first value specifies rows, the second columns. Understanding this system is crucial because it will guide many [subsetting][qf-subset] operations when we work with datasets.
```{r matrix-exploration}
# First row, second cell.
grades[1,2]
# First row.
grades[1, ]
# First two rows.
grades[1:2, ]
# Third column.
grades[, 3]
# Descriptive statistics for final grades.
summary(grades[, 3])
# Descriptive statistics for all matrix columns.
summary(grades)
```
## Encoding vectors as factors
Let's anticipate on an issue that will come up often when we explore larger datasets. To illustrate the issue, let's code if the student passes or fails the test. The `ifelse` command works pretty much like the `IF` command in spreadsheet editors.
```{r grades-pass}
# Create a text object based on a logical condition.
pass <- ifelse(grades[, 3] >= 10, "Pass", "Fail")
# Check result.
pass
```
The `pass` vector contains text items (strings). Try to combine it to the numeric `grades` matrix to see what happens, and find the issue with that operation ("hint").
```{r cbind-mixed-types, eval=FALSE}
cbind(grades, pass)
```
To solve the issue, you are going to convert the text column to numeric values, while keeping the textual information as 'captions' or 'labels' for the numbers. R calls that operation factor encoding, and the 'labels' are called 'levels'.
```{r factor}
# Understand what happens when you factor a string (text) variable.
factor(pass)
# The numeric matrix is preserved if 'pass' is added as a factor.
cbind(grades, factor(pass))
# Final operation. The '=' assignment gives a name to the column.
grades <- cbind(grades, pass = factor(pass))
# Marvel at your work.
grades
```
Notice that factors are alphabetically ordered, and that this influences how they are ordered in graphs. In the simple example above, there is no particular issue with the fact that "1" stands for "Fail" and "2" for "Pass", but if we want to order a list of countries in any other way than alphabetically, this is when we will be (re)ordering factors.
> __Next__: [Variables and factors](022_variables.html).