-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathlists.Rmd
473 lines (368 loc) · 19.3 KB
/
lists.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
# Lists
This chapter covers an additional R data type called lists. Lists are
somewhat similar to atomic vectors (they are "generalized vectors"!),
but can store more types of data and more details _about_ that data
(with some cost). Lists are another way to create R's version of a
[**Map**](https://en.wikipedia.org/wiki/Associative_array) data structure, a common and extremely useful way of organizing data in a computer program. Moreover: lists are used to create _data frames_, which is the primary data storage type used for working with sets of real data in R. This chapter will cover how to create and access elements in a list, as well as how to apply functions to lists or vectors.
## What is a List?
A **List** is a lot like an atomic vector. It is also a
_one-dimensional positional ordered collection of data_. Exaclyt as in
case of atomic vectors, list elements preserve their order, and they
have a well-defined position in the list. However, lists have a few major differences from vectors:
1. Unlike a vector, you can store elements of _different types_ in a
list: e.g., a list can contain numeric data _and_ character string data,
functions, and even other lists.
2. Because lists can contain any type of data, they are much less efficient
as vectors. The vectorized operations that can handle atomic vectors on
the fly usually fail in case of lists, or may work substantially slower. Hence one should prefer atomic
vectors over lists if possible.
3. Elements in a list can also be named, but unlike in case of vector,
there exists a convenient shorthand `$`-construct to extract named elements from lists.
Lists are extremely useful for organizing data. They allow you to group
together data like a person's name (characters), job title (characters),
salary (number), and whether they are in a union (logical)—and you
don't have to remember whether the person's name or title was the first
element! In this sense lists can be used as a quick alternative to
formal classes, objects that can store heterogeneous data in a
consistent way. This is one of the primary uses of lists.
## Creating Lists
You create a list by using the `list()` function and passing it any number of **arguments** (separated by commas) that you want to make up that list—similar to the `c()` functon for vectors.
However, if your list contains heterogenous elements, it is usually a
good idea to specify the **names** (or **tags**) for each element in the list in the same
way you can give names to vector elements in `c()`—by putting the name tag (which is like a variable name), followed by an equal symbol (**`=`**), followed by the value you want to go in the list and be associated with that tag. For example:
```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000,
in_union = TRUE)
person
## person
## $first_name
## [1] "Ada"
##
## $job
## [1] "Programmer"
##
## $salary
## [1] 78000
##
## $in_union
## [1] TRUE
```
This creates a list of 4 elements: `"Ada"` which is tagged with
`first_name`, `"Programmer"` which is tagged with `job`, `78000` which
is tagged with `salary`, and `TRUE` which is tagged with `in_union`.
The output lists all component names following the dollar sign `$` (more
about it below), and prints the components themselves right after the names.
- Note that you can have _vectors_ as elements of a list. In fact, each
of these scalar values are really vectors (of length 1) as indicated
by `[1]` preceeding their values!
- The use of the `=` symbol here is an example of assigning a value to a specific named argument. You can actually use this syntax for _any_ function (e.g., rather than listing arguments in order, you can explicit "assign" a value to each argument), but it is more common to just use the normal order of the arguments if there aren't very many.
Note that if you need to, you can get a _vector_ of element tags using the `names()` function:
```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)
names(person) # [1] "first_name" "job" "salary" "in_union"
```
This is useful for understanding the structure of variables that may have come from other data sources.
It is possible to create a list without tagging the elements, and assign
names later if you wish:
```r
person_alt <- list("Ada", 78000, TRUE)
person_alt
## [[1]]
## [1] "Ada"
##
## [[3]]
## [1] 78000
##
## [[4]]
## [1] TRUE
names(person_alt) <- c("name", "income", "memebership")
person_alt
## $name
## [1] "Ada"
##
## $income
## [1] 78000
##
## $memebership
## [1] TRUE
```
Note that the name tags are missing before we assign names, instead of
names we see the position of the components in double brackets like
`[[1]]` (more about it [below](#lists-indexing-by-position)).
Making name-less lists and assigning names later is usually more error-prone and harder way to make lists
manually, but when you automatically create lists in your code, it may be the only option.
Finally, empty lists of given length can also be created using the
general `vector()` function. For instance, `vector("list", 5)`, creates
a list of five `NULL` elements. This is a good approach if you just want am empty list to be filled in a loop later.
## Accessing List Elements
There are four ways to access elements in lists. Three of these reflect
atomic vector indexing, the [`$`-construct](#lists-dollar-shortcut) is unique for lists. However, there
are important differences.
### Indexing by position {#lists-indexing-by-position}
You can always access list elements by their position. It is in many
ways similar to that of atomic vectors with one major caveat:
indexing with single brackets will extract not the components but a _sublist_ that
contains just those components:
```{r, eval=FALSE}
# note: this list is not not an atomic vector, even though elements have the same types
animals <- list("Aardvark", "Baboon", "Camel")
animals[c(1,3)]
## [[1]]
## [1] "Aardvark"
##
## [[2]]
## [1] "Camel"
```
You can see that the result is a list with two components, "Aardvark"
and "Camel", picked from the the positions 1 and 3 in the original list.
The fact that single brackets return a list in case of vector is
actually a smart design choice. First, it cannot return a vector in
general—the requested components may be of different type and
simply not fit into an atomic vector. Second, single-bracket indexing
in case of vectors actually returns a _subvector_. We just tend to
overlook that a "scalar" is actually a length-1 vector. But however
smart this design decision may be, people tend to learn it in the hard
way. When confronted with weird errors, check that what you think
should be a vector is in fact a vector and not a list.
The good news is that there is an easy way to extract components. A
single element, and not just a length-one-sublist, is extracted by
double brackets. For instance,
```{r, eval=FALSE}
animals[[2]]
## [1] "Baboon"
```
returns a length-1 character vector.
Unfortunately, the good news end here. You can extract individual
elements in this way, but you cannot get a vector of individual list
components: `animals[[1:2]]` will give you _subscript out of bounds_.
As above, this is a design choice: as list components may be of
different type, you may not be able to mold these into a single vector.
<p class="alert alert-info">
There are ways to merge components into a vector, given they are of the
same type. For instance `Reduce(c, animals)` will convert the animals
into a vector of suitable type. Ditto with `as.character(animals)`.
</p>
### Indexing by Name
If the list is named, one can use a character vector to extract it's
components, exacly in the same way as we used the numeric positions
above. For instance
```r
person <- list(first_name = "Bob", last_name = "Wong", salary = 77000, in_union = TRUE)
person[c("first_name", "salary")]
## $first_name
## [1] "Bob"
##
## $salary
## [1] 77000
person[["first_name"]] # [1] "Bob"
person[["salary"]] # [1] 77000
```
As in case of positional indexing, single brackets return a sublist
while double brackets return the corresponding component itself.
### Indexing by Logical Vector
As in case of atomic vectors, we can use logical indices with
lists too. There are a few differences though:
* one can only extract sublists, not individual components.
`person[c(TRUE, TRUE, FALSE, FALSE)]` will give you a sublist with
first and last name. `person[[c(TRUE, FALSE, FALSE, FALSE)]]` will
fail.
* many operators are vectorized but they are not "listified". You
cannot do math like `*` or `+` with lists. Hence the
powerful logical indexing operations like `x[x > 0]` are in general not possible
with lists. This substantially reduces the potential usage cases of
logical indexing.
For instance, we can extract all components of certain name from the
list:
```{r, eval=FALSE}
planes <- list("Airbus 380"=c(seats=575, speed=0.85),
"Boeing 787"=c(seats=290, speed=0.85),
"Airbus 350"=c(seats=325, speed=0.85))
# cruise speed, Mach
planes[startsWith(names(planes), "Airbus")] # extract components, names
# of which starting with "Airbus"
## $`Airbus 380`
## seats speed
## 575.00 0.85
##
## $`Airbus 350`
## seats speed
## 325.00 0.85
```
<p class="alert alert-info">
However, certain vectorized operations, such as `>` or `==` also work with lists
that contain single numeric values as their elements. It seems to be
hard to come up with general rules, so we recommed not to rely on this
behaviour in code.
</p>
### Extracting named elements with `$` {#lists-dollar-shortcut}
Finally, there is a very convenient `$`-shortcut alternative for
extracting individual components.
If you printed out one of the named lists above, for instance `person`, you would see the following:
```{r, eval=FALSE}
person <- list(name = "Ada", job = "Programmer")
print(person)
## $first_name
## [1] "Ada"
##
## $job
## [1] "Programmer"
```
Notice that the output lists each name tag prepended with a dollar sign
(**`$`**) symbol, and then on the following line the vector that is the
element itself. You can retrieve individual components in a similar
fashion, the **dollar notation** is one of the easiest ways of accessing
list elements. You refer to the particular element in the list with its
tag by writing the name of the list, followed by a `$`, followed by the element's tag:
```{r, eval=FALSE}
person$name # [1] "Ada"
person$job # [1] "Programmer"
```
Obviously, this only works for named lists. There are no dollar notation analogue for atomic vectors, even for named
vectors. `$` extractor only exists for lists (and such data structures
that are derived from lists, like data frames).
You can almost read the dollar sign as like an "apostrophe s" (possessive) in English: so `person$salary` would mean "the `person` list**'s** `salary` value".
Dollar notation allows list elements to almost be treated as variables in their own right—for example, you specify that you're talking about the `salary` variable in the `person` list, rather than the `salary` variable in some other list (or not in a list at all).
```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)
# use elements as function or operation arguments
paste(person$job, person$first_name) # [1] "Programmer Ada"
# assign values to list element
person$job <- "Senior Programmer" # a promotion!
print(person$job) # [1] "Senior Programmer"
# assign value to list element from itself
person$salary <- person$salary * 1.15 # a 15% raise!
print(person$salary) # [1] 89700
```
Dollar-notation is a drop-in replacement to double-brackets extraction
given you know the name of the component. If you do not—as is
often the case when programming—you have to rely on double bracket approach.
### Single vs. Double Brackets vs. Dollar
The list indexing may be confusing: we have single and double brackets,
indexing by position and name, and finally the dollar-notation. Which
is the right thing to do? As is so often the case, it depends.
* **Dollar notation** is the quickest and easiest way to extract a single
named component in case you know it's name.
* **Double brackets** is very much a more verbose alternative to the dollar notation. It returns a single component exactly as the dollar notation. However, it also allows one to decide later which components to extract. (This is terribly useful in programs!) For instance,
we can decide if we want to use someones first or last name:
```{r, eval=FALSE}
person <- list(first_name = "Bob", last_name = "Wong", salary = 77000)
name_to_use <- "last_name" # choose name (i.e., based on formality)
person[[name_to_use]] # [1] "Wong"
name_to_use <- "first_name" # change name to use
person[[name_to_use]] # [1] "Bob"
```
<p class="alert alert-info">
Note: you can often hear that double brackets return a vector. This is only true if the corresponding element is a vector. But they always return the element!
</p>
* **Single brackets** is the most powerful and universal way of indexing. If work in a very similar fashion than vector indexing. The main caveat here is that it _returns a sub-list_, not a vector. (But note that in case of vectors, single-bracket indexing returns a _sub-vector_.) It allows by position, by names, and by logical vector.
In some sense it is **filtering** by whatever vector is inside the brackets (which may have just a single element). In R, single brackets _always_ mean to filter the collection where the collection may be either atomic vector or list. So if you put single-brackets after a collection, you get a filtered version of the same collection, containing the desired elements. The type of the collection, list or atomic vector, is not affected.
<p class="alert alert-warning">**Watch out**: In vectors, single-bracket notation returns a vector, in lists single-bracket notation returns a list!
</p>
We recap this section by an example:
```r
animal <- list(class='A', count=201, endangered=TRUE, species='rhinoceros')
## SINGLE brackets returns a list
animal[1]
## $class
## [1] "A"
## can use any vector as the argument to single brackets, just like with vectors
animal[c("species", "endangered")]
## $species
## [1] "rhinoceros"
##
## $endangered
## [1] TRUE
## DOUBLE brackets returns the element (here its a vector)!
animal[[1]] # [1] "A"
## Dollar notation is equivalent to the double brackets
animal$class # [1] "A"
```
Finally, all these methods can also be used for assignment. Just put any of these construct on the left side of the assignment operator `<-`.
## Modifying Lists
As in the case with atomic vectors, you can assign new values to existing elements. However, lists also enable dedicated syntax to _remove_ elements. (Remember, you can always "unselect" an element in a vector, including list, by using negative positional index.)
You can add elements to a list simply by assigning a value to a tag (or index) in the list that doesn't yet exist:
```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)
# has no `age` element
person$age # NULL
# assign a value to the `age` tag to add it
person$age <- 40
person$age # [1] 40
# assign using index
person[[10]] <- "Tenth field"
# elements 6-9 will be NULL
```
This parallel fairly closely with atomic vectors.
You can remove elements by assiging the special value `NULL` to their tag or index:
```r
a_list <- list('A', 201, True)
a_list[[2]] <- NULL # remove element #2
print(a_list)
# [[1]]
# [1] "A"
#
# [[2]]
# [1] TRUE
```
There is no analogue here to atomic vectors.
## The `lapply()` Function
A large number of common R functions (e.g., `paste()`, `round()`, etc.) and most common operators (like `+`, `>`, etc) are _vectorized_ so you can pass vectors as arguments, and the function will be applied to each item in the vector. It "just works". In case of lists it usually fails. You need to put in a bit more effort if you want to apply a
function to each item in a list.
The effort involves either an explicit loop, or an implicit loop through a function called **`lapply()`** (for _**l**ist apply_). We will discuss the latter approach here.
`lapply()` takes two arguments: the first is a list (or a vector, vectors will do as well) you want to work with, and the second is the function you want to "apply" to each item in that list. For example:
```{r, eval=FALSE}
# list, not a vector
people <- list("Sarah", "Amit", "Zhang")
# apply the `toupper()` function to each element in `people`
lapply(people, toupper)
## [[1]]
## [1] "SARAH"
##
## [[2]]
## [1] "AMIT"
##
## [[3]]
## [1] "ZHANG"
```
You can add even more arguments to `lapply()`, those will be assumed to belong to the function you are applying:
```{r, eval=FALSE}
# apply the `paste()` function to each element in `people`,
# with an addition argument `"dances!"` to each call
lapply(people, paste, "dances!")
## [[1]]
## [1] "Sarah dances!"
##
## [[2]]
## [1] "Amit dances!"
##
## [[3]]
## [1] "Zhang dances!"
```
The last unnamed argument, `"dances"`, are taken as the second argument to `paste`. So behind the scenes, `lapply()` runs a loop over `paste("Sarah", "dances!")`, `paste("Amit", "dances!")` and so on.
- Notice that the second argument to `lapply()` is just the function: not the name of the functions as character string (it's not quoted in `""`). You're also not actually _calling_ that function when you write it's name in `lapply()` (you don't put the parenthesis `()` after its name). See more in [section _How to Use Functions_](#how-to-use-functions).
After the function, you can put any additional arguments you want the applied function to be called with: for example, how many digits to round to, or what value to paste to the end of a string.
Note that the `lapply()` function returns a _new_ list; the original one is unmodified. This makes it a [**mapping**](https://en.wikipedia.org/wiki/Map_(parallel_pattern)) operation. It is an operation, and not the same thing as _map_ data structure. In mapping operation the code applies the same **elemental function** to the all elements in a list.
You commonly use `lapply()` with your own custom functions which define what you want to do to a single element in that list:
```r
# A function that prepends "Hello" to any item
greet <- function(item) {
return(paste("Hello", item))
}
# a list of people
people <- list("Sarah", "Amit", "Zhang")
# greet each name
greetings <- lapply(people, greet)
## [[1]]
## [1] "Hello Sarah"
##
## [[2]]
## [1] "Hello Amit"
##
## [[3]]
## [1] "Hello Zhang"
```
Additionally, `lapply()` is a member of the "`*apply()`" family of functions: a set of functions that each start with different letters and may apply to a different data structure, but otherwise all work in a similar fashion. For example, `lapply()` returns a list, while `sapply()` (**s**implified apply) simplifies the list into a vector, if possible. If you are interested in parallel programming, we recommend you to check out the function `parLapply` and it's friends in the _parallel_ package.
## Resources {-}
- [R Tutorial: Lists](http://www.r-tutor.com/r-introduction/list)
- [R Tutorial: Named List Members](http://www.r-tutor.com/r-introduction/list/named-list-members)
- [StackOverflow: Single vs. double brackets](http://stackoverflow.com/questions/1169456/in-r-what-is-the-difference-between-the-and-notations-for-accessing-the)