lists.Rmd

# Lists

This chapter covers an additional R data type called lists. Lists are
somewhat similar to atomic vectors (they are "generalized vectors"!),
but can store more types of data and more details _about_ that data
(with some cost).  Lists are another way to create R's version of a
[**Map**](https://en.wikipedia.org/wiki/Associative_array) data structure, a common and extremely useful way of organizing data in a computer program. Moreover: lists are used to create _data frames_, which is the primary data storage type used for working with sets of real data in R. This chapter will cover how to create and access elements in a list, as well as how to apply functions to lists or vectors.

## What is a List?

A **List** is a lot like an atomic vector.  It is also a
_one-dimensional positional ordered collection of data_.  Exaclyt as in
case of atomic vectors, list elements preserve their order, and they
have a well-defined position in the list.  However, lists have a few major differences from vectors:

1. Unlike a vector, you can store elements of _different types_ in a
list: e.g., a list can contain numeric data _and_ character string data,
functions, and even other lists.

2. Because lists can contain any type of data, they are much less efficient
as vectors.  The vectorized operations that can handle atomic vectors on
the fly usually fail in case of lists, or may work substantially slower.  Hence one should prefer atomic
vectors over lists if possible.

3. Elements in a list can also be named, but unlike in case of vector,
there exists a convenient shorthand `$`-construct to extract named elements from lists.

Lists are extremely useful for organizing data. They allow you to group
together data like a person's name (characters), job title (characters),
salary (number), and whether they are in a union (logical)&mdash;and you
don't have to remember whether the person's name or title was the first
element!  In this sense lists can be used as a quick alternative to
formal classes, objects that can store heterogeneous data in a
consistent way.  This is one of the primary uses of lists.


## Creating Lists

You create a list by using the `list()` function and passing it any number of **arguments** (separated by commas) that you want to make up that list&mdash;similar to the `c()` functon for vectors.

However, if your list contains heterogenous elements, it is usually a
good idea to specify the **names** (or **tags**) for each element in the list in the same
way you can give names to vector elements in `c()`&mdash;by putting the name tag (which is like a variable name), followed by an equal symbol (**`=`**), followed by the value you want to go in the list and be associated with that tag. For example:

```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000,
               in_union = TRUE)
person
## person
## $first_name
## [1] "Ada"
## 
## $job
## [1] "Programmer"
## 
## $salary
## [1] 78000
## 
## $in_union
## [1] TRUE
```

This creates a list of 4 elements: `"Ada"` which is tagged with
`first_name`, `"Programmer"` which is tagged with `job`, `78000` which
is tagged with `salary`, and `TRUE` which is tagged with `in_union`.
The output lists all component names following the dollar sign `$` (more
about it below), and prints the components themselves right after the names.

- Note that you can have _vectors_ as elements of a list. In fact, each
  of these scalar values are really vectors (of length 1) as indicated
  by `[1]` preceeding their values!

- The use of the `=` symbol here is an example of assigning a value to a specific named argument. You can actually use this syntax for _any_ function (e.g., rather than listing arguments in order, you can explicit "assign" a value to each argument), but it is more common to just use the normal order of the arguments if there aren't very many.

Note that if you need to, you can get a _vector_ of element tags using the `names()` function:

```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)
names(person)  # [1] "first_name" "job"  "salary"  "in_union"
```
This is useful for understanding the structure of variables that may have come from other data sources.

It is possible to create a list without tagging the elements, and assign
names later if you wish:

```r
person_alt <- list("Ada", 78000, TRUE)
person_alt

## [[1]]
## [1] "Ada"
## 
## [[3]]
## [1] 78000
## 
## [[4]]
## [1] TRUE

names(person_alt) <- c("name", "income", "memebership")
person_alt

## $name
## [1] "Ada"
## 
## $income
## [1] 78000
## 
## $memebership
## [1] TRUE
```

Note that the name tags are missing before we assign names, instead of
names we see the position of the components in double brackets like
`[[1]]` (more about it [below](#lists-indexing-by-position)).

Making name-less lists and assigning names later is usually more error-prone and harder way to make lists
manually, but when you automatically create lists in your code, it may be the only option.

Finally, empty lists of given length can also be created using the
general `vector()` function.  For instance, `vector("list", 5)`, creates
a list of five `NULL` elements.  This is a good approach if you just want am empty list to be filled in a loop later.


## Accessing List Elements

There are four ways to access elements in lists.  Three of these reflect
atomic vector indexing, the [`$`-construct](#lists-dollar-shortcut) is unique for lists.  However, there
are important differences.


### Indexing by position {#lists-indexing-by-position}

You can always access list elements by their position.  It is in many
ways similar to that of atomic vectors with one major caveat:
indexing with single brackets will extract not the components but a _sublist_ that
contains just those components:
```{r, eval=FALSE}
# note: this list is not not an atomic vector, even though elements have the same types
animals <- list("Aardvark", "Baboon", "Camel")
animals[c(1,3)]

## [[1]]
## [1] "Aardvark"
## 
## [[2]]
## [1] "Camel"
```

You can see that the result is a list with two components, "Aardvark"
and "Camel", picked from the the positions 1 and 3 in the original list.

The fact that single brackets return a list in case of vector is
actually a smart design choice.  First, it cannot return a vector in
general&mdash;the requested components may be of different type and
simply not fit into an atomic vector.  Second, single-bracket indexing
in case of vectors actually returns a _subvector_.  We just tend to
overlook that a "scalar" is actually a length-1 vector.  But however
smart this design decision may be, people tend to learn it in the hard
way.  When confronted with weird errors, check that what you think
should be a vector is in fact a vector and not a list.

The good news is that there is an easy way to extract components.  A
single element, and not just a length-one-sublist, is extracted by
double brackets.  For instance,
```{r, eval=FALSE}
animals[[2]]

## [1] "Baboon"
```
returns a length-1 character vector.

Unfortunately, the good news end here.  You can extract individual
elements in this way, but you cannot get a vector of individual list
components: `animals[[1:2]]` will give you _subscript out of bounds_.
As above, this is a design choice: as list components may be of
different type, you may not be able to mold these into a single vector.

<p class="alert alert-info">
There are ways to merge components into a vector, given they are of the
same type.  For instance `Reduce(c, animals)` will convert the animals
into a vector of suitable type.  Ditto with `as.character(animals)`.
</p>


### Indexing by Name

If the list is named, one can use a character vector to extract it's
components, exacly in the same way as we used the numeric positions
above.  For instance

```r
person <- list(first_name = "Bob", last_name = "Wong", salary = 77000, in_union = TRUE)

person[c("first_name", "salary")]

## $first_name
## [1] "Bob"
## 
## $salary
## [1] 77000

person[["first_name"]]  # [1] "Bob"
person[["salary"]]  # [1] 77000
```
As in case of positional indexing, single brackets return a sublist
while double brackets return the corresponding component itself.


### Indexing by Logical Vector

As in case of atomic vectors, we can use logical indices with
lists too.  There are a few differences though:

* one can only extract sublists, not individual components.
  `person[c(TRUE, TRUE, FALSE, FALSE)]` will give you a sublist with
  first and last name.  `person[[c(TRUE, FALSE, FALSE, FALSE)]]` will
  fail.
* many operators are vectorized but they are not "listified".  You
  cannot do math like `*` or `+` with lists.  Hence the
  powerful logical indexing operations like `x[x > 0]` are in general not possible
  with lists.  This substantially reduces the potential usage cases of
  logical indexing.

For instance, we can extract all components of certain name from the
list:
```{r, eval=FALSE}
planes <- list("Airbus 380"=c(seats=575, speed=0.85),
               "Boeing 787"=c(seats=290, speed=0.85),
               "Airbus 350"=c(seats=325, speed=0.85))
                           # cruise speed, Mach
planes[startsWith(names(planes), "Airbus")]  # extract components, names
                           # of which starting with "Airbus"
## $`Airbus 380`
##  seats  speed 
## 575.00   0.85 
## 
## $`Airbus 350`
##  seats  speed 
## 325.00   0.85 
```
<p class="alert alert-info">
However, certain vectorized operations, such as `>` or `==` also work with lists
that contain single numeric values as their elements.  It seems to be
hard to come up with general rules, so we recommed not to rely on this
behaviour in code.
</p>


### Extracting named elements with `$` {#lists-dollar-shortcut}

Finally, there is a very convenient `$`-shortcut alternative for
extracting individual components.
If you printed out one of the named lists above, for instance `person`, you would see the following:

```{r, eval=FALSE}
person <- list(name = "Ada", job = "Programmer")
print(person)
## $first_name
## [1] "Ada"
## 
## $job
## [1] "Programmer"
```

Notice that the output lists each name tag prepended with a dollar sign
(**`$`**) symbol, and then on the following line the vector that is the
element itself.  You can retrieve individual components in a similar
fashion, the **dollar notation** is one of the easiest ways of accessing
list elements.  You refer to the particular element in the list with its
tag by writing the name of the list, followed by a `$`, followed by the element's tag:
```{r, eval=FALSE}

person$name  # [1] "Ada"
person$job  # [1] "Programmer"
```
Obviously, this only works for named lists.  There are no dollar notation analogue for atomic vectors, even for named
vectors.  `$` extractor only exists for lists (and such data structures
that are derived from lists, like data frames).  

You can almost read the dollar sign as like an "apostrophe s" (possessive) in English: so `person$salary` would mean "the `person` list**'s** `salary` value".

Dollar notation allows list elements to almost be treated as variables in their own right&mdash;for example, you specify that you're talking about the `salary` variable in the `person` list, rather than the `salary` variable in some other list (or not in a list at all).

```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)

# use elements as function or operation arguments
paste(person$job, person$first_name)   # [1] "Programmer Ada"

# assign values to list element
person$job <- "Senior Programmer"  # a promotion!
print(person$job)  # [1] "Senior Programmer"

# assign value to list element from itself
person$salary <- person$salary * 1.15  # a 15% raise!
print(person$salary)  # [1] 89700
```

Dollar-notation is a drop-in replacement to double-brackets extraction
given you know the name of the component.  If you do not&mdash;as is
often the case when programming&mdash;you have to rely on double bracket approach.


### Single vs. Double Brackets vs. Dollar

The list indexing may be confusing: we have single and double brackets,
indexing by position and name, and finally the dollar-notation.  Which
is the right thing to do?  As is so often the case, it depends.

* **Dollar notation** is the quickest and easiest way to extract a single
  named component in case you know it's name.
  
* **Double brackets** is very much a more verbose alternative to the dollar notation.  It returns a single component exactly as the dollar notation.  However, it also allows one to decide later which components to extract.  (This is terribly useful in programs!)  For instance,
we can decide if we want to use someones first or last name:
```{r, eval=FALSE}
person <- list(first_name = "Bob", last_name = "Wong", salary = 77000)
name_to_use <- "last_name"  # choose name (i.e., based on formality)
person[[name_to_use]]  # [1] "Wong"

name_to_use <- "first_name"  # change name to use
person[[name_to_use]]  # [1] "Bob"
```
<p class="alert alert-info">
Note: you can often hear that double brackets return a vector.  This is only true if the corresponding element is a vector.  But they always return the element!
</p>


* **Single brackets** is the most powerful and universal way of indexing.  If work in a very similar fashion than vector indexing.  The main caveat here is that it _returns a sub-list_, not a vector.  (But note that in case of vectors, single-bracket indexing returns a _sub-vector_.)  It allows by position, by names, and by logical vector.  
In some sense it is **filtering** by whatever vector is inside the brackets (which may have just a single element).  In R, single brackets _always_ mean to filter the collection where the collection may be either atomic vector or list.  So if you put single-brackets after a collection, you get a filtered version of the same collection, containing the desired elements.  The type of the collection, list or atomic vector, is not affected.

<p class="alert alert-warning">**Watch out**: In vectors, single-bracket notation returns a vector, in lists single-bracket notation returns a list!
</p>

We recap this section by an example:

```r
animal <- list(class='A', count=201, endangered=TRUE, species='rhinoceros')

## SINGLE brackets returns a list
animal[1]
## $class
## [1] "A"

## can use any vector as the argument to single brackets, just like with vectors
animal[c("species", "endangered")]
## $species
## [1] "rhinoceros"
## 
## $endangered
## [1] TRUE

## DOUBLE brackets returns the element (here its a vector)!
animal[[1]]  # [1] "A"

## Dollar notation is equivalent to the double brackets
animal$class  # [1] "A"

```

Finally, all these methods can also be used for assignment.  Just put any of these construct on the left side of the assignment operator `<-`.


## Modifying Lists

As in the case with atomic vectors, you can assign new values to existing elements.  However, lists also enable dedicated syntax to _remove_ elements.  (Remember, you can always "unselect" an element in a vector, including list, by using negative positional index.)

You can add elements to a list simply by assigning a value to a tag (or index) in the list that doesn't yet exist:
```r
person <- list(first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE)

# has no `age` element
person$age  # NULL

# assign a value to the `age` tag to add it
person$age <- 40
person$age  # [1] 40

# assign using index
person[[10]] <- "Tenth field"
# elements 6-9 will be NULL
```
This parallel fairly closely with atomic vectors.

You can remove elements by assiging the special value `NULL` to their tag or index:

```r
a_list <- list('A', 201, True)
a_list[[2]] <- NULL  # remove element #2
print(a_list)
            # [[1]]
            # [1] "A"
            #
            # [[2]]
            # [1] TRUE
```
There is no analogue here to atomic vectors.


## The `lapply()` Function

A large number of common R functions (e.g., `paste()`, `round()`, etc.) and most common operators (like `+`, `>`, etc) are _vectorized_ so you can pass vectors as arguments, and the function will be applied to each item in the vector.  It "just works".  In case of lists it usually fails.  You need to put in a bit more effort if you want to apply a 
function to each item in a list.
The effort involves either an explicit loop, or an implicit loop through a function called **`lapply()`** (for _**l**ist apply_).  We will discuss the latter approach here.

`lapply()` takes two arguments: the first is a list (or a vector, vectors will do as well) you want to work with, and the second is the function you want to "apply" to each item in that list. For example:
```{r, eval=FALSE}
# list, not a vector
people <- list("Sarah", "Amit", "Zhang")

# apply the `toupper()` function to each element in `people`
lapply(people, toupper)
## [[1]]
## [1] "SARAH"
##
## [[2]]
## [1] "AMIT"
##
## [[3]]
## [1] "ZHANG"
```
You can add even more arguments to `lapply()`, those will be assumed to belong to the function you are applying:
```{r, eval=FALSE}
# apply the `paste()` function to each element in `people`,
# with an addition argument `"dances!"` to each call
lapply(people, paste, "dances!")
## [[1]]
## [1] "Sarah dances!"
## 
## [[2]]
## [1] "Amit dances!"
## 
## [[3]]
## [1] "Zhang dances!"
```
The last unnamed argument, `"dances"`, are taken as the second argument to `paste`.  So behind the scenes, `lapply()` runs a loop over `paste("Sarah", "dances!")`, `paste("Amit", "dances!")` and so on.

- Notice that the second argument to `lapply()` is just the function: not the name of the functions as character string (it's not quoted in `""`).  You're also not actually _calling_ that function when you write it's name in `lapply()` (you don't put the parenthesis `()` after its name).  See more in [section _How to Use Functions_](#how-to-use-functions). 
After the function, you can put any additional arguments you want the applied function to be called with: for example, how many digits to round to, or what value to paste to the end of a string.

Note that the `lapply()` function returns a _new_ list; the original one is unmodified.  This makes it a [**mapping**](https://en.wikipedia.org/wiki/Map_(parallel_pattern)) operation.  It is an operation, and not the same thing as _map_ data structure.  In mapping operation the code applies the same **elemental function** to the all elements in a list.

You commonly use `lapply()` with your own custom functions which define what you want to do to a single element in that list:

```r
# A function that prepends "Hello" to any item
greet <- function(item) {
  return(paste("Hello", item))
}

# a list of people
people <- list("Sarah", "Amit", "Zhang")

# greet each name
greetings <- lapply(people, greet)
## [[1]]
## [1] "Hello Sarah"
##
## [[2]]
## [1] "Hello Amit"
##
## [[3]]
## [1] "Hello Zhang"
```

Additionally, `lapply()` is a member of the "`*apply()`" family of functions: a set of functions that each start with different letters and may apply to a different data structure, but otherwise all work in a similar fashion.  For example, `lapply()` returns a list, while `sapply()` (**s**implified apply) simplifies the list into a vector, if possible.  If you are interested in parallel programming, we recommend you to check out the function `parLapply` and it's friends in the _parallel_ package.


## Resources {-}
- [R Tutorial: Lists](http://www.r-tutor.com/r-introduction/list)
- [R Tutorial: Named List Members](http://www.r-tutor.com/r-introduction/list/named-list-members)
- [StackOverflow: Single vs. double brackets](http://stackoverflow.com/questions/1169456/in-r-what-is-the-difference-between-the-and-notations-for-accessing-the)