data-tables.Rmd

# Thinking Big: Data Tables {#data-tables}

Data frames are core elements for data handling in R.  However, they
suffer from several limitations.  One of the major issue with data
frames is that they are memory hungry and slow.  This is not an issue
when working with relatively small datasets (say, up to 100,000
rows).  However, when your dataset size exceed gigabytes, dataframes
may be infeasibly slow and too memory hungry.


## Background: Passing By Value And Passing By Reference

R is (mostly) a pass-by-value language.  This means that when you
modify the data, at every step a new copy of the complete modified
object is created, stored in memory, and the former object is freed
(carbage-collected) if not in use any more.

The main advantage of this approach is consistency: we have the
guarantee that functions do not modify their inputs.  However, in case
of large objects, copying may be slow, and even more, it requires
at least twice as much memory before the old object is freed.  In case
of more complex process pipelines, the memory consumption may be even more
than twice of the size of the original object.

Data tables implement a number of pass-by-reference functions.  In
pass-by-reference, the function is not given a fresh copy of the
inputs, but is instead told where the object is in memory.  Instead of
copying gigabytes of data, only a single tiny memory pointer is
passed.  But this also means the function now is accessing and
modifying the original object, not a copy of it.  This may sometimes
lead to bugs and unexpected behavior, but professional use of
pass-by-reference approach may improve the speed and lower the memory
footprint substantially.


## Data Tables: Introduction

Data tables and most of the related goodies live in _data.table_
library, so you either have to load the library or specify the
namespace when using the functions.


### Replacement for Data Frames (Sort of)

Data tables are designed to be largely a replacement to data frames.
The syntax is similar and they are largely replaceable.  For instance,
we can create and play with a data table as
```{r}
library(data.table)
dt <- data.table(id=1:5, x=rnorm(5), y=runif(5))
dt
```
The result looks almost identical to a similar data frame (the only
difference are the colons after the row numbers).  Behind the scenes
these objects are almost identical too--both objects are lists of
vectors.  This structural similarity allows to use data tables
as drop-in replacements for dataframes, at least in some
circumstances.  For instance, we can extract variables with `$`:
```{r}
dt$x
```
or rows with row indices:
```{r}
dt[c(2,4),]
```

However, data tables use unquoted variables names (like _dplyr_) by
default:
```{r}
dt[,x]
```
In case we need to store the variable name into another variable, with
have to use the additional argument `with`:
```{r}
var <- "x"
dt[, var, with=FALSE]
```
Note also that instead of getting a vector, now we get a data.table
with a single column "x" in the first.  This behavior is the main culprit that when
replacing data frames with data tables one may need to change quite a
bit of code.


### Fast Reading and Writing

Many data frame users may appreciate the fact that the data
input-output function `fread` and `fwrite` run at least a magnitude
faster on large files.  These are largely replacement for `read.table`
and `write.table`, however they syntax differs noticeably in places.
In particular, `fread` accepts either a file name, http-url, or a _shell command
that prints output_; it automatically detects the column separator,
but it
does not automatically open compressed files.  The latter is not a big
deal when using unix where one can just issue
```r
data <- fread("bzcat data.csv.bz2")
```
However, the decompression is not that simple on windows and hence it
is hard to write platform-independent code that opens compressed
files.^[Automatic decompression is a feature request for data tables]

If your computer has enough memory and speed is not an issue, your
interest for data tables may end here.  You can just transform data
table into a data frame with `setDF` (and the other way around with `setDT`).  Let's transform our data table to data
frame:
```{r}
setDF(dt)
dt
```
Do you see that the colons after row names are gone?  This means `dt`
now is a data frame.

Note that this function behaves very differently from what we have
learned earlier: it modifies the object _in place_ (by reference).  We
do not have to assign the result into a new variable using a construct
like `df <- setDF(dt)` (but we still can write like this, handy when
using magrittr pipes).  This is a manifestation of the power of
data.tables: the object is not copied but the same object is modified
in memory instead.  `setDF` and `setDT` are very efficient, even huge
tables are converted instantly with virtually no need for any
additional memory.

However, big powers come hand-in-hand with big responsibility:
it is easy to forget that `setDF` modifies the function argument.


## Indexing: The Major Powerhorse of Data Tables

Data tables' indexing is much more powerful than that of data frames.
The single-bracket indexing is a powerful (albeit confusing) set of
functions.  It's general syntax is as follows:
```r
dt[i, j, by]
```
where `i` specifies what to do with rows (for instance, select certain
rows), `j` tells what to do with columns (such as select columns,
compute new columns, aggregate columns), and `by` contains the
grouping variables. 

Let's demonstrate this with the _flights_ data from _nycflights13_
package.  We load the data and transform it into data.table:
```{r}
data(flights, package="nycflights13")
setDT(flights)
head(flights)
```

### i: Select Observations

Obviously, we can always just tell which observations we want:
```{r}
flights[c(1:3),]
```
picks the first three lines from the data.  Maybe more interestingly,
we can use the special variable `.N` (the number of rows), to get the
penultimate row:
```{r}
flights[.N-1,]
```

We can select observations with logical index vector in the same way as in data frames:
```{r}
head(flights[origin == "EWR" & dest == "SEA",], 3)
```
will create a new data table including only flights from Newark to
Seattle.  However, note that we just use `origin`, and not
`flights$origin` as were the case with data frames.  Data tables
evaluate the arguments as if inside `with`-function.

The first, integer indexing corresponds to dplyr's `slice` function
while the other one is equivalent to `filter`.


### j: Work with Columns

`j` is perhaps the most powerful (and most confusing) of all arguments
for data table indexing.  It allows both to select and do more complex
tasks.  Lets start with selection:
```{r}
head(flights[, dest], 3)
```
selects only the `dest` variable from the data.  Note this results in
a vector, not in a single-variable data table.  If you want to get
that, you can do
```{r}
head(flights[, .(dest)], 3)
```
`.()` is just an alias for `list()`, encoded differently in data
tables to improve readability and make it easier to type.  If we want to select
more that one variable, we can use the latter syntax:
```{r}
head(flights[, .(origin, dest)], 3)
```
Selection supports a number of goodies, such as ranges of variables
with `:` (for instance, `dep_time:arr_delay`) and excluding variables
with `!` or `-` (for instance, `-year`).

Obviously we can combine both `i` and `j`: let's select origin and
departure delay for flights to Seattle:
```{r}
head(flights[dest == "SEA", .(origin, dep_delay)], 3)
```
The example so far broadly corresponds to dplyr's `select`.

But `j` is not just for selecting.  It is also for computing.  Let's
find the mean arrival delay for flights to Seattle:
```{r}
flights[dest == "SEA", mean(arr_delay, na.rm=TRUE)]
```
Several variables can be returned by wrapping, and optionally named,
these in `.()`.  For instance, find the average departure and arrival
delay for all flights to Seattle, given the flight was delayed on
arrival, and name these `dep` and `arr`:
```{r}
flights[dest == "SEA" & arr_delay > 0,
        .(dep = mean(dep_delay, na.rm=TRUE), arr = mean(arr_delay, na.rm=TRUE))]
```
The result is a data table with two variables.

We can use the special variable `.N` to count the rows:
```{r}
flights[dest == "SEA" & arr_delay > 0, .N]
```
will tell us how many flights to Seattle were delayed at arrival.

Handling the case where the variable names are stored in other
variables is not that hard, but still adds a layer of complexity.  We
can specify the variables in `.SDcols` parameter.  This
parameter determines which columns go into `.SD` (=Subset Data)
special variable.  Afterwards we make an `lapply` expression in `j`:
```{r}
flights[dest == "SEA" & arr_delay > 0,
        lapply(.SD, function(x) mean(x, na.rm=TRUE)),
        .SDcols = c("arr_delay", "dep_delay")]
```
Let's repeat: `.SDcols` determines which variables will go into the
special `.SD` list (default: all).  `lapply` in `j` computes mean
values of each of the variables in the `.SD` list.  This procedure
feels complex, although it is internally optimized.

These examples correspond to dplyr's `aggregate` function.  One can
argue, however, that data tables' syntax is more confusing and harder
to read.  Note also that the functionality data tables offer here is
optimized for speed and memory efficiency but still return a new
object.  Aggregation does not work by reference.


### Group in `by`

Finally, all of the above can by computed by groups using `by`.  Let's
compute the average delays above by carrier and origin:

```{r}
flights[dest == "SEA" & arr_delay > 0,
        .(dep = mean(dep_delay, na.rm=TRUE), arr = mean(arr_delay, na.rm=TRUE)),
        by = .(carrier, origin)]
```
We just had to specify the `by` argument that lists the grouping
variables.  If more than one, these should be wrapped in a list with
`.()` function.

We can use the `.N` variable to get the group size.  How many flights
did each carrier from each origin?
```{r}
flights[, .N, by=.(carrier, origin)] %>%
   head(3)
```
Finally, we can also use quoted variables for grouping too just be
replacing `.()` with `c()`:
```{r}
flights[, .N, by=c("carrier", "origin")] %>%
   head(3)
```

In dplyr context, the examples here include `group_by` and `summarize` verbs.

Read more about the basic usage in data.table the vignette [Data analysis using data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html).


## `:=`--Create variables by reference

While summarizing we compute values in `j`, these will always create a
new data.table.  Reducing operations are not possible to do in-place.
But computing new variables can be done in-place.

In place variable computations (without summarizing) can be done with
`:=` assignment operator in `j`.  Let's compute a new
variable--speed--for each flight.  We can do this as follows:
```{r}
flights[, speed := distance/(air_time/60)]
flights %>% head(3)
```
We see the new variable, speed, included as the last variable in the
data.  Note we did this operation _by reference_, i.e. we did not
assign the result to a new data table.  The existing table was
modified in place.

The same assignment operator also permits us to remove variables by
setting these to `NULL`.  Let's remove speed:
```{r}
flights[, speed := NULL]
flights %>% head(3)
```
Indeed, there is no speed any more.

Assigning more that one variable by reference may feel somewhat more
intimidating:
```{r}
flights[, c("speed", "meanDelay") := .(distance/(air_time/60), (arr_delay + dep_delay)/2)]
flights %>% head(3)
```

Assignment works together with both selection and grouping.  For
instance, we may want to replace negative delay by zeros:
```{r}
flights[ arr_delay < 0, arr_delay := 0][, arr_delay] %>%
   head(20)
```
Indeed, we only see positive numbers and zeros.  But be careful: now
we have overwritten the `arr_delay` in the original data.  We cannot
restore the previous state any more without re-loading the dataset.

As an example of
groupings, let's compute the maximum departure delay by origin:
```{r}
flights[, maxDelay := max(dep_delay, na.rm=TRUE), by=origin] %>%
   head(4)
```
We can see that `by` caused the delay to be computed for each group,
however, the data is not summarized, just the max delay is added to
every single row.

Finally, if you _do not_ want to modify the original data, you should
use `copy` function.  This makes a deep copy of the data, and you can
modify the copy afterwards:
```{r}
fl <- copy(flights)
fl <- fl[, .(origin, dest)]
head(fl, 3)
head(flights, 3)
```
As you see, the `flights` data has not changed.

These operations correspond to the dplyr's `mutate` verb.  However,
`mutate` always makes a copy of the original dataset, something that
may well make your analysis slow and sluggish with large data.


Read more in vignette [Data.table reference semantics](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html)


## keys

Data tables allow fast lookup based on _key_.  In it's simplest
version, a key is a column (or several columns) which is used to
pre-sort the data table.  Pre-sorting makes it much faster to look up
certain values, perform grouping operations and merges.  As data can
only be sorted according to one rule at a time, there can only be one
key on data.table (but a key may be based on several variables).

Let's set origin and destination as keys for the data table:
```{r}
data(flights, pkg="nycflights13")
setDT(flights, key=c("origin", "dest"))
fl <- flights[,.(origin, dest, arr_delay)]
                           # focus on a few variables only
head(fl, 3)
```
We see that both origin and destination are alphabetically ordered.
Note that when selecting variables, the resulting data table `fl` will
have the same keys as the original one.

When set, we can easily subset by key by just feeding the key values
in `i`:
```{r}
fl["LGA"] %>%
   head(5)
```
will extract all LaGuardia-originating flights.  In terms of output,
this is equivalent to `fl[origin == "LGA"]`, just much more
efficient.  When you want to
extract flights based on origin-destination pair, you can just add
both key columns:
```{r}
fl[.("EWR", "SEA")] %>%
   head(4)
```
Again, this can be achieved in other ways, just keys are more
efficient.  Finally, if we want to extract based on the second key,
the syntax is more confusing:
```{r}
fl[.(unique(origin), "SEA")] %>%
   head(4)
```
We have to tell the `[` that we want to extract all observations
where the first key is everything, and the second one is "SEA". 

Read more in the vignette [Keys and fast binary search based subset](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html).


## Resources

* [Data Table CRAN page](https://cran.r-project.org/web/packages/data.table/index.html).
  Vignettes are a very valuable source of information.