Skip to content

Latest commit

 

History

History
97 lines (80 loc) · 2.95 KB

Aggregate.md

File metadata and controls

97 lines (80 loc) · 2.95 KB

Aggregate

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
Two forms:

aggregate(x, by, FUN, ..., simplify = TRUE)
  • x: is the data frame.
  • by: a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.
  • FUN: a function to compute the summary statistics which can be applied to all data subsets.
## S3 method for class 'formula'
aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
  • formula: a formula, such as y ~ x or cbind(y1, y2) ~ x1 + x2, where the y variables are numeric data to be split into groups according to the grouping x variables (usually factors).
  • data: a data frame (or list) from which the variables in formula should be taken.
  • FUN: same as above

~ in aggregate() separates, to the left side, what is being "aggregated", and to the right side, what is being used to "aggregate" the items. (Source).

f = price ~ carat

We start by creating the formula f using the strange looking tilde operator. That tells the R interpreter that we're defining a symbolic formula, rather than an expression to be evaluated immediately. So, our definition of formula f says, "price is a function of carat". (Source).

Examples

How many of each value?

> values <- data.frame(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c"))
> values
   value
1      a
2      a
3      a
4      a
5      a
6      b
7      b
8      b
9      c
10     c
11     c
12     c
> aggregate(values, list(unique.values = values$value), length)
  unique.values value
1             a     5
2             b     3
3             c     4

(Source)

For each data subset in values$value, we apply the length function to compute the summary statistic.

One column as a function of another column

> usage = data.frame(user_id = c(1, 1, 1, 1, 2, 3, 3), timestamp = c(1451606518, 1451606519, 1451607549, 1451608511, 1451608555, 1451608773, 1451679101))
> usage
  user_id  timestamp
1       1 1451606518
2       1 1451606519
3       1 1451607549
4       1 1451608511
5       2 1451608555
6       3 1451608773
7       3 1451679101
> aggregate(timestamp ~ user_id, usage, length)
  user_id timestamp
1       1         4
2       2         1
3       3         2

# We could also do it like this:
> aggregate(usage["timestamp"], usage["user_id"], length)
  user_id timestamp
1       1         4
2       2         1
3       3         2

Note that NA values are automatically removed:

> d <- data.frame(list(kids = c("Jack", "Jill"), ages = c(12, NA)))
> d
  kids ages
1 Jack   12
2 Jill   NA
> aggregate(ages ~ kids, d, length)
  kids ages
1 Jack    1