Skip to content

Latest commit

 

History

History
94 lines (74 loc) · 2.1 KB

spark-sql-aggregation.adoc

File metadata and controls

94 lines (74 loc) · 2.1 KB

Aggregation (GroupedData)

Note
Executing aggregation on DataFrames by means of groupBy is still an experimental feature. It is available since Apache Spark 1.3.0.

You can use DataFrame to compute aggregates over a collection of (grouped) rows.

DataFrame offers the following aggregate operators:

Each method returns GroupedData.

groupBy Operator

Note
The following session uses the data setup as described in Test Setup section below.
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa|      100| 0.12|
| aaa|      200| 0.29|
| bbb|      200| 0.53|
| bbb|      300| 0.42|
+----+---------+-----+

scala> df.groupBy("name").count.show
+----+-----+
|name|count|
+----+-----+
| aaa|    2|
| bbb|    2|
+----+-----+

scala> df.groupBy("name").max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa|      0.29|
| bbb|      0.53|
+----+----------+

scala> df.groupBy("name").sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa|      0.41|
| bbb|      0.95|
+----+----------+

scala> df.groupBy("productId").sum("score").show
+---------+------------------+
|productId|        sum(score)|
+---------+------------------+
|      300|              0.42|
|      100|              0.12|
|      200|0.8200000000000001|
+---------+------------------+

GroupedData

GroupedData is a result of executing

It offers the following operators to work on group of rows:

  • agg

  • count

  • mean

  • max

  • avg

  • min

  • sum

  • pivot

Test Setup

This is a setup for learning GroupedData. Paste it into Spark Shell using :paste.

import spark.implicits._

case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
  Token("aaa", 200, 0.29) ::
  Token("bbb", 200, 0.53) ::
  Token("bbb", 300, 0.42) :: Nil
val df = data.toDF.cache  // (1)
  1. Cache the DataFrame so following queries won’t load data over and over again.