Note
|
Executing aggregation on DataFrames by means of groupBy is still an experimental feature. It is available since Apache Spark 1.3.0.
|
You can use DataFrame to compute aggregates over a collection of (grouped) rows.
DataFrame
offers the following aggregate operators:
-
rollup
-
cube
Each method returns GroupedData.
Note
|
The following session uses the data setup as described in Test Setup section below. |
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> df.groupBy("name").count.show
+----+-----+
|name|count|
+----+-----+
| aaa| 2|
| bbb| 2|
+----+-----+
scala> df.groupBy("name").max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa| 0.29|
| bbb| 0.53|
+----+----------+
scala> df.groupBy("name").sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa| 0.41|
| bbb| 0.95|
+----+----------+
scala> df.groupBy("productId").sum("score").show
+---------+------------------+
|productId| sum(score)|
+---------+------------------+
| 300| 0.42|
| 100| 0.12|
| 200|0.8200000000000001|
+---------+------------------+
GroupedData
is a result of executing
It offers the following operators to work on group of rows:
-
agg
-
count
-
mean
-
max
-
avg
-
min
-
sum
-
pivot
This is a setup for learning GroupedData
. Paste it into Spark Shell using :paste
.
import spark.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
Token("aaa", 200, 0.29) ::
Token("bbb", 200, 0.53) ::
Token("bbb", 300, 0.42) :: Nil
val df = data.toDF.cache // (1)
-
Cache the DataFrame so following queries won’t load data over and over again.