Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Updating Rollup tutorial #16762

Merged
merged 21 commits into from
Jul 26, 2024
Merged
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 70 additions & 119 deletions docs/tutorials/tutorial-rollup.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,21 @@ sidebar_label: Aggregate data with rollup
-->


Apache Druid can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../multi-stage-query/concepts.md#rollup) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.

This tutorial will demonstrate the effects of rollup on an example dataset.
This tutorial demonstrates how to apply rollup during ingestion and highlights its effects during query execution. The examples in the tutorial use the [multi-stage query (MSQ)](../multi-stage-query/index.md) task engine to execute SQL statements.

For this tutorial, we'll assume you've already downloaded Druid as described in
the [single-machine quickstart](index.md) and have it running on your local machine.
## Prerequisites

It will also be helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials.
Before proceeding, download Druid as described in [Quickstart (local)](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster.

## Example data
You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first.

For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.

## Load the example data

For this tutorial, you use a small sample of network flow event data representing IP traffic.
The data contains packet and byte counts from a source IP address to a destination IP address.

```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
Expand All @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
```

A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`.

We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`.

```json
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "rollup-tutorial",
"dimensionsSpec" : {
"dimensions" : [
"srcIP",
"dstIP"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"metricsSpec" : [
{ "type" : "count", "name" : "count" },
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
{ "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "minute",
"intervals" : ["2018-01-01/2018-01-03"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial",
"filter" : "rollup-data.json"
},
"inputFormat" : {
"type" : "json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel",
"partitionsSpec": {
"type": "dynamic"
},
"maxRowsInMemory" : 25000
}
}
}
Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query:

```sql
INSERT INTO "rollup_tutorial"
WITH "inline_data" AS (
SELECT *
FROM TABLE(EXTERN('{
"type":"inline",
"data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}',
'{"type":"json"}'))
EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT)
)
SELECT
FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
"srcIP",
"dstIP",
SUM("bytes") AS "bytes",
SUM("packets") AS "packets",
COUNT(*) AS "count"
FROM "inline_data"
GROUP BY 1, 2, 3
PARTITIONED BY DAY
```

Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`.

Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`.
In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity.
For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
We will see how these definitions are used after we load this data.
After the ingestion completes, you can query the data.

## Load the example data
## Query the example data

From the apache-druid-{{DRUIDVERSION}} package root, run the following command:
In the web console, open a new tab in the **Query** view. Run the following query to view the ingested data:

```bash
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
```sql
SELECT * FROM "rollup_tutorial"
```

After the script completes, we will query the data.
Returns the following:

## Query the example data
| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` |
| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |
| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` |
| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` |

Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested.

```bash
$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql> select * from "rollup-tutorial";
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │
│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
Retrieved 5 rows in 1.18s.

dsql>
```
Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

## View rollup in action

Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`:
Consider the three events in the original input data that occur over the course of minute `2018-01-01T01:01`:

```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
```

These three rows have been "rolled up" into the following row:
Druid combines the three rows into one during rollup:

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |

The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`.
Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression buckets (floors) the timestamp column of the original input by minute.

Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec.
The input rows are grouped because they have the same values for their dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the sum aggregation of the grouped rows for `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final rolled-up row.

Likewise, these two events that occurred during `2018-01-01T01:02` have been rolled up:
Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`:

```json
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
```

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
The rows are grouped into the following during rollup:

For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03`:
| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` |

In the original input data, only one event occurs over the course of minute `2018-01-01T01:03`:

```json
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
```

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
Therefore, no rollup takes place:

| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |


Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
Loading