Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Updating Rollup tutorial #16762

Merged
merged 21 commits into from
Jul 26, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 65 additions & 114 deletions docs/tutorials/tutorial-rollup.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,18 @@ sidebar_label: Aggregate data with rollup
-->


Apache Druid can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
Apache Druid® can summarize raw data at ingestion time using a process known as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

This tutorial will demonstrate the effects of rollup on an example dataset.
This tutorial demonstrate the effects of rollup on an example dataset.
asdf2014 marked this conversation as resolved.
Show resolved Hide resolved

For this tutorial, we'll assume you've already downloaded Druid as described in
the [single-machine quickstart](index.md) and have it running on your local machine.
For this tutorial, you should have Druid downloaded as described in
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/index.md) (MSQ) task engine to execute SQL statements.

It will also be helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials.
It is helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

## Example data
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
For this tutorial, you use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
Expand All @@ -49,150 +49,101 @@ For this tutorial, we'll use a small sample of network flow event data, represen
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
```

A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`.
The tutorial guides you through how to ingest this data using rollup.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`.
## Load the example data

```json
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "rollup-tutorial",
"dimensionsSpec" : {
"dimensions" : [
"srcIP",
"dstIP"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"metricsSpec" : [
{ "type" : "count", "name" : "count" },
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
{ "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "minute",
"intervals" : ["2018-01-01/2018-01-03"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial",
"filter" : "rollup-data.json"
},
"inputFormat" : {
"type" : "json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel",
"partitionsSpec": {
"type": "dynamic"
},
"maxRowsInMemory" : 25000
}
}
}
Load the sample dataset using INSERT and EXTERN functions. The EXTERN function lets you read external data or write to an external location.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

In the Druid web console, go to the Query view and run the following query:
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

```sql
INSERT INTO "rollup_tutorial"
WITH "inline_data" AS (
SELECT *
FROM TABLE(EXTERN('{
"type":"inline",
"data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}',
'{"type":"json"}'))
EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT)
)
SELECT
FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
"srcIP",
"dstIP",
SUM("bytes") AS "bytes",
COUNT(*) AS "count",
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
SUM("packets") AS "packets"
FROM "inline_data"
GROUP BY 1, 2, 3
PARTITIONED BY DAY
```

Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`.
Note that the query uses the `FLOOR` function to give the `__time` a granularity of `MINUTE`. The query defines the dimensions of the rollup by grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, and `dstIP` columns. The query defines the metrics of the rollup by aggregating the `bytes` and `packets` columns.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`.
After the ingestion completes, you can query the data.

We will see how these definitions are used after we load this data.

## Load the example data
## Query the example data

From the apache-druid-{{DRUIDVERSION}} package root, run the following command:
Open a new tab in the Query view and run the following query to see what data was ingested:
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

```bash
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
```sql
SELECT * FROM "rollup_tutorial"
```

After the script completes, we will query the data.
Returns the following:

## Query the example data
| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` |
| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |
| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` |
| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` |

Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested.

```bash
$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql> select * from "rollup-tutorial";
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │
│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
Retrieved 5 rows in 1.18s.

dsql>
```

Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`:
Consider the three events in the original input data that occur over the course of minute `2018-01-01T01:01`:

```json
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
```

These three rows have been "rolled up" into the following row:
Druid combines the three rows into the following during rollup:
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |

The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec.
Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` function in the query.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

Likewise, these two events that occurred during `2018-01-01T01:02` have been rolled up:
Consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`:
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

```json
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
```

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
The rows have been grouped into the following during rollup:

| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` |

For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03`:
In the original input data, only one event occurs over the course of minute `2018-01-01T01:03`:

```json
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
```

```bash
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
Therefore no rollup takes place:
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
| -- | -- | -- | -- | -- | -- |
| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |

Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved
edgar2020 marked this conversation as resolved.
Show resolved Hide resolved

Loading