From 9748b038ea72d07b9aa852072d888b9cbbe1ee63 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 19 Jul 2024 09:16:58 -0700 Subject: [PATCH 01/21] updating tutorial to use inline ingesting --- docs/tutorials/tutorial-rollup.md | 151 ++++++++++-------------------- 1 file changed, 48 insertions(+), 103 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index b2c74d7e5b37..68963c0244e6 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -29,7 +29,7 @@ Apache Druid can summarize raw data at ingestion time using a process we refer t This tutorial will demonstrate the effects of rollup on an example dataset. For this tutorial, we'll assume you've already downloaded Druid as described in -the [single-machine quickstart](index.md) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/) (MSQ) task engine to execute SQL statements. It will also be helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials. @@ -49,101 +49,57 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} -``` - -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. We will see how these definitions are used after we load this data. ## Load the example data -From the apache-druid-{{DRUIDVERSION}} package root, run the following command: - -```bash -bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 +Load a sample dataset using INSERT and EXTERN functions. The EXTERN function lets you read external data or write to an external location. + +In the Druid web console, go to the Query view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', '{"type":"json"}')) EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + COUNT(*) AS "count", + SUM("packets") AS "packets" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` After the script completes, we will query the data. ## Query the example data -Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested. - -```bash -$ bin/dsql -Welcome to dsql, the command-line client for Druid SQL. -Type "\h" for help. -dsql> select * from "rollup-tutorial"; -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ -│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -Retrieved 5 rows in 1.18s. - -dsql> +Open a new tab in the Query view and run the following query to see what data was ingested: + +```sql +SELECT * FROM "rollup_tutorial" ``` +Returns the following: + +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | +| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | +| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | + + Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`: ```json @@ -154,13 +110,9 @@ Let's look at the three events in the original input data that occurred during ` These three rows have been "rolled up" into the following row: -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. @@ -173,13 +125,9 @@ Likewise, these two events that occurred during `2018-01-01T01:02` have been rol {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} ``` -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03`: @@ -187,12 +135,9 @@ For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} ``` -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. + From a2472f8098b174c97fd6de2d16f4e425bc21ab4d Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 19 Jul 2024 10:57:35 -0700 Subject: [PATCH 02/21] updating table and example sentences --- docs/tutorials/tutorial-rollup.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 68963c0244e6..3731d7cfed91 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -100,7 +100,7 @@ Returns the following: | `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`: +Consider the three events in the original input data that occur over the course of minute `2018-01-01T01:01`: ```json {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} @@ -108,7 +108,7 @@ Let's look at the three events in the original input data that occurred during ` {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} ``` -These three rows have been "rolled up" into the following row: +Apache Druid combines the three rows into the following using rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | @@ -118,23 +118,27 @@ The input rows have been grouped by the timestamp and dimension columns `{timest Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec. -Likewise, these two events that occurred during `2018-01-01T01:02` have been rolled up: +Likewise, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: ```json {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} ``` +The rows have been grouped into the following: + | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | -For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03`: +In the original input data, only one event occurs over the course of minute `2018-01-01T01:03`: ```json {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} ``` +Therefor no rollup takes place: + | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | From a9d4b0a8ac9e9eddd5ce3d2606e4621f0567ce26 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 19 Jul 2024 13:11:42 -0700 Subject: [PATCH 03/21] modifying paragraphs --- docs/tutorials/tutorial-rollup.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 3731d7cfed91..923cef61e4d8 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -24,7 +24,7 @@ sidebar_label: Aggregate data with rollup --> -Apache Druid can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. +Apache Druid® can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. This tutorial will demonstrate the effects of rollup on an example dataset. @@ -49,9 +49,7 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. - -We will see how these definitions are used after we load this data. +We will see how to ingest this data using rollup. ## Load the example data @@ -65,7 +63,9 @@ WITH "inline_data" AS ( SELECT * FROM TABLE(EXTERN('{ "type":"inline", - "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', '{"type":"json"}')) EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) ) SELECT FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, @@ -79,6 +79,8 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` +Note that the query uses the `FLOOR` function to give the `__time` a granularity of `MINUTE`. The query defines the dimmensions of the rollup by grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, and `dstIP` columns. The query defines the metrics of the rollup by aggregating the `bytes` and `packets` columns. + After the script completes, we will query the data. ## Query the example data @@ -108,7 +110,7 @@ Consider the three events in the original input data that occur over the course {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} ``` -Apache Druid combines the three rows into the following using rollup: +Apache Druid combines the three rows into the following during rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | @@ -116,16 +118,16 @@ Apache Druid combines the three rows into the following using rollup: The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. -Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec. +Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` function in the query. -Likewise, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: +Consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: ```json {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} ``` -The rows have been grouped into the following: +The rows have been grouped into the following during rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | @@ -137,7 +139,7 @@ In the original input data, only one event occurs over the course of minute `201 {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} ``` -Therefor no rollup takes place: +Therefore no rollup takes place: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | From 2b76cf7759477b28d76601ce535a85f6008f4730 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 19 Jul 2024 13:40:32 -0700 Subject: [PATCH 04/21] applying drive by comments --- docs/tutorials/tutorial-rollup.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 923cef61e4d8..faad2e605f54 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -24,18 +24,18 @@ sidebar_label: Aggregate data with rollup --> -Apache Druid® can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. +Apache Druid® can summarize raw data at ingestion time using a process known as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -This tutorial will demonstrate the effects of rollup on an example dataset. +This tutorial demonstrate the effects of rollup on an example dataset. -For this tutorial, we'll assume you've already downloaded Druid as described in -the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/) (MSQ) task engine to execute SQL statements. +For this tutorial, you should have Druid downloaded as described in +the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/index.md) (MSQ) task engine to execute SQL statements. -It will also be helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials. +It is helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials. ## Example data -For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second. +For this tutorial, you use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second. ```json {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} @@ -49,11 +49,11 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -We will see how to ingest this data using rollup. +The tutorial guides you through how to ingest this data using rollup. ## Load the example data -Load a sample dataset using INSERT and EXTERN functions. The EXTERN function lets you read external data or write to an external location. +Load the sample dataset using INSERT and EXTERN functions. The EXTERN function lets you read external data or write to an external location. In the Druid web console, go to the Query view and run the following query: @@ -79,9 +79,9 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` -Note that the query uses the `FLOOR` function to give the `__time` a granularity of `MINUTE`. The query defines the dimmensions of the rollup by grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, and `dstIP` columns. The query defines the metrics of the rollup by aggregating the `bytes` and `packets` columns. +Note that the query uses the `FLOOR` function to give the `__time` a granularity of `MINUTE`. The query defines the dimensions of the rollup by grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, and `dstIP` columns. The query defines the metrics of the rollup by aggregating the `bytes` and `packets` columns. -After the script completes, we will query the data. +After the ingestion completes, you can query the data. ## Query the example data @@ -110,7 +110,7 @@ Consider the three events in the original input data that occur over the course {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} ``` -Apache Druid combines the three rows into the following during rollup: +Druid combines the three rows into the following during rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | From 1e12fbf8b8cfa0aa5195c4d9c2897373c21c80cf Mon Sep 17 00:00:00 2001 From: Benedict Jin Date: Mon, 22 Jul 2024 14:38:47 +0800 Subject: [PATCH 05/21] Update docs/tutorials/tutorial-rollup.md --- docs/tutorials/tutorial-rollup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index faad2e605f54..05025a3af535 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -26,7 +26,7 @@ sidebar_label: Aggregate data with rollup Apache Druid® can summarize raw data at ingestion time using a process known as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -This tutorial demonstrate the effects of rollup on an example dataset. +This tutorial demonstrates the effects of rollup on an example dataset. For this tutorial, you should have Druid downloaded as described in the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/index.md) (MSQ) task engine to execute SQL statements. From ac8f4c8084fae075f84556b2c90cf1517f9a5d95 Mon Sep 17 00:00:00 2001 From: Edgar Melendrez Date: Tue, 23 Jul 2024 13:05:25 -0700 Subject: [PATCH 06/21] Update docs/tutorials/tutorial-rollup.md Co-authored-by: Victoria Lim --- docs/tutorials/tutorial-rollup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 05025a3af535..28923c240139 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -24,7 +24,7 @@ sidebar_label: Aggregate data with rollup --> -Apache Druid® can summarize raw data at ingestion time using a process known as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. +Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. This tutorial demonstrates the effects of rollup on an example dataset. From 9e07eedee52c089b8855b2b13483fb582773f52a Mon Sep 17 00:00:00 2001 From: Edgar Melendrez Date: Tue, 23 Jul 2024 13:06:30 -0700 Subject: [PATCH 07/21] Update docs/tutorials/tutorial-rollup.md Co-authored-by: Victoria Lim --- docs/tutorials/tutorial-rollup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 28923c240139..e5af32cc6fef 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -31,7 +31,7 @@ This tutorial demonstrates the effects of rollup on an example dataset. For this tutorial, you should have Druid downloaded as described in the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/index.md) (MSQ) task engine to execute SQL statements. -It is helpful to have finished [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md) tutorials. +Before proceeding, it's recommended to complete the tutorials to [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md). ## Example data From 2ec1fcca82106a8673bf9d98b8d1d60107be32f4 Mon Sep 17 00:00:00 2001 From: Edgar Melendrez Date: Tue, 23 Jul 2024 13:26:52 -0700 Subject: [PATCH 08/21] Apply suggestions from code review Co-authored-by: Victoria Lim Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/tutorials/tutorial-rollup.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index e5af32cc6fef..9478e11102de 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -35,7 +35,8 @@ Before proceeding, it's recommended to complete the tutorials to [Load a file](. ## Example data -For this tutorial, you use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second. +For this tutorial, you use a small sample of network flow event data, representing IP traffic. +The data contains packet and byte counts from a source IP address to a destination IP address. ```json {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} @@ -49,13 +50,13 @@ For this tutorial, you use a small sample of network flow event data, representi {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -The tutorial guides you through how to ingest this data using rollup. +The tutorial demonstrates how to apply rollup at ingestion and shows the effect of rollup at query time. ## Load the example data -Load the sample dataset using INSERT and EXTERN functions. The EXTERN function lets you read external data or write to an external location. +Load the sample dataset using the EXTERN function to read data provided inline with the query. -In the Druid web console, go to the Query view and run the following query: +In the Druid web console, go to the **Query** view and run the following query: ```sql INSERT INTO "rollup_tutorial" @@ -79,13 +80,15 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` -Note that the query uses the `FLOOR` function to give the `__time` a granularity of `MINUTE`. The query defines the dimensions of the rollup by grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, and `dstIP` columns. The query defines the metrics of the rollup by aggregating the `bytes` and `packets` columns. +Note that the query uses the `FLOOR` function to combine rows based on MINUTE granularity. +In the query, you group by dimensions, the `timestamp`, `srcIP`, and `dstIP` columns. +You apply aggregations for the metrics, specifically to sum the `bytes` and `packets` columns and to add a column to count the number of rows that get rolled up. After the ingestion completes, you can query the data. ## Query the example data -Open a new tab in the Query view and run the following query to see what data was ingested: +Open a new tab in the Query view. Run the following query to view the ingested data: ```sql SELECT * FROM "rollup_tutorial" @@ -110,15 +113,15 @@ Consider the three events in the original input data that occur over the course {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} ``` -Druid combines the three rows into the following during rollup: +Druid combines the three rows into one during rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | -The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. +The input rows were grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. -Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` function in the query. +Before the grouping occurs, the timestamps of the original input data are bucketed (floored) by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression in the query. Consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: @@ -139,7 +142,7 @@ In the original input data, only one event occurs over the course of minute `201 {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} ``` -Therefore no rollup takes place: +Therefore, no rollup takes place: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | From e475568a2c99523c10acd35200060a6ac02e8a1c Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Tue, 23 Jul 2024 14:55:13 -0700 Subject: [PATCH 09/21] most sugggestions/comments addressed --- docs/tutorials/tutorial-rollup.md | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 9478e11102de..7014d47c9cfb 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -24,16 +24,18 @@ sidebar_label: Aggregate data with rollup --> -Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. +Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../ingestion/rollup.md) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -This tutorial demonstrates the effects of rollup on an example dataset. +This tutorial demonstrates the effects of rollup on an example dataset. See [ingesting with rollup](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) to learn more. -For this tutorial, you should have Druid downloaded as described in -the [single-machine quickstart](index.md) and have it running on your local machine. The examples in the tutorial use the [multi-stage query](../multi-stage-query/index.md) (MSQ) task engine to execute SQL statements. +## Prerequisites -Before proceeding, it's recommended to complete the tutorials to [Load a file](../tutorials/tutorial-batch.md) and [Query data](../tutorials/tutorial-query.md). +Before proceeding, download Druid as described in [Quickstart (local)](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster. -## Example data +You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. + + +## Load the example data For this tutorial, you use a small sample of network flow event data, representing IP traffic. The data contains packet and byte counts from a source IP address to a destination IP address. @@ -52,11 +54,9 @@ The data contains packet and byte counts from a source IP address to a destinati The tutorial demonstrates how to apply rollup at ingestion and shows the effect of rollup at query time. -## Load the example data - -Load the sample dataset using the EXTERN function to read data provided inline with the query. +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to read data provided inline with the query. -In the Druid web console, go to the **Query** view and run the following query: +In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: ```sql INSERT INTO "rollup_tutorial" @@ -73,8 +73,8 @@ SELECT "srcIP", "dstIP", SUM("bytes") AS "bytes", - COUNT(*) AS "count", - SUM("packets") AS "packets" + SUM("packets") AS "packets", + COUNT(*) AS "count" FROM "inline_data" GROUP BY 1, 2, 3 PARTITIONED BY DAY @@ -88,7 +88,7 @@ After the ingestion completes, you can query the data. ## Query the example data -Open a new tab in the Query view. Run the following query to view the ingested data: +In the web console, open a new tab in the **Query** view. Run the following query to view the ingested data: ```sql SELECT * FROM "rollup_tutorial" @@ -119,7 +119,7 @@ Druid combines the three rows into one during rollup: | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | -The input rows were grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. +The input rows were grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. The `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. Before the grouping occurs, the timestamps of the original input data are bucketed (floored) by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression in the query. @@ -148,5 +148,4 @@ Therefore, no rollup takes place: | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | -Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. From 22a647ff4d4eafcbad41345206a7edead08b1de5 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Tue, 23 Jul 2024 15:26:32 -0700 Subject: [PATCH 10/21] dividing the section into two --- docs/tutorials/tutorial-rollup.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 7014d47c9cfb..17e7c1879b58 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -104,6 +104,9 @@ Returns the following: | `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | | `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | +Notice there are only six rows as opposed to the nine rows of the example data. The next section covers how ingestion with rollup acomplishes this. + +## View rollup in action Consider the three events in the original input data that occur over the course of minute `2018-01-01T01:01`: From 4832c63544a08f9eae2a93e2f68c0ff26a03b347 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Tue, 23 Jul 2024 15:44:45 -0700 Subject: [PATCH 11/21] changes after viewing localy --- docs/tutorials/tutorial-rollup.md | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 17e7c1879b58..460e13287a1d 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -26,7 +26,7 @@ sidebar_label: Aggregate data with rollup Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../ingestion/rollup.md) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -This tutorial demonstrates the effects of rollup on an example dataset. See [ingesting with rollup](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) to learn more. +The tutorial demonstrates how to apply rollup at ingestion and shows the effect of rollup at query time. See [ingesting with rollup](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) to learn more. ## Prerequisites @@ -52,11 +52,7 @@ The data contains packet and byte counts from a source IP address to a destinati {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -The tutorial demonstrates how to apply rollup at ingestion and shows the effect of rollup at query time. - -Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to read data provided inline with the query. - -In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: ```sql INSERT INTO "rollup_tutorial" @@ -80,9 +76,8 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` -Note that the query uses the `FLOOR` function to combine rows based on MINUTE granularity. -In the query, you group by dimensions, the `timestamp`, `srcIP`, and `dstIP` columns. -You apply aggregations for the metrics, specifically to sum the `bytes` and `packets` columns and to add a column to count the number of rows that get rolled up. +In the query, you group by dimensions, the `timestamp`, `srcIP`, and `dstIP` columns. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +You apply aggregations for the metrics, specifically to sum the `bytes` and `packets` columns and to add a column that counts the number of rows that get rolled up. After the ingestion completes, you can query the data. @@ -104,7 +99,7 @@ Returns the following: | `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | | `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Notice there are only six rows as opposed to the nine rows of the example data. The next section covers how ingestion with rollup acomplishes this. +Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this. ## View rollup in action @@ -126,7 +121,7 @@ The input rows were grouped by the timestamp and dimension columns `{timestamp, Before the grouping occurs, the timestamps of the original input data are bucketed (floored) by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression in the query. -Consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: +Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: ```json {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} From ef29e4cd1b754ae3ea0af4377534de758f806ce7 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Tue, 23 Jul 2024 15:51:22 -0700 Subject: [PATCH 12/21] restructuring paragraph --- docs/tutorials/tutorial-rollup.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 460e13287a1d..c83e69dc4d25 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -117,9 +117,9 @@ Druid combines the three rows into one during rollup: | -- | -- | -- | -- | -- | -- | | `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | -The input rows were grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. The `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. +Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression buckets (floors) the timestamp column of the original input by minute. -Before the grouping occurs, the timestamps of the original input data are bucketed (floored) by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression in the query. +The input rows are then grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final "rolled up" row. Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: @@ -128,7 +128,7 @@ Now, consider the two events in the original input data that occur over the cour {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} ``` -The rows have been grouped into the following during rollup: +The rows are grouped into the following during rollup: | `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | | -- | -- | -- | -- | -- | -- | From 4186130ea410b1c0a3740844858cae7ae33b3fe9 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Tue, 23 Jul 2024 16:08:36 -0700 Subject: [PATCH 13/21] felt paragraph needed more detail --- docs/tutorials/tutorial-rollup.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index c83e69dc4d25..66afc5645c2e 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -76,8 +76,8 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` -In the query, you group by dimensions, the `timestamp`, `srcIP`, and `dstIP` columns. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. -You apply aggregations for the metrics, specifically to sum the `bytes` and `packets` columns and to add a column that counts the number of rows that get rolled up. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled up. After the ingestion completes, you can query the data. @@ -119,7 +119,7 @@ Druid combines the three rows into one during rollup: Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression buckets (floors) the timestamp column of the original input by minute. -The input rows are then grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final "rolled up" row. +The input rows are grouped because they have the same values for their dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the sum aggregation of the grouped rows for `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final "rolled up" row. Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: From 83ff0c32e874af454411e8f022d8c31416263f79 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Thu, 25 Jul 2024 14:00:23 -0700 Subject: [PATCH 14/21] apply suggested improvments --- docs/tutorials/tutorial-rollup.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 66afc5645c2e..fb3ff2556d88 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -24,20 +24,20 @@ sidebar_label: Aggregate data with rollup --> -Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../ingestion/rollup.md) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. +Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../multi-stage-query/concepts.md#rollup) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -The tutorial demonstrates how to apply rollup at ingestion and shows the effect of rollup at query time. See [ingesting with rollup](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) to learn more. +This tutorial demonstrates how to apply rollup during ingestion and highlights its effects during query execution. The examples in the tutorial use the [multi-stage query (MSQ)](../multi-stage-query/index.md) task engine to executes SQL statements. ## Prerequisites Before proceeding, download Druid as described in [Quickstart (local)](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster. -You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. +You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. ## Load the example data -For this tutorial, you use a small sample of network flow event data, representing IP traffic. +For this tutorial, you use a small sample of network flow event data representing IP traffic. The data contains packet and byte counts from a source IP address to a destination IP address. ```json @@ -77,7 +77,7 @@ PARTITIONED BY DAY ``` In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. -For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled up. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. After the ingestion completes, you can query the data. @@ -119,7 +119,7 @@ Druid combines the three rows into one during rollup: Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression buckets (floors) the timestamp column of the original input by minute. -The input rows are grouped because they have the same values for their dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the sum aggregation of the grouped rows for `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final "rolled up" row. +The input rows are grouped because they have the same values for their dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the sum aggregation of the grouped rows for `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final rolled-up row. Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: From 452953f8db018170a1b4633212660e53efc32a48 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Thu, 25 Jul 2024 14:16:10 -0700 Subject: [PATCH 15/21] typo found --- docs/tutorials/tutorial-rollup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index fb3ff2556d88..4be007a713de 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -26,7 +26,7 @@ sidebar_label: Aggregate data with rollup Apache Druid® can summarize raw data at ingestion time using a process known as "rollup." [Rollup](../multi-stage-query/concepts.md#rollup) is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. -This tutorial demonstrates how to apply rollup during ingestion and highlights its effects during query execution. The examples in the tutorial use the [multi-stage query (MSQ)](../multi-stage-query/index.md) task engine to executes SQL statements. +This tutorial demonstrates how to apply rollup during ingestion and highlights its effects during query execution. The examples in the tutorial use the [multi-stage query (MSQ)](../multi-stage-query/index.md) task engine to execute SQL statements. ## Prerequisites From 9e6ef00d3e4e919d2a3653945d9e0ca8fc1fc53b Mon Sep 17 00:00:00 2001 From: Edgar Melendrez Date: Fri, 26 Jul 2024 13:01:27 -0700 Subject: [PATCH 16/21] Update docs/tutorials/tutorial-rollup.md Co-authored-by: Victoria Lim --- docs/tutorials/tutorial-rollup.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 4be007a713de..861baf4999ca 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -76,8 +76,11 @@ GROUP BY 1, 2, 3 PARTITIONED BY DAY ``` -In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. -For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. +Note the following aspects of the ingestion statement: +* You transform the timestamp field using the `FLOOR` function to round timestamps down to the minute. +* You group by the dimensions `timestamp`, `srcIP`, and `dstIP`. +* You create the `bytes` and `packets` metrics, which are summed from their respective input fields. +* You also create the `count` metric that records the number of rows that get rolled-up per each row in the datasource. After the ingestion completes, you can query the data. From 9048c4a376f6dae70a8aad170890a363b6c18b4a Mon Sep 17 00:00:00 2001 From: Edgar Melendrez Date: Fri, 26 Jul 2024 13:04:29 -0700 Subject: [PATCH 17/21] Apply suggestions from code review Co-authored-by: Victoria Lim --- docs/tutorials/tutorial-rollup.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 861baf4999ca..f5e1e82174fd 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -82,6 +82,9 @@ Note the following aspects of the ingestion statement: * You create the `bytes` and `packets` metrics, which are summed from their respective input fields. * You also create the `count` metric that records the number of rows that get rolled-up per each row in the datasource. +With rollup, Druid combines rows with identical timestamp and dimension values after the timestamp truncation. +Druid computes and stores the metric values using the specified aggregation function over each set of rolled-up rows. + After the ingestion completes, you can query the data. ## Query the example data @@ -102,7 +105,7 @@ Returns the following: | `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | | `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this. +Notice there are only six rows as opposed to the nine rows in the example data. In the next section, you explore the components of the rolled-up rows. ## View rollup in action From 985c36a51b33d088316182adf62482a1af848481 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 26 Jul 2024 13:31:44 -0700 Subject: [PATCH 18/21] applying suggestions adding learn more section --- docs/tutorials/tutorial-rollup.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index f5e1e82174fd..623f8dee12dd 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -82,8 +82,7 @@ Note the following aspects of the ingestion statement: * You create the `bytes` and `packets` metrics, which are summed from their respective input fields. * You also create the `count` metric that records the number of rows that get rolled-up per each row in the datasource. -With rollup, Druid combines rows with identical timestamp and dimension values after the timestamp truncation. -Druid computes and stores the metric values using the specified aggregation function over each set of rolled-up rows. +With rollup, Druid combines rows with identical timestamp and dimension values after the timestamp truncation. Druid computes and stores the metric values using the specified aggregation function over each set of rolled-up rows. After the ingestion completes, you can query the data. @@ -105,7 +104,7 @@ Returns the following: | `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | | `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Notice there are only six rows as opposed to the nine rows in the example data. In the next section, you explore the components of the rolled-up rows. +Notice there are only five rows as opposed to the nine rows in the example data. In the next section, you explore the components of the rolled-up rows. ## View rollup in action @@ -153,3 +152,17 @@ Therefore, no rollup takes place: | `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | +## Learn More + +See the following topics for more information: + +* [SQL-based ingestion query examples](https://druid.apache.org/docs/latest/multi-stage-query/examples/#insert-with-rollup) for another example of data rollup during ingestion. + +* [SQL-based ingestion concepts](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) for more details on the concept of rollup. + +* [Data rollup](https://druid.apache.org/docs/latest/ingestion/rollup/) for suggestions and best practices when performing rollup. + + +* [Druid schema model](https://druid.apache.org/docs/latest/ingestion/schema-model/) to go over more details on timestamp, dimensions, and metrics. + + From 5402d33bc39d46116ce91de81b5d40804dabfea4 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 26 Jul 2024 13:34:29 -0700 Subject: [PATCH 19/21] switch to using relative links --- docs/tutorials/tutorial-rollup.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 623f8dee12dd..c9b1bc4e1b6a 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -156,13 +156,13 @@ Therefore, no rollup takes place: See the following topics for more information: -* [SQL-based ingestion query examples](https://druid.apache.org/docs/latest/multi-stage-query/examples/#insert-with-rollup) for another example of data rollup during ingestion. +* [SQL-based ingestion query examples](../multi-stage-query/examples.md/#insert-with-rollup) for another example of data rollup during ingestion. -* [SQL-based ingestion concepts](https://druid.apache.org/docs/latest/multi-stage-query/concepts/#rollup) for more details on the concept of rollup. +* [SQL-based ingestion concepts](../multi-stage-query/concepts/#rollup) for more details on the concept of rollup. -* [Data rollup](https://druid.apache.org/docs/latest/ingestion/rollup/) for suggestions and best practices when performing rollup. +* [Data rollup](../ingestion/rollup/) for suggestions and best practices when performing rollup. -* [Druid schema model](https://druid.apache.org/docs/latest/ingestion/schema-model/) to go over more details on timestamp, dimensions, and metrics. +* [Druid schema model](../ingestion/schema-model.md) to go over more details on timestamp, dimensions, and metrics. From 3a02d1a127db124d6281de102f21bee267f06e21 Mon Sep 17 00:00:00 2001 From: edgar2020 Date: Fri, 26 Jul 2024 14:58:57 -0700 Subject: [PATCH 20/21] fixed Learn more section --- docs/tutorials/tutorial-rollup.md | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index c9b1bc4e1b6a..300c9a4d88e5 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -152,17 +152,11 @@ Therefore, no rollup takes place: | `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | -## Learn More +## Learn more See the following topics for more information: -* [SQL-based ingestion query examples](../multi-stage-query/examples.md/#insert-with-rollup) for another example of data rollup during ingestion. - -* [SQL-based ingestion concepts](../multi-stage-query/concepts/#rollup) for more details on the concept of rollup. - -* [Data rollup](../ingestion/rollup/) for suggestions and best practices when performing rollup. - - +* [SQL-based ingestion query examples](../multi-stage-query/examples.md#insert-with-rollup) for another example of data rollup during ingestion. +* [SQL-based ingestion concepts](../multi-stage-query/concepts.md#rollup) for more details on the concept of rollup. +* [Data rollup](../ingestion/rollup.md) for suggestions and best practices when performing rollup. * [Druid schema model](../ingestion/schema-model.md) to go over more details on timestamp, dimensions, and metrics. - - From cd95e04cdf46e7a2c03ebb50f6f0ab8baee5568d Mon Sep 17 00:00:00 2001 From: Victoria Lim Date: Fri, 26 Jul 2024 15:05:07 -0700 Subject: [PATCH 21/21] reorder and minor edits to learn more links --- docs/tutorials/tutorial-rollup.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 300c9a4d88e5..464197d551c9 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -156,7 +156,7 @@ Therefore, no rollup takes place: See the following topics for more information: -* [SQL-based ingestion query examples](../multi-stage-query/examples.md#insert-with-rollup) for another example of data rollup during ingestion. -* [SQL-based ingestion concepts](../multi-stage-query/concepts.md#rollup) for more details on the concept of rollup. -* [Data rollup](../ingestion/rollup.md) for suggestions and best practices when performing rollup. -* [Druid schema model](../ingestion/schema-model.md) to go over more details on timestamp, dimensions, and metrics. +* [Data rollup](../ingestion/rollup.md) for suggestions and best practices when performing rollup. +* [SQL-based ingestion concepts](../multi-stage-query/concepts.md#rollup) for information on rollup using SQL-based ingestion. +* [SQL-based ingestion query examples](../multi-stage-query/examples.md#insert-with-rollup) for another example of data rollup. +* [Druid schema model](../ingestion/schema-model.md) to learn about the primary timestamp, dimensions, and metrics.