Skip to content

Commit

Permalink
better documentation for the differences between arrays and mvds
Browse files Browse the repository at this point in the history
  • Loading branch information
clintropolis committed Oct 25, 2023
1 parent 65b69cd commit 4c5a5db
Show file tree
Hide file tree
Showing 8 changed files with 486 additions and 86 deletions.
14 changes: 10 additions & 4 deletions docs/multi-stage-query/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,9 @@ When deciding whether to use `REPLACE` or `INSERT`, keep in mind that segments g
with dimension-based pruning but those generated with `INSERT` cannot. For more information about the requirements
for dimension-based pruning, see [Clustering](#clustering).

To insert [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows
ARRAY types to be stored in segments. This flag is not enabled by default.

For more information about the syntax, see [INSERT](./reference.md#insert).

<a name="replace"></a>
Expand Down Expand Up @@ -192,10 +195,13 @@ To perform ingestion with rollup:
2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation
functions to write their internal state to the generated segments, instead of the finalized end result, and enables
further aggregation at query time.
3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting:
false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
`GROUP BY` operator.
3. To ingest [Druid multi-value dimensions](../querying/multi-value-dimensions.md), wrap all multi-value strings
in `MV_TO_ARRAY(...)` in the grouping clause and set [`groupByEnableMultiValueUnnesting: false`](reference.md#context-parameters) in your context.
This ensures that multi-value strings are left alone and remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
`GROUP BY` operator. To INSERT these arrays as multi-value strings, wrap the expressions in the SELECT clause with
`ARRAY_TO_MV` to coerce the ARRAY back to a VARCHAR
4. To ingest [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows
ARRAY types to be stored in segments. This flag is not enabled by default.

When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes
rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata`
Expand Down
47 changes: 45 additions & 2 deletions docs/multi-stage-query/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ CLUSTERED BY channel

## INSERT with rollup

This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
This example inserts data into a table named `kttm_rollup` and performs data rollup. The ARRAY inputs are stored in a [multi-value dimension](../querying/multi-value-dimensions.md). This example implements the recommendations described in [Rollup](./concepts.md#rollup).

<details><summary>Show the query</summary>

Expand All @@ -102,7 +102,50 @@ SELECT
agent_type,
browser,
browser_version,
MV_TO_ARRAY("language") AS "language", -- Multi-value string dimension
ARRAY_TO_MV(MV_TO_ARRAY("language")) AS "language", -- Multi-value string dimension
os,
city,
country,
forwarded_for AS ip_address,

COUNT(*) AS "cnt",
SUM(session_length) AS session_length,
APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types
FROM kttm_data
WHERE os = 'iOS'
GROUP BY 1, 2, 3, 4, 5, 6, MV_TO_ARRAY("language"), 8, 9, 10, 11
PARTITIONED BY HOUR
CLUSTERED BY browser, session
```
</details>

## INSERT with rollup and ARRAY types

This example inserts data into a table named `kttm_rollup_arrays` and performs data rollup. The ARRAY inputs are stored in an [ARRAY column](../querying/arrays.md). This example also implements the recommendations described in [Rollup](./concepts.md#rollup). Be sure to set context flag `"arrayIngestMode":"array"` which allows
ARRAY types to be stored in segments.

<details><summary>Show the query</summary>

```sql
INSERT INTO "kttm_rollup_arrays"

WITH kttm_data AS (
SELECT * FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
'{"type":"json"}',
'[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"array<string>"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
)
))

SELECT
FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
session,
agent_category,
agent_type,
browser,
browser_version,
"language", -- array
os,
city,
country,
Expand Down
Loading

0 comments on commit 4c5a5db

Please sign in to comment.