Skip to content

Commit

Permalink
explicit outputType for ExpressionPostAggregator, better documentatio…
Browse files Browse the repository at this point in the history
…n for the differences between arrays and mvds (apache#15245)

* better documentation for the differences between arrays and mvds
* add outputType to ExpressionPostAggregator to make docs true
* add output coercion if outputType is defined on ExpressionPostAgg
* updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables
  • Loading branch information
clintropolis authored Nov 2, 2023
1 parent 22443ab commit d261587
Show file tree
Hide file tree
Showing 33 changed files with 1,117 additions and 420 deletions.
9 changes: 2 additions & 7 deletions docs/multi-stage-query/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,18 +192,13 @@ To perform ingestion with rollup:
2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation
functions to write their internal state to the generated segments, instead of the finalized end result, and enables
further aggregation at query time.
3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting:
false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
`GROUP BY` operator.
3. See [ARRAY types](../querying/arrays.md#sql-based-ingestion-with-rollup) for information about ingesting `ARRAY` columns
4. See [multi-value dimensions](../querying/multi-value-dimensions.md#sql-based-ingestion-with-rollup) for information to ingest multi-value VARCHAR columns

When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes
rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata`
queries](../querying/segmentmetadataquery.md) to retrieve rollup-related information.

If you see the error "Encountered multi-value dimension `x` that cannot be processed with
groupByEnableMultiValueUnnesting set to false", then wrap that column in `MV_TO_ARRAY(x) AS x`.

The following [aggregation functions](../querying/sql-aggregations.md) are supported for rollup at ingestion time:
`COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` and `EARLIEST_BY` ([string only](known-issues.md#select-statement)),
`LATEST` and `LATEST_BY` ([string only](known-issues.md#select-statement)), `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`,
Expand Down
10 changes: 4 additions & 6 deletions docs/multi-stage-query/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ CLUSTERED BY channel

## INSERT with rollup

This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
This example inserts data into a table named `kttm_rollup` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).

<details><summary>Show the query</summary>

Expand All @@ -91,7 +91,7 @@ SELECT * FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
'{"type":"json"}',
'[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
'[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
)
))

Expand All @@ -101,8 +101,7 @@ SELECT
agent_category,
agent_type,
browser,
browser_version,
MV_TO_ARRAY("language") AS "language", -- Multi-value string dimension
browser_version
os,
city,
country,
Expand All @@ -113,11 +112,10 @@ SELECT
APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types
FROM kttm_data
WHERE os = 'iOS'
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
PARTITIONED BY HOUR
CLUSTERED BY browser, session
```

</details>

## INSERT for reindexing an existing datasource
Expand Down
34 changes: 18 additions & 16 deletions docs/multi-stage-query/reference.md

Large diffs are not rendered by default.

253 changes: 253 additions & 0 deletions docs/querying/arrays.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
id: arrays
title: "Arrays"
---

<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->


Apache Druid supports SQL standard `ARRAY` typed columns for `VARCHAR`, `BIGINT`, and `DOUBLE` types (native types `ARRAY<STRING>`, `ARRAY<LONG>`, and `ARRAY<DOUBLE>`). Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays.

This document describes inserting, filtering, and grouping behavior for `ARRAY` typed columns.
Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details
about the functions available to use with ARRAY columns and types in SQL.

The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns:

```json lines
{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a", "b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null, "b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [], "arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a", "b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null, "arrayLong":[], "arrayDouble":null}
```

## Ingesting arrays

### Native batch and streaming ingestion
When using native [batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), arrays can be ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects) type dimension schema which is shared with [type-aware schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery).

When ingesting from TSV or CSV data, you can specify the array delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as an array type. JSON data does not require `inputFormat` configuration.

The following shows an example `dimensionsSpec` for native ingestion of the data used in this document:

```
"dimensions": [
{
"type": "auto",
"name": "label"
},
{
"type": "auto",
"name": "arrayString"
},
{
"type": "auto",
"name": "arrayLong"
},
{
"type": "auto",
"name": "arrayDouble"
}
],
```

### SQL-based ingestion

Arrays can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include a query context parameter [`"arrayIngestMode":"array"`](../multi-stage-query/reference.md#context-parameters).

For example, to insert the data used in this document:
```sql
REPLACE INTO "array_example" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null, \"arrayLong\":[], \"arrayDouble\":null}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY<STRING>"},{"name":"arrayLong", "type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
"arrayString",
"arrayLong",
"arrayDouble"
FROM "ext"
PARTITIONED BY DAY
```

### SQL-based ingestion with rollup
These input arrays can also be grouped for rollup:

```sql
REPLACE INTO "array_example_rollup" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null, \"arrayLong\":[], \"arrayDouble\":null}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY<STRING>"},{"name":"arrayLong", "type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
"arrayString",
"arrayLong",
"arrayDouble",
COUNT(*) as "count"
FROM "ext"
GROUP BY 1,2,3,4,5
PARTITIONED BY DAY
```


## Querying arrays

### Filtering

All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on array typed columns. Filters follow these rules for array types:

- All filters match against the entire array value for the row
- Native value filters like [equality](filters.md#equality-filter) and [range](filters.md#range-filter) match on entire array values, as do SQL constructs that plan into these native filters
- The [`IS NULL`](filters.md#null-filter) filter will match rows where the entire array value is null
- [Array specific functions](sql-array-functions.md) like `ARRAY_CONTAINS` and `ARRAY_OVERLAP` follow the behavior specified by those functions
- All other filters do not directly support ARRAY types and will result in a query error

#### Example: equality
```sql
SELECT *
FROM "array_example"
WHERE arrayLong = ARRAY[1,2,3]
```

```json lines
{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"}
{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
```

#### Example: null
```sql
SELECT *
FROM "array_example"
WHERE arrayLong IS NULL
```

```json lines
{"__time":"2023-01-01T00:00:00.000Z","label":"row2","arrayString":"[null,\"b\"]","arrayLong":null,"arrayDouble":"[999.0,null,5.5]"}
```

#### Example: range
```sql
SELECT *
FROM "array_example"
WHERE arrayString >= ARRAY['a','b']
```

```json lines
{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
```

#### Example: ARRAY_CONTAINS
```sql
SELECT *
FROM "array_example"
WHERE ARRAY_CONTAINS(arrayString, 'a')
```

```json lines
{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
```

### Grouping

When grouping on an array with SQL or a native [groupBy query](groupbyquery.md), grouping follows standard SQL behavior and groups on the entire array as a single value. The [`UNNEST`](sql.md#unnest) function allows grouping on the individual array elements.

#### Example: SQL grouping query with no filtering
```sql
SELECT label, arrayString
FROM "array_example"
GROUP BY 1,2
```
results in:
```json lines
{"label":"row1","arrayString":"[\"a\",\"b\"]"}
{"label":"row2","arrayString":"[null,\"b\"]"}
{"label":"row3","arrayString":"[]"}
{"label":"row4","arrayString":"[\"a\",\"b\"]"}
{"label":"row5","arrayString":null}
```

#### Example: SQL grouping query with a filter
```sql
SELECT label, arrayString
FROM "array_example"
WHERE arrayLong = ARRAY[1,2,3]
GROUP BY 1,2
```

results:
```json lines
{"label":"row3","arrayString":"[]"}
{"label":"row4","arrayString":"[\"a\",\"b\"]"}
```

#### Example: UNNEST
```sql
SELECT label, strings
FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings)
GROUP BY 1,2
```

results:
```json lines
{"label":"row1","strings":"a"}
{"label":"row1","strings":"b"}
{"label":"row2","strings":null}
{"label":"row2","strings":"b"}
{"label":"row4","strings":"a"}
{"label":"row4","strings":"b"}
```

## Differences between arrays and multi-value dimensions
Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensions.md). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior.

Use care during ingestion to ensure you get the type you want.

To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.

To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.

You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:

```sql
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'mytable'
```

Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`.
Loading

0 comments on commit d261587

Please sign in to comment.