explicit outputType for ExpressionPostAggregator, better documentatio…

…n for the differences between arrays and mvds (#15245) (#15307) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables
apache · Nov 3, 2023 · 80dc45e · 80dc45e
1 parent 4a9d6a4
commit 80dc45e
Show file tree

Hide file tree

Showing 33 changed files with 1,117 additions and 420 deletions.
diff --git a/docs/multi-stage-query/concepts.md b/docs/multi-stage-query/concepts.md
@@ -192,18 +192,13 @@ To perform ingestion with rollup:
 2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation
    functions to write their internal state to the generated segments, instead of the finalized end result, and enables
    further aggregation at query time.
-3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting:
-   false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
-   remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
-   `GROUP BY` operator.
+3. See [ARRAY types](../querying/arrays.md#sql-based-ingestion-with-rollup) for information about ingesting `ARRAY` columns
+4. See [multi-value dimensions](../querying/multi-value-dimensions.md#sql-based-ingestion-with-rollup) for information to ingest multi-value VARCHAR columns
 
 When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes
 rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata`
 queries](../querying/segmentmetadataquery.md) to retrieve rollup-related information.
 
-If you see the error "Encountered multi-value dimension `x` that cannot be processed with
-groupByEnableMultiValueUnnesting set to false", then wrap that column in `MV_TO_ARRAY(x) AS x`.
-
 The following [aggregation functions](../querying/sql-aggregations.md) are supported for rollup at ingestion time:
 `COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` and `EARLIEST_BY` ([string only](known-issues.md#select-statement)),
 `LATEST` and `LATEST_BY` ([string only](known-issues.md#select-statement)), `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`,

diff --git a/docs/multi-stage-query/examples.md b/docs/multi-stage-query/examples.md
@@ -79,7 +79,7 @@ CLUSTERED BY channel
 
 ## INSERT with rollup
 
-This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
+This example inserts data into a table named `kttm_rollup` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
 
 <details><summary>Show the query</summary>
 
@@ -91,7 +91,7 @@ SELECT * FROM TABLE(
   EXTERN(
     '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
     '{"type":"json"}',
-    '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
+    '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
   )
 ))
 
@@ -101,8 +101,7 @@ SELECT
   agent_category,
   agent_type,
   browser,
-  browser_version,
-  MV_TO_ARRAY("language") AS "language", -- Multi-value string dimension
+  browser_version
   os,
   city,
   country,
@@ -113,11 +112,10 @@ SELECT
   APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types
 FROM kttm_data
 WHERE os = 'iOS'
-GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
+GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
 PARTITIONED BY HOUR
 CLUSTERED BY browser, session
 ```
-
 </details>
 
 ## INSERT for reindexing an existing datasource

diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md
diff --git a/docs/querying/arrays.md b/docs/querying/arrays.md
@@ -0,0 +1,253 @@
+---
+id: arrays
+title: "Arrays"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `VARCHAR`, `BIGINT`, and `DOUBLE` types (native types `ARRAY<STRING>`, `ARRAY<LONG>`, and `ARRAY<DOUBLE>`). Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for `ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns:
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a", "b"],  "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null, "b"], "arrayLong":null,        "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],          "arrayLong":[1, 2, 3],   "arrayDouble":[null, 2.2, 1.1]} 
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a", "b"],  "arrayLong":[1, 2, 3],   "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,        "arrayLong":[],          "arrayDouble":null}
+```
+
+## Ingesting arrays
+
+### Native batch and streaming ingestion
+When using native [batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), arrays can be ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects) type dimension schema which is shared with [type-aware schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as an array type. JSON data does not require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the data used in this document:
+
+```
+"dimensions": [
+  {
+    "type": "auto",
+    "name": "label"
+  },
+  {
+    "type": "auto",
+    "name": "arrayString"
+  },
+  {
+    "type": "auto",
+    "name": "arrayLong"
+  },
+  {
+    "type": "auto",
+    "name": "arrayDouble"
+  }
+],
+```
+
+### SQL-based ingestion
+
+Arrays can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include a query context parameter [`"arrayIngestMode":"array"`](../multi-stage-query/reference.md#context-parameters).
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"],  \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null,        \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],          \"arrayLong\":[1, 2, 3],   \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],  \"arrayLong\":[1, 2, 3],   \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,        \"arrayLong\":[],          \"arrayDouble\":null}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY<STRING>"},{"name":"arrayLong", "type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  "arrayString",
+  "arrayLong",
+  "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+### SQL-based ingestion with rollup
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"],  \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null,        \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],          \"arrayLong\":[1, 2, 3],   \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],  \"arrayLong\":[1, 2, 3],   \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,        \"arrayLong\":[],          \"arrayDouble\":null}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY<STRING>"},{"name":"arrayLong", "type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  "arrayString",
+  "arrayLong",
+  "arrayDouble",
+  COUNT(*) as "count"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying arrays
+
+### Filtering
+
+All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on array typed columns. Filters follow these rules for array types:
+
+- All filters match against the entire array value for the row
+- Native value filters like [equality](filters.md#equality-filter) and [range](filters.md#range-filter) match on entire array values, as do SQL constructs that plan into these native filters
+- The [`IS NULL`](filters.md#null-filter) filter will match rows where the entire array value is null
+- [Array specific functions](sql-array-functions.md) like `ARRAY_CONTAINS` and `ARRAY_OVERLAP` follow the behavior specified by those functions
+- All other filters do not directly support ARRAY types and will result in a query error
+
+#### Example: equality
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong = ARRAY[1,2,3]
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: null
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong IS NULL
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row2","arrayString":"[null,\"b\"]","arrayLong":null,"arrayDouble":"[999.0,null,5.5]"}
+```
+
+#### Example: range
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayString >= ARRAY['a','b']
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: ARRAY_CONTAINS
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_CONTAINS(arrayString, 'a')
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+### Grouping
+
+When grouping on an array with SQL or a native [groupBy query](groupbyquery.md), grouping follows standard SQL behavior and groups on the entire array as a single value. The [`UNNEST`](sql.md#unnest) function allows grouping on the individual array elements.
+
+#### Example: SQL grouping query with no filtering
+```sql
+SELECT label, arrayString
+FROM "array_example"
+GROUP BY 1,2
+```
+results in:
+```json lines
+{"label":"row1","arrayString":"[\"a\",\"b\"]"}
+{"label":"row2","arrayString":"[null,\"b\"]"}
+{"label":"row3","arrayString":"[]"}
+{"label":"row4","arrayString":"[\"a\",\"b\"]"}
+{"label":"row5","arrayString":null}
+```
+
+#### Example: SQL grouping query with a filter
+```sql
+SELECT label, arrayString
+FROM "array_example"
+WHERE arrayLong = ARRAY[1,2,3]
+GROUP BY 1,2
+```
+
+results:
+```json lines
+{"label":"row3","arrayString":"[]"}
+{"label":"row4","arrayString":"[\"a\",\"b\"]"}
+```
+
+#### Example: UNNEST
+```sql
+SELECT label, strings
+FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings)
+GROUP BY 1,2
+```
+
+results:
+```json lines
+{"label":"row1","strings":"a"}
+{"label":"row1","strings":"b"}
+{"label":"row2","strings":null}
+{"label":"row2","strings":"b"}
+{"label":"row4","strings":"a"}
+{"label":"row4","strings":"b"}
+```
+
+## Differences between arrays and multi-value dimensions
+Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensions.md). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior.
+
+Use care during ingestion to ensure you get the type you want.
+
+To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.
+
+To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.
+
+You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
+
+```sql
+SELECT COLUMN_NAME, DATA_TYPE
+FROM INFORMATION_SCHEMA.COLUMNS
+WHERE TABLE_NAME = 'mytable'
+```
+
+Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`.