From 4c5a5db9ab08dae9aef42a7623dbbd5b834c2cad Mon Sep 17 00:00:00 2001 From: Clint Wylie Date: Tue, 24 Oct 2023 20:48:13 -0700 Subject: [PATCH] better documentation for the differences between arrays and mvds --- docs/multi-stage-query/concepts.md | 14 +- docs/multi-stage-query/examples.md | 47 ++++- docs/multi-stage-query/reference.md | 35 ++-- docs/querying/arrays.md | 228 ++++++++++++++++++++++++ docs/querying/multi-value-dimensions.md | 213 +++++++++++++++++----- docs/querying/sql-data-types.md | 25 +-- docs/querying/sql.md | 9 +- website/sidebars.json | 1 + 8 files changed, 486 insertions(+), 86 deletions(-) create mode 100644 docs/querying/arrays.md diff --git a/docs/multi-stage-query/concepts.md b/docs/multi-stage-query/concepts.md index 2969ba9722a9..13e0c4280fe3 100644 --- a/docs/multi-stage-query/concepts.md +++ b/docs/multi-stage-query/concepts.md @@ -88,6 +88,9 @@ When deciding whether to use `REPLACE` or `INSERT`, keep in mind that segments g with dimension-based pruning but those generated with `INSERT` cannot. For more information about the requirements for dimension-based pruning, see [Clustering](#clustering). +To insert [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows +ARRAY types to be stored in segments. This flag is not enabled by default. + For more information about the syntax, see [INSERT](./reference.md#insert). @@ -192,10 +195,13 @@ To perform ingestion with rollup: 2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation functions to write their internal state to the generated segments, instead of the finalized end result, and enables further aggregation at query time. -3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting: - false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and - remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the - `GROUP BY` operator. +3. To ingest [Druid multi-value dimensions](../querying/multi-value-dimensions.md), wrap all multi-value strings + in `MV_TO_ARRAY(...)` in the grouping clause and set [`groupByEnableMultiValueUnnesting: false`](reference.md#context-parameters) in your context. + This ensures that multi-value strings are left alone and remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the + `GROUP BY` operator. To INSERT these arrays as multi-value strings, wrap the expressions in the SELECT clause with + `ARRAY_TO_MV` to coerce the ARRAY back to a VARCHAR +4. To ingest [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows + ARRAY types to be stored in segments. This flag is not enabled by default. When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata` diff --git a/docs/multi-stage-query/examples.md b/docs/multi-stage-query/examples.md index 51a645448daf..d440f8b93f81 100644 --- a/docs/multi-stage-query/examples.md +++ b/docs/multi-stage-query/examples.md @@ -79,7 +79,7 @@ CLUSTERED BY channel ## INSERT with rollup -This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup). +This example inserts data into a table named `kttm_rollup` and performs data rollup. The ARRAY inputs are stored in a [multi-value dimension](../querying/multi-value-dimensions.md). This example implements the recommendations described in [Rollup](./concepts.md#rollup).
Show the query @@ -102,7 +102,50 @@ SELECT agent_type, browser, browser_version, - MV_TO_ARRAY("language") AS "language", -- Multi-value string dimension + ARRAY_TO_MV(MV_TO_ARRAY("language")) AS "language", -- Multi-value string dimension + os, + city, + country, + forwarded_for AS ip_address, + + COUNT(*) AS "cnt", + SUM(session_length) AS session_length, + APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types +FROM kttm_data +WHERE os = 'iOS' +GROUP BY 1, 2, 3, 4, 5, 6, MV_TO_ARRAY("language"), 8, 9, 10, 11 +PARTITIONED BY HOUR +CLUSTERED BY browser, session +``` +
+ +## INSERT with rollup and ARRAY types + +This example inserts data into a table named `kttm_rollup_arrays` and performs data rollup. The ARRAY inputs are stored in an [ARRAY column](../querying/arrays.md). This example also implements the recommendations described in [Rollup](./concepts.md#rollup). Be sure to set context flag `"arrayIngestMode":"array"` which allows +ARRAY types to be stored in segments. + +
Show the query + +```sql +INSERT INTO "kttm_rollup_arrays" + +WITH kttm_data AS ( +SELECT * FROM TABLE( + EXTERN( + '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', + '{"type":"json"}', + '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"array"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]' + ) +)) + +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + session, + agent_category, + agent_type, + browser, + browser_version, + "language", -- array os, city, country, diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 9ec50de0a9b1..67509d6f17f1 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -232,23 +232,26 @@ If you're using the web console, you can specify the context parameters through The following table lists the context parameters for the MSQ task engine: -| Parameter | Description | Default value | -|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---| -| `maxNumTasks` | SELECT, INSERT, REPLACE

The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.

May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority. | 2 | -| `taskAssignment` | SELECT, INSERT, REPLACE

Determines how many tasks to use. Possible values include:
  • `max`: Uses as many tasks as possible, up to `maxNumTasks`.
  • `auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 512 MiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When calculating the size of files, the weighted size is used, which considers the file format and compression format used if any. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.
| `max` | -| `finalizeAggregations` | SELECT, INSERT, REPLACE

Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true | -| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE

Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | -| `rowsInMemory` | INSERT or REPLACE

Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./known-issues.md) around memory usage. | 100,000 | +| Parameter | Description | Default value | +|---|---|---| +| `maxNumTasks` | SELECT, INSERT, REPLACE

The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.

May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority. | 2 | +| `taskAssignment` | SELECT, INSERT, REPLACE

Determines how many tasks to use. Possible values include:
  • `max`: Uses as many tasks as possible, up to `maxNumTasks`.
  • `auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 512 MiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When calculating the size of files, the weighted size is used, which considers the file format and compression format used if any. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.
| `max` | +| `finalizeAggregations` | SELECT, INSERT, REPLACE

Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true | +| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE

Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | +| `rowsInMemory` | INSERT or REPLACE

Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./known-issues.md) around memory usage. | 100,000 | | `segmentSortOrder` | INSERT or REPLACE

Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.

You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list | -| `maxParseExceptions`| SELECT, INSERT, REPLACE

Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1. | 0 | -| `rowsPerSegment` | INSERT or REPLACE

The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 | -| `indexSpec` | INSERT or REPLACE

An [`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating segments. May be a JSON string or object. See [Front coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring an `indexSpec` with front coding. | See [`indexSpec`](../ingestion/ingestion-spec.md#indexspec). | -| `durableShuffleStorage` | SELECT, INSERT, REPLACE

Whether to use durable storage for shuffle mesh. To use this feature, configure the durable storage at the server level using `druid.msq.intermediate.storage.enable=true`). If these properties are not configured, any query with the context variable `durableShuffleStorage=true` fails with a configuration error.

| `false` | -| `faultTolerance` | SELECT, INSERT, REPLACE

Whether to turn on fault tolerance mode or not. Failed workers are retried based on [Limits](#limits). Cannot be used when `durableShuffleStorage` is explicitly set to false. | `false` | -| `selectDestination` | SELECT

Controls where the final result of the select query is written.
Use `taskReport`(the default) to write select results to the task report. This is not scalable since task reports size explodes for large results
Use `durableStorage` to write results to durable storage location. For large results sets, its recommended to use `durableStorage` . To configure durable storage see [`this`](#durable-storage) section. | `taskReport` | -| `waitTillSegmentsLoad` | INSERT, REPLACE

If set, the ingest query waits for the generated segment to be loaded before exiting, else the ingest query exits without waiting. The task and live reports contain the information about the status of loading segments if this flag is set. This will ensure that any future queries made after the ingestion exits will include results from the ingestion. The drawback is that the controller task will stall till the segments are loaded. | `false` | -| `includeSegmentSource` | SELECT, INSERT, REPLACE

Controls the sources, which will be queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks. | `NONE` | -| `rowsPerPage` | SELECT

The number of rows per page to target. The actual number of rows per page may be somewhat higher or lower than this number. In most cases, use the default.
This property comes into effect only when `selectDestination` is set to `durableStorage` | 100000 | +| `maxParseExceptions`| SELECT, INSERT, REPLACE

Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1. | 0 | +| `rowsPerSegment` | INSERT or REPLACE

The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 | +| `indexSpec` | INSERT or REPLACE

An [`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating segments. May be a JSON string or object. See [Front coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring an `indexSpec` with front coding. | See [`indexSpec`](../ingestion/ingestion-spec.md#indexspec). | +| `durableShuffleStorage` | SELECT, INSERT, REPLACE

Whether to use durable storage for shuffle mesh. To use this feature, configure the durable storage at the server level using `druid.msq.intermediate.storage.enable=true`). If these properties are not configured, any query with the context variable `durableShuffleStorage=true` fails with a configuration error.

| `false` | +| `faultTolerance` | SELECT, INSERT, REPLACE

Whether to turn on fault tolerance mode or not. Failed workers are retried based on [Limits](#limits). Cannot be used when `durableShuffleStorage` is explicitly set to false. | `false` | +| `selectDestination` | SELECT

Controls where the final result of the select query is written.
Use `taskReport`(the default) to write select results to the task report. This is not scalable since task reports size explodes for large results
Use `durableStorage` to write results to durable storage location. For large results sets, its recommended to use `durableStorage` . To configure durable storage see [`this`](#durable-storage) section. | `taskReport` | +| `waitTillSegmentsLoad` | INSERT, REPLACE

If set, the ingest query waits for the generated segment to be loaded before exiting, else the ingest query exits without waiting. The task and live reports contain the information about the status of loading segments if this flag is set. This will ensure that any future queries made after the ingestion exits will include results from the ingestion. The drawback is that the controller task will stall till the segments are loaded. | `false` | +| `includeSegmentSource` | SELECT, INSERT, REPLACE

Controls the sources, which will be queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks. | `NONE` | +| `rowsPerPage` | SELECT

The number of rows per page to target. The actual number of rows per page may be somewhat higher or lower than this number. In most cases, use the default.
This property comes into effect only when `selectDestination` is set to `durableStorage` | 100000 | +| `arrayIngestMode` | INSERT, REPLACE

Controls how ARRAY type values are stored in Druid segments. When set to `'array'` (recommended for SQL compliance), Druid will store all ARRAY typed values in [ARRAY typed columns](../querying/arrays.md), and supports storing both VARCHAR and numeric typed arrays. When set to `'mvd'` (the default, for backwards compatibility), Druid only supports VARCHAR typed arrays, and will store them as [multi-value string columns](../querying/multi-value-dimensions.md). When set to `none`, Druid will throw an exception when trying to store any type of arrays, used to help migrate operators from `'mvd'` mode to `'array'` mode and force query writers to make an explicit choice between ARRAY and multi-value VARCHAR typed columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for SQL compliance)| + + ## Joins Joins in multi-stage queries use one of two algorithms based on what you set the [context parameter](#context-parameters) `sqlJoinAlgorithm` to: diff --git a/docs/querying/arrays.md b/docs/querying/arrays.md new file mode 100644 index 000000000000..1511f6e92916 --- /dev/null +++ b/docs/querying/arrays.md @@ -0,0 +1,228 @@ +--- +id: arrays +title: "Array columns" +--- + + + + +Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, and `DOUBLE` types. Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays. + +This document describes inserting, filtering, and grouping behavior for `ARRAY` typed columns. +Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details +about the functions available to use with ARRAY columns and types in SQL. + +The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns. + +```json lines +{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a", "b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]} +{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null, "b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]} +{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [], "arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]} +{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a", "b"], "arrayLong":[1, 2, 3], "arrayDouble":[]} +{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null, "arrayLong":[], "arrayDouble":null} +``` + +## Overview + +When using [native ingestion](../ingestion/native-batch.md), arrays can be ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects) type dimension schema which is shared with [type-aware schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery). + +When ingesting from TSV or CSV data, you can specify the array delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as an array type. JSON data does not require `inputFormat` configuration. + +The following shows an example `dimensionsSpec` for native ingestion of the data used in this document: + +``` +"dimensions": [ + { + "type": "auto", + "name": "label" + }, + { + "type": "auto", + "name": "arrayString" + }, + { + "type": "auto", + "name": "arrayLong" + }, + { + "type": "auto", + "name": "arrayDouble" + } +], +``` + +Arrays can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md), but must include a query context parameter `"arrayIngestMode":"array"`. + +For example, to insert the data used in this document: +```sql +REPLACE INTO "array_example" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null, \"arrayLong\":[], \"arrayDouble\":null}"}', + '{"type":"json"}', + '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY"},{"name":"arrayLong", "type":"ARRAY"},{"name":"arrayDouble", "type":"ARRAY"}]' + ) + ) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "label", + "arrayString", + "arrayLong", + "arrayDouble" +FROM "ext" +PARTITIONED BY DAY +``` + +These input arrays can also be grouped for rollup: + +```sql +REPLACE INTO "array_example_rollup" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"], \"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\": \"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null, \"arrayLong\":[], \"arrayDouble\":null}"}', + '{"type":"json"}', + '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"arrayString", "type":"ARRAY"},{"name":"arrayLong", "type":"ARRAY"},{"name":"arrayDouble", "type":"ARRAY"}]' + ) + ) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "label", + "arrayString", + "arrayLong", + "arrayDouble" +FROM "ext" +GROUP BY 1,2,3,4,5 +PARTITIONED BY DAY +``` + + +## Querying ARRAYS + +### Filtering + +All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on array typed columns. Filters follow these rules for array types: + +- Value filters, like "equality", "range" match on entire array values +- The "null" filter will match rows where the entire array value is null +- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the behavior specified by those functions +- All other filters do not directly support ARRAY types + +#### Example: equality +```sql +SELECT * +FROM "array_example" +WHERE arrayLong = ARRAY[1,2,3] +``` + +```json lines +{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"} +{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"} +``` + +#### Example: null +```sql +SELECT * +FROM "array_example" +WHERE arrayLong is null +``` + +```json lines +{"__time":"2023-01-01T00:00:00.000Z","label":"row2","arrayString":"[null,\"b\"]","arrayLong":null,"arrayDouble":"[999.0,null,5.5]"} +``` + +#### Example: range +```sql +SELECT * +FROM "array_example" +WHERE arrayString >= ARRAY['a','b'] +``` + +```json lines +{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"} +{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"} +``` + +#### Example: ARRAY_CONTAINS +```sql +SELECT * +FROM "array_example" +WHERE ARRAY_CONTAINS(arrayString, 'a') +``` + +```json lines +{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"} +{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"} +``` + +### Grouping + +When grouping on an array with SQL or a native [groupBy queries](groupbyquery.md), grouping follows standard SQL behavior and groups on the entire array as a single value. The [`UNNEST`](sql.md#unnest) function allows grouping on the individual array elements. + +#### Example: SQL grouping query with no filtering +```sql +SELECT label, arrayString +FROM "array_example" +GROUP BY 1,2 +``` +results in: +```json lines +{"label":"row1","arrayString":"[\"a\",\"b\"]"} +{"label":"row2","arrayString":"[null,\"b\"]"} +{"label":"row3","arrayString":"[]"} +{"label":"row4","arrayString":"[\"a\",\"b\"]"} +{"label":"row5","arrayString":null} +``` + +#### Example: SQL grouping query with a filter +```sql +SELECT label, arrayString +FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings) +WHERE arrayLong = ARRAY[1,2,3] +GROUP BY 1,2 +``` + +results: +```json lines +{"label":"row3","arrayString":"[]"} +{"label":"row4","arrayString":"[\"a\",\"b\"]"} +``` + +#### Example: UNNEST +```sql +SELECT label, strings +FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings) +GROUP BY 1,2 +``` + +results: +```json lines +{"label":"row1","strings":"a"} +{"label":"row1","strings":"b"} +{"label":"row2","strings":null} +{"label":"row2","strings":"b"} +{"label":"row4","strings":"a"} +{"label":"row4","strings":"b"} +``` \ No newline at end of file diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md index f1081d3f4323..5dcf54579a46 100644 --- a/docs/querying/multi-value-dimensions.md +++ b/docs/querying/multi-value-dimensions.md @@ -30,21 +30,36 @@ array of values instead of a single value, such as the `tags` values in the foll {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} ``` -This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see +It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md), which behave like standard SQL arrays. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior). + +This document describes inserting, filtering, and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see [segments documentation](../design/segments.md#multi-value-columns). Examples in this document -are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql-multivalue-string-functions.md) for details -about using multi-value string dimensions in SQL. +are in the form of both [SQL](sql.md) and [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql-multivalue-string-functions.md) for details +about the functions available for using multi-value string dimensions in SQL. + +The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`. + +```json lines +{"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]} +{"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]} +{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]} +{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []} +``` ## Overview -At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions. +When using [native ingestion](../ingestion/native-batch.md), the Druid web console data loader can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. -For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration. +For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `inputFormat` configuration. -The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`: +The following shows an example `dimensionsSpec` for native ingestion of the data used in this document: ``` "dimensions": [ + { + "type": "string", + "name": "label" + }, { "type": "string", "name": "tags", @@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling. +Multi-value dimensions can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data. + +For example, to insert the data used in this document: +```sql +REPLACE INTO "mvd_example" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', + '{"type":"json"}', + '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY"}]' + ) + ) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "label", + ARRAY_TO_MV("tags") AS "tags" +FROM "ext" +PARTITIONED BY DAY +``` -## Querying multi-value dimensions - -The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`. - +These input arrays can also be grouped prior to converting into a multi-value dimension: +```sql +REPLACE INTO "mvd_example_rollup" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', + '{"type":"json"}', + '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY"}]' + ) + ) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "label", + ARRAY_TO_MV("tags") AS "tags", + COUNT(*) AS "count" +FROM "ext" +GROUP BY 1, 2, "tags" +PARTITIONED BY DAY ``` -{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1 -{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2 -{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3 -{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4 + +Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping. + + +The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` must be used since the multi-stage engine only supports grouping on multi-value dimensions as arrays, and so they must be coerced first. These arrays then must be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`. + +```sql +REPLACE INTO "mvd_example_rollup" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', + '{"type":"json"}' + ) + ) EXTEND ("timestamp" VARCHAR, "label" VARCHAR, "tags" VARCHAR) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "label", + ARRAY_TO_MV(MV_TO_ARRAY("tags")) AS "tags", + COUNT(*) AS "count" +FROM "ext" +GROUP BY 1, 2, MV_TO_ARRAY("tags") +PARTITIONED BY DAY ``` -:::info - Be sure to remove the comments before trying out the sample data. -::: + +## Querying multi-value dimensions ### Filtering @@ -88,28 +162,22 @@ dimensions. Filters follow these rules on multi-value dimensions: - Logical expression filters behave the same way they do on single-value dimensions: "and" matches a row if all underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row if the underlying filter does not match the row. - + The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3: +```sql +SELECT * +FROM "mvd_example_rollup" +WHERE tags = 't1' OR tags = 't3' ``` -{ - "type": "or", - "fields": [ - { - "type": "selector", - "dimension": "tags", - "value": "t1" - }, - { - "type": "selector", - "dimension": "tags", - "value": "t3" - } - ] -} + +returns +```json lines +{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1} +{"__time":"2011-01-13T00:00:00.000Z","label":"row2","tags":"[\"t3\",\"t4\",\"t5\"]","count":1} ``` -This "and" filter would match only row1 of the dataset above: +Native queries can also perform filtering that would be considered a "contradiction" in SQL, such as this "and" filter which would match only row1 of the dataset above: ``` { @@ -129,26 +197,73 @@ This "and" filter would match only row1 of the dataset above: } ``` -This "selector" filter would match row4 of the dataset above: +which returns +```json lines +{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1} +``` +Multi-value dimensions also consider an empty row as `null`, consider: +```sql +SELECT * +FROM "mvd_example_rollup" +WHERE tags is null ``` -{ - "type": "selector", - "dimension": "tags", - "value": null -} + +which results in: +```json lines +{"__time":"2011-01-14T00:00:00.000Z","label":"row4","tags":null,"count":1} ``` ### Grouping -topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values +When grouping on a multi-value dimension with SQL or a native [topN](topnquery.md) or [groupBy](groupbyquery.md) queries, _all_ values from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST` operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups: -`t1`, `t2`, and `t3`. If you only need to include values that match your filter, you can use a -[filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. +`t1`, `t2`, and `t3`. + +If you only need to include values that match your filter, you can use the SQL functions [`MV_FILTER_ONLY`/`MV_FILTER_NONE`](sql-multivalue-string-functions.md), +[filtered virtual column](virtual-columns.md#list-filtered-virtual-column), or [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. + +#### Example: SQL grouping query with no filtering +```sql +SELECT label, tags +FROM "mvd_example_rollup" +GROUP BY 1,2 +``` +results in: +```json lines +{"label":"row1","tags":"t1"} +{"label":"row1","tags":"t2"} +{"label":"row1","tags":"t3"} +{"label":"row2","tags":"t3"} +{"label":"row2","tags":"t4"} +{"label":"row2","tags":"t5"} +{"label":"row3","tags":"t5"} +{"label":"row3","tags":"t6"} +{"label":"row3","tags":"t7"} +{"label":"row4","tags":null} +``` + +#### Example: SQL grouping query with a filter +```sql +SELECT label, tags +FROM "mvd_example_rollup" +WHERE tags = 't3' +GROUP BY 1,2 +``` + +results: +```json lines +{"label":"row1","tags":"t1"} +{"label":"row1","tags":"t2"} +{"label":"row1","tags":"t3"} +{"label":"row2","tags":"t3"} +{"label":"row2","tags":"t4"} +{"label":"row2","tags":"t5"} +``` -## Example: GroupBy query with no filtering +#### Example: native GroupBy query with no filtering See [GroupBy querying](groupbyquery.md) for details. @@ -236,7 +351,7 @@ This query returns the following result: Notice that original rows are "exploded" into multiple rows and merged. -## Example: GroupBy query with a selector query filter +#### Example: native GroupBy query with a selector query filter See [query filters](filters.md) for details of selector query filter. @@ -314,11 +429,11 @@ This query returns the following result: ``` You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is -applied on the row before explosion. For multi-value dimensions, a selector filter for "t3" would match row1 and row2, +applied on the row before explosion. For multi-value dimensions, a filter for value "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside the multiple values matches the query filter. -## Example: GroupBy query with selector query and dimension filters +#### Example: native GroupBy query with selector query and dimension filters To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below. @@ -379,7 +494,7 @@ Having specs are applied at the outermost level of groupBy query processing. ## Disable GroupBy on multi-value columns -You can disable the implicit unnesting behavior for groupBy by setting groupByEnableMultiValueUnnesting: false in your -query context. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety +You can disable the implicit unnesting behavior for groupBy by setting `groupByEnableMultiValueUnnesting: false` in your +[query context](query-context.md). In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any multi-valued dimensions that were inadvertently included. \ No newline at end of file diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 2b30407bc3b9..10137b3c6ee8 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -75,6 +75,17 @@ Casts between two SQL types that have different Druid runtime types generate a r If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid a substitutes [NULL](#null-values). When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid instead substitutes a default value, including when NULL values cast to non-nullable types. For example, if `druid.generic.useDefaultValueForNull = true`, a null VARCHAR cast to BIGINT is converted to a zero. +## Arrays + +Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row. + +ARRAY typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `'mvd'` for backwards compatibility, which instead can only handle `ARRAY` which it stores in [multi-value string columns](#multi-value-strings). + +You can convert multi-value dimensions to standard SQL arrays explicitly with `MV_TO_ARRAY` or implicitly using [array functions](./sql-array-functions.md). You can also use the array functions to construct arrays from multiple columns. + +Druid serializes `ARRAY` results as a JSON string of the array by default, which can be controlled by the context parameter +`sqlStringifyArrays`. When set to `false`, the arrays will instead be returned as regular JSON arrays instead of in stringified form. + ## Multi-value strings Druid's native type system allows strings to have multiple values. These [multi-value string dimensions](multi-value-dimensions.md) are reported in SQL as type VARCHAR and can be @@ -86,20 +97,12 @@ You can treat multi-value string dimensions as arrays using special Grouping by multi-value dimensions observes the native Druid multi-value aggregation behavior, which is similar to an implicit SQL UNNEST. See [Grouping](multi-value-dimensions.md#grouping) for more information. :::info - Because the SQL planner treats multi-value dimensions as VARCHAR, there are some inconsistencies between how they are handled in Druid SQL and in native queries. For instance, expressions involving multi-value dimensions may be incorrectly optimized by the Druid SQL planner. For example, `multi_val_dim = 'a' AND multi_val_dim = 'b'` is optimized to +Because the SQL planner treats multi-value dimensions as VARCHAR, there are some inconsistencies between how they are handled in Druid SQL and in native queries. For instance, expressions involving multi-value dimensions may be incorrectly optimized by the Druid SQL planner. For example, `multi_val_dim = 'a' AND multi_val_dim = 'b'` is optimized to `false`, even though it is possible for a single row to have both `'a'` and `'b'` as values for `multi_val_dim`. - The SQL behavior of multi-value dimensions may change in a future release to more closely align with their behavior in native queries, but the [multi-value string functions](./sql-multivalue-string-functions.md) should be able to provide nearly all possible native functionality. +The SQL behavior of multi-value dimensions may change in a future release to more closely align with their behavior in native queries, but the [multi-value string functions](./sql-multivalue-string-functions.md) should be able to provide nearly all possible native functionality. ::: -## Arrays - -Druid supports ARRAY types constructed at query time. ARRAY types behave as standard SQL arrays, where results are grouped by matching entire arrays. This is in contrast to the implicit UNNEST that occurs when grouping on multi-value dimensions directly or when used with multi-value functions. - -You can convert multi-value dimensions to standard SQL arrays explicitly with `MV_TO_ARRAY` or implicitly using [array functions](./sql-array-functions.md). You can also use the array functions to construct arrays from multiple columns. - -You can use [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. - ## Multi-value strings behavior The behavior of Druid [multi-value string dimensions](multi-value-dimensions.md) varies depending on the context of @@ -170,7 +173,7 @@ You can interact with nested data using [JSON functions](./sql-json-functions.md COMPLEX types have limited functionality outside the specialized functions that use them, so their behavior is undefined when: * Grouping on complex values. -* Filtering directly on complex values, such as `WHERE json is NULL`. +* Filtering directly on complex values. * Used as inputs to aggregators without specialized handling for a specific complex type. In many cases, functions are provided to translate COMPLEX value types to STRING, which serves as a workaround solution until COMPLEX type functionality can be improved. diff --git a/docs/querying/sql.md b/docs/querying/sql.md index 952c7c419a27..d64485dda43d 100644 --- a/docs/querying/sql.md +++ b/docs/querying/sql.md @@ -87,7 +87,7 @@ documentation. ## UNNEST -The UNNEST clause unnests array values. It's the SQL equivalent to the [unnest datasource](./datasource.md#unnest). The source for UNNEST can be an array or an input that's been transformed into an array, such as with helper functions like MV_TO_ARRAY or ARRAY. +The UNNEST clause unnests ARRAY typed values. The source for UNNEST can be an array type column, or an input that's been transformed into an array, such as with helper functions like [`MV_TO_ARRAY`](./sql-multivalue-string-functions.md) or [`ARRAY`](./sql-array-functions.md). The following is the general syntax for UNNEST, specifically a query that returns the column that gets unnested: @@ -98,7 +98,7 @@ SELECT column_alias_name FROM datasource CROSS JOIN UNNEST(source_expression1) A * The `datasource` for UNNEST can be any Druid datasource, such as the following: * A table, such as `FROM a_table`. * A subset of a table based on a query, a filter, or a JOIN. For example, `FROM (SELECT columnA,columnB,columnC from a_table)`. -* The `source_expression` for the UNNEST function must be an array and can come from any expression. If the dimension you are unnesting is a multi-value dimension, you have to specify `MV_TO_ARRAY(dimension)` to convert it to an implicit ARRAY type. You can also specify any expression that has an SQL array datatype. For example, you can call UNNEST on the following: +* The `source_expression` for the UNNEST function must be an array and can come from any expression. UNNEST works directly on Druid ARRAY typed columns. If the column you are unnesting is a multi-value VARCHAR, you must specify `MV_TO_ARRAY(dimension)` to convert it to an ARRAY type. You can also specify any expression that has an SQL array datatype. For example, you can call UNNEST on the following: * `ARRAY[dim1,dim2]` if you want to make an array out of two dimensions. * `ARRAY_CONCAT(dim1,dim2)` if you want to concatenate two multi-value dimensions. * The `AS table_alias_name(column_alias_name)` clause is not required but is highly recommended. Use it to specify the output, which can be an existing column or a new one. Replace `table_alias_name` and `column_alias_name` with a table and column name you want to alias the unnested results to. If you don't provide this, Druid uses a nondescriptive name, such as `EXPR$0`. @@ -115,8 +115,9 @@ For examples, see the [Unnest arrays tutorial](../tutorials/tutorial-unnest-arra The UNNEST function has the following limitations: - The function does not remove any duplicates or nulls in an array. Nulls will be treated as any other value in an array. If there are multiple nulls within the array, a record corresponding to each of the nulls gets created. -- Arrays inside complex JSON types are not supported. -- You cannot perform an UNNEST at ingestion time, including SQL-based ingestion using the MSQ task engine. +- Arrays of complex objects inside complex JSON types are not supported. + +UNNEST is the SQL equivalent of the [unnest datasource](./datasource.md#unnest). ## WHERE diff --git a/website/sidebars.json b/website/sidebars.json index 1062b3dfee97..2042f9ec47da 100644 --- a/website/sidebars.json +++ b/website/sidebars.json @@ -125,6 +125,7 @@ "querying/joins", "querying/lookups", "querying/multi-value-dimensions", + "querying/arrays", "querying/nested-columns", "querying/multitenancy", "querying/caching",