diff --git a/docs/multi-stage-query/concepts.md b/docs/multi-stage-query/concepts.md index 13e0c4280fe3..8a421028a968 100644 --- a/docs/multi-stage-query/concepts.md +++ b/docs/multi-stage-query/concepts.md @@ -195,21 +195,13 @@ To perform ingestion with rollup: 2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation functions to write their internal state to the generated segments, instead of the finalized end result, and enables further aggregation at query time. -3. To ingest [Druid multi-value dimensions](../querying/multi-value-dimensions.md), wrap all multi-value strings - in `MV_TO_ARRAY(...)` in the grouping clause and set [`groupByEnableMultiValueUnnesting: false`](reference.md#context-parameters) in your context. - This ensures that multi-value strings are left alone and remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the - `GROUP BY` operator. To INSERT these arrays as multi-value strings, wrap the expressions in the SELECT clause with - `ARRAY_TO_MV` to coerce the ARRAY back to a VARCHAR -4. To ingest [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows - ARRAY types to be stored in segments. This flag is not enabled by default. +3. See [ARRAY types](../querying/arrays.md#sql-based-ingestion-with-rollup) for information about ingesting `ARRAY` columns +4. See [multi-value dimensions](../querying/multi-value-dimensions.md#sql-based-ingestion-with-rollup) for information to ingest multi-value VARCHAR columns When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata` queries](../querying/segmentmetadataquery.md) to retrieve rollup-related information. -If you see the error "Encountered multi-value dimension `x` that cannot be processed with -groupByEnableMultiValueUnnesting set to false", then wrap that column in `MV_TO_ARRAY(x) AS x`. - The following [aggregation functions](../querying/sql-aggregations.md) are supported for rollup at ingestion time: `COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` and `EARLIEST_BY` ([string only](known-issues.md#select-statement)), `LATEST` and `LATEST_BY` ([string only](known-issues.md#select-statement)), `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`, diff --git a/docs/multi-stage-query/examples.md b/docs/multi-stage-query/examples.md index d440f8b93f81..14914cab1158 100644 --- a/docs/multi-stage-query/examples.md +++ b/docs/multi-stage-query/examples.md @@ -79,7 +79,7 @@ CLUSTERED BY channel ## INSERT with rollup -This example inserts data into a table named `kttm_rollup` and performs data rollup. The ARRAY inputs are stored in a [multi-value dimension](../querying/multi-value-dimensions.md). This example implements the recommendations described in [Rollup](./concepts.md#rollup). +This example inserts data into a table named `kttm_rollup` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
Show the query @@ -91,7 +91,7 @@ SELECT * FROM TABLE( EXTERN( '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', '{"type":"json"}', - '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]' + '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]' ) )) @@ -101,8 +101,7 @@ SELECT agent_category, agent_type, browser, - browser_version, - ARRAY_TO_MV(MV_TO_ARRAY("language")) AS "language", -- Multi-value string dimension + browser_version os, city, country, @@ -113,56 +112,12 @@ SELECT APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types FROM kttm_data WHERE os = 'iOS' -GROUP BY 1, 2, 3, 4, 5, 6, MV_TO_ARRAY("language"), 8, 9, 10, 11 +GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 PARTITIONED BY HOUR CLUSTERED BY browser, session ```
-## INSERT with rollup and ARRAY types - -This example inserts data into a table named `kttm_rollup_arrays` and performs data rollup. The ARRAY inputs are stored in an [ARRAY column](../querying/arrays.md). This example also implements the recommendations described in [Rollup](./concepts.md#rollup). Be sure to set context flag `"arrayIngestMode":"array"` which allows -ARRAY types to be stored in segments. - -
Show the query - -```sql -INSERT INTO "kttm_rollup_arrays" - -WITH kttm_data AS ( -SELECT * FROM TABLE( - EXTERN( - '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', - '{"type":"json"}', - '[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"array"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]' - ) -)) - -SELECT - FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, - session, - agent_category, - agent_type, - browser, - browser_version, - "language", -- array - os, - city, - country, - forwarded_for AS ip_address, - - COUNT(*) AS "cnt", - SUM(session_length) AS session_length, - APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types -FROM kttm_data -WHERE os = 'iOS' -GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 -PARTITIONED BY HOUR -CLUSTERED BY browser, session -``` - -
- ## INSERT for reindexing an existing datasource This example aggregates data from a table named `w000` and inserts the result into `w002`. diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 56bc09e5af55..56e66017423f 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -237,6 +237,7 @@ The following table lists the context parameters for the MSQ task engine: | `maxNumTasks` | SELECT, INSERT, REPLACE

The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.

May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority. | 2 | | `taskAssignment` | SELECT, INSERT, REPLACE

Determines how many tasks to use. Possible values include: | `max` | | `finalizeAggregations` | SELECT, INSERT, REPLACE

Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true | +| `arrayIngestMode` | INSERT, REPLACE

Controls how ARRAY type values are stored in Druid segments. When set to `'array'` (recommended for SQL compliance), Druid will store all ARRAY typed values in [ARRAY typed columns](../querying/arrays.md), and supports storing both VARCHAR and numeric typed arrays. When set to `'mvd'` (the default, for backwards compatibility), Druid only supports VARCHAR typed arrays, and will store them as [multi-value string columns](../querying/multi-value-dimensions.md). When set to `none`, Druid will throw an exception when trying to store any type of arrays, used to help migrate operators from `'mvd'` mode to `'array'` mode and force query writers to make an explicit choice between ARRAY and multi-value VARCHAR typed columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for SQL compliance)| | `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE

Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | | `rowsInMemory` | INSERT or REPLACE

Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./known-issues.md) around memory usage. | 100,000 | | `segmentSortOrder` | INSERT or REPLACE

Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.

You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list | @@ -249,7 +250,6 @@ The following table lists the context parameters for the MSQ task engine: | `waitUntilSegmentsLoad` | INSERT, REPLACE

If set, the ingest query waits for the generated segment to be loaded before exiting, else the ingest query exits without waiting. The task and live reports contain the information about the status of loading segments if this flag is set. This will ensure that any future queries made after the ingestion exits will include results from the ingestion. The drawback is that the controller task will stall till the segments are loaded. | `false` | | `includeSegmentSource` | SELECT, INSERT, REPLACE

Controls the sources, which will be queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks. | `NONE` | | `rowsPerPage` | SELECT

The number of rows per page to target. The actual number of rows per page may be somewhat higher or lower than this number. In most cases, use the default.
This property comes into effect only when `selectDestination` is set to `durableStorage` | 100000 | -| `arrayIngestMode` | INSERT, REPLACE

Controls how ARRAY type values are stored in Druid segments. When set to `'array'` (recommended for SQL compliance), Druid will store all ARRAY typed values in [ARRAY typed columns](../querying/arrays.md), and supports storing both VARCHAR and numeric typed arrays. When set to `'mvd'` (the default, for backwards compatibility), Druid only supports VARCHAR typed arrays, and will store them as [multi-value string columns](../querying/multi-value-dimensions.md). When set to `none`, Druid will throw an exception when trying to store any type of arrays, used to help migrate operators from `'mvd'` mode to `'array'` mode and force query writers to make an explicit choice between ARRAY and multi-value VARCHAR typed columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for SQL compliance)| ## Joins diff --git a/docs/querying/arrays.md b/docs/querying/arrays.md index 1511f6e92916..ddd9369347da 100644 --- a/docs/querying/arrays.md +++ b/docs/querying/arrays.md @@ -1,6 +1,6 @@ --- id: arrays -title: "Array columns" +title: "Arrays" --- -Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, and `DOUBLE` types. Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays. +Apache Druid supports SQL standard `ARRAY` typed columns for `VARCHAR`, `BIGINT`, and `DOUBLE` types (native types `ARRAY`, `ARRAY`, and `ARRAY`). Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays. This document describes inserting, filtering, and grouping behavior for `ARRAY` typed columns. Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details about the functions available to use with ARRAY columns and types in SQL. -The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns. +The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns: ```json lines {"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a", "b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]} @@ -39,9 +39,10 @@ The following sections describe inserting, filtering, and grouping behavior base {"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null, "arrayLong":[], "arrayDouble":null} ``` -## Overview +## Ingesting arrays -When using [native ingestion](../ingestion/native-batch.md), arrays can be ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects) type dimension schema which is shared with [type-aware schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery). +### Native batch and streaming ingestion +When using native [batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), arrays can be ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects) type dimension schema which is shared with [type-aware schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery). When ingesting from TSV or CSV data, you can specify the array delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as an array type. JSON data does not require `inputFormat` configuration. @@ -68,7 +69,9 @@ The following shows an example `dimensionsSpec` for native ingestion of the data ], ``` -Arrays can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md), but must include a query context parameter `"arrayIngestMode":"array"`. +### SQL-based ingestion + +Arrays can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include a query context parameter [`"arrayIngestMode":"array"`](docs/multi-stage-query/reference.md#context). For example, to insert the data used in this document: ```sql @@ -93,6 +96,7 @@ FROM "ext" PARTITIONED BY DAY ``` +### SQL-based ingestion with rollup These input arrays can also be grouped for rollup: ```sql @@ -112,23 +116,25 @@ SELECT "label", "arrayString", "arrayLong", - "arrayDouble" + "arrayDouble", + COUNT(*) as "count" FROM "ext" GROUP BY 1,2,3,4,5 PARTITIONED BY DAY ``` -## Querying ARRAYS +## Querying arrays ### Filtering All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on array typed columns. Filters follow these rules for array types: -- Value filters, like "equality", "range" match on entire array values -- The "null" filter will match rows where the entire array value is null -- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the behavior specified by those functions -- All other filters do not directly support ARRAY types +- All filters match against the entire array value for the row +- Native value filters like [equality](filters.md#equality-filter) and [range](filters.md#range-filter) match on entire array values, as do SQL constructs that plan into these native filters +- The [`IS NULL`](filters.md#null-filter) filter will match rows where the entire array value is null +- [Array specific functions](sql-array-functions.md) like `ARRAY_CONTAINS` and `ARRAY_OVERLAP` follow the behavior specified by those functions +- All other filters do not directly support ARRAY types and will result in a query error #### Example: equality ```sql @@ -146,7 +152,7 @@ WHERE arrayLong = ARRAY[1,2,3] ```sql SELECT * FROM "array_example" -WHERE arrayLong is null +WHERE arrayLong IS NULL ``` ```json lines @@ -179,7 +185,7 @@ WHERE ARRAY_CONTAINS(arrayString, 'a') ### Grouping -When grouping on an array with SQL or a native [groupBy queries](groupbyquery.md), grouping follows standard SQL behavior and groups on the entire array as a single value. The [`UNNEST`](sql.md#unnest) function allows grouping on the individual array elements. +When grouping on an array with SQL or a native [groupBy query](groupbyquery.md), grouping follows standard SQL behavior and groups on the entire array as a single value. The [`UNNEST`](sql.md#unnest) function allows grouping on the individual array elements. #### Example: SQL grouping query with no filtering ```sql @@ -199,7 +205,7 @@ results in: #### Example: SQL grouping query with a filter ```sql SELECT label, arrayString -FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings) +FROM "array_example" WHERE arrayLong = ARRAY[1,2,3] GROUP BY 1,2 ``` @@ -225,4 +231,23 @@ results: {"label":"row2","strings":"b"} {"label":"row4","strings":"a"} {"label":"row4","strings":"b"} -``` \ No newline at end of file +``` + +## Differences between arrays and multi-value dimensions +Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensions.md). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior. + +Use care during ingestion to ensure you get the type you want. + +To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers. + +To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings. + +You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like: + +```sql +SELECT COLUMN_NAME, DATA_TYPE +FROM INFORMATION_SCHEMA.COLUMNS +WHERE TABLE_NAME = 'mytable' +``` + +Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`. \ No newline at end of file diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md index 5dcf54579a46..12e575521ab3 100644 --- a/docs/querying/multi-value-dimensions.md +++ b/docs/querying/multi-value-dimensions.md @@ -30,7 +30,7 @@ array of values instead of a single value, such as the `tags` values in the foll {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} ``` -It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md), which behave like standard SQL arrays. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior). +It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md). While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior). This document describes inserting, filtering, and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see [segments documentation](../design/segments.md#multi-value-columns). Examples in this document @@ -46,9 +46,10 @@ The following sections describe inserting, filtering, and grouping behavior base {"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []} ``` -## Overview +## Ingestion -When using [native ingestion](../ingestion/native-batch.md), the Druid web console data loader can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. +### Native batch and streaming ingestion +When using native [batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), the Druid web console data loader can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `inputFormat` configuration. @@ -76,7 +77,8 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling. -Multi-value dimensions can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data. +### SQL-based ingestion +Multi-value dimensions can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data. For example, to insert the data used in this document: ```sql @@ -99,6 +101,7 @@ FROM "ext" PARTITIONED BY DAY ``` +### SQL-based ingestion with rollup These input arrays can also be grouped prior to converting into a multi-value dimension: ```sql REPLACE INTO "mvd_example_rollup" OVERWRITE ALL @@ -122,10 +125,10 @@ GROUP BY 1, 2, "tags" PARTITIONED BY DAY ``` -Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping. +Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we only wish to coerce the type _after_ grouping. -The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` must be used since the multi-stage engine only supports grouping on multi-value dimensions as arrays, and so they must be coerced first. These arrays then must be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`. +The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case you must use `MV_TO_ARRAY` since the multi-stage query engine only supports grouping on multi-value dimensions as arrays. So, they must be coerced first. These arrays must then be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`. ```sql REPLACE INTO "mvd_example_rollup" OVERWRITE ALL @@ -497,4 +500,23 @@ Having specs are applied at the outermost level of groupBy query processing. You can disable the implicit unnesting behavior for groupBy by setting `groupByEnableMultiValueUnnesting: false` in your [query context](query-context.md). In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any -multi-valued dimensions that were inadvertently included. \ No newline at end of file +multi-valued dimensions that were inadvertently included. + +## Differences between arrays and multi-value dimensions +Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensions.md). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior. + +Use care during ingestion to ensure you get the type you want. + +To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../development/extensions-core/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers. + +To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings. + +You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like: + +```sql +SELECT COLUMN_NAME, DATA_TYPE +FROM INFORMATION_SCHEMA.COLUMNS +WHERE TABLE_NAME = 'mytable' +``` + +Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`. \ No newline at end of file diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md index 10137b3c6ee8..66a5b0d5be9f 100644 --- a/docs/querying/sql-data-types.md +++ b/docs/querying/sql-data-types.md @@ -77,14 +77,14 @@ When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid instead ## Arrays -Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row. +Druid supports [`ARRAY` types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row. -ARRAY typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `'mvd'` for backwards compatibility, which instead can only handle `ARRAY` which it stores in [multi-value string columns](#multi-value-strings). +`ARRAY` typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `"mvd"` for backwards compatibility, which instead can only handle `ARRAY` which it stores in [multi-value string columns](#multi-value-strings). You can convert multi-value dimensions to standard SQL arrays explicitly with `MV_TO_ARRAY` or implicitly using [array functions](./sql-array-functions.md). You can also use the array functions to construct arrays from multiple columns. Druid serializes `ARRAY` results as a JSON string of the array by default, which can be controlled by the context parameter -`sqlStringifyArrays`. When set to `false`, the arrays will instead be returned as regular JSON arrays instead of in stringified form. +[`sqlStringifyArrays`](sql-query-context.md). When set to `false` and using JSON [result formats](../api-reference/sql-api.md#responses), the arrays will instead be returned as regular JSON arrays instead of in stringified form. ## Multi-value strings @@ -128,10 +128,11 @@ separately while processing. When converted to ARRAY or used with [array functions](./sql-array-functions.md), multi-value strings behave as standard SQL arrays and can no longer be manipulated with non-array functions. -Druid serializes multi-value VARCHAR results as a JSON string of the array, if grouping was not applied on the value. +By default Druid serializes multi-value VARCHAR results as a JSON string of the array, if grouping was not applied on the value. If the value was grouped, due to the implicit UNNEST behavior, all results will always be standard single value -VARCHAR. ARRAY typed results will be serialized into stringified JSON arrays if the context parameter -`sqlStringifyArrays` is set, otherwise they remain in their array format. +VARCHAR. ARRAY typed results serialization is controlled with the context parameter [`sqlStringifyArrays`](sql-query-context.md). When set +to `false` and using JSON [result formats](../api-reference/sql-api.md#responses), the arrays will instead be returned +as regular JSON arrays instead of in stringified form. ## NULL values