Skip to content

Commit

Permalink
change arrayIngestMode default to array (#16789)
Browse files Browse the repository at this point in the history
* change arrayIngestMode default to array

* remove arrayIngestMode flag option none

* fix space

* fix test
  • Loading branch information
clintropolis authored Jul 25, 2024
1 parent 7e3fab5 commit 5da69a0
Show file tree
Hide file tree
Showing 7 changed files with 61 additions and 119 deletions.
71 changes: 32 additions & 39 deletions docs/querying/arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,46 +71,10 @@ The following shows an example `dimensionsSpec` for native ingestion of the data

### SQL-based ingestion

#### `arrayIngestMode`

Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include the query context
parameter `arrayIngestMode: array`.

When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
tables.

When `arrayIngestMode` is `mvd`, SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior
may change to `array` in a future release.

When `arrayIngestMode` is `none`, Druid throws an exception when trying to store any type of arrays. This mode is most
useful when set in the system default query context with `druid.query.default.context.arrayIngestMode = none`, in cases
where the cluster administrator wants SQL query authors to explicitly provide one or the other in their query context.

The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
`arrayIngestMode: mvd`.

| SQL type | Stored type when `arrayIngestMode: array` | Stored type when `arrayIngestMode: mvd` (default) |
|---|---|---|
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|

In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
[multi-value strings](multi-value-dimensions.md).

When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
mixing arrays and multi-value strings in the same column.
Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md).

#### Examples

Set [`arrayIngestMode: array`](#arrayingestmode) in your query context to run the following examples.

```sql
REPLACE INTO "array_example" OVERWRITE ALL
WITH "ext" AS (
Expand Down Expand Up @@ -169,6 +133,35 @@ GROUP BY 1,2,3,4,5
PARTITIONED BY DAY
```

#### `arrayIngestMode`

For seamless backwards compatible behavior with Druid versions older than 31, there is an `arrayIngestMode` query context flag.

When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
tables and the default configuration for Druid 31 and newer.

When `arrayIngestMode` is `mvd` (legacy), SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
mode is not recommended and will be removed in a future release, but provided for backwards compatibility.

The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
`arrayIngestMode: mvd`.

| SQL type | Stored type when `arrayIngestMode: array` (default) | Stored type when `arrayIngestMode: mvd` |
|---|---|---|
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|

In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
[multi-value strings](multi-value-dimensions.md).

When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
mixing arrays and multi-value strings in the same column.

## Querying arrays

Expand Down Expand Up @@ -284,9 +277,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio

Use care during ingestion to ensure you get the type you want.

To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.

To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions. Multi-value dimensions can only contain strings.

You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:

Expand Down
4 changes: 2 additions & 2 deletions docs/querying/multi-value-dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -507,9 +507,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio

Use care during ingestion to ensure you get the type you want.

To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter [`"arrayIngestMode": "array"`](arrays.md#arrayingestmode). Arrays may contain strings or numbers.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.

To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any [`arrayIngestMode`](arrays.md#arrayingestmode). Multi-value dimensions can only contain strings.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion). Multi-value dimensions can only contain strings.

You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,6 @@
*/
public enum ArrayIngestMode
{
/**
* Disables the ingestion of arrays via MSQ's INSERT queries.
*/
NONE,

/**
* String arrays are ingested as MVDs. This is to preserve the legacy behaviour of Druid and will be removed in the
* future, since MVDs are not true array types and the behaviour is incorrect.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,19 +131,9 @@ public static ColumnType getDimensionType(
} else if (queryType.getType() == ValueType.ARRAY) {
ValueType elementType = queryType.getElementType().getType();
if (elementType == ValueType.STRING) {
if (arrayIngestMode == ArrayIngestMode.NONE) {
throw InvalidInput.exception(
"String arrays can not be ingested when '%s' is set to '%s'. Set '%s' in query context "
+ "to 'array' to ingest the string array as an array, or ingest it as an MVD by explicitly casting the "
+ "array to an MVD with the ARRAY_TO_MV function.",
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE,
StringUtils.toLowerCase(arrayIngestMode.name()),
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE
);
} else if (arrayIngestMode == ArrayIngestMode.MVD) {
if (arrayIngestMode == ArrayIngestMode.MVD) {
return ColumnType.STRING;
} else {
assert arrayIngestMode == ArrayIngestMode.ARRAY;
return queryType;
}
} else if (elementType.isNumeric()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ public class MultiStageQueryContext
public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;

public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD;
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.ARRAY;

public static final String NEXT_WINDOW_SHUFFLE_COL = "__windowShuffleCol";

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,30 +122,7 @@ public void setup() throws IOException
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testInsertStringArrayWithArrayIngestModeNone(String contextName, Map<String, Object> context)
{

final Map<String, Object> adjustedContext = new HashMap<>(context);
adjustedContext.put(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "none");

testIngestQuery().setSql(
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
.setQueryContext(adjustedContext)
.setExpectedExecutionErrorMatcher(CoreMatchers.allOf(
CoreMatchers.instanceOf(ISE.class),
ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString(
"String arrays can not be ingested when 'arrayIngestMode' is set to 'none'"))
))
.verifyExecutionError();
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
Expand All @@ -172,7 +149,7 @@ public void testReplaceMvdWithStringArray(String contextName, Map<String, Object
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
Expand Down Expand Up @@ -200,7 +177,7 @@ public void testReplaceStringArrayWithMvdInArrayMode(String contextName, Map<Str
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
Expand Down Expand Up @@ -228,7 +205,7 @@ public void testReplaceStringArrayWithMvdInMvdMode(String contextName, Map<Strin
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
Expand Down Expand Up @@ -277,7 +254,7 @@ public void testReplaceMvdWithStringArraySkipValidation(String contextName, Map<
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
Expand Down Expand Up @@ -316,25 +293,40 @@ public void testReplaceMvdWithMvd(String contextName, Map<String, Object> contex
}

/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to mvd (default) and the only array type to be
* ingested is string array
* Tests the behaviour of INSERT query when arrayIngestMode is set to array (default)
*/
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testInsertOnFoo1WithMultiValueToArrayGroupByWithDefaultContext(String contextName, Map<String, Object> context)
{
RowSignature rowSignature = RowSignature.builder()
.add("__time", ColumnType.LONG)
.add("dim3", ColumnType.STRING)
.add("dim3", ColumnType.STRING_ARRAY)
.build();

List<Object[]> expectedRows = new ArrayList<>(
ImmutableList.of(
new Object[]{0L, null},
new Object[]{0L, new Object[]{"a", "b"}}
)
);
if (!useDefault) {
expectedRows.add(new Object[]{0L, new Object[]{""}});
}
expectedRows.addAll(
ImmutableList.of(
new Object[]{0L, new Object[]{"b", "c"}},
new Object[]{0L, new Object[]{"d"}}
)
);

testIngestQuery().setSql(
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
.setExpectedDataSource("foo1")
.setExpectedRowSignature(rowSignature)
.setQueryContext(context)
.setExpectedSegment(ImmutableSet.of(SegmentId.of("foo1", Intervals.ETERNITY, "test", 0)))
.setExpectedResultRows(expectedMultiValueFooRowsToArray())
.setExpectedResultRows(expectedRows)
.verifyResults();
}

Expand Down Expand Up @@ -603,13 +595,6 @@ public void testInsertArraysAsArrays(String contextName, Map<String, Object> con
.verifyResults();
}

@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testSelectOnArraysWithArrayIngestModeAsNone(String contextName, Map<String, Object> context)
{
testSelectOnArrays(contextName, context, "none");
}

@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testSelectOnArraysWithArrayIngestModeAsMVD(String contextName, Map<String, Object> context)
Expand Down Expand Up @@ -1128,20 +1113,4 @@ public void testScanExternArrayWithNonConvertibleType(String contextName, Map<St
.setExpectedResultRows(expectedRows)
.verifyResults();
}

private List<Object[]> expectedMultiValueFooRowsToArray()
{
List<Object[]> expectedRows = new ArrayList<>();
expectedRows.add(new Object[]{0L, null});
if (!useDefault) {
expectedRows.add(new Object[]{0L, ""});
}

expectedRows.addAll(ImmutableList.of(
new Object[]{0L, ImmutableList.of("a", "b")},
new Object[]{0L, ImmutableList.of("b", "c")},
new Object[]{0L, "d"}
));
return expectedRows;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -221,17 +221,12 @@ public void useAutoColumnSchemes_set_returnsCorrectValue()
@Test
public void arrayIngestMode_unset_returnsDefaultValue()
{
Assert.assertEquals(ArrayIngestMode.MVD, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
Assert.assertEquals(ArrayIngestMode.ARRAY, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
}

@Test
public void arrayIngestMode_set_returnsCorrectValue()
{
Assert.assertEquals(
ArrayIngestMode.NONE,
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "none")))
);

Assert.assertEquals(
ArrayIngestMode.MVD,
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "mvd")))
Expand Down

0 comments on commit 5da69a0

Please sign in to comment.