change arrayIngestMode default to array (#16789)

* change arrayIngestMode default to array * remove arrayIngestMode flag option none * fix space * fix test
apache · Jul 25, 2024 · 5da69a0 · 5da69a0
1 parent 7e3fab5
commit 5da69a0
Show file tree

Hide file tree

Showing 7 changed files with 61 additions and 119 deletions.
diff --git a/docs/querying/arrays.md b/docs/querying/arrays.md
@@ -71,46 +71,10 @@ The following shows an example `dimensionsSpec` for native ingestion of the data
 
 ### SQL-based ingestion
 
-#### `arrayIngestMode`
-
-Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include the query context
-parameter `arrayIngestMode: array`.
-
-When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
-tables.
-
-When `arrayIngestMode` is `mvd`, SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
-This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
-as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
-is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior
-may change to `array` in a future release.
-
-When `arrayIngestMode` is `none`, Druid throws an exception when trying to store any type of arrays. This mode is most
-useful when set in the system default query context with `druid.query.default.context.arrayIngestMode = none`, in cases
-where the cluster administrator wants SQL query authors to explicitly provide one or the other in their query context.
-
-The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
-`arrayIngestMode: mvd`.
-
-| SQL type | Stored type when `arrayIngestMode: array` | Stored type when `arrayIngestMode: mvd` (default) |
-|---|---|---|
-|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
-|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
-|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
-
-In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
-[multi-value strings](multi-value-dimensions.md).
-
-When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
-to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
-validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
-a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
-mixing arrays and multi-value strings in the same column.
+Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md).
 
 #### Examples
 
-Set [`arrayIngestMode: array`](#arrayingestmode) in your query context to run the following examples.
-
 ```sql
 REPLACE INTO "array_example" OVERWRITE ALL
 WITH "ext" AS (
@@ -169,6 +133,35 @@ GROUP BY 1,2,3,4,5
 PARTITIONED BY DAY
 ```
 
+#### `arrayIngestMode`
+
+For seamless backwards compatible behavior with Druid versions older than 31, there is an `arrayIngestMode` query context flag.
+
+When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
+tables and the default configuration for Druid 31 and newer.
+
+When `arrayIngestMode` is `mvd` (legacy), SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
+This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
+as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
+mode is not recommended and will be removed in a future release, but provided for backwards compatibility.
+
+The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
+`arrayIngestMode: mvd`.
+
+| SQL type | Stored type when `arrayIngestMode: array` (default) | Stored type when `arrayIngestMode: mvd` |
+|---|---|---|
+|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
+|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
+|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
+
+In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
+[multi-value strings](multi-value-dimensions.md).
+
+When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
+to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
+validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
+a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
+mixing arrays and multi-value strings in the same column.
 
 ## Querying arrays
 
@@ -284,9 +277,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
 
 Use care during ingestion to ensure you get the type you want.
 
-To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.
+To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
 
-To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.
+To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions. Multi-value dimensions can only contain strings.
 
 You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
 

diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md
@@ -507,9 +507,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
 
 Use care during ingestion to ensure you get the type you want.
 
-To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter [`"arrayIngestMode": "array"`](arrays.md#arrayingestmode). Arrays may contain strings or numbers.
+To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
 
-To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any [`arrayIngestMode`](arrays.md#arrayingestmode). Multi-value dimensions can only contain strings.
+To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion). Multi-value dimensions can only contain strings.
 
 You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
 

diff --git a/...sions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/ArrayIngestMode.java b/...sions-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/ArrayIngestMode.java
@@ -25,11 +25,6 @@
  */
 public enum ArrayIngestMode
 {
-  /**
-   * Disables the ingestion of arrays via MSQ's INSERT queries.
-   */
-  NONE,
-
   /**
    * String arrays are ingested as MVDs. This is to preserve the legacy behaviour of Druid and will be removed in the
    * future, since MVDs are not true array types and the behaviour is incorrect.

diff --git a/...-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/DimensionSchemaUtils.java b/...-core/multi-stage-query/src/main/java/org/apache/druid/msq/util/DimensionSchemaUtils.java
@@ -131,19 +131,9 @@ public static ColumnType getDimensionType(
     } else if (queryType.getType() == ValueType.ARRAY) {
       ValueType elementType = queryType.getElementType().getType();
       if (elementType == ValueType.STRING) {
-        if (arrayIngestMode == ArrayIngestMode.NONE) {
-          throw InvalidInput.exception(
-              "String arrays can not be ingested when '%s' is set to '%s'. Set '%s' in query context "
-              + "to 'array' to ingest the string array as an array, or ingest it as an MVD by explicitly casting the "
-              + "array to an MVD with the ARRAY_TO_MV function.",
-              MultiStageQueryContext.CTX_ARRAY_INGEST_MODE,
-              StringUtils.toLowerCase(arrayIngestMode.name()),
-              MultiStageQueryContext.CTX_ARRAY_INGEST_MODE
-          );
-        } else if (arrayIngestMode == ArrayIngestMode.MVD) {
+        if (arrayIngestMode == ArrayIngestMode.MVD) {
           return ColumnType.STRING;
         } else {
-          assert arrayIngestMode == ArrayIngestMode.ARRAY;
           return queryType;
         }
       } else if (elementType.isNumeric()) {

diff --git a/...ore/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java b/...ore/multi-stage-query/src/main/java/org/apache/druid/msq/util/MultiStageQueryContext.java
@@ -165,7 +165,7 @@ public class MultiStageQueryContext
   public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;
 
   public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
-  public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD;
+  public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.ARRAY;
 
   public static final String NEXT_WINDOW_SHUFFLE_COL = "__windowShuffleCol";
 

diff --git a/extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQArraysTest.java b/extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQArraysTest.java
@@ -122,30 +122,7 @@ public void setup() throws IOException
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
-   * string arrays
-   */
-  @MethodSource("data")
-  @ParameterizedTest(name = "{index}:with context {0}")
-  public void testInsertStringArrayWithArrayIngestModeNone(String contextName, Map<String, Object> context)
-  {
-
-    final Map<String, Object> adjustedContext = new HashMap<>(context);
-    adjustedContext.put(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "none");
-
-    testIngestQuery().setSql(
-                         "INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
-                     .setQueryContext(adjustedContext)
-                     .setExpectedExecutionErrorMatcher(CoreMatchers.allOf(
-                         CoreMatchers.instanceOf(ISE.class),
-                         ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString(
-                             "String arrays can not be ingested when 'arrayIngestMode' is set to 'none'"))
-                     ))
-                     .verifyExecutionError();
-  }
-
-  /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
    * string arrays
    */
   @MethodSource("data")
@@ -172,7 +149,7 @@ public void testReplaceMvdWithStringArray(String contextName, Map<String, Object
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
    * string arrays
    */
   @MethodSource("data")
@@ -200,7 +177,7 @@ public void testReplaceStringArrayWithMvdInArrayMode(String contextName, Map<Str
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
    * string arrays
    */
   @MethodSource("data")
@@ -228,7 +205,7 @@ public void testReplaceStringArrayWithMvdInMvdMode(String contextName, Map<Strin
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
    * string arrays
    */
   @MethodSource("data")
@@ -277,7 +254,7 @@ public void testReplaceMvdWithStringArraySkipValidation(String contextName, Map<
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
    * string arrays
    */
   @MethodSource("data")
@@ -316,25 +293,40 @@ public void testReplaceMvdWithMvd(String contextName, Map<String, Object> contex
   }
 
   /**
-   * Tests the behaviour of INSERT query when arrayIngestMode is set to mvd (default) and the only array type to be
-   * ingested is string array
+   * Tests the behaviour of INSERT query when arrayIngestMode is set to array (default)
    */
   @MethodSource("data")
   @ParameterizedTest(name = "{index}:with context {0}")
   public void testInsertOnFoo1WithMultiValueToArrayGroupByWithDefaultContext(String contextName, Map<String, Object> context)
   {
     RowSignature rowSignature = RowSignature.builder()
                                             .add("__time", ColumnType.LONG)
-                                            .add("dim3", ColumnType.STRING)
+                                            .add("dim3", ColumnType.STRING_ARRAY)
                                             .build();
 
+    List<Object[]> expectedRows = new ArrayList<>(
+        ImmutableList.of(
+            new Object[]{0L, null},
+            new Object[]{0L, new Object[]{"a", "b"}}
+        )
+    );
+    if (!useDefault) {
+      expectedRows.add(new Object[]{0L, new Object[]{""}});
+    }
+    expectedRows.addAll(
+        ImmutableList.of(
+            new Object[]{0L, new Object[]{"b", "c"}},
+            new Object[]{0L, new Object[]{"d"}}
+        )
+    );
+
     testIngestQuery().setSql(
                          "INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
                      .setExpectedDataSource("foo1")
                      .setExpectedRowSignature(rowSignature)
                      .setQueryContext(context)
                      .setExpectedSegment(ImmutableSet.of(SegmentId.of("foo1", Intervals.ETERNITY, "test", 0)))
-                     .setExpectedResultRows(expectedMultiValueFooRowsToArray())
+                     .setExpectedResultRows(expectedRows)
                      .verifyResults();
   }
 
@@ -603,13 +595,6 @@ public void testInsertArraysAsArrays(String contextName, Map<String, Object> con
                      .verifyResults();
   }
 
-  @MethodSource("data")
-  @ParameterizedTest(name = "{index}:with context {0}")
-  public void testSelectOnArraysWithArrayIngestModeAsNone(String contextName, Map<String, Object> context)
-  {
-    testSelectOnArrays(contextName, context, "none");
-  }
-
   @MethodSource("data")
   @ParameterizedTest(name = "{index}:with context {0}")
   public void testSelectOnArraysWithArrayIngestModeAsMVD(String contextName, Map<String, Object> context)
@@ -1128,20 +1113,4 @@ public void testScanExternArrayWithNonConvertibleType(String contextName, Map<St
                      .setExpectedResultRows(expectedRows)
                      .verifyResults();
   }
-
-  private List<Object[]> expectedMultiValueFooRowsToArray()
-  {
-    List<Object[]> expectedRows = new ArrayList<>();
-    expectedRows.add(new Object[]{0L, null});
-    if (!useDefault) {
-      expectedRows.add(new Object[]{0L, ""});
-    }
-
-    expectedRows.addAll(ImmutableList.of(
-        new Object[]{0L, ImmutableList.of("a", "b")},
-        new Object[]{0L, ImmutableList.of("b", "c")},
-        new Object[]{0L, "d"}
-    ));
-    return expectedRows;
-  }
 }
diff --git a/...multi-stage-query/src/test/java/org/apache/druid/msq/util/MultiStageQueryContextTest.java b/...multi-stage-query/src/test/java/org/apache/druid/msq/util/MultiStageQueryContextTest.java
@@ -221,17 +221,12 @@ public void useAutoColumnSchemes_set_returnsCorrectValue()
   @Test
   public void arrayIngestMode_unset_returnsDefaultValue()
   {
-    Assert.assertEquals(ArrayIngestMode.MVD, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
+    Assert.assertEquals(ArrayIngestMode.ARRAY, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
   }
 
   @Test
   public void arrayIngestMode_set_returnsCorrectValue()
   {
-    Assert.assertEquals(
-        ArrayIngestMode.NONE,
-        MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "none")))
-    );
-
     Assert.assertEquals(
         ArrayIngestMode.MVD,
         MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "mvd")))