sql compatible three-valued logic native filters (#15058)

* sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true * log.warn if non-default configurations are used to guide operators towards SQL complaint behavior
apache · Oct 12, 2023 · d0f6460 · d0f6460
1 parent 265c811
commit d0f6460
Show file tree

Hide file tree

Showing 164 changed files with 4,360 additions and 1,548 deletions.
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
@@ -798,8 +798,9 @@ Support for 64-bit floating point columns was released in Druid 0.11.0, so if yo
 Prior to version 0.13.0, Druid string columns treated `''` and `null` values as interchangeable, and numeric columns were unable to represent `null` values, coercing `null` to `0`. Druid 0.13.0 introduced a mode which enabled SQL compatible null handling, allowing string columns to distinguish empty strings from nulls, and numeric columns to contain null rows.
 
 |Property|Description|Default|
-|---|---|---|
+|--------|-----------|-------|
 |`druid.generic.useDefaultValueForNull`|Set to `false` to store and query data in SQL compatible mode. When set to `true` (legacy mode), `null` values will be stored as `''` for string columns and `0` for numeric columns.|`false`|
+|`druid.generic.useThreeValueLogicForNativeFilters`|Set to `true` to use SQL compatible three-value logic when processing native Druid filters when `druid.generic.useDefaultValueForNull=false` and `druid.expressions.useStrictBooleans=true`. When set to `false` Druid uses 2 value logic for filter processing, even when `druid.generic.useDefaultValueForNull=false` and `druid.expressions.useStrictBooleans=true`. See [boolean handling](../querying/sql-data-types.md#boolean-logic) for more details|`true`|
 |`druid.generic.ignoreNullsForStringCardinality`|When set to `true`, `null` values will be ignored for the built-in cardinality aggregator over string columns. Set to `false` to include `null` values while estimating cardinality of only string columns using the built-in cardinality aggregator. This setting takes effect only when `druid.generic.useDefaultValueForNull` is set to `true` and is ignored in SQL compatibility mode. Additionally, empty strings (equivalent to null) are not counted when this is set to `true`. |`false`|
 This mode does have a storage size and query performance cost, see [segment documentation](../design/segments.md#handling-null-values) for more details.
 

diff --git a/docs/querying/filters.md b/docs/querying/filters.md
@@ -33,6 +33,8 @@ sidebar_label: "Filters"
 A filter is a JSON object indicating which rows of data should be included in the computation for a query. It’s essentially the equivalent of the WHERE clause in SQL.
 Filters are commonly applied on dimensions, but can be applied on aggregated metrics, for example, see [Filtered aggregator](./aggregations.md#filtered-aggregator) and [Having filters](./having.md).
 
+By default, Druid uses SQL compatible three-value logic when filtering. See [Boolean logic](./sql-data-types.md#boolean-logic) for more details.
+
 Apache Druid supports the following types of filters.
 
 ## Selector filter

diff --git a/docs/querying/sql-data-types.md b/docs/querying/sql-data-types.md
@@ -152,14 +152,14 @@ values are treated as zeroes. This was the default prior to Druid 28.0.0.
 
 ## Boolean logic
 
-The [`druid.expressions.useStrictBooleans`](../configuration/index.md#expression-processing-configurations)
-runtime property controls Druid's boolean logic mode. For the most SQL compliant behavior, set this to `true` (the default).
+By default, Druid uses [SQL three-valued logic](https://en.wikipedia.org/wiki/Three-valued_logic#SQL) for filter processing
+and boolean expression evaluation. This behavior relies on three settings:
 
-When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for
-[expressions](math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters.
-However, even in this mode, Druid uses two-valued logic for filter types other than `expression`.
+*  [`druid.generic.useDefaultValueForNull`](../configuration/index.md#sql-compatible-null-handling) must be set to false (default), a runtime property which allows NULL values to exist in numeric columns and expressions, and string typed columns to distinguish between NULL and the empty string 
+*  [`druid.expressions.useStrictBooleans`](../configuration/index.md#expression-processing-configurations) must be set to true (default), a runtime property controls Druid's boolean logic mode for expressions, as well as coercing all expression boolean values to be represented with a 1 for true and 0 for false
+*  [`druid.generic.useThreeValueLogicForNativeFilters`](../configuration/index.md#sql-compatible-null-handling) must be set to true (default), a runtime property which decouples three-value logic handling from `druid.generic.useDefaultValueForNull` and `druid.expressions.useStrictBooleans` for backwards compatibility with older versions of Druid that did not fully support SQL compatible null value logic handling
 
-When `druid.expressions.useStrictBooleans = false` (legacy mode), Druid uses two-valued logic.
+If any of these settings is configured with a non-default value, Druid will use two-valued logic for non-expression based filters. Expression based filters are controlled independently with `druid.expressions.useStrictBooleans`, which if set to false Druid will use two-valued logic for expressions.
 
 ## Nested columns
 

diff --git a/...umns/src/main/java/org/apache/druid/segment/MapTypeMapVirtualColumnDimensionSelector.java b/...umns/src/main/java/org/apache/druid/segment/MapTypeMapVirtualColumnDimensionSelector.java
@@ -19,7 +19,7 @@
 
 package org.apache.druid.segment;
 
-import com.google.common.base.Predicate;
+import org.apache.druid.query.filter.DruidPredicateFactory;
 import org.apache.druid.query.filter.ValueMatcher;
 import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
 import org.apache.druid.segment.data.IndexedInts;
@@ -55,7 +55,7 @@ public ValueMatcher makeValueMatcher(@Nullable String value)
     return new ValueMatcher()
     {
       @Override
-      public boolean matches()
+      public boolean matches(boolean includeUnknown)
       {
         // Map column doesn't match with any string
         return false;
@@ -70,12 +70,12 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
   }
 
   @Override
-  public ValueMatcher makeValueMatcher(Predicate<String> predicate)
+  public ValueMatcher makeValueMatcher(DruidPredicateFactory predicateFactory)
   {
     return new ValueMatcher()
     {
       @Override
-      public boolean matches()
+      public boolean matches(boolean includeUnknown)
       {
         return false;
       }

diff --git a/...s/src/main/java/org/apache/druid/segment/StringTypeMapVirtualColumnDimensionSelector.java b/...s/src/main/java/org/apache/druid/segment/StringTypeMapVirtualColumnDimensionSelector.java
@@ -21,6 +21,7 @@
 
 import com.google.common.base.Preconditions;
 import com.google.common.base.Predicate;
+import org.apache.druid.query.filter.DruidPredicateFactory;
 import org.apache.druid.query.filter.ValueMatcher;
 import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
 import org.apache.druid.segment.data.IndexedInts;
@@ -68,9 +69,10 @@ public ValueMatcher makeValueMatcher(@Nullable String value)
     return new ValueMatcher()
     {
       @Override
-      public boolean matches()
+      public boolean matches(boolean includeUnknown)
       {
-        return Objects.equals(value, getObject());
+        final Object rowValue = getObject();
+        return (includeUnknown && rowValue == null) || Objects.equals(value, rowValue);
       }
 
       @Override
@@ -84,14 +86,17 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
   }
 
   @Override
-  public ValueMatcher makeValueMatcher(Predicate<String> predicate)
+  public ValueMatcher makeValueMatcher(DruidPredicateFactory predicateFactory)
   {
+    final Predicate<String> predicate = predicateFactory.makeStringPredicate();
     return new ValueMatcher()
     {
       @Override
-      public boolean matches()
+      public boolean matches(boolean includeUnknown)
       {
-        return predicate.apply((String) getObject());
+        final String rowValue = (String) getObject();
+        final boolean matchNull = includeUnknown && predicateFactory.isNullInputUnknown();
+        return (matchNull && rowValue == null) || predicate.apply(rowValue);
       }
 
       @Override

diff --git a/...b/virtual-columns/src/test/java/org/apache/druid/segment/MapVirtualColumnGroupByTest.java b/...b/virtual-columns/src/test/java/org/apache/druid/segment/MapVirtualColumnGroupByTest.java
@@ -22,6 +22,7 @@
 import com.google.common.collect.ImmutableList;
 import org.apache.druid.collections.DefaultBlockingPool;
 import org.apache.druid.collections.StupidPool;
+import org.apache.druid.common.config.NullHandling;
 import org.apache.druid.data.input.MapBasedRow;
 import org.apache.druid.jackson.DefaultObjectMapper;
 import org.apache.druid.java.util.common.DateTimes;
@@ -34,32 +35,32 @@
 import org.apache.druid.query.TableDataSource;
 import org.apache.druid.query.aggregation.CountAggregatorFactory;
 import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.filter.EqualityFilter;
+import org.apache.druid.query.filter.InDimFilter;
+import org.apache.druid.query.filter.NotDimFilter;
 import org.apache.druid.query.groupby.GroupByQuery;
 import org.apache.druid.query.groupby.GroupByQueryConfig;
 import org.apache.druid.query.groupby.GroupByQueryQueryToolChest;
 import org.apache.druid.query.groupby.GroupByQueryRunnerFactory;
 import org.apache.druid.query.groupby.GroupingEngine;
 import org.apache.druid.query.groupby.ResultRow;
 import org.apache.druid.query.spec.MultipleIntervalSegmentSpec;
+import org.apache.druid.segment.column.ColumnType;
 import org.apache.druid.segment.incremental.IncrementalIndex;
 import org.apache.druid.testing.InitializedNullHandlingTest;
 import org.apache.druid.timeline.SegmentId;
 import org.junit.Assert;
 import org.junit.Before;
-import org.junit.Rule;
 import org.junit.Test;
-import org.junit.rules.ExpectedException;
 
 import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.util.Collections;
 import java.util.List;
 import java.util.stream.Collectors;
 
 public class MapVirtualColumnGroupByTest extends InitializedNullHandlingTest
 {
-  @Rule
-  public ExpectedException expectedException = ExpectedException.none();
-
   private QueryRunner<ResultRow> runner;
 
   @Before
@@ -132,11 +133,14 @@ public void testWithMapColumn()
         null
     );
 
-    expectedException.expect(UnsupportedOperationException.class);
-    expectedException.expectMessage("Map column doesn't support getRow()");
-    runner.run(QueryPlus.wrap(query)).toList();
+    Throwable t = Assert.assertThrows(
+        UnsupportedOperationException.class,
+        () -> runner.run(QueryPlus.wrap(query)).toList()
+    );
+    Assert.assertEquals("Map column doesn't support getRow()", t.getMessage());
   }
 
+
   @Test
   public void testWithSubColumn()
   {
@@ -166,4 +170,124 @@ public void testWithSubColumn()
 
     Assert.assertEquals(expected, result);
   }
+
+  @Test
+  public void testWithSubColumnWithFilter()
+  {
+    final GroupByQuery query = new GroupByQuery(
+        new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
+        new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
+        VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
+        new EqualityFilter("params.key3", ColumnType.STRING, "value3", null),
+        Granularities.ALL,
+        ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
+        ImmutableList.of(new CountAggregatorFactory("count")),
+        null,
+        null,
+        null,
+        null,
+        null
+    );
+
+    final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
+    final List<ResultRow> expected = ImmutableList.of(
+        new MapBasedRow(
+            DateTimes.of("2011-01-12T00:00:00.000Z"),
+            MapVirtualColumnTestBase.mapOf("count", 1L, "params.key3", "value3")
+        )
+    ).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
+
+    Assert.assertEquals(expected, result);
+  }
+
+  @Test
+  public void testWithSubColumnWithPredicateFilter()
+  {
+    final GroupByQuery query = new GroupByQuery(
+        new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
+        new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
+        VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
+        new InDimFilter("params.key3", ImmutableList.of("value1", "value3"), null),
+        Granularities.ALL,
+        ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
+        ImmutableList.of(new CountAggregatorFactory("count")),
+        null,
+        null,
+        null,
+        null,
+        null
+    );
+
+    final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
+    final List<ResultRow> expected = ImmutableList.of(
+        new MapBasedRow(
+            DateTimes.of("2011-01-12T00:00:00.000Z"),
+            MapVirtualColumnTestBase.mapOf("count", 1L, "params.key3", "value3")
+        )
+    ).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
+
+    Assert.assertEquals(expected, result);
+  }
+
+  @Test
+  public void testWithSubColumnWithNotFilter()
+  {
+    final GroupByQuery query = new GroupByQuery(
+        new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
+        new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
+        VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
+        NotDimFilter.of(new EqualityFilter("params.key3", ColumnType.STRING, "value3", null)),
+        Granularities.ALL,
+        ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
+        ImmutableList.of(new CountAggregatorFactory("count")),
+        null,
+        null,
+        null,
+        null,
+        null
+    );
+
+    final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
+    final List<ResultRow> expected;
+    if (NullHandling.sqlCompatible()) {
+      expected = Collections.emptyList();
+    } else {
+      expected = ImmutableList.of(
+          new MapBasedRow(DateTimes.of("2011-01-12T00:00:00.000Z"), MapVirtualColumnTestBase.mapOf("count", 2L))
+      ).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
+    }
+
+    Assert.assertEquals(expected, result);
+  }
+
+  @Test
+  public void testWithSubColumnWithNotPredicateFilter()
+  {
+    final GroupByQuery query = new GroupByQuery(
+        new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
+        new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
+        VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
+        NotDimFilter.of(new InDimFilter("params.key3", ImmutableList.of("value1", "value3"), null)),
+        Granularities.ALL,
+        ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
+        ImmutableList.of(new CountAggregatorFactory("count")),
+        null,
+        null,
+        null,
+        null,
+        null
+    );
+
+    final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
+    final List<ResultRow> expected;
+    if (NullHandling.sqlCompatible()) {
+      expected = Collections.emptyList();
+    } else {
+      expected = ImmutableList.of(
+          new MapBasedRow(DateTimes.of("2011-01-12T00:00:00.000Z"), MapVirtualColumnTestBase.mapOf("count", 2L))
+      ).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
+    }
+
+    Assert.assertEquals(expected, result);
+  }
 }
diff --git a/...he/druid/query/aggregation/datasketches/quantiles/sql/DoublesSketchSqlAggregatorTest.java b/...he/druid/query/aggregation/datasketches/quantiles/sql/DoublesSketchSqlAggregatorTest.java
@@ -279,7 +279,7 @@ public void testQuantileOnCastedString()
               10.1,
               20.2,
               Double.NaN,
-              10.1,
+              2.0,
               Double.NaN
           }
       );

diff --git a/...s-core/druid-bloom-filter/src/main/java/org/apache/druid/query/filter/BloomDimFilter.java b/...s-core/druid-bloom-filter/src/main/java/org/apache/druid/query/filter/BloomDimFilter.java
@@ -98,6 +98,8 @@ public Filter toFilter()
         dimension,
         new DruidPredicateFactory()
         {
+          private final boolean isNullUnknown = !bloomKFilter.testBytes(null, 0, 0);
+
           @Override
           public Predicate<String> makeStringPredicate()
           {
@@ -165,6 +167,12 @@ public boolean applyNull()
               }
             };
           }
+
+          @Override
+          public boolean isNullInputUnknown()
+          {
+            return isNullUnknown;
+          }
         },
         extractionFn,
         filterTuning

diff --git a/...re/druid-bloom-filter/src/test/java/org/apache/druid/query/filter/BloomDimFilterTest.java b/...re/druid-bloom-filter/src/test/java/org/apache/druid/query/filter/BloomDimFilterTest.java
@@ -39,6 +39,8 @@
 import org.apache.druid.query.lookup.LookupExtractor;
 import org.apache.druid.segment.IndexBuilder;
 import org.apache.druid.segment.StorageAdapter;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
 import org.apache.druid.segment.filter.BaseFilterTest;
 import org.apache.druid.segment.incremental.IncrementalIndexSchema;
 import org.junit.AfterClass;
@@ -69,24 +71,20 @@ public class BloomDimFilterTest extends BaseFilterTest
       )
   );
 
+  private static final RowSignature ROW_SIGNATURE = RowSignature.builder()
+                                                                .add("dim0", ColumnType.STRING)
+                                                                .add("dim1", ColumnType.STRING)
+                                                                .add("dim2", ColumnType.STRING)
+                                                                .add("dim6", ColumnType.STRING)
+                                                                .build();
+
   private static final List<InputRow> ROWS = ImmutableList.of(
-      PARSER.parseBatch(ImmutableMap.of(
-          "dim0",
-          "0",
-          "dim1",
-          "",
-          "dim2",
-          ImmutableList.of("a", "b"),
-          "dim6",
-          "2017-07-25"
-      )).get(0),
-      PARSER.parseBatch(ImmutableMap.of("dim0", "1", "dim1", "10", "dim2", ImmutableList.of(), "dim6", "2017-07-25"))
-            .get(0),
-      PARSER.parseBatch(ImmutableMap.of("dim0", "2", "dim1", "2", "dim2", ImmutableList.of(""), "dim6", "2017-05-25"))
-            .get(0),
-      PARSER.parseBatch(ImmutableMap.of("dim0", "3", "dim1", "1", "dim2", ImmutableList.of("a"))).get(0),
-      PARSER.parseBatch(ImmutableMap.of("dim0", "4", "dim1", "def", "dim2", ImmutableList.of("c"))).get(0),
-      PARSER.parseBatch(ImmutableMap.of("dim0", "5", "dim1", "abc")).get(0)
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "0", "", ImmutableList.of("a", "b"), "2017-07-25"),
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "1", "10", ImmutableList.of(), "2017-07-25"),
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "2", "2", ImmutableList.of(""), "2017-05-25"),
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "3", "1", ImmutableList.of("a")),
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "4", "def", ImmutableList.of("c")),
+      BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "5", "abc")
   );
 
   private static DefaultObjectMapper mapper = new DefaultObjectMapper();
-Original file line number
+Diff line change
@@ Expand Up / @@ -279,7 +279,7 @@ public void testQuantileOnCastedString() @@
 .1,
 .2,
                   Double.NaN,
-.1,
+.0,
                   Double.NaN
               }
           );
@@ Expand Down @@