Skip to content

Commit

Permalink
sql compatible three-valued logic native filters (#15058)
Browse files Browse the repository at this point in the history
* sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true
* log.warn if non-default configurations are used to guide operators towards SQL complaint behavior
  • Loading branch information
clintropolis authored Oct 12, 2023
1 parent 265c811 commit d0f6460
Show file tree
Hide file tree
Showing 164 changed files with 4,360 additions and 1,548 deletions.
3 changes: 2 additions & 1 deletion docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -798,8 +798,9 @@ Support for 64-bit floating point columns was released in Druid 0.11.0, so if yo
Prior to version 0.13.0, Druid string columns treated `''` and `null` values as interchangeable, and numeric columns were unable to represent `null` values, coercing `null` to `0`. Druid 0.13.0 introduced a mode which enabled SQL compatible null handling, allowing string columns to distinguish empty strings from nulls, and numeric columns to contain null rows.

|Property|Description|Default|
|---|---|---|
|--------|-----------|-------|
|`druid.generic.useDefaultValueForNull`|Set to `false` to store and query data in SQL compatible mode. When set to `true` (legacy mode), `null` values will be stored as `''` for string columns and `0` for numeric columns.|`false`|
|`druid.generic.useThreeValueLogicForNativeFilters`|Set to `true` to use SQL compatible three-value logic when processing native Druid filters when `druid.generic.useDefaultValueForNull=false` and `druid.expressions.useStrictBooleans=true`. When set to `false` Druid uses 2 value logic for filter processing, even when `druid.generic.useDefaultValueForNull=false` and `druid.expressions.useStrictBooleans=true`. See [boolean handling](../querying/sql-data-types.md#boolean-logic) for more details|`true`|
|`druid.generic.ignoreNullsForStringCardinality`|When set to `true`, `null` values will be ignored for the built-in cardinality aggregator over string columns. Set to `false` to include `null` values while estimating cardinality of only string columns using the built-in cardinality aggregator. This setting takes effect only when `druid.generic.useDefaultValueForNull` is set to `true` and is ignored in SQL compatibility mode. Additionally, empty strings (equivalent to null) are not counted when this is set to `true`. |`false`|
This mode does have a storage size and query performance cost, see [segment documentation](../design/segments.md#handling-null-values) for more details.

Expand Down
2 changes: 2 additions & 0 deletions docs/querying/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ sidebar_label: "Filters"
A filter is a JSON object indicating which rows of data should be included in the computation for a query. It’s essentially the equivalent of the WHERE clause in SQL.
Filters are commonly applied on dimensions, but can be applied on aggregated metrics, for example, see [Filtered aggregator](./aggregations.md#filtered-aggregator) and [Having filters](./having.md).

By default, Druid uses SQL compatible three-value logic when filtering. See [Boolean logic](./sql-data-types.md#boolean-logic) for more details.

Apache Druid supports the following types of filters.

## Selector filter
Expand Down
12 changes: 6 additions & 6 deletions docs/querying/sql-data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,14 +152,14 @@ values are treated as zeroes. This was the default prior to Druid 28.0.0.

## Boolean logic

The [`druid.expressions.useStrictBooleans`](../configuration/index.md#expression-processing-configurations)
runtime property controls Druid's boolean logic mode. For the most SQL compliant behavior, set this to `true` (the default).
By default, Druid uses [SQL three-valued logic](https://en.wikipedia.org/wiki/Three-valued_logic#SQL) for filter processing
and boolean expression evaluation. This behavior relies on three settings:

When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for
[expressions](math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters.
However, even in this mode, Druid uses two-valued logic for filter types other than `expression`.
* [`druid.generic.useDefaultValueForNull`](../configuration/index.md#sql-compatible-null-handling) must be set to false (default), a runtime property which allows NULL values to exist in numeric columns and expressions, and string typed columns to distinguish between NULL and the empty string
* [`druid.expressions.useStrictBooleans`](../configuration/index.md#expression-processing-configurations) must be set to true (default), a runtime property controls Druid's boolean logic mode for expressions, as well as coercing all expression boolean values to be represented with a 1 for true and 0 for false
* [`druid.generic.useThreeValueLogicForNativeFilters`](../configuration/index.md#sql-compatible-null-handling) must be set to true (default), a runtime property which decouples three-value logic handling from `druid.generic.useDefaultValueForNull` and `druid.expressions.useStrictBooleans` for backwards compatibility with older versions of Druid that did not fully support SQL compatible null value logic handling

When `druid.expressions.useStrictBooleans = false` (legacy mode), Druid uses two-valued logic.
If any of these settings is configured with a non-default value, Druid will use two-valued logic for non-expression based filters. Expression based filters are controlled independently with `druid.expressions.useStrictBooleans`, which if set to false Druid will use two-valued logic for expressions.

## Nested columns

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

package org.apache.druid.segment;

import com.google.common.base.Predicate;
import org.apache.druid.query.filter.DruidPredicateFactory;
import org.apache.druid.query.filter.ValueMatcher;
import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
import org.apache.druid.segment.data.IndexedInts;
Expand Down Expand Up @@ -55,7 +55,7 @@ public ValueMatcher makeValueMatcher(@Nullable String value)
return new ValueMatcher()
{
@Override
public boolean matches()
public boolean matches(boolean includeUnknown)
{
// Map column doesn't match with any string
return false;
Expand All @@ -70,12 +70,12 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
}

@Override
public ValueMatcher makeValueMatcher(Predicate<String> predicate)
public ValueMatcher makeValueMatcher(DruidPredicateFactory predicateFactory)
{
return new ValueMatcher()
{
@Override
public boolean matches()
public boolean matches(boolean includeUnknown)
{
return false;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import com.google.common.base.Preconditions;
import com.google.common.base.Predicate;
import org.apache.druid.query.filter.DruidPredicateFactory;
import org.apache.druid.query.filter.ValueMatcher;
import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
import org.apache.druid.segment.data.IndexedInts;
Expand Down Expand Up @@ -68,9 +69,10 @@ public ValueMatcher makeValueMatcher(@Nullable String value)
return new ValueMatcher()
{
@Override
public boolean matches()
public boolean matches(boolean includeUnknown)
{
return Objects.equals(value, getObject());
final Object rowValue = getObject();
return (includeUnknown && rowValue == null) || Objects.equals(value, rowValue);
}

@Override
Expand All @@ -84,14 +86,17 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
}

@Override
public ValueMatcher makeValueMatcher(Predicate<String> predicate)
public ValueMatcher makeValueMatcher(DruidPredicateFactory predicateFactory)
{
final Predicate<String> predicate = predicateFactory.makeStringPredicate();
return new ValueMatcher()
{
@Override
public boolean matches()
public boolean matches(boolean includeUnknown)
{
return predicate.apply((String) getObject());
final String rowValue = (String) getObject();
final boolean matchNull = includeUnknown && predicateFactory.isNullInputUnknown();
return (matchNull && rowValue == null) || predicate.apply(rowValue);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import com.google.common.collect.ImmutableList;
import org.apache.druid.collections.DefaultBlockingPool;
import org.apache.druid.collections.StupidPool;
import org.apache.druid.common.config.NullHandling;
import org.apache.druid.data.input.MapBasedRow;
import org.apache.druid.jackson.DefaultObjectMapper;
import org.apache.druid.java.util.common.DateTimes;
Expand All @@ -34,32 +35,32 @@
import org.apache.druid.query.TableDataSource;
import org.apache.druid.query.aggregation.CountAggregatorFactory;
import org.apache.druid.query.dimension.DefaultDimensionSpec;
import org.apache.druid.query.filter.EqualityFilter;
import org.apache.druid.query.filter.InDimFilter;
import org.apache.druid.query.filter.NotDimFilter;
import org.apache.druid.query.groupby.GroupByQuery;
import org.apache.druid.query.groupby.GroupByQueryConfig;
import org.apache.druid.query.groupby.GroupByQueryQueryToolChest;
import org.apache.druid.query.groupby.GroupByQueryRunnerFactory;
import org.apache.druid.query.groupby.GroupingEngine;
import org.apache.druid.query.groupby.ResultRow;
import org.apache.druid.query.spec.MultipleIntervalSegmentSpec;
import org.apache.druid.segment.column.ColumnType;
import org.apache.druid.segment.incremental.IncrementalIndex;
import org.apache.druid.testing.InitializedNullHandlingTest;
import org.apache.druid.timeline.SegmentId;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.ExpectedException;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;

public class MapVirtualColumnGroupByTest extends InitializedNullHandlingTest
{
@Rule
public ExpectedException expectedException = ExpectedException.none();

private QueryRunner<ResultRow> runner;

@Before
Expand Down Expand Up @@ -132,11 +133,14 @@ public void testWithMapColumn()
null
);

expectedException.expect(UnsupportedOperationException.class);
expectedException.expectMessage("Map column doesn't support getRow()");
runner.run(QueryPlus.wrap(query)).toList();
Throwable t = Assert.assertThrows(
UnsupportedOperationException.class,
() -> runner.run(QueryPlus.wrap(query)).toList()
);
Assert.assertEquals("Map column doesn't support getRow()", t.getMessage());
}


@Test
public void testWithSubColumn()
{
Expand Down Expand Up @@ -166,4 +170,124 @@ public void testWithSubColumn()

Assert.assertEquals(expected, result);
}

@Test
public void testWithSubColumnWithFilter()
{
final GroupByQuery query = new GroupByQuery(
new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
new EqualityFilter("params.key3", ColumnType.STRING, "value3", null),
Granularities.ALL,
ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
ImmutableList.of(new CountAggregatorFactory("count")),
null,
null,
null,
null,
null
);

final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
final List<ResultRow> expected = ImmutableList.of(
new MapBasedRow(
DateTimes.of("2011-01-12T00:00:00.000Z"),
MapVirtualColumnTestBase.mapOf("count", 1L, "params.key3", "value3")
)
).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());

Assert.assertEquals(expected, result);
}

@Test
public void testWithSubColumnWithPredicateFilter()
{
final GroupByQuery query = new GroupByQuery(
new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
new InDimFilter("params.key3", ImmutableList.of("value1", "value3"), null),
Granularities.ALL,
ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
ImmutableList.of(new CountAggregatorFactory("count")),
null,
null,
null,
null,
null
);

final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
final List<ResultRow> expected = ImmutableList.of(
new MapBasedRow(
DateTimes.of("2011-01-12T00:00:00.000Z"),
MapVirtualColumnTestBase.mapOf("count", 1L, "params.key3", "value3")
)
).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());

Assert.assertEquals(expected, result);
}

@Test
public void testWithSubColumnWithNotFilter()
{
final GroupByQuery query = new GroupByQuery(
new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
NotDimFilter.of(new EqualityFilter("params.key3", ColumnType.STRING, "value3", null)),
Granularities.ALL,
ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
ImmutableList.of(new CountAggregatorFactory("count")),
null,
null,
null,
null,
null
);

final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
final List<ResultRow> expected;
if (NullHandling.sqlCompatible()) {
expected = Collections.emptyList();
} else {
expected = ImmutableList.of(
new MapBasedRow(DateTimes.of("2011-01-12T00:00:00.000Z"), MapVirtualColumnTestBase.mapOf("count", 2L))
).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
}

Assert.assertEquals(expected, result);
}

@Test
public void testWithSubColumnWithNotPredicateFilter()
{
final GroupByQuery query = new GroupByQuery(
new TableDataSource(QueryRunnerTestHelper.DATA_SOURCE),
new MultipleIntervalSegmentSpec(ImmutableList.of(Intervals.of("2011/2012"))),
VirtualColumns.create(ImmutableList.of(new MapVirtualColumn("keys", "values", "params"))),
NotDimFilter.of(new InDimFilter("params.key3", ImmutableList.of("value1", "value3"), null)),
Granularities.ALL,
ImmutableList.of(new DefaultDimensionSpec("params.key3", "params.key3")),
ImmutableList.of(new CountAggregatorFactory("count")),
null,
null,
null,
null,
null
);

final List<ResultRow> result = runner.run(QueryPlus.wrap(query)).toList();
final List<ResultRow> expected;
if (NullHandling.sqlCompatible()) {
expected = Collections.emptyList();
} else {
expected = ImmutableList.of(
new MapBasedRow(DateTimes.of("2011-01-12T00:00:00.000Z"), MapVirtualColumnTestBase.mapOf("count", 2L))
).stream().map(row -> ResultRow.fromLegacyRow(row, query)).collect(Collectors.toList());
}

Assert.assertEquals(expected, result);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ public void testQuantileOnCastedString()
10.1,
20.2,
Double.NaN,
10.1,
2.0,
Double.NaN
}
);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ public Filter toFilter()
dimension,
new DruidPredicateFactory()
{
private final boolean isNullUnknown = !bloomKFilter.testBytes(null, 0, 0);

@Override
public Predicate<String> makeStringPredicate()
{
Expand Down Expand Up @@ -165,6 +167,12 @@ public boolean applyNull()
}
};
}

@Override
public boolean isNullInputUnknown()
{
return isNullUnknown;
}
},
extractionFn,
filterTuning
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
import org.apache.druid.query.lookup.LookupExtractor;
import org.apache.druid.segment.IndexBuilder;
import org.apache.druid.segment.StorageAdapter;
import org.apache.druid.segment.column.ColumnType;
import org.apache.druid.segment.column.RowSignature;
import org.apache.druid.segment.filter.BaseFilterTest;
import org.apache.druid.segment.incremental.IncrementalIndexSchema;
import org.junit.AfterClass;
Expand Down Expand Up @@ -69,24 +71,20 @@ public class BloomDimFilterTest extends BaseFilterTest
)
);

private static final RowSignature ROW_SIGNATURE = RowSignature.builder()
.add("dim0", ColumnType.STRING)
.add("dim1", ColumnType.STRING)
.add("dim2", ColumnType.STRING)
.add("dim6", ColumnType.STRING)
.build();

private static final List<InputRow> ROWS = ImmutableList.of(
PARSER.parseBatch(ImmutableMap.of(
"dim0",
"0",
"dim1",
"",
"dim2",
ImmutableList.of("a", "b"),
"dim6",
"2017-07-25"
)).get(0),
PARSER.parseBatch(ImmutableMap.of("dim0", "1", "dim1", "10", "dim2", ImmutableList.of(), "dim6", "2017-07-25"))
.get(0),
PARSER.parseBatch(ImmutableMap.of("dim0", "2", "dim1", "2", "dim2", ImmutableList.of(""), "dim6", "2017-05-25"))
.get(0),
PARSER.parseBatch(ImmutableMap.of("dim0", "3", "dim1", "1", "dim2", ImmutableList.of("a"))).get(0),
PARSER.parseBatch(ImmutableMap.of("dim0", "4", "dim1", "def", "dim2", ImmutableList.of("c"))).get(0),
PARSER.parseBatch(ImmutableMap.of("dim0", "5", "dim1", "abc")).get(0)
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "0", "", ImmutableList.of("a", "b"), "2017-07-25"),
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "1", "10", ImmutableList.of(), "2017-07-25"),
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "2", "2", ImmutableList.of(""), "2017-05-25"),
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "3", "1", ImmutableList.of("a")),
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "4", "def", ImmutableList.of("c")),
BaseFilterTest.makeSchemaRow(PARSER, ROW_SIGNATURE, "5", "abc")
);

private static DefaultObjectMapper mapper = new DefaultObjectMapper();
Expand Down
Loading

0 comments on commit d0f6460

Please sign in to comment.