-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework cursor creation #16533
rework cursor creation #16533
Conversation
changes: * Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * added new interfaces `CursorMaker` and new method `asCursorMaker` to `CursorFactory`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * Deprecated `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * updated all `CursorFactory` implementations to implement `asCursorMaker` * updated all query engines to use `asCursorMaker`
processing/src/main/java/org/apache/druid/segment/CursorFactory.java
Fixed
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/CursorFactory.java
Fixed
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/CursorFactory.java
Fixed
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/QueryableIndexCursorMaker.java
Fixed
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/FilteredStorageAdapter.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method needs to be removed as it is never used to pass the intellij-inspections
check in Github Action
processing/src/main/java/org/apache/druid/segment/CursorBuildSpec.java
Outdated
Show resolved
Hide resolved
* CursorMaker no longer handles query granularity directly * Sequence<Cursor> of makeCursors is now just a Cursor from makeCursor * added CursorGranularizer to bucket queries by granularity instead, updated all engines that support query granularity to use this instead (topN still needs some more work) * remove Cursor.getTime * remove implementations of CursorFactory methods that are not asCursorMaker, replacing with defaults that throw exceptions
frameWriter.addSelection(); | ||
in.advance(); | ||
for (; !cursor.isDoneOrInterrupted() && remainingRowsToFetch > 0; remainingRowsToFetch--) { | ||
writer.addSelection(); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null Warning
writer
this
processing/src/main/java/org/apache/druid/query/rowsandcols/StorageAdapterRowsAndColumns.java
Fixed
Show fixed
Hide fixed
processing/src/test/java/org/apache/druid/segment/virtual/ExpressionVectorSelectorsTest.java
Fixed
Show fixed
Hide fixed
processing/src/test/java/org/apache/druid/segment/virtual/ExpressionVectorSelectorsTest.java
Fixed
Show fixed
Hide fixed
processing/src/test/java/org/apache/druid/segment/virtual/ExpressionVectorSelectorsTest.java
Fixed
Show fixed
Hide fixed
...g/src/test/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapterTest.java
Fixed
Show fixed
Hide fixed
...g/src/test/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapterTest.java
Fixed
Show fixed
Hide fixed
...ti-stage-query/src/main/java/org/apache/druid/msq/querykit/scan/ScanQueryFrameProcessor.java
Fixed
Show fixed
Hide fixed
...ti-stage-query/src/main/java/org/apache/druid/msq/querykit/scan/ScanQueryFrameProcessor.java
Fixed
Show fixed
Hide fixed
processing/src/test/java/org/apache/druid/segment/virtual/ExpressionVectorSelectorsTest.java
Fixed
Show fixed
Hide fixed
in.advance(); | ||
try(final FrameWriter writer = frameWriterFactory.newFrameWriter(columnSelectorFactory)) { | ||
while (!cursor.isDoneOrInterrupted()) { | ||
writer.addSelection(); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null Warning
writer
this
The change in apache#16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go.
* ScanQueryFrameProcessor: Close CursorHolders as we go along. The change in #16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go. * Fix other call sites. * Fix reference. * Improvements.
…pache#17175) Fixes a mistake introduced in apache#16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone
…17152) * ScanQueryFrameProcessor: Close CursorHolders as we go along. The change in apache#16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go. * Fix other call sites. * Fix reference. * Improvements.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in #16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in #16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter. Co-authored-by: Gian Merlino <[email protected]>
changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533
* topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag
* topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag
* topn with granularity regression fixes (#17565) * topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag * remove unused
Description
This PR reworks
Cursor
andVectorCursor
creation fromCursorFactory
(StorageAdapter
) to be more efficient and push additional details about the query into the process with a newly introducedCursorBuildSpec
. A newCursorHolder
interface has been added that is constructed by another new interface,CursorHolderFactory
with a methodmakeCursorHolder
, which accepts theCursorBuildSpec
and replaces the arguments ofCursorFactory.makeCursors
andCursorFactory.makeVectorCursors
.The primary goal here is to make it easy to push these details down to cursor creation for an upcoming feature under development, projections, which are effectively a type of materialized views that live within segments and will be created at ingestion time. More details on this coming later once I finish writing the proposal, but basically we need to know stuff like what grouping columns, filters, aggregations, etc, are involved in a query in order to be able to automatically select the appropriate projection available within the segment, or using the regular default segment rows if no matching projections are available.
In addition to the primary goal, this refactor also allows cursor creation to be more efficient. Since
CursorHolder
is created from theCursorBuildSpec
, thecanVectorize
,asCursor
, andasVectorCursor
methods can all share and re-use the same resources.Finally, since this PR is introducing new cursor building interfaces, it is also taking the opportunity to downplay the prominence of time ordering, particularly query granularity and its role in cursors.
CursorFactory.makeCursors
previously returned aSequence<Cursor>
, corresponding to query granularity buckets. It has been transitioned to no longer do this inCursorHolder.asCursor
, instead returning a singleCursor
, equivalent to usingALL
granularity with the previous method but without being wrapped in aSequence
. Query granularity in engines is now handled using a newCursorGranularizer
which is equivalent to how the vectorized engine works, simplifying query engines which do not care about query granularity, and making things easier for engines in the future to have segments which are not sorted by time.StorageAdapter
is a@PublicApi
, and extendsCursorFactory
, so a default implementation ofCursorHolderFactory
that is backed by the oldcanVectorize
,makeCursors
, andmakeVectorCursor
methods is provided. That said, all of the built-in query engines and tests have been migrated to usemakeCursorHolder
, allCursorFactory
implementations have been migrated toCursorHolderFactory
, and the old methods ofCursorFactory
have been marked as@Deprecated
, and implementations ofCursorFactory
have been completely removed.summary of changes
CursorBuildSpec
which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments ofCursorFactory.canVectorize
,CursorFactory.makeCursor
, andCursorFactory.makeVectorCursor
CursorHolder
and new interfaceCursorHolderFactory
as a replacement forCursorFactory
, with methodmakeCursorHolder
, which takes aCursorBuildSpec
as an argument and replacesCursorFactory.canVectorize
,CursorFactory.makeCursor
, andCursorFactory.makeVectorCursor
CursorFactory.makeCursors
previously returned aSequence<Cursor>
corresponding to the query granularity buckets, with a separateCursor
per bucket.CursorHolder.asCursor
instead returns a singleCursor
(equivalent to 'ALL' granularity), and a newCursorGranularizer
has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with itsVectorCursorGranularizer
), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.CursorFactory
,CursorFactory.canVectorize
,CursorFactory.makeCursors
, andCursorFactory.makeVectorCursor
StorageAdapter
implementations to implementmakeCursorHolder
, transitioned directCursorFactory
implementations to instead implementCursorMakerFactory
.StorageAdapter
being aCursorMakerFactory
is intended to be a transitional thing, ideally will not be released in favor of movingCursorMakerFactory
to be fetched directly fromSegment
, however this PR was already large enough so this will be done in a follow-up.makeCursorHolder
, granularity based engines to useCursorGranularizer
.Release note
(for developers)
Use notes from #16985 instead
There is a change to theStorageAdapter
interface, which is a 'public' api for extension writers to add additional types of segments to participate in Druids query engines.StorageAdapter
extends theCursorFactory
interface, whose methodscanVectorize
,makeCursors
, andmakeVectorCursor
have been deprecated. This interface is replaced with a newCursorHolderFactory
, whichStorageAdapter
implements, with methodmakeCursorHolder
, which accepts aCursorBuildSpec
and returns a new interfaceCursorHolder
which defines no argument versions ofcanVectorize
,asCursor
, andasVectorCursor
.A default implementation ofmakeCursorHolder
is provided that uses the existing deprecated methods, so no immediate action is needed forStorageAdapter
implementors on upgrade, but implementors should plan in the future to migrate to implementingCursorHolderFactory
directly. Custom query engines can no longer use these deprecated methods, as they have been removed from allStorageAdapter
implementations. UseCursorHolder
instead.CursorHolder.asCursor
is equivalent to callingCursorFactory.getCursors
withALL
granularity. Any custom engines which need to bucket the cursor by query granularity should instead useCursorGranularizer
, which can be used to produce a set of intervals and advance the cursor within those intervals.(for users)
Due to some internal refactoring, TopN queries no longer can use the "pooled" algorithms internally for query granularities other than 'ALL', unless the columns value cardinality is smaller than the internally computed "numValuesPerPass", which is based on processing buffer size and projected aggregator size. This is not completely obvious from a users perspective, so the only indicator might be a performance difference for TopN queries with other query granularities. SQL will only ever plan to a TopN query with 'ALL' granularity so it is not affected.
Key changed/added classes in this PR
CursorHolderFactory
CursorHolder
CursorBuildSpec
CursorGranularizer
Cursor
Offset
CursorFactory
(deprecated)GroupingEngine
(to useCursorHolder
andCursorGranularizer
)TimeseriesQueryEngine
(to useCursorHolder
andCursorGranularizer
)TopNQueryEngine
(to useCursorHolder
andCursorGranularizer
)CursorHolder
CursorFactory
methods now useCursorHolder
This PR has: