Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework cursor creation #16533

Merged
merged 89 commits into from
Aug 16, 2024
Merged

rework cursor creation #16533

merged 89 commits into from
Aug 16, 2024

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Jun 1, 2024

Description

This PR reworks Cursor and VectorCursor creation from CursorFactory (StorageAdapter) to be more efficient and push additional details about the query into the process with a newly introduced CursorBuildSpec. A new CursorHolder interface has been added that is constructed by another new interface, CursorHolderFactory with a method makeCursorHolder, which accepts the CursorBuildSpec and replaces the arguments of CursorFactory.makeCursors and CursorFactory.makeVectorCursors.

The primary goal here is to make it easy to push these details down to cursor creation for an upcoming feature under development, projections, which are effectively a type of materialized views that live within segments and will be created at ingestion time. More details on this coming later once I finish writing the proposal, but basically we need to know stuff like what grouping columns, filters, aggregations, etc, are involved in a query in order to be able to automatically select the appropriate projection available within the segment, or using the regular default segment rows if no matching projections are available.

In addition to the primary goal, this refactor also allows cursor creation to be more efficient. Since CursorHolder is created from the CursorBuildSpec, the canVectorize, asCursor, and asVectorCursor methods can all share and re-use the same resources.

Finally, since this PR is introducing new cursor building interfaces, it is also taking the opportunity to downplay the prominence of time ordering, particularly query granularity and its role in cursors. CursorFactory.makeCursors previously returned a Sequence<Cursor>, corresponding to query granularity buckets. It has been transitioned to no longer do this in CursorHolder.asCursor, instead returning a single Cursor, equivalent to using ALL granularity with the previous method but without being wrapped in a Sequence. Query granularity in engines is now handled using a new CursorGranularizer which is equivalent to how the vectorized engine works, simplifying query engines which do not care about query granularity, and making things easier for engines in the future to have segments which are not sorted by time.

StorageAdapter is a @PublicApi, and extends CursorFactory, so a default implementation of CursorHolderFactory that is backed by the old canVectorize, makeCursors, and makeVectorCursor methods is provided. That said, all of the built-in query engines and tests have been migrated to use makeCursorHolder, all CursorFactory implementations have been migrated to CursorHolderFactory, and the old methods of CursorFactory have been marked as @Deprecated, and implementations of CursorFactory have been completely removed.

summary of changes

  • Added CursorBuildSpec which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of CursorFactory.canVectorize, CursorFactory.makeCursor, and CursorFactory.makeVectorCursor
  • added new interface CursorHolder and new interface CursorHolderFactory as a replacement for CursorFactory, with method makeCursorHolder, which takes a CursorBuildSpec as an argument and replaces CursorFactory.canVectorize, CursorFactory.makeCursor, and CursorFactory.makeVectorCursor
  • CursorFactory.makeCursors previously returned a Sequence<Cursor> corresponding to the query granularity buckets, with a separate Cursor per bucket. CursorHolder.asCursor instead returns a single Cursor (equivalent to 'ALL' granularity), and a new CursorGranularizer has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its VectorCursorGranularizer), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.
  • Deprecated CursorFactory, CursorFactory.canVectorize, CursorFactory.makeCursors, and CursorFactory.makeVectorCursor
  • updated all StorageAdapter implementations to implement makeCursorHolder, transitioned direct CursorFactory implementations to instead implement CursorMakerFactory. StorageAdapter being a CursorMakerFactory is intended to be a transitional thing, ideally will not be released in favor of moving CursorMakerFactory to be fetched directly from Segment, however this PR was already large enough so this will be done in a follow-up.
  • updated all query engines to use makeCursorHolder, granularity based engines to use CursorGranularizer.

Release note

(for developers)

Use notes from #16985 instead

There is a change to the StorageAdapter interface, which is a 'public' api for extension writers to add additional types of segments to participate in Druids query engines. StorageAdapter extends the CursorFactory interface, whose methods canVectorize, makeCursors, and makeVectorCursor have been deprecated. This interface is replaced with a new CursorHolderFactory, which StorageAdapter implements, with method makeCursorHolder, which accepts a CursorBuildSpec and returns a new interface CursorHolder which defines no argument versions of canVectorize, asCursor, and asVectorCursor.

A default implementation of makeCursorHolder is provided that uses the existing deprecated methods, so no immediate action is needed for StorageAdapter implementors on upgrade, but implementors should plan in the future to migrate to implementing CursorHolderFactory directly. Custom query engines can no longer use these deprecated methods, as they have been removed from all StorageAdapter implementations. Use CursorHolder instead. CursorHolder.asCursor is equivalent to calling CursorFactory.getCursors with ALL granularity. Any custom engines which need to bucket the cursor by query granularity should instead use CursorGranularizer, which can be used to produce a set of intervals and advance the cursor within those intervals.

(for users)
Due to some internal refactoring, TopN queries no longer can use the "pooled" algorithms internally for query granularities other than 'ALL', unless the columns value cardinality is smaller than the internally computed "numValuesPerPass", which is based on processing buffer size and projected aggregator size. This is not completely obvious from a users perspective, so the only indicator might be a performance difference for TopN queries with other query granularities. SQL will only ever plan to a TopN query with 'ALL' granularity so it is not affected.


Key changed/added classes in this PR
  • CursorHolderFactory
  • CursorHolder
  • CursorBuildSpec
  • CursorGranularizer
  • Cursor
  • Offset
  • CursorFactory (deprecated)
  • GroupingEngine (to use CursorHolder and CursorGranularizer)
  • TimeseriesQueryEngine (to use CursorHolder and CursorGranularizer)
  • TopNQueryEngine (to use CursorHolder and CursorGranularizer)
  • All other built-in query engines to use CursorHolder
  • All tests that use CursorFactory methods now use CursorHolder

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors`
* added new interfaces `CursorMaker` and new method `asCursorMaker` to `CursorFactory`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors`
* Deprecated `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors`
* updated all `CursorFactory` implementations to implement `asCursorMaker`
* updated all query engines to use `asCursorMaker`
Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method needs to be removed as it is never used to pass the intellij-inspections check in Github Action

* CursorMaker no longer handles query granularity directly
* Sequence<Cursor> of makeCursors is now just a Cursor from makeCursor
* added CursorGranularizer to bucket queries by granularity instead, updated all engines that support query granularity to use this instead (topN still needs some more work)
* remove Cursor.getTime
* remove implementations of CursorFactory methods that are not asCursorMaker, replacing with defaults that throw exceptions
frameWriter.addSelection();
in.advance();
for (; !cursor.isDoneOrInterrupted() && remainingRowsToFetch > 0; remainingRowsToFetch--) {
writer.addSelection();

Check warning

Code scanning / CodeQL

Dereferenced variable may be null Warning

Variable
writer
may be null at this access as suggested by
this
null guard.
in.advance();
try(final FrameWriter writer = frameWriterFactory.newFrameWriter(columnSelectorFactory)) {
while (!cursor.isDoneOrInterrupted()) {
writer.addSelection();

Check warning

Code scanning / CodeQL

Dereferenced variable may be null Warning

Variable
writer
may be null at this access as suggested by
this
null guard.
@clintropolis clintropolis deleted the cursor-rework branch August 16, 2024 19:46
gianm added a commit to gianm/druid that referenced this pull request Sep 25, 2024
The change in apache#16533 added CursorHolders to the processor-level Closer.
This is problematic when running on an input channel: it means we started
keeping around all CursorHolders for all frames we process and closing them
when the channel is complete, rather than closing them as we go along.

This patch restores the prior behavior, where resources are closed as we go.
gianm added a commit that referenced this pull request Sep 25, 2024
* ScanQueryFrameProcessor: Close CursorHolders as we go along.

The change in #16533 added CursorHolders to the processor-level Closer.
This is problematic when running on an input channel: it means we started
keeping around all CursorHolders for all frames we process and closing them
when the channel is complete, rather than closing them as we go along.

This patch restores the prior behavior, where resources are closed as we go.

* Fix other call sites.

* Fix reference.

* Improvements.
abhishekagarwal87 pushed a commit that referenced this pull request Sep 27, 2024
…17175)

Fixes a mistake introduced in #16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone
clintropolis added a commit to clintropolis/druid that referenced this pull request Sep 27, 2024
…pache#17175)

Fixes a mistake introduced in apache#16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone
clintropolis added a commit that referenced this pull request Sep 30, 2024
…17175) (#17176)

Fixes a mistake introduced in #16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Oct 4, 2024
…17152)

* ScanQueryFrameProcessor: Close CursorHolders as we go along.

The change in apache#16533 added CursorHolders to the processor-level Closer.
This is problematic when running on an input channel: it means we started
keeping around all CursorHolders for all frames we process and closing them
when the channel is complete, rather than closing them as we go along.

This patch restores the prior behavior, where resources are closed as we go.

* Fix other call sites.

* Fix reference.

* Improvements.
gianm added a commit to gianm/druid that referenced this pull request Oct 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in apache#16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.
gianm added a commit that referenced this pull request Nov 6, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in #16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.
@clintropolis clintropolis mentioned this pull request Nov 14, 2024
jtuglu-netflix pushed a commit to jtuglu-netflix/druid that referenced this pull request Nov 20, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in apache#16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.
clintropolis pushed a commit to clintropolis/druid that referenced this pull request Nov 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in apache#16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.
clintropolis pushed a commit to clintropolis/druid that referenced this pull request Nov 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in apache#16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.
clintropolis added a commit that referenced this pull request Nov 24, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which
enables caching of minDataTimestamp and maxDataTimestamp.

Fixes a performance regression introduced in #16533, where these fields
stopped being cached across cursors. Prior to that patch, they were
cached in the QueryableIndexStorageAdapter.

Co-authored-by: Gian Merlino <[email protected]>
clintropolis added a commit to clintropolis/druid that referenced this pull request Dec 13, 2024
changes:
* fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533
* fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533
cryptoe pushed a commit that referenced this pull request Dec 17, 2024
* topn with granularity regression fixes

changes:
* fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533
* fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533

* move defensive check outside of loop

* more test

* extra layer of safety

* move check outside of loop

* fix spelling

* add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all

* add comment, revert IT context changes and add new context flag
clintropolis added a commit to clintropolis/druid that referenced this pull request Dec 17, 2024
* topn with granularity regression fixes

changes:
* fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533
* fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533

* move defensive check outside of loop

* more test

* extra layer of safety

* move check outside of loop

* fix spelling

* add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all

* add comment, revert IT context changes and add new context flag
cryptoe pushed a commit that referenced this pull request Dec 18, 2024
* topn with granularity regression fixes (#17565)

* topn with granularity regression fixes

changes:
* fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533
* fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533

* move defensive check outside of loop

* more test

* extra layer of safety

* move check outside of loop

* fix spelling

* add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all

* add comment, revert IT context changes and add new context flag

* remove unused
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants