Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segregate advance and advanceUninterruptibly flow in postJoinCursor to allow for interrupts in advance #15222

Merged

Conversation

gargvishesh
Copy link
Contributor

@gargvishesh gargvishesh commented Oct 20, 2023

Fixes: #14514

Description

Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled.

With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.

…r by calling baseCursor.advance and baseCursor.advanceInterruptibly resp.
@pranavbhole
Copy link
Contributor

pranavbhole commented Oct 20, 2023

can we add test for this scenario if possible unless it is covered already in some other test cases?

Copy link
Contributor

@soumyava soumyava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example in CalciteJoinQueryTest of the form that was mentioned in the issue ? Since the PostJoinCursor is used used to wrap around the Unnest Cursor can you also please check if we are unnesting a high cardinality array column or a MVD the problem mentioned in the issue does not appear anymore

@@ -99,15 +108,15 @@ public Filter getPostJoinFilter()
@Override
public void advance()
{
advanceUninterruptibly();
BaseQuery.checkInterrupted();
baseCursor.advance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be good to put some comments as in why this change is being done in the code for a future developer to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@@ -99,15 +108,15 @@ public Filter getPostJoinFilter()
@Override
public void advance()
{
advanceUninterruptibly();
BaseQuery.checkInterrupted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note on why the checkInterrupted() is not needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

@soumyava
Copy link
Contributor

I dunno why the checkstyle is failing though. If using IntelliJ it'll be good to add this https://raw.githubusercontent.com/apache/incubator-druid/master/dev/druid_intellij_formatting.xml to IntelliJ

@LakshSingla
Copy link
Contributor

Error:  /home/runner/work/druid/druid/processing/src/main/java/org/apache/druid/segment/join/PostJoinCursor.java:22:8: Unused import - org.apache.druid.query.BaseQuery. [UnusedImports]

My intellij doesn't clean up unused imports automatically. I run mvn checkstyle:checkstyle to confirm that checkstyle is passing. Sometimes deeply nested stuff won't get flagged up in Intellij but will show up as a check style failure.

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add few test cases. LGTM otherwise.

}
}

private void advanceToMatchUninterruptibly()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment here that this can be a long-running CPU call. Which is why advanceUninterruptibly is not directly used in advance() call unlike other cursors. Please link the github issue in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@gargvishesh gargvishesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranavbhole
I've added a test now to mimic the original issue of postJoinCursor filter condition not matching any tuples coming out of high-cardinality joins, thereby sending it into an uninterruptible, cpu-intensive and long-running call.

@soumyava
For unnest cases, the scenario (with prior implementation), though possible, is less likely due to absence of (join's) output row_count multiplication property, leading to a much faster exhaustion of input if postJoinCursor filter condition is never met. But in principle it can potentially happen with any long-running advanceUninterruptibly() call at any level -- join itself not finding any matching rows for high cardinality tables, for example. Probably maxRowsPerSegment would act as the limit in such cases.

}
}

private void advanceToMatchUninterruptibly()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -99,15 +108,15 @@ public Filter getPostJoinFilter()
@Override
public void advance()
{
advanceUninterruptibly();
BaseQuery.checkInterrupted();
baseCursor.advance();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@@ -99,15 +108,15 @@ public Filter getPostJoinFilter()
@Override
public void advance()
{
advanceUninterruptibly();
BaseQuery.checkInterrupted();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

@@ -0,0 +1,230 @@
package org.apache.druid.segment.join;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the license header here


cursor.advance();
}
}
Copy link
Contributor

@soumyava soumyava Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to leave a blank line at the end of each file. Not sure though. If checkstyle oks it am good with it

Copy link
Contributor

@soumyava soumyava Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is needed, the checkstyle is complaining. It is good to run a mvn checkstyle:checkstyle beforehand as Laksh have suggested earlier. Another thing I do is in IntelliJ when adding a new section of code have the druid coding style and run a reformat code that fixes most of the codestyling for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall follow. Thanks!

@Override
public boolean isDone()
{
return cursor.isDone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this return false? May not matter but given its infinite cursor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 197 to 199
catch (InterruptedException e) {
throw new RuntimeException(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : catch clause can be removed and exception declared on method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 193 to 195
sleep(5000);
joinCursorThread.interrupt();
sleep(5000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 10 second test is expensive. Rather, we should use a count latch or something to wait for X records to be processed. The latch wait can have 5 second timeout. Similarly, the 5 second sleep after interrupting should be removed and we can wait in a loop or use count latch again in ExceptionHandler.

I also like https://github.com/awaitility/awaitility/wiki/Usage#simple in case you want to bring that in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with a latch, but thanks for the pointer to this library.

if (exceptionHandler.getException() == null) {
sleep(1);
} else {
assertTrue(exceptionHandler.getException() instanceof QueryInterruptedException);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just return the call here and also add assertFalse at the end of the method. So we are checking that we did get QueryInterruptedException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done - thanks.

@abhishekagarwal87 abhishekagarwal87 added this to the 28.0 milestone Oct 30, 2023
@gargvishesh gargvishesh changed the title Segregate advance and advanceUninterruptibly flow in postJoinCursor Segregate advance and advanceUninterruptibly flow in postJoinCursor to allow for interrupts in advance Oct 30, 2023
@abhishekagarwal87 abhishekagarwal87 merged commit a27598a into apache:master Oct 30, 2023
82 checks passed
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Oct 30, 2023
…o allow for interrupts in advance (apache#15222)

Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled.

With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.
AmatyaAvadhanula pushed a commit that referenced this pull request Nov 2, 2023
…o allow for interrupts in advance (#15222) (#15278)

Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled.

With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.

Co-authored-by: Vishesh Garg <[email protected]>
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
…o allow for interrupts in advance (apache#15222)

Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled.

With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The class PostJoinCursor causes the CPU to spike to 100% and cannot be terminated.
5 participants