-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Peon not fail gracefully #14880
Fix Peon not fail gracefully #14880
Conversation
{ | ||
if (Thread.currentThread().isInterrupted()) { | ||
// clears the interrupted status so the subsequent cleanup work can continue without interruption | ||
Thread.interrupted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exact sure whether this has a wider impact on somewhere else. fyi @kfaraz
// Only certain tasks, primarily from unit tests, are not subclasses of AbstractTask. | ||
if (task instanceof AbstractTask) { | ||
((AbstractTask) task).waitForCleanupToFinish(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why bake that assumption into the code? This class shouldn't be aware of what an AbstractTask
is. If required, you should add a method to Task
interface like waitForTermination
and put the contract in the interface clearly how should that be implemented and how should it be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @abhishekagarwal87 , this PR is ready to be reviewed. Thanks.
1bac985
to
1a30735
Compare
1d8ee8a
to
2ec2fc2
Compare
Hi @abhishekagarwal87 @kfaraz , does this PR look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it wouldn't be better to just call cleanup
from inside stopGracefully
rather than adding a latch and waiting on it.
try { | ||
if (cleanupCompletionLatch != null) { | ||
// block until the cleanup process completes | ||
return cleanupCompletionLatch.await(30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 30 seconds typically enough time for the cleanup to finish?
I think 5 minutes might be better as the cleanup seems to be doing a bunch of things - update status, update location, push task reports, .
It might also be good to
- log a warning message if the cleanup could not be finished in time and the latch returned
false
. - log an info message for the time taken to finish the cleanup (or is it already being done in
cleanup
method?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to 100 seconds, it won't take that long to finish cleanUp. warning message is also added.
the time taken to shutdown is recorded in SingleTaskBackgroundRunner#stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, typically it would be fast, but in case it isn't, I believe it is better to wait a little longer since this change is going to affect all tasks. The cleanup seems to submit a couple of actions to the Overlord which could be slow in getting processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to 5 mins
indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java
Outdated
Show resolved
Hide resolved
Task can only do |
Don't think the build failure is relevant to my change. |
Sorry for the late reply.
I see your point. Currently, cleanup is called from Ideally the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this for now. But we should revisit the stopGracefully
method for a more concrete fix later.
* fix Peon not fail gracefully * move methods to Task interface * fix checkstyle * extract to interface * check runThread nullability * fix merge conflict * minor refine * minor refine * fix unit test * increase latch waiting time
Description
During testing of the K8s task runner, peon pods faced challenges in achieving a graceful termination upon receiving a SIGTERM signal. Typically, they would encounter an
InterruptedException
and subsequently fail to push theirstatus.json
. As a result, the Druid console would display a vague "task status not found" message. Such graceful terminations may arise when a pod is manually deleted by K8s or is terminated due to out-of-memory conditions.The primary issues behind this behavior are:
cleanUp
operation.DiscoveryServiceLocator
andOverlordClient
to execute its cleanUp work. However, since they all exist within the same lifecycle scope (NORMAL scope), there's a risk that the dependent services might be stopped before the task'scleanUp
process concludes. To rectify this, we've transitioned theSingleTaskBackgroundRunner
to the SERVER scope, ensuring the task stops prior to its dependent services, allowing for a successful cleanUp.Note: although the issue is from running K8sTaskRunner, but it's a general fix to peon jvms when receiving a sigint (instruction to gracefully stop) that we didn't notice before since it rarely happened within MiddleManager's
Release note
Fix Peon not fail gracefully.
Key changed/added classes in this PR
AbstractTask
addcleanupCompletionLatch
SingleTaskBackgroundRunner
wait for the latch to finish when stopping.CliPeon
moveSingleTaskBackgroundRunner
to the SERVER scope.This PR has: