Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Peon not fail gracefully #14880

Merged
merged 10 commits into from
Sep 29, 2023
Merged

Fix Peon not fail gracefully #14880

merged 10 commits into from
Sep 29, 2023

Conversation

YongGang
Copy link
Contributor

@YongGang YongGang commented Aug 20, 2023

Description

During testing of the K8s task runner, peon pods faced challenges in achieving a graceful termination upon receiving a SIGTERM signal. Typically, they would encounter an InterruptedException and subsequently fail to push their status.json. As a result, the Druid console would display a vague "task status not found" message. Such graceful terminations may arise when a pod is manually deleted by K8s or is terminated due to out-of-memory conditions.

The primary issues behind this behavior are:

  1. When attempting to halt a task, the thread is interrupted. This interruption inadvertently prevents the completion of the task cleanUp operation.
  2. The task relies on services like DiscoveryServiceLocator and OverlordClient to execute its cleanUp work. However, since they all exist within the same lifecycle scope (NORMAL scope), there's a risk that the dependent services might be stopped before the task's cleanUp process concludes. To rectify this, we've transitioned the SingleTaskBackgroundRunner to the SERVER scope, ensuring the task stops prior to its dependent services, allowing for a successful cleanUp.

Note: although the issue is from running K8sTaskRunner, but it's a general fix to peon jvms when receiving a sigint (instruction to gracefully stop) that we didn't notice before since it rarely happened within MiddleManager's

Release note

Fix Peon not fail gracefully.


Key changed/added classes in this PR
  • In AbstractTask add cleanupCompletionLatch
  • In SingleTaskBackgroundRunner wait for the latch to finish when stopping.
  • In CliPeon move SingleTaskBackgroundRunner to the SERVER scope.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

{
if (Thread.currentThread().isInterrupted()) {
// clears the interrupted status so the subsequent cleanup work can continue without interruption
Thread.interrupted();
Copy link
Contributor Author

@YongGang YongGang Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exact sure whether this has a wider impact on somewhere else. fyi @kfaraz

Comment on lines 189 to 192
// Only certain tasks, primarily from unit tests, are not subclasses of AbstractTask.
if (task instanceof AbstractTask) {
((AbstractTask) task).waitForCleanupToFinish();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bake that assumption into the code? This class shouldn't be aware of what an AbstractTask is. If required, you should add a method to Task interface like waitForTermination and put the contract in the interface clearly how should that be implemented and how should it be called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhishekagarwal87 , this PR is ready to be reviewed. Thanks.

@YongGang
Copy link
Contributor Author

Hi @abhishekagarwal87 @kfaraz , does this PR look good to you?

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be better to just call cleanup from inside stopGracefully rather than adding a latch and waiting on it.

try {
if (cleanupCompletionLatch != null) {
// block until the cleanup process completes
return cleanupCompletionLatch.await(30, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 30 seconds typically enough time for the cleanup to finish?
I think 5 minutes might be better as the cleanup seems to be doing a bunch of things - update status, update location, push task reports, .

It might also be good to

  • log a warning message if the cleanup could not be finished in time and the latch returned false.
  • log an info message for the time taken to finish the cleanup (or is it already being done in cleanup method?)

Copy link
Contributor Author

@YongGang YongGang Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to 100 seconds, it won't take that long to finish cleanUp. warning message is also added.
the time taken to shutdown is recorded in SingleTaskBackgroundRunner#stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, typically it would be fast, but in case it isn't, I believe it is better to wait a little longer since this change is going to affect all tasks. The cleanup seems to submit a couple of actions to the Overlord which could be slow in getting processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 5 mins

@YongGang
Copy link
Contributor Author

I wonder if it wouldn't be better to just call cleanup from inside stopGracefully rather than adding a latch and waiting on it.

Task can only do cleanUp when it stopped running otherwise we don't know the exact status of it. For task like AbstractBatchIndexTask it interrupt thread to make it stop running, but its timing is not deterministic. Here we use latch to give the task time to stop and do cleanUp work.

@YongGang
Copy link
Contributor Author

Don't think the build failure is relevant to my change.

@kfaraz
Copy link
Contributor

kfaraz commented Sep 27, 2023

Sorry for the late reply.

Task can only do cleanUp when it stopped running otherwise we don't know the exact status of it. For task like AbstractBatchIndexTask it interrupt thread to make it stop running, but its timing is not deterministic. Here we use latch to give the task time to stop and do cleanUp work.

I see your point. Currently, cleanup is called from run method of task either after it has finished running or if an exception has occurred while running. Whereas stopGracefully is called when we are trying to terminate the task from a different thread.

Ideally the stopGracefully method itself should have ensured that the cleanup has completed, thus avoiding the need for a new waitForCleanupToFinish method. But I see that all Task implementations are handling the stopGracefully in different ways. So fixing that might be more change than necessary right now.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this for now. But we should revisit the stopGracefully method for a more concrete fix later.

@suneet-s suneet-s merged commit 86087ce into apache:master Sep 29, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
* fix Peon not fail gracefully

* move methods to Task interface

* fix checkstyle

* extract to interface

* check runThread nullability

* fix merge conflict

* minor refine

* minor refine

* fix unit test

* increase latch waiting time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants