Fix Peon not fail gracefully #14880

YongGang · 2023-08-20T21:04:09Z

Description

During testing of the K8s task runner, peon pods faced challenges in achieving a graceful termination upon receiving a SIGTERM signal. Typically, they would encounter an InterruptedException and subsequently fail to push their status.json. As a result, the Druid console would display a vague "task status not found" message. Such graceful terminations may arise when a pod is manually deleted by K8s or is terminated due to out-of-memory conditions.

The primary issues behind this behavior are:

When attempting to halt a task, the thread is interrupted. This interruption inadvertently prevents the completion of the task cleanUp operation.
The task relies on services like DiscoveryServiceLocator and OverlordClient to execute its cleanUp work. However, since they all exist within the same lifecycle scope (NORMAL scope), there's a risk that the dependent services might be stopped before the task's cleanUp process concludes. To rectify this, we've transitioned the SingleTaskBackgroundRunner to the SERVER scope, ensuring the task stops prior to its dependent services, allowing for a successful cleanUp.

Note: although the issue is from running K8sTaskRunner, but it's a general fix to peon jvms when receiving a sigint (instruction to gracefully stop) that we didn't notice before since it rarely happened within MiddleManager's

Release note

Fix Peon not fail gracefully.

Key changed/added classes in this PR

In AbstractTask add cleanupCompletionLatch
In SingleTaskBackgroundRunner wait for the latch to finish when stopping.
In CliPeon move SingleTaskBackgroundRunner to the SERVER scope.

This PR has:

YongGang · 2023-08-21T16:36:27Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java

  {
+    if (Thread.currentThread().isInterrupted()) {
+      // clears the interrupted status so the subsequent cleanup work can continue without interruption
+      Thread.interrupted();


I'm not exact sure whether this has a wider impact on somewhere else. fyi @kfaraz

abhishekagarwal87 · 2023-08-22T06:08:12Z

...ing-service/src/main/java/org/apache/druid/indexing/overlord/SingleTaskBackgroundRunner.java

+      // Only certain tasks, primarily from unit tests, are not subclasses of AbstractTask.
+      if (task instanceof AbstractTask) {
+        ((AbstractTask) task).waitForCleanupToFinish();
+      }


why bake that assumption into the code? This class shouldn't be aware of what an AbstractTask is. If required, you should add a method to Task interface like waitForTermination and put the contract in the interface clearly how should that be implemented and how should it be called.

Hi @abhishekagarwal87 , this PR is ready to be reviewed. Thanks.

YongGang · 2023-09-19T17:23:13Z

Hi @abhishekagarwal87 @kfaraz , does this PR look good to you?

kfaraz

I wonder if it wouldn't be better to just call cleanup from inside stopGracefully rather than adding a latch and waiting on it.

kfaraz · 2023-09-20T13:33:16Z

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java

+    try {
+      if (cleanupCompletionLatch != null) {
+        // block until the cleanup process completes
+        return cleanupCompletionLatch.await(30, TimeUnit.SECONDS);


Is 30 seconds typically enough time for the cleanup to finish?
I think 5 minutes might be better as the cleanup seems to be doing a bunch of things - update status, update location, push task reports, .

It might also be good to

log a warning message if the cleanup could not be finished in time and the latch returned false.

log an info message for the time taken to finish the cleanup (or is it already being done in cleanup method?)

I updated to 100 seconds, it won't take that long to finish cleanUp. warning message is also added.
the time taken to shutdown is recorded in SingleTaskBackgroundRunner#stop

Yes, typically it would be fast, but in case it isn't, I believe it is better to wait a little longer since this change is going to affect all tasks. The cleanup seems to submit a couple of actions to the Overlord which could be slow in getting processed.

updated to 5 mins

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java

YongGang · 2023-09-20T16:41:42Z

I wonder if it wouldn't be better to just call cleanup from inside stopGracefully rather than adding a latch and waiting on it.

Task can only do cleanUp when it stopped running otherwise we don't know the exact status of it. For task like AbstractBatchIndexTask it interrupt thread to make it stop running, but its timing is not deterministic. Here we use latch to give the task time to stop and do cleanUp work.

YongGang · 2023-09-20T23:11:17Z

Don't think the build failure is relevant to my change.

kfaraz · 2023-09-27T07:20:47Z

Sorry for the late reply.

Task can only do cleanUp when it stopped running otherwise we don't know the exact status of it. For task like AbstractBatchIndexTask it interrupt thread to make it stop running, but its timing is not deterministic. Here we use latch to give the task time to stop and do cleanUp work.

I see your point. Currently, cleanup is called from run method of task either after it has finished running or if an exception has occurred while running. Whereas stopGracefully is called when we are trying to terminate the task from a different thread.

Ideally the stopGracefully method itself should have ensured that the cleanup has completed, thus avoiding the need for a new waitForCleanupToFinish method. But I see that all Task implementations are handling the stopGracefully in different ways. So fixing that might be more change than necessary right now.

kfaraz

We can merge this for now. But we should revisit the stopGracefully method for a more concrete fix later.

* fix Peon not fail gracefully * move methods to Task interface * fix checkstyle * extract to interface * check runThread nullability * fix merge conflict * minor refine * minor refine * fix unit test * increase latch waiting time

kfaraz added Kubernetes Area - Ingestion labels Aug 21, 2023

YongGang commented Aug 21, 2023

View reviewed changes

abhishekagarwal87 reviewed Aug 22, 2023

View reviewed changes

YongGang mentioned this pull request Aug 23, 2023

leader query should wait until leader election finished #14898

Closed

10 tasks

YongGang force-pushed the fix-task-cleanup branch from 1bac985 to 1a30735 Compare August 25, 2023 19:44

YongGang added 6 commits September 5, 2023 11:08

fix Peon not fail gracefully

ba21d24

move methods to Task interface

d786477

fix checkstyle

b4c7c78

extract to interface

dd78571

check runThread nullability

81ec03b

fix merge conflict

2ec2fc2

YongGang force-pushed the fix-task-cleanup branch from 1d8ee8a to 2ec2fc2 Compare September 5, 2023 18:12

kfaraz reviewed Sep 20, 2023

View reviewed changes

minor refine

fd2524f

github-actions bot removed the Kubernetes label Sep 20, 2023

YongGang added 2 commits September 20, 2023 10:48

minor refine

2863607

fix unit test

b2a70ef

increase latch waiting time

587bd36

suneet-s requested review from kfaraz and abhishekagarwal87 September 25, 2023 20:07

kfaraz approved these changes Sep 27, 2023

View reviewed changes

suneet-s merged commit 86087ce into apache:master Sep 29, 2023

LakshSingla added this to the 28.0 milestone Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Peon not fail gracefully #14880

Fix Peon not fail gracefully #14880

YongGang commented Aug 20, 2023 •

edited

Loading

YongGang Aug 21, 2023 •

edited

Loading

abhishekagarwal87 Aug 22, 2023

YongGang Aug 27, 2023

YongGang Sep 8, 2023

YongGang commented Sep 19, 2023

kfaraz left a comment

kfaraz Sep 20, 2023

YongGang Sep 20, 2023 •

edited

Loading

kfaraz Sep 22, 2023

YongGang Sep 22, 2023

YongGang commented Sep 20, 2023

YongGang commented Sep 20, 2023

kfaraz commented Sep 27, 2023

kfaraz left a comment •

edited

Loading

Fix Peon not fail gracefully #14880

Fix Peon not fail gracefully #14880

Conversation

YongGang commented Aug 20, 2023 • edited Loading

Description

Release note

Key changed/added classes in this PR

YongGang Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

abhishekagarwal87 Aug 22, 2023

Choose a reason for hiding this comment

YongGang Aug 27, 2023

Choose a reason for hiding this comment

YongGang Sep 8, 2023

Choose a reason for hiding this comment

YongGang commented Sep 19, 2023

kfaraz left a comment

Choose a reason for hiding this comment

kfaraz Sep 20, 2023

Choose a reason for hiding this comment

YongGang Sep 20, 2023 • edited Loading

Choose a reason for hiding this comment

kfaraz Sep 22, 2023

Choose a reason for hiding this comment

YongGang Sep 22, 2023

Choose a reason for hiding this comment

YongGang commented Sep 20, 2023

YongGang commented Sep 20, 2023

kfaraz commented Sep 27, 2023

kfaraz left a comment • edited Loading

Choose a reason for hiding this comment

YongGang commented Aug 20, 2023 •

edited

Loading

YongGang Aug 21, 2023 •

edited

Loading

YongGang Sep 20, 2023 •

edited

Loading

kfaraz left a comment •

edited

Loading