fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

bom-d-van · 2024-12-23T07:02:00Z

There seems to be no related open issue, so I just submit a pr. but it's related to #12537

Motivation

error scenario: a pod for a step in a workflow has completed, and its task result are properly created and finalized by its wait container (judging from the exit status of the wait container), however, the task result informer in the controller leader has not received any updates about it (due to overloaded api server or etcd).

currently, the argo workflow controller doesn't handle the above scenario properly. it would mark the workflow node succeeded and shows no artifact outputs (even though they are already uploaded to the repository).

we did run into this situation in our production instance (it's v3.5.8).

it's not easy to reproduce this problem, but we can have a manual fault injection in workflow/controller/taskresult.go:func (woc *wfOperationCtx) taskResultReconciliation() to simulate the situation and I did reproduce the issue on release v3.6.2:

+++ workflow/controller/taskresult.go
@@ -1,7 +1,9 @@
 package controller

 import (
+	"os"
 	"reflect"
+	"strings"
 	"time"

 	log "github.com/sirupsen/logrus"
@@ -62,6 +64,12 @@ func (woc *wfOperationCtx) taskResultReconciliation() {
 	objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name)
 	woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation")

+	if strings.Contains(woc.wf.Name, "-xhu-debug-") {
+		if _, err := os.Stat("/tmp/xhu-debug-control"); err != nil {
+			return
+		}
+	}

Modifications

the change is to forcefully mark the workflow having incomplete TaskResult in assessNodeStatus.

this fix doesn't handle the case when a pod failed, there are too many potentially failure scenarios (like the wait container might not be able to insert a task result). plus, a retry is probably needed when there are failures. the loss is probably not as great as a successful one.

Verification

it's covered by the updated test case and my manual verification details are included bellow (using the fault injection above).

test workflow to verify the fix:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-xhu-debug-
spec:
  entrypoint: bash-script-example
  activeDeadlineSeconds: 3600
  templates:
  - name: bash-script-example
    steps:
    - - name: generate
        template: gen-random-int-bash
    - - name: print
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "{{steps.generate.outputs.result}}"  # The result of the here-script

  - name: gen-random-int-bash
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art
        path: /tmp/hello_world.txt
    script:
      image: reg.deeproute.ai/deeproute-public/nicolaka/netshoot:v0.13
      command: [bash]
      source: |                                         # Contents of the here-script
        touch /tmp/hello_world.txt
        cat /dev/urandom | od -N2 -An -i | awk -v f=1 -v r=100 '{printf "%i\n", f + r * $1 / 65536}'

  - name: print-message
    inputs:
      parameters:
      - name: message
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art2
        path: /tmp/hello_world.txt
    container:
      image: reg.deeproute.ai/deeproute-public/nicolaka/netshoot:v0.13
      command: [sh, -c]
      args: ["touch /tmp/hello_world.txt; echo result was: {{inputs.parameters.message}}"]

with the manual fault injection, without the fix:

with the manual fault injection, with the fix:

…en failing to received timely updates from api server error scenario: a pod for a step in a workflow has completed, and its task result are properly created and finalized by its wait container (judging from the exit status of the wait container), however, the task result informer in the controller leader has not received any updates about it (due to overloaded api server or etcd). currently, the argo workflow controller doesn't handle the above scenario properly. it would mark the workflow node succeeded and shows no artifact outputs (even though they are already uploaded to the repository). we did run into this situation in our production instance (it's v3.5.8). it's not easy to reproduce this problem, but we can have a manual fault injection in `workflow/controller/taskresult.go:func (woc *wfOperationCtx) taskResultReconciliation()` to simulate the situation and I did reproduce the issue on release v3.6.2: ```diff +++ workflow/controller/taskresult.go @@ -1,7 +1,9 @@ package controller import ( + "os" "reflect" + "strings" "time" log "github.com/sirupsen/logrus" @@ -62,6 +64,12 @@ func (woc *wfOperationCtx) taskResultReconciliation() { objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name) woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation") + if strings.Contains(woc.wf.Name, "-xhu-debug-") { + if _, err := os.Stat("/tmp/xhu-debug-control"); err != nil { + return + } + } ``` the change is to forcefully mark the workflow having incomplete TaskResult in assessNodeStatus. this fix doesn't handle the case when a pod failed, there are too many potentially failure scenarios (like the wait container might not be able to insert a task result). plus, a retry is probably needed when there are failures. the loss is probably not as great as a successful one. Signed-off-by: Xiaofan Hu <[email protected]>

tczhao

Great finding.
A delay in taskresultInformer will result in the issue you are seeing.
However, if you experience k8sapi pressure and the retry inside informer fails, you will see a much bigger problem. e.g. argo uses the cache from informer, the k8sapi pressure may result in this cache out of sync, and they will never sync until you restarts controller

@shuangkun has more context related to this change, will ask him to review instead

tczhao · 2024-12-25T05:08:24Z

workflow/controller/operator.go

+		// this fix doesn't handle the case when a pod failed, there are too many
+		// potentially failure scenarios (like the wait container might not be able to
+		// insert a task result). plus, a retry is probably needed when there are
+		// failures. the loss is probably not as great as a successful one.


we have retry for transient errors in wait container to handle this

argo-workflows/workflow/executor/executor.go

Lines 852 to 858 in 27a5938

func (we *WorkflowExecutor) reportResult(ctx context.Context, result wfv1.NodeResult) error {

return retryutil.OnError(wait.Backoff{

Duration: time.Second,

Factor: 2,

Jitter: 0.1,

Steps: 5,

Cap: 30 * time.Second,

bom-d-van · 2024-12-25T08:07:49Z

Great finding. A delay in taskresultInformer will result in the issue you are seeing. However, if you experience k8sapi pressure and the retry inside informer fails, you will see a much bigger problem. e.g. argo uses the cache from informer, the k8sapi pressure may result in this cache out of sync, and they will never sync until you restarts controller

@shuangkun has more context related to this change, will ask him to review instead

yep, but in a way, a controller restart could help us resolving the issue when things went terribly wrong.
arguably, at least for us, it's still better than having partially succeeded workflows.

shuangkun · 2025-01-06T16:43:13Z

It seems that the problem of output not being displayed cannot be solved

bom-d-van · 2025-01-07T05:33:21Z

It seems that the problem of output not being displayed cannot be solved

That's strange. I don't have this problem on my simulation. Once the controller was able to receive task result from api-server, everything just continue to work like normal.

Could you share me your argo-workflow setup and test workflow configs? @shuangkun

shuangkun · 2025-01-15T16:24:41Z

If the pod has been successfully created, but the taskresult has not been updated by watch, and mark TaskResultStatus is true, will the previously fixed problem（#12537 ） occur again? When the next task references the output of the previous step, the node information has not been updated by taskresults at this time, and the reference will report an error.

shuangkun · 2025-01-15T16:29:35Z

Here is a judgment to avoid starting the next new pod when taskresult is not updated

argo-workflows/workflow/controller/operator.go

Line 1144 in b1a65e7

if newState.Succeeded() && woc.wf.Status.IsTaskResultIncomplete(node.ID) {

isubasinghe · 2025-01-16T06:31:44Z

I think we need a rethink of how we handle taskresults/pod statuses together.

We (as open source contributors, not assigning any blame to anyone here) keep putting out these fires, but we should solve this through some other means.

Solving this properly is probably only achievable in a larger time frame and as a result it probably makes sense to go ahead with this change.

I'm mostly saying this because we hit another a bug relating to a similar problem.

I'm currently working on a state machine in TLA+ (well PlusCal) to model this problem, after this is done, we should extract an algorithm out of this state machine and implement it in the controller.

isubasinghe · 2025-01-16T06:32:04Z

If anyone wants to work on the TLA+ spec with me, let me know.

bom-d-van · 2025-01-16T09:12:02Z

If the pod has been successfully created, but the taskresult has not been updated by watch, and mark TaskResultStatus is true, will the previously fixed problem（#12537 ） occur again?

@shuangkun Should be good. This is actually fixing one more edge case when checking for task result. (at least the unit tests still pass properly).

bom-d-van · 2025-01-16T09:21:17Z

I think we need a rethink of how we handle taskresults/pod statuses together.

We (as open source contributors, not assigning any blame to anyone here) keep putting out these fires, but we should solve this through some other means.

Solving this properly is probably only achievable in a larger time frame and as a result it probably makes sense to go ahead with this change.

I'm mostly saying this because we hit another a bug relating to a similar problem.

I'm currently working on a state machine in TLA+ (well PlusCal) to model this problem, after this is done, we should extract an algorithm out of this state machine and implement it in the controller.

@isubasinghe In a way, maybe not using TaskResult CRD as the middle man for task result reconciliation might work better. Having argoexec/wait container talking directly to the controller would be better?

shuangkun · 2025-01-17T05:53:54Z

I think we need a rethink of how we handle taskresults/pod statuses together.
We (as open source contributors, not assigning any blame to anyone here) keep putting out these fires, but we should solve this through some other means.
Solving this properly is probably only achievable in a larger time frame and as a result it probably makes sense to go ahead with this change.
I'm mostly saying this because we hit another a bug relating to a similar problem.
I'm currently working on a state machine in TLA+ (well PlusCal) to model this problem, after this is done, we should extract an algorithm out of this state machine and implement it in the controller.

@isubasinghe In a way, maybe not using TaskResult CRD as the middle man for task result reconciliation might work better. Having argoexec/wait container talking directly to the controller would be better?

TaskResults was probably introduced for security reasons (cancelling patch pod permissions) and complex outputs transmission. Is there a better cloud-native approach?

bom-d-van force-pushed the controller/reconciliation-of-delayed-task-result-public branch from 63cd4c4 to 747f1e6 Compare December 23, 2024 07:05

bom-d-van marked this pull request as ready for review December 23, 2024 07:35

tczhao reviewed Dec 25, 2024

View reviewed changes

tczhao requested a review from shuangkun December 25, 2024 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

bom-d-van commented Dec 23, 2024 •

edited

Loading

tczhao left a comment

tczhao Dec 25, 2024

bom-d-van commented Dec 25, 2024 •

edited

Loading

shuangkun commented Jan 6, 2025

bom-d-van commented Jan 7, 2025

shuangkun commented Jan 15, 2025 •

edited

Loading

shuangkun commented Jan 15, 2025

isubasinghe commented Jan 16, 2025

isubasinghe commented Jan 16, 2025

bom-d-van commented Jan 16, 2025 •

edited

Loading

bom-d-van commented Jan 16, 2025 •

edited

Loading

shuangkun commented Jan 17, 2025

	func (we *WorkflowExecutor) reportResult(ctx context.Context, result wfv1.NodeResult) error {
	return retryutil.OnError(wait.Backoff{
	Duration: time.Second,
	Factor: 2,
	Jitter: 0.1,
	Steps: 5,
	Cap: 30 * time.Second,

fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

Are you sure you want to change the base?

fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

Conversation

bom-d-van commented Dec 23, 2024 • edited Loading

Motivation

Modifications

Verification

test workflow to verify the fix:

with the manual fault injection, without the fix:

with the manual fault injection, with the fix:

tczhao left a comment

Choose a reason for hiding this comment

tczhao Dec 25, 2024

Choose a reason for hiding this comment

bom-d-van commented Dec 25, 2024 • edited Loading

shuangkun commented Jan 6, 2025

bom-d-van commented Jan 7, 2025

shuangkun commented Jan 15, 2025 • edited Loading

shuangkun commented Jan 15, 2025

isubasinghe commented Jan 16, 2025

isubasinghe commented Jan 16, 2025

bom-d-van commented Jan 16, 2025 • edited Loading

bom-d-van commented Jan 16, 2025 • edited Loading

shuangkun commented Jan 17, 2025

bom-d-van commented Dec 23, 2024 •

edited

Loading

bom-d-van commented Dec 25, 2024 •

edited

Loading

shuangkun commented Jan 15, 2025 •

edited

Loading

bom-d-van commented Jan 16, 2025 •

edited

Loading

bom-d-van commented Jan 16, 2025 •

edited

Loading