Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve get archived workflow query performance during controller estimation. Fixes #13382 #13394

Merged

Conversation

linzhengen
Copy link
Contributor

@linzhengen linzhengen commented Jul 25, 2024

Fixes #13382

Motivation

Modifications

  • Added a query for during estimation
    • Avoided using SQL JSON extraction as it is slow (removed unnecessary columns)
    • Changed the filter for phase from exists to phase = 'Succeeded'

Verification

SQL Debug Log

2024/07/26 18:32:00 	Session ID:     00001
	Query:          SELECT "name", "namespace", "uid", "startedat", "finishedat" FROM "argo_archived_workflows" WHERE (("clustername" = $1 AND "namespace" = $2 AND "instanceid" = $3) AND "phase" = $4 AND "namespace" = $5 AND exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/cron-workflow' and value = 'hello-world')) ORDER BY "startedat" DESC LIMIT 1
	Arguments:      []interface {}{"default", "argo", "", "Succeeded", "argo"}
	Stack:          
		fmt.(*pp).handleMethods@/Users/linzhengen/.gvm/gos/go1.22.3/src/fmt/print.go:673
		fmt.(*pp).printArg@/Users/linzhengen/.gvm/gos/go1.22.3/src/fmt/print.go:756
		fmt.(*pp).doPrint@/Users/linzhengen/.gvm/gos/go1.22.3/src/fmt/print.go:1209
		fmt.Append@/Users/linzhengen/.gvm/gos/go1.22.3/src/fmt/print.go:289
		log.(*Logger).Print.func1@/Users/linzhengen/.gvm/gos/go1.22.3/src/log/log.go:261
		log.(*Logger).output@/Users/linzhengen/.gvm/gos/go1.22.3/src/log/log.go:238
		log.(*Logger).Print@/Users/linzhengen/.gvm/gos/go1.22.3/src/log/log.go:260
		github.com/argoproj/argo-workflows/v3/persist/sqldb.(*workflowArchive).GetWorkflowForEstimator@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/persist/sqldb/workflow_archive.go:408
		github.com/argoproj/argo-workflows/v3/workflow/controller/estimation.(*estimatorFactory).NewEstimator@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/estimation/estimator_factory.go:78
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).getEstimator@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2411
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).estimateWorkflowDuration@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2417
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).markWorkflowPhase@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2355
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).markWorkflowRunning@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2451
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:293
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:865
		github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:788
		k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157
		k8s.io/apimachinery/pkg/util/wait.BackoffUntil@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158
		k8s.io/apimachinery/pkg/util/wait.JitterUntil@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135
		k8s.io/apimachinery/pkg/util/wait.Until@/Users/linzhengen/.gvm/pkgsets/go1.22.3/global/src/github.com/argoproj/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
		runtime.goexit@/Users/linzhengen/.gvm/gos/go1.22.3/src/runtime/asm_arm64.s:1222
	Time taken:     0.01311s
	Context:        context.Background
EXPLAIN
  • Before
EXPLAIN SELECT
  name,
  namespace,
  uid,
  phase,
  startedat,
  finishedat,
  coalesce((workflow::json) -> 'metadata' ->> 'labels', '{}')      as labels,
  coalesce((workflow::json) -> 'metadata' ->> 'annotations', '{}') as annotations,
  coalesce((workflow::json) -> 'status' ->> 'progress', '')        as progress
FROM
  "argo_archived_workflows"
WHERE
  (
    (
      "clustername" = 'a'
      AND "instanceid" = 'a'
    )
    AND "namespace" = 'a'
    AND exists (
      select
        1
      from
        argo_archived_workflows_labels
      where
        clustername = argo_archived_workflows.clustername
        and uid = argo_archived_workflows.uid
        and name = 'workflows.argoproj.io/cron-workflow'
        and value = 'uuid-my-cwf-name'
    )
    AND exists (
      select
        1
      from
        argo_archived_workflows_labels
      where
        clustername = argo_archived_workflows.clustername
        and uid = argo_archived_workflows.uid
        and name = 'workflows.argoproj.io/phase'
        and value = 'Succeeded'
    )
  )
ORDER BY
  "startedat" DESC
LIMIT
  1
Limit  (cost=18.90..18.90 rows=1 width=119)
  ->  Sort  (cost=18.90..18.90 rows=1 width=119)
        Sort Key: argo_archived_workflows.startedat DESC
        ->  Nested Loop  (cost=1.55..18.89 rows=1 width=119)
              Join Filter: ((argo_archived_workflows_labels.uid)::text = (argo_archived_workflows_labels_1.uid)::text)
              ->  Nested Loop  (cost=0.99..12.54 rows=1 width=164)
                    Join Filter: ((argo_archived_workflows.uid)::text = (argo_archived_workflows_labels.uid)::text)
                    ->  Index Scan using argo_archived_workflows_labels_pkey on argo_archived_workflows_labels  (cost=0.56..6.33 rows=1 width=45)
                          Index Cond: (((clustername)::text = 'a'::text) AND ((name)::text = 'workflows.argoproj.io/cron-workflow'::text))
                          Filter: ((value)::text = 'uuid-my-cwf-name'::text)
                    ->  Index Scan using argo_archived_workflows_i2 on argo_archived_workflows  (cost=0.43..6.20 rows=1 width=127)
                          Index Cond: (((clustername)::text = 'a'::text) AND ((instanceid)::text = 'a'::text))
                          Filter: ((namespace)::text = 'a'::text)
              ->  Index Scan using argo_archived_workflows_labels_pkey on argo_archived_workflows_labels argo_archived_workflows_labels_1  (cost=0.56..6.33 rows=1 width=45)
                    Index Cond: (((clustername)::text = 'a'::text) AND ((name)::text = 'workflows.argoproj.io/phase'::text))
                    Filter: ((value)::text = 'Succeeded'::text)
  • After
EXPLAIN SELECT
  name,
  namespace,
  uid,
  phase,
  startedat,
  finishedat
FROM
  "argo_archived_workflows"
WHERE
  (
    (
      "clustername" = 'a'
      AND "instanceid" = 'a'
    )
    AND "namespace" = 'a'
    AND "phase" = 'Succeeded'
    AND exists (
      select
        1
      from
        argo_archived_workflows_labels
      where
        clustername = argo_archived_workflows.clustername
        and uid = argo_archived_workflows.uid
        and name = 'workflows.argoproj.io/cron-workflow'
        and value = 'uuid-my-cwf-name'
    )
  )
ORDER BY
  "startedat" DESC
LIMIT
  1
Limit  (cost=12.56..12.56 rows=1 width=119)
  ->  Sort  (cost=12.56..12.56 rows=1 width=119)
        Sort Key: argo_archived_workflows.startedat DESC
        ->  Nested Loop  (cost=0.99..12.55 rows=1 width=119)
              Join Filter: ((argo_archived_workflows.uid)::text = (argo_archived_workflows_labels.uid)::text)
              ->  Index Scan using argo_archived_workflows_i2 on argo_archived_workflows  (cost=0.43..6.20 rows=1 width=127)
                    Index Cond: (((clustername)::text = 'a'::text) AND ((instanceid)::text = 'a'::text))
                    Filter: (((namespace)::text = 'a'::text) AND ((phase)::text = 'Succeeded'::text))
              ->  Index Scan using argo_archived_workflows_labels_pkey on argo_archived_workflows_labels  (cost=0.56..6.33 rows=1 width=45)
                    Index Cond: (((clustername)::text = 'a'::text) AND ((name)::text = 'workflows.argoproj.io/cron-workflow'::text))
                    Filter: ((value)::text = 'uuid-my-cwf-name'::text)

@linzhengen linzhengen marked this pull request as ready for review July 25, 2024 11:21
@linzhengen linzhengen force-pushed the fix/improve_archived_api_performance branch 2 times, most recently from 867c8ed to 2b00f0e Compare July 25, 2024 12:40
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variant is a regression to past behavior and doesn't match suggested solutions. See in-line comment below for more details.

persist/sqldb/workflow_archive.go Outdated Show resolved Hide resolved
@linzhengen linzhengen marked this pull request as draft July 26, 2024 04:08
@linzhengen linzhengen force-pushed the fix/improve_archived_api_performance branch from 2b00f0e to cdc7b51 Compare July 26, 2024 07:31
@linzhengen linzhengen changed the title fix: improve archived api performance. Fixes #13382 fix: improve query performance for get archived workflow in controller during estimation. Fixes #13382 Jul 26, 2024
@linzhengen linzhengen force-pushed the fix/improve_archived_api_performance branch from cdc7b51 to fe3085d Compare July 26, 2024 09:47
@linzhengen linzhengen marked this pull request as ready for review July 26, 2024 10:21
@linzhengen linzhengen requested a review from agilgur5 July 26, 2024 10:21
@agilgur5 agilgur5 changed the title fix: improve query performance for get archived workflow in controller during estimation. Fixes #13382 fix: improve get archived workflow query performance during controller estimation. Fixes #13382 Jul 26, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jul 26, 2024
@linzhengen linzhengen force-pushed the fix/improve_archived_api_performance branch from fe3085d to 45fce7d Compare July 26, 2024 22:11
@linzhengen
Copy link
Contributor Author

This CI has failed:
https://github.com/argoproj/argo-workflows/actions/runs/10118236440/job/27984839387

For fanout(0:1), the message Pod was active on the node longer than the specified deadline is not being output.
Is this an issue with v1.26.15+k3s1?

Events:
  Type     Reason               Age   From                 Message
  ----     ------               ----  ----                 -------
  Normal   WorkflowRunning      13s   workflow-controller  Workflow Running
  Normal   WorkflowNodeRunning  13s   workflow-controller  Running node active-deadline-fanout-template-level-v2ljb[0]
  Normal   WorkflowNodeRunning  13s   workflow-controller  Running node active-deadline-fanout-template-level-v2ljb
  Normal   WorkflowNodeRunning  8s    workflow-controller  Running node active-deadline-fanout-template-level-v2ljb[0].fanout(3:4): Pod was active on the node longer than the specified deadline
  Warning  WorkflowNodeFailed   8s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb[0].fanout(3:4): Pod was active on the node longer than the specified deadline
  Normal   WorkflowNodeRunning  8s    workflow-controller  Running node active-deadline-fanout-template-level-v2ljb[0].fanout(2:3): Pod was active on the node longer than the specified deadline
  Warning  WorkflowNodeFailed   8s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb[0].fanout(2:3): Pod was active on the node longer than the specified deadline
  Normal   WorkflowNodeRunning  8s    workflow-controller  Running node active-deadline-fanout-template-level-v2ljb[0].fanout(1:2): Pod was active on the node longer than the specified deadline
  Warning  WorkflowNodeFailed   8s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb[0].fanout(1:2): Pod was active on the node longer than the specified deadline
  Warning  WorkflowFailed       3s    workflow-controller  child 'active-deadline-fanout-template-level-v2ljb-1669964074' failed
  Warning  WorkflowNodeFailed   3s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb: child 'active-deadline-fanout-template-level-v2ljb-1669964074' failed
  Warning  WorkflowNodeFailed   3s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb[0]: child 'active-deadline-fanout-template-level-v2ljb-1669964074' failed
  Normal   WorkflowNodeRunning  3s    workflow-controller  Running node active-deadline-fanout-template-level-v2ljb[0].fanout(0:1)
  Warning  WorkflowNodeFailed   3s    workflow-controller  Failed node active-deadline-fanout-template-level-v2ljb[0].fanout(0:1)

@agilgur5
Copy link

Maybe a test flake (although I don't recall that one failing before 🤔), I re-ran that job

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiachengxu Could you help review this?

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jul 27, 2024
Copy link
Member

@jiachengxu jiachengxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me! I think this is a reasonable mitigation for controller estimation, and for solving the slow archived workflows for API, JSONB migration or materialized view might improve more.

Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some small stylistic comments below, but otherwise this LGTM.

persist/sqldb/workflow_archive.go Outdated Show resolved Hide resolved
persist/sqldb/workflow_archive.go Show resolved Hide resolved
workflow/controller/estimation/estimator_factory.go Outdated Show resolved Hide resolved
…just below GetWorkflow, error log more match the naming

Signed-off-by: linzhengen <[email protected]>
@linzhengen linzhengen force-pushed the fix/improve_archived_api_performance branch from 45fce7d to 77f08af Compare July 29, 2024 07:40
@linzhengen
Copy link
Contributor Author

@agilgur5
Thank you for the review. I agree with all points raised and fixed the code.
Please review again.

And the Windows Unit Tests failed. Could you re-run it?

@linzhengen linzhengen requested a review from agilgur5 July 29, 2024 08:05
@agilgur5
Copy link

And the Windows Unit Tests failed. Could you re-run it?

These aren't required to pass (yet) so no need this time, can merge without them

@agilgur5 agilgur5 merged commit 7cc20bb into argoproj:main Jul 29, 2024
27 of 28 checks passed
@agilgur5
Copy link

Thanks for fixing this!

@linzhengen linzhengen deleted the fix/improve_archived_api_performance branch July 29, 2024 14:18
agilgur5 pushed a commit that referenced this pull request Jul 30, 2024
…r estimation. Fixes #13382 (#13394)

Signed-off-by: linzhengen <[email protected]>
(cherry picked from commit 7cc20bb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.5: Slow Archived Workflows query in Controller during estimation
4 participants