Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5: Slow Archived Workflows query in Controller during estimation #13382

Closed
4 tasks done
linzhengen opened this issue Jul 23, 2024 · 9 comments · Fixed by #13394
Closed
4 tasks done

3.5: Slow Archived Workflows query in Controller during estimation #13382

linzhengen opened this issue Jul 23, 2024 · 9 comments · Fixed by #13394
Labels
area/controller Controller issues, panics area/workflow-archive P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@linzhengen
Copy link
Contributor

linzhengen commented Jul 23, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

I have confirmed that the following logs frequently occur from the workflow-controller. After downgrading to version v3.4.17, the slow query issue did not occur.
I have not confirmed version 3.5.8 yet, but based on the differences, it seems likely that the same issue will occur.

  • Query Insights for the Past 7 Days
    • Avg execution time (ms): 1,788.33
    • SQL
      SELECT name, namespace, uid, phase, startedat, finishedat, coalesce((workflow::json)->$4->>$5, $6) as labels, coalesce((workflow::json)->$7->>$8, $9) as annotations, coalesce((workflow::json)->$10->>$11, $12) as progress FROM "argo_archived_workflows" WHERE (("clustername" = $1 AND "instanceid" = $2) AND "namespace" = $3 AND exists (select $13 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = $14 and value = $15) AND exists (select $16 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = $17 and value = $18)) ORDER BY "startedat" DESC LIMIT $19
  • DB: Google Cloud SQL (PostgreSQL)
  • Data Size:
    • argo_archived_workflows: ~1.7 million rows
    • argo_archived_workflows_labels: ~10 million rows
  • workflow-controller-configmap
    • archiveTTL: 90d
workflow-controller 	Query:          SELECT name, namespace, uid, phase, startedat, finishedat, coalesce((workflow::json)->'metadata'->>'labels', '{}') as labels, coalesce((workflow::json)->'metadata'->>'annotations', '{}') as annotations, coalesce((workflow::json)->'status'->>'progress', '') as progress FROM "argo_archived_workflows" WHERE (("clustername" = $1 AND "instanceid" = $2) AND "namespace" = $3 AND exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/cron-workflow' and value = 'uuid-my-cwf-name') AND exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and value = 'Succeeded')) ORDER BY "startedat" DESC LIMIT 1
workflow-controller 	Arguments:      []interface {}{"default", "", "my-namespace"}
workflow-controller 	Stack:
workflow-controller 		fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673
workflow-controller 		fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756
workflow-controller 		fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1211
workflow-controller 		fmt.Append@/usr/local/go/src/fmt/print.go:289
workflow-controller 		log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261
workflow-controller 		log.(*Logger).output@/usr/local/go/src/log/log.go:238
workflow-controller 		log.(*Logger).Print@/usr/local/go/src/log/log.go:260
workflow-controller 		github.com/argoproj/argo-workflows/v3/persist/sqldb.(*workflowArchive).ListWorkflows@/go/src/github.com/argoproj/argo-workflows/persist/sqldb/workflow_archive.go:166
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller/estimation.(*estimatorFactory).NewEstimator@/go/src/github.com/argoproj/argo-workflows/workflow/controller/estimation/estimator_factory.go:79
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).getEstimator@/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2401
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).estimateWorkflowDuration@/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2407
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).markWorkflowPhase@/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2347
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).markWorkflowRunning@/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2441
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate@/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:286
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem@/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:811
workflow-controller 		github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker@/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:734
workflow-controller 		k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1@/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
workflow-controller 		k8s.io/apimachinery/pkg/util/wait.BackoffUntil@/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
workflow-controller 		k8s.io/apimachinery/pkg/util/wait.JitterUntil@/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
workflow-controller 		k8s.io/apimachinery/pkg/util/wait.Until@/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
workflow-controller 		runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1650
workflow-controller 	Error:          upper: slow query
workflow-controller 	Time taken:     2.71444s

Version(s)

v3.5.7

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Referring to the above log

Logs from the workflow controller

Referring to the above log

Logs from in your workflow's wait container

Referring to the above log
@agilgur5 agilgur5 changed the title Slow Query Occurred in workflow-controller Slow Archived Workflows Query in controller Jul 23, 2024
@agilgur5 agilgur5 changed the title Slow Archived Workflows Query in controller Slow Archived Workflows query in controller Jul 23, 2024
@agilgur5 agilgur5 added area/workflow-archive type/regression Regression from previous behavior (a specific type of bug) area/controller Controller issues, panics labels Jul 23, 2024
@agilgur5
Copy link

Root cause sounds like a duplicate of #13295

I'm surprised that it impacts the Controller though; it looks like the estimator is listing out archived workflows?

@agilgur5 agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jul 23, 2024
@linzhengen
Copy link
Contributor Author

Root cause sounds like a duplicate of #13295

The executed query is different, but it looks like the same cause.

it looks like the estimator is listing out archived workflows?

I think so. Here is the source code.

workflows, err := f.wfArchive.ListWorkflows(
utils.ListOptions{
Namespace: wf.Namespace,
LabelRequirements: requirements,
Limit: 1,
Offset: 0,
})

@agilgur5
Copy link

Huh, I didn't know there was a fallback to archived workflows there. TIL.

How often does this slow query happen? Since it's a limit 1, I'm even more surprised it takes that long. Is the Workflow very big?

@linzhengen
Copy link
Contributor Author

How often does this slow query happen?

From looking at Query Insights, it seems to occur as frequently as the number of workflows executed.

Is the Workflow very big?

The workflow is not big, just a 2-step workflow.
This issue did not occur in v3.4.17, so I am curious about the differences.

@tooptoop4
Copy link
Contributor

@linzhengen try #13131

@agilgur5
Copy link

The workflow is not big, just a 2-step workflow

Does it have a big loop or anything? As in, does it produce a large graph?

If not that's bizarre that a 2-step workflow is causing that... 🤔

This issue did not occur in v3.4.17, so I am curious about the differences.

From #13295, it looks like the JSON extraction is not re-used and happens once for each of the fields. So that and the JSON extraction being slow in general seems to cause the slowdown.

The list query changed in 3.5 for the combined Workflow List UI+API. The estimator should probably not be using the same exact query, I don't think all those fields are necessary for it.

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Jul 23, 2024
@agilgur5 agilgur5 changed the title Slow Archived Workflows query in controller 3.5: Slow Archived Workflows query in controller Jul 23, 2024
@linzhengen
Copy link
Contributor Author

So that and the JSON extraction being slow in general seems to cause the slowdown.

Here are sample QUERY PLANS for two versions of Google Cloud SQL Query insights.
I found that there is a high cost associated in 3.5 with the Nested Loop before the Index Scan .
It seems likely that JSON extraction is the cause.

  • v3.4.17

SELECT
  "name",
  "namespace",
  "uid",
  "phase",
  "startedat",
  "finishedat"
FROM
  "argo_archived_workflows"
WHERE
  (("clustername" = $1
      AND "instanceid" = $2)
    AND "namespace" = $3
    AND (EXISTS (
      SELECT
        $4
      FROM
        argo_archived_workflows_labels
      WHERE
        clustername = argo_archived_workflows.clustername
        AND uid = argo_archived_workflows.uid
        AND name = $5
        AND value = $6)
      AND EXISTS (
      SELECT
        $7
      FROM
        argo_archived_workflows_labels
      WHERE
        clustername = argo_archived_workflows.clustername
        AND uid = argo_archived_workflows.uid
        AND name = $8
        AND value = $9)))
ORDER BY
  "startedat" DESC
LIMIT
  $10
image
  • v3.5.7

SELECT
  name,
  namespace,
  uid,
  phase,
  startedat,
  finishedat,
  COALESCE((workflow::json)->$4->>$5, $6) AS labels,
  COALESCE((workflow::json)->$7->>$8, $9) AS annotations,
  COALESCE((workflow::json)->$10->>$11, $12) AS progress
FROM
  "argo_archived_workflows"
WHERE
  (("clustername" = $1
      AND "instanceid" = $2)
    AND "namespace" = $3
    AND NOT EXISTS (
    SELECT
      $13
    FROM
      argo_archived_workflows_labels
    WHERE
      clustername = argo_archived_workflows.clustername
      AND uid = argo_archived_workflows.uid
      AND name = $14)
    AND EXISTS (
    SELECT
      $15
    FROM
      argo_archived_workflows_labels
    WHERE
      clustername = argo_archived_workflows.clustername
      AND uid = argo_archived_workflows.uid
      AND name = $16
      AND value = $17))
ORDER BY
  "startedat" DESC
LIMIT
  $18
image

@linzhengen
Copy link
Contributor Author

linzhengen commented Jul 24, 2024

try #13131

Adding an option to skip workflow duration estimation might be one way to work around the issue.

@agilgur5 agilgur5 changed the title 3.5: Slow Archived Workflows query in controller 3.5: Slow Archived Workflows query in Controller during estimation Jul 24, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 25, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 25, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 25, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 25, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 26, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 26, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 26, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jul 27, 2024
linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 29, 2024
agilgur5 pushed a commit that referenced this issue Jul 29, 2024
@agilgur5
Copy link

agilgur5 commented Jul 29, 2024

try #13131

Adding an option to skip workflow duration estimation might be one way to work around the issue.

I think that's a good feature to have as well, but that PR did not pass CI. If you want to get CI passing in a new PR of that, I can review it too

agilgur5 pushed a commit that referenced this issue Jul 30, 2024
…r estimation. Fixes #13382 (#13394)

Signed-off-by: linzhengen <[email protected]>
(cherry picked from commit 7cc20bb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/workflow-archive P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
3 participants