Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

JishinJames · 2025-01-15T11:53:40Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Description:

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run

Environment:

Argo Workflows version: 3.5.12
Parallel workflows: 6000+

What happened:

The Argo Workflows Controller's memory consumption increases exponentially. Despite this excessive memory usage, the controller crashes frequently. It does not log any specific error messages prior to these crashes, making it challenging to pinpoint the cause or underlying issue.
Hence we did a memory profiling of argo controller pod and some experiments were tried out to check whether it can be mitigated.

How to reproduce it (as minimally and precisely as possible):
Set up an environment with 300+ nodes.
Launch 5000+ workflows in parallel.
Monitor the RAM usage of the Argo Workflows Controller and note any unexpected crashes.

Version(s)

v3.5.8, v3.5.10, v3.5.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

metadata:
  name: test-workflow
spec:
  templates:
    - name: execution
      inputs: {}
      outputs: {}
      metadata: {}
      steps:
        - - name: generate-file-test
            template: generate-file
            arguments: {}
        - - name: read-file-test
            template: read-file
            arguments:
              artifacts:
                - name: generated-file
                  from: >-
                    {{steps.generate-file-test.outputs.artifacts.generated-file}}
        - - name: remove-file-test
            template: generate-and-remove-file
            arguments: {}
    - name: generate-file
      inputs: {}
      outputs:
        artifacts:
          - name: generated-file
            path: /data/test-file.txt
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - >
            mkdir -p /data;

            echo "This is a test file created by Argo Workflow" >
            /data/test-file.txt;

            cat /data/test-file.txt
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
    - name: read-file
      inputs:
        artifacts:
          - name: generated-file
            path: /data/test-file.txt
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - |
            echo "Getting file from Volume"
            ls -al /data
            cat /data/test-file.txt
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
    - name: generate-and-remove-file
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - |
            mkdir -p /data;
            export FILE_PATH='/data/test-file.txt'
            echo "This is a test file created by Argo Workflow" > $FILE_PATH;
            cat $FILE_PATH

            if [ ! -f $FILE_PATH ]; then
              echo "File $FILE_PATH does not exist"
              exit 1
            else
              rm -f $FILE_PATH
              echo "File has been removed."
              ls -al /data
            fi
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
  entrypoint: execution
  arguments:
    parameters:
      - name: wf-message
        value: This is a testing workflow on CPU node!
  volumes:
    - name: temp-storage
      emptyDir: {}
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

time="2025-01-15T11:10:57Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2025-01-15T11:10:57Z" level=info msg="cron config" cronSyncPeriod=10s
time="2025-01-15T11:10:57Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2025-01-15T11:10:57.406Z" level=warning msg="Non-transient error: <nil>"
time="2025-01-15T11:10:57.414Z" level=warning msg="Non-transient error: <nil>"
W0115 11:10:58.545737       1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed
time="2025-01-15T11:10:59.186Z" level=error msg="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"delete-wrong-ids\" in namespace \"argo-image-search\"" conditionType=SpecError namespace=argo-image-search workflow=delete-wrong-ids-dev
time="2025-01-15T11:10:59.186Z" level=error msg="invalid cron workflow" cronWorkflow=argo-image-search/delete-wrong-ids-dev error="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"delete-wrong-ids\" in namespace \"argo-image-search\""
time="2025-01-15T11:10:59.389Z" level=error msg="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"test-lucene-filter-workflow\" in namespace \"argo-image-search\"" conditionType=SpecError namespace=argo-image-search workflow=test-lucene-filter-workflow
time="2025-01-15T11:10:59.389Z" level=error msg="invalid cron workflow" cronWorkflow=argo-image-search/test-lucene-filter-workflow error="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"test-lucene-filter-workflow\" in namespace \"argo-image-search\""
time="2025-01-15T11:14:24.354Z" level=error msg="unable to substitute parameters for metric 'after_executed': failed to resolve {{workflow.failures}}" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.361Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.362Z" level=error msg="was unable to obtain node for stress-testing-2ngbw"
time="2025-01-15T11:14:24.362Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-1482424255"
time="2025-01-15T11:14:24.395Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.396Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3197284649"
time="2025-01-15T11:14:24.581Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.581Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-273366853"
time="2025-01-15T11:14:24.641Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.641Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3098736065"
time="2025-01-15T11:14:24.707Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.708Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3285503829"
time="2025-01-15T11:14:24.778Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.778Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3101850513"
time="2025-01-15T11:14:24.864Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.864Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3403075349"
time="2025-01-15T11:14:24.927Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.927Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3835149785"
time="2025-01-15T11:14:24.991Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t

time="2025-01-15T11:14:34.437Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-1842043933"

Logs from in your workflow's wait container

no logs in wait container

The text was updated successfully, but these errors were encountered:

tczhao · 2025-01-15T12:48:46Z

We have a different usage pattern - not that many workflow (10 at a time), but each workflow is gigantic.
Our controller memory increase a lot during workflow run but drop back down after.

I don't have any comment on how much memory a controller should use.
However, on the topic of memory not clearning. Have you configure ttlStrategy or archived?
If workflow is not cleared in etcd, it will persist in both etcd and argo controller causing high memory even after workflow completion.

shuangkun · 2025-01-15T14:46:21Z

Could see any logs?

kubectl logs workflow-controller -n argo -p

Or

kubectl get pod workflow-controller -n argo -o yaml

see any oom condations.
I once encountered controller crashes in a large-scale scenario at "woc.log.Fatalf".

JishinJames · 2025-01-16T08:32:11Z

HI @tczhao. Thanks for replying.

We have this configured.
persistence:
archive: true
archiveTTL: 30d
nodeStatusOffLoad: true
postgresql:
database: ********
port: 5432
ssl: true
sslMode: require
tableName: argo_workflows

JishinJames · 2025-01-16T08:33:51Z

HI @shuangkun, Thanks for the reply
Have no errors related to OOM in any of the logs. The pod only restarts when it goes beyond the memory allocated to the node.

Have attached logs here

tczhao · 2025-01-17T04:26:54Z

Try kubectl get workflows, see if the archived workflows are present in the cluster.
If they are present, that means you need to configure https://argo-workflows.readthedocs.io/en/latest/fields/#ttlstrategy

JishinJames · 2025-01-17T11:13:12Z

We dont see any archived workflows from the above command.

Tried to check what accumulating in the heap memory. Found out this goroutines accumulates most.img

JishinJames added the type/bug label Jan 15, 2025

shuangkun added the area/controller Controller issues, panics label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

JishinJames commented Jan 15, 2025 •

edited

Loading

tczhao commented Jan 15, 2025 •

edited

Loading

shuangkun commented Jan 15, 2025 •

edited

Loading

JishinJames commented Jan 16, 2025 •

edited

Loading

JishinJames commented Jan 16, 2025 •

edited

Loading

tczhao commented Jan 17, 2025

JishinJames commented Jan 17, 2025 •

edited

Loading

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

Comments

JishinJames commented Jan 15, 2025 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Description:

Environment:

What happened:

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tczhao commented Jan 15, 2025 • edited Loading

shuangkun commented Jan 15, 2025 • edited Loading

JishinJames commented Jan 16, 2025 • edited Loading

JishinJames commented Jan 16, 2025 • edited Loading

tczhao commented Jan 17, 2025

JishinJames commented Jan 17, 2025 • edited Loading

JishinJames commented Jan 15, 2025 •

edited

Loading

tczhao commented Jan 15, 2025 •

edited

Loading

shuangkun commented Jan 15, 2025 •

edited

Loading

JishinJames commented Jan 16, 2025 •

edited

Loading

JishinJames commented Jan 16, 2025 •

edited

Loading

JishinJames commented Jan 17, 2025 •

edited

Loading