Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run #14084

Open
3 of 4 tasks
JishinJames opened this issue Jan 15, 2025 · 6 comments
Labels
area/controller Controller issues, panics type/bug

Comments

@JishinJames
Copy link

JishinJames commented Jan 15, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Description:

Memory (RAM) increases exponentially on submitting large number of workflows and it s not cleared even after the workflow run

Environment:

Argo Workflows version: 3.5.12
Parallel workflows: 6000+

What happened:

The Argo Workflows Controller's memory consumption increases exponentially. Despite this excessive memory usage, the controller crashes frequently. It does not log any specific error messages prior to these crashes, making it challenging to pinpoint the cause or underlying issue.
Hence we did a memory profiling of argo controller pod and some experiments were tried out to check whether it can be mitigated.

How to reproduce it (as minimally and precisely as possible):
Set up an environment with 300+ nodes.
Launch 5000+ workflows in parallel.
Monitor the RAM usage of the Argo Workflows Controller and note any unexpected crashes.

Image

Version(s)

v3.5.8, v3.5.10, v3.5.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

metadata:
  name: test-workflow
spec:
  templates:
    - name: execution
      inputs: {}
      outputs: {}
      metadata: {}
      steps:
        - - name: generate-file-test
            template: generate-file
            arguments: {}
        - - name: read-file-test
            template: read-file
            arguments:
              artifacts:
                - name: generated-file
                  from: >-
                    {{steps.generate-file-test.outputs.artifacts.generated-file}}
        - - name: remove-file-test
            template: generate-and-remove-file
            arguments: {}
    - name: generate-file
      inputs: {}
      outputs:
        artifacts:
          - name: generated-file
            path: /data/test-file.txt
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - >
            mkdir -p /data;

            echo "This is a test file created by Argo Workflow" >
            /data/test-file.txt;

            cat /data/test-file.txt
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
    - name: read-file
      inputs:
        artifacts:
          - name: generated-file
            path: /data/test-file.txt
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - |
            echo "Getting file from Volume"
            ls -al /data
            cat /data/test-file.txt
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
    - name: generate-and-remove-file
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: alpine:3.7
        command:
          - sh
          - '-c'
        args:
          - |
            mkdir -p /data;
            export FILE_PATH='/data/test-file.txt'
            echo "This is a test file created by Argo Workflow" > $FILE_PATH;
            cat $FILE_PATH

            if [ ! -f $FILE_PATH ]; then
              echo "File $FILE_PATH does not exist"
              exit 1
            else
              rm -f $FILE_PATH
              echo "File has been removed."
              ls -al /data
            fi
        resources: {}
        volumeMounts:
          - name: temp-storage
            mountPath: /data
  entrypoint: execution
  arguments:
    parameters:
      - name: wf-message
        value: This is a testing workflow on CPU node!
  volumes:
    - name: temp-storage
      emptyDir: {}
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

time="2025-01-15T11:10:57Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2025-01-15T11:10:57Z" level=info msg="cron config" cronSyncPeriod=10s
time="2025-01-15T11:10:57Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2025-01-15T11:10:57.406Z" level=warning msg="Non-transient error: <nil>"
time="2025-01-15T11:10:57.414Z" level=warning msg="Non-transient error: <nil>"
W0115 11:10:58.545737       1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed
time="2025-01-15T11:10:59.186Z" level=error msg="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"delete-wrong-ids\" in namespace \"argo-image-search\"" conditionType=SpecError namespace=argo-image-search workflow=delete-wrong-ids-dev
time="2025-01-15T11:10:59.186Z" level=error msg="invalid cron workflow" cronWorkflow=argo-image-search/delete-wrong-ids-dev error="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"delete-wrong-ids\" in namespace \"argo-image-search\""
time="2025-01-15T11:10:59.389Z" level=error msg="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"test-lucene-filter-workflow\" in namespace \"argo-image-search\"" conditionType=SpecError namespace=argo-image-search workflow=test-lucene-filter-workflow
time="2025-01-15T11:10:59.389Z" level=error msg="invalid cron workflow" cronWorkflow=argo-image-search/test-lucene-filter-workflow error="cannot validate Workflow: WorkflowTemplate Informer cannot find WorkflowTemplate of name \"test-lucene-filter-workflow\" in namespace \"argo-image-search\""
time="2025-01-15T11:14:24.354Z" level=error msg="unable to substitute parameters for metric 'after_executed': failed to resolve {{workflow.failures}}" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.361Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.362Z" level=error msg="was unable to obtain node for stress-testing-2ngbw"
time="2025-01-15T11:14:24.362Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-1482424255"
time="2025-01-15T11:14:24.395Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.396Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3197284649"
time="2025-01-15T11:14:24.581Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.581Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-273366853"
time="2025-01-15T11:14:24.641Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.641Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3098736065"
time="2025-01-15T11:14:24.707Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.708Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3285503829"
time="2025-01-15T11:14:24.778Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.778Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3101850513"
time="2025-01-15T11:14:24.864Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.864Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3403075349"
time="2025-01-15T11:14:24.927Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t
time="2025-01-15T11:14:24.927Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-3835149785"
time="2025-01-15T11:14:24.991Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-dev workflow=stress-testing-86d2t

time="2025-01-15T11:14:34.437Z" level=error msg="was unable to obtain node for stress-testing-2ngbw-1842043933"

Logs from in your workflow's wait container

no logs in wait container
@tczhao
Copy link
Member

tczhao commented Jan 15, 2025

We have a different usage pattern - not that many workflow (10 at a time), but each workflow is gigantic.
Our controller memory increase a lot during workflow run but drop back down after.


I don't have any comment on how much memory a controller should use.
However, on the topic of memory not clearning. Have you configure ttlStrategy or archived?
If workflow is not cleared in etcd, it will persist in both etcd and argo controller causing high memory even after workflow completion.

@shuangkun
Copy link
Member

shuangkun commented Jan 15, 2025

Could see any logs?

kubectl logs workflow-controller -n argo -p

Or

kubectl get pod workflow-controller -n argo -o yaml

see any oom condations.
I once encountered controller crashes in a large-scale scenario at "woc.log.Fatalf".

@shuangkun shuangkun added the area/controller Controller issues, panics label Jan 15, 2025
@JishinJames
Copy link
Author

JishinJames commented Jan 16, 2025

HI @tczhao. Thanks for replying.

We have this configured.
persistence:
archive: true
archiveTTL: 30d
nodeStatusOffLoad: true
postgresql:
database: ********
port: 5432
ssl: true
sslMode: require
tableName: argo_workflows

@JishinJames
Copy link
Author

JishinJames commented Jan 16, 2025

HI @shuangkun, Thanks for the reply
Have no errors related to OOM in any of the logs. The pod only restarts when it goes beyond the memory allocated to the node.

Have attached logs here

@tczhao
Copy link
Member

tczhao commented Jan 17, 2025

Try kubectl get workflows, see if the archived workflows are present in the cluster.
If they are present, that means you need to configure https://argo-workflows.readthedocs.io/en/latest/fields/#ttlstrategy

@JishinJames
Copy link
Author

JishinJames commented Jan 17, 2025

We dont see any archived workflows from the above command.

Tried to check what accumulating in the heap memory. Found out this goroutines accumulates most.img

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

No branches or pull requests

3 participants