Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock message misleading when workflow is not at front of queue #14080

Open
2 of 4 tasks
moserke opened this issue Jan 14, 2025 · 1 comment
Open
2 of 4 tasks

Lock message misleading when workflow is not at front of queue #14080

moserke opened this issue Jan 14, 2025 · 1 comment

Comments

@moserke
Copy link

moserke commented Jan 14, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running a workflow with multiple synchronization locks, where one is a mutex and one is a semaphore, you can get into a case where the shown Pending message is misleading/confusing.

Since a workflow must wait for all locks to be available, but the queue is processed in order, if you have say 5 jobs in the queue that all rely on the same mutex, and then 5 more jobs behind them that don't, but they all rely on the same semaphore, the jobs that had a different mutex will not run until all of the jobs relying on the same mutex finish.

This is not the issue, this is expected and fully explained in the docs. However, the message the system returns for those not relying on the mutex is fmt.Sprintf("Waiting for %s lock. Lock status: %d/%d", s.name, s.limit-len(s.lockHolder), s.limit) from https://github.com/argoproj/argo-workflows/blob/v3.6.2/workflow/sync/semaphore.go#L176.

What is misleading is the case where there is plenty of semaphore room, but you are violating your position in queue. You end up with a pending message that says waiting for lock but shows you plenty of lock room, but what it really should say is "Waiting for position in queue" or something.

This would help troubleshooting situations as to why not a lot of jobs are running even though the semaphore has lots of room.

Version(s)

v3.6.2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Example workflow level synchronization config, where the semaphore is set to say 200 and the mutex has 5 jobs with the same key in front of a job with a different key


  synchronization:
    semaphore:
      configMapKeyRef:
        name: my_config_map
        key: lots_of_jobs
      namespace: default
    mutex:
      name: my_uuid
      namespace: default

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@shuangkun
Copy link
Member

Could you submit a simple PR to fix this log? @moserke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants