Lock message misleading when workflow is not at front of queue #14080

moserke · 2025-01-14T20:43:05Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running a workflow with multiple synchronization locks, where one is a mutex and one is a semaphore, you can get into a case where the shown Pending message is misleading/confusing.

Since a workflow must wait for all locks to be available, but the queue is processed in order, if you have say 5 jobs in the queue that all rely on the same mutex, and then 5 more jobs behind them that don't, but they all rely on the same semaphore, the jobs that had a different mutex will not run until all of the jobs relying on the same mutex finish.

This is not the issue, this is expected and fully explained in the docs. However, the message the system returns for those not relying on the mutex is fmt.Sprintf("Waiting for %s lock. Lock status: %d/%d", s.name, s.limit-len(s.lockHolder), s.limit) from https://github.com/argoproj/argo-workflows/blob/v3.6.2/workflow/sync/semaphore.go#L176.

What is misleading is the case where there is plenty of semaphore room, but you are violating your position in queue. You end up with a pending message that says waiting for lock but shows you plenty of lock room, but what it really should say is "Waiting for position in queue" or something.

This would help troubleshooting situations as to why not a lot of jobs are running even though the semaphore has lots of room.

Version(s)

v3.6.2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Example workflow level synchronization config, where the semaphore is set to say 200 and the mutex has 5 jobs with the same key in front of a job with a different key


  synchronization:
    semaphore:
      configMapKeyRef:
        name: my_config_map
        key: lots_of_jobs
      namespace: default
    mutex:
      name: my_uuid
      namespace: default

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

shuangkun · 2025-01-15T04:02:58Z

Could you submit a simple PR to fix this log? @moserke

moserke added the type/bug label Jan 14, 2025

shuangkun added the area/mutex-semaphore label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock message misleading when workflow is not at front of queue #14080

Lock message misleading when workflow is not at front of queue #14080

moserke commented Jan 14, 2025 •

edited

Loading

shuangkun commented Jan 15, 2025

Lock message misleading when workflow is not at front of queue #14080

Lock message misleading when workflow is not at front of queue #14080

Comments

moserke commented Jan 14, 2025 • edited Loading

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

shuangkun commented Jan 15, 2025

moserke commented Jan 14, 2025 •

edited

Loading