The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

binhdt2611 · 2024-08-27T23:35:11Z

Describe the bug
The "forbid" concurrent option is not working as expected. I have jobs set to "forbid" option. The issue happens when a node in our cluster crashed due to being overloaded, and restarted dkron service itself. Once it's up back, it started running the jobs which were still running in other agent servers.

When all nodes are running without being crashed, the "forbid" concurrent option works as expected.

To Reproduce

On an agent called dkron-marketplace-5, from the dkron log, it receives a signal from the cluster to run a job

Aug 28 00:55:03 dkron-marketplace-5 dkron[250326]: time="2024-08-28T00:55:03+12:00" level=info msg="agent: Calling AgentRun" job_name=marketplace-import-orders-16-amazon node="172.30.3.25:6868"

The job was selected to run on another agent called dkron-marketplace-9 then.

Aug 28 00:55:03 dkron-marketplace-9 dkron[55533]: time="2024-08-28T00:55:03+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-9

At 01:01:34, the dkron-marketplace-5 became overloaded and the dkron service was restarted then.

Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=error msg="grpc: error dialing." error="failed to build resolver: passthrough: received empty target in Build()" method=ExecutionDone node=dkron-marketplace-5 server_addr=
Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=fatal msg="agent: error applying SetExecutionType" error="node is not the leader" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Dkron agent starting" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-marketplace-5

Once the dkron started and rejoined the cluster, it has started the job which was still running on dkron-marketplace-9.

Aug 28 01:05:00 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:05:00+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-5

Expected behavior
The jobs shouldn't be allowed to run concurrently in that way because it was set to "forbid" option which allow only to run on 1 node at a time.

** Specifications:**

OS: Ubuntu 20.04
Version: 3.2.6

The text was updated successfully, but these errors were encountered:

vcastellm · 2024-10-27T16:41:13Z

Thanks for reporting we'll investigate the issue

vcastellm added the bug label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

binhdt2611 commented Aug 27, 2024 •

edited

Loading

vcastellm commented Oct 27, 2024

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

Comments

binhdt2611 commented Aug 27, 2024 • edited Loading

vcastellm commented Oct 27, 2024

binhdt2611 commented Aug 27, 2024 •

edited

Loading