Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start #1569

Open
binhdt2611 opened this issue Aug 27, 2024 · 1 comment
Labels

Comments

@binhdt2611
Copy link

binhdt2611 commented Aug 27, 2024

Describe the bug
The "forbid" concurrent option is not working as expected. I have jobs set to "forbid" option. The issue happens when a node in our cluster crashed due to being overloaded, and restarted dkron service itself. Once it's up back, it started running the jobs which were still running in other agent servers.

When all nodes are running without being crashed, the "forbid" concurrent option works as expected.

To Reproduce

  1. On an agent called dkron-marketplace-5, from the dkron log, it receives a signal from the cluster to run a job
Aug 28 00:55:03 dkron-marketplace-5 dkron[250326]: time="2024-08-28T00:55:03+12:00" level=info msg="agent: Calling AgentRun" job_name=marketplace-import-orders-16-amazon node="172.30.3.25:6868"
  1. The job was selected to run on another agent called dkron-marketplace-9 then.
Aug 28 00:55:03 dkron-marketplace-9 dkron[55533]: time="2024-08-28T00:55:03+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-9
  1. At 01:01:34, the dkron-marketplace-5 became overloaded and the dkron service was restarted then.
Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=error msg="grpc: error dialing." error="failed to build resolver: passthrough: received empty target in Build()" method=ExecutionDone node=dkron-marketplace-5 server_addr=
Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=fatal msg="agent: error applying SetExecutionType" error="node is not the leader" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Dkron agent starting" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-marketplace-5
  1. Once the dkron started and rejoined the cluster, it has started the job which was still running on dkron-marketplace-9.
Aug 28 01:05:00 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:05:00+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-5

Expected behavior
The jobs shouldn't be allowed to run concurrently in that way because it was set to "forbid" option which allow only to run on 1 node at a time.

** Specifications:**

  • OS: Ubuntu 20.04
  • Version: 3.2.6
@vcastellm vcastellm added the bug label Oct 27, 2024
@vcastellm
Copy link
Member

Thanks for reporting we'll investigate the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants