Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RETRY_JOIN fails after server comes back up - it's always DNS! #1253

Open
fopina opened this issue Jan 10, 2023 · 25 comments
Open

RETRY_JOIN fails after server comes back up - it's always DNS! #1253

fopina opened this issue Jan 10, 2023 · 25 comments

Comments

@fopina
Copy link
Contributor

fopina commented Jan 10, 2023

Describe the bug

After both server and agents are up and cluster is running smoothly, if the server goes down and comes back up with a different IP (but same hostname), agents do not reconnect.

To Reproduce

  • Use following as docker-compose.yml
version: '3.7'

x-common: &base
  image: dkron/dkron:3.2.1
  command: agent

networks:
  vpcbr:
    ipam:
     config:
       - subnet: 10.5.0.0/16
         gateway: 10.5.0.1

services:
  server:
    <<: *base
    environment:
      DKRON_DATA_DIR: /ext/data
      DKRON_SERVER: 1
      DKRON_NODE_NAME: dkron1
      DKRON_BOOTSTRAP_EXPECT: 1
    ports:
      - 8888:8080
    networks:
      vpcbr:
        ipv4_address: 10.5.0.20

  agents:
    <<: *base
    environment:
      DKRON_RETRY_JOIN: server
    networks:
      vpcbr:
    deploy:
      replicas: 3
  • Wait for everything to be up and running (and http://localhost:8888 shows all 4 nodes in the cluster)
  • docker kill the server
  • Verify that all agents log level=info msg="removing server dkron1 (Addr: 10.5.0.20:6868) (DC: dc1)" node=...
  • Modify server IP in docker-compose to 10.5.0.22 and run docker compose up -d to re-create the server
  • Verify that agents never reconnect

Expected behavior
Agents would eventually retry joining on hostname, picking up the new IP.

Additional context
I understand serf or raft might be tricky with DNS but in this case, server does start up with proper access to data/log, no corruption. And if I restart the agents, they will reconnect just fine.
It seems it's just that retry will go on using IP after first join, instead of re-resolving hostname.

To reproduce the issue, I'm forcing the IP change here, but when running in docker swarm (and I assume in k8s as well) new IP upon service re-creation is expected without using fixed IPs.

Is this something easy to fix?

@fopina
Copy link
Contributor Author

fopina commented Jan 22, 2023

Details pushed to https://github.com/fopina/delme/tree/main/dkron_retry_join_does_not_reresolve_dns to make reproducing easier

I'd assume with a HA / multi-server setup this won't happen as the server changing IP will retry-join by himself onto the other servers (and then new IP will be shared among everyone), but I haven't tested, and I think it still makes this a valid bug as single-server setup is documented.

@fopina
Copy link
Contributor Author

fopina commented Jan 23, 2023

after taking a look at dkron/retry_join.go, I think the agents are not stuck in a retry loop with an outdated IP, they're not retrying at all. So that is not the right place to fix it...

But looking at agent -h, this is an interesting option

      --serf-reconnect-timeout string   This is the amount of time to attempt to reconnect to a failed node before
                                        giving up and considering it completely gone. In Kubernetes, you might need
                                        this to about 5s, because there is no reason to try reconnects for default
                                        24h value. Also Raft behaves oddly if node is not reaped and returned with
                                        same ID, but different IP.

But now the agents will "see" the server, but they do not retry join

agents_3  | time="2023-01-23T00:22:28Z" level=info msg="removing server dkron1 (Addr: 10.5.0.20:6868) (DC: dc1)" node=e9eca2dc43f1
...
agents_3  | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-update node=e9eca2dc43f1
agents_3  | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-reap node=e9eca2dc43f1

@fopina
Copy link
Contributor Author

fopina commented Jan 27, 2023

@yvanoers just in case you're still around, would you have any comment on this one? I've tried debugging but I believe part that is handling reconnection (and not re-resolving DNS) is within serf library, not dkron.

I couldn't find any workaround at all... I've tried modifying -serf-reconnect-timeout to a low value (as recommended for kubernetes) but then it's even worse as the agents remove the server and never see it again (even if it comes back up with same IP)

@yvanoers
Copy link
Collaborator

I'm not that well-versed in the internals of serf, but you could very well be right that this is a serf-related issue.
Maybe @vcastellm has more readily available knowledge, I would have to dig into it - which I am willing to, except my available time is rather sparse lately.

@vcastellm
Copy link
Member

This is an old one, known issue, it's because how Raft is handling nodes, it's affecting any dynamic IP system like k8s and it should be fixable. I need to dig into it, it's something really annoying, so expect that I'll try to allocate time for this soon.

@fopina
Copy link
Contributor Author

fopina commented Jan 30, 2023

That's awesome! I did an attempt to trace it but failed...

Even when trying with multiple servers as the workaround I mentioned, it still doesn't work. Using the low serf reconnect timeout, kicks server out and never allows it back it..

@vcastellm
Copy link
Member

I took a deeper look into this, it's not related to Raft but to what you mentioned, Serf is not resolving the hostname but using the existing IP, it's always DNS :)

I need to investigate a bit more to come up with a workaround that doesn't involve restarting the agents.

@fopina
Copy link
Contributor Author

fopina commented Mar 21, 2023

Gentle reminder this is still happening 🩸 😄

@jaccky
Copy link

jaccky commented Jul 18, 2023

Hi,
we have the same issue (dns name not used by raft, it uses ip adresses that in k8s are changing unexpectedly), after pod are restarted, the new ip adresses are not taken in account by raft .

Did anyone found a solution to this ??

@fopina
Copy link
Contributor Author

fopina commented Jul 18, 2023

I don't and it's really annoying. I ended up setting log alerts (as I have logs in Loki) and kill all agents when issue starts popping up...

Really bad workaround but, in my case, I prefer to break some ongoing jobs than not running any until I manually restarted...

@jaccky
Copy link

jaccky commented Jul 19, 2023

Thanks @fopina for your reply !
I have a question for you: but with a simple kill of the agent you are able to stabilize the dkron cluster, reatining all the data (schedules) ??
I don't understand how this can happen ...

@fopina
Copy link
Contributor Author

fopina commented Jul 19, 2023

Agents have no data, it's all in the server(s).
Killing those agents makes them restart and re-resolve server hostnames.
The impact is that any job running there, fails (gets killed as well)

@vcastellm
Copy link
Member

@fopina can you check against v4-beta?

@fopina
Copy link
Contributor Author

fopina commented Feb 12, 2024

I already did @vcastellm : #1442 (comment)

It didn't work though :/

@jaccky
Copy link

jaccky commented Feb 15, 2024

Hi,
we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes.
Various restarts of the nodes, always resulted in a working cluster with an elected leader.
So the issue seems to be finally solved !

@fopina
Copy link
Contributor Author

fopina commented Feb 15, 2024

@jaccky @vcastellm maybe that is what the other issue/PR refer to (missing leader elections), though this issue I opened is not about leader.

I have single server setup and if the container restarts, the worker nodes will not reconnect (as server changed IP but not hostname)

the server itself comes back up and resumes as leader (as single node).
And that, I tested and wasn’t fixed in beta4.

It sounds similar, but maybe it’s in slightly different place of the code? (As one is about server nodes reconnecting to the one that changed IP and mine is about worker nodes reconnecting to the same server that changed name)
Seems like forcing name resolution should be the solution as well but maybe in other code path

@ivan-kripakov-m10
Copy link
Contributor

ivan-kripakov-m10 commented Feb 17, 2024

@fopina Hey there! Have you faced any issues when running more than one dkron servers?

AFAIK, retry join is a finite process in dkron. Here's what typically happens when deploying dkron in such a configuration:

  1. Your dkron agent successfully joins the cluster and starts listening to serf events.
  2. If the server is killed, the agent receives a member leave event, but no rejoin process is initiated.
  3. When you deploy a new dkron server node with the same ID but a different IP, the agent does not retry joining in the serf layer, and the dkron server doesn't attempt to find agents and join them to its own serf cluster.

While a DNS solution might work, there could be other approaches to consider. For example, if the agent receives a server leave event and there are no known dkron server nodes, it could initiate a retry-joining process on the dkron agent.

I'm not very familiar with the dkron backend, so I'd like to ask @vcastellm to validate this information.

@fopina
Copy link
Contributor Author

fopina commented Feb 17, 2024

Hi @ivan-kripakov-m10.

I believe that is not correct, the nodes do keep trying to rejoin at serf layer but only keep resolved IP, they do not re-resolve.

Relating to multiple server nodes, yes, I used to run a 3 server node cluster, but the leader election / raft issues were so frequent that HA setup had more downtime than single server node hehe
Also, as my single server node is a service in swarm cluster, if the host goes down, it's reassigned to other node, very little down time.
I just need to resolve the rejoin of workers hehe

@ivan-kripakov-m10
Copy link
Contributor

ivan-kripakov-m10 commented Feb 18, 2024

@fopina thanks for reply!

Just to clarify, retry join is not a feature of the serf layer itself. Instead, it's an abstraction within dkron. You can find the implementation details in the dkron source code at this link: retry_join.

This method is invoked only when a dkron server or agent starts up

@ivan-kripakov-m10
Copy link
Contributor

ivan-kripakov-m10 commented Feb 18, 2024

So, I reproduced the issue in k8s environment. I initiated one dkron server and one dkron agent, then removed the retry join property from the dkron server configuration. Here's how the configuration looked:

- "--retry-join=\"provider=k8s label_selector=\"\"app.kubernetes.io/instance={{ .Release.Name }}\"\" namespace=\"\"{{ .Release.Namespace }}\"\"\""

After removing the retry join property and restarting the dkron server, the dkron agent produced the following logs (like yours):

time="2024-02-18T16:29:10Z" level=info msg="agent: Received event" event=member-leave node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:29:10Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.0.20:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b

The issue is not reproducible when the retry join property is present in the dkron server configuration. With this property dkron server is able to discover the dkron agent. Consequently, the dkron agent simply receives an update event rather than only a member leave event. Below are the logs from the dkron agent:

time="2024-02-18T16:25:11Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.4.97:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="agent: Received event" event=member-update node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="Updating LAN server" node=dkron-agent-5ffc84b448-4ft7b server="dkron-server-0 (Addr: 10.0.3.155:6868) (DC: dc1)"

It appears that you can try adding the dkron-agent DNS name to the retry-join configuration in the dkron-server as a workaround.

@fopina
Copy link
Contributor Author

fopina commented Feb 18, 2024

@ivan-kripakov-m10 could you highlight the differences of your test with the configuration I posted in the issue itself?

It's using retry-join and DNS name.
Maybe it has been fixed in v4 indeed and I tested it wrong this time

@ivan-kripakov-m10
Copy link
Contributor

ivan-kripakov-m10 commented Feb 18, 2024

@fopina no, the issue itself is not fixed in v4 yet :(
I'm suggesting a workaround with adding DKRON_RETRY_JOIN with dkron-agents' hosts to dkron-server configuration.

services.server.environment.DKRON_RETRY_JOIN: {{dkron-agents-dns-names}}

@fopina
Copy link
Contributor Author

fopina commented Feb 18, 2024

@ivan-kripakov-m10 oh got it! Good point, I’ll test in my setup, might be worth it even if it causes some network “noise”!

@ivan-kripakov-m10
Copy link
Contributor

So, I did a bit of digging into how serf works and if we can use DNS names with it. Here's what I found:

  1. Dkron uses serf.join method.
  2. Serf, in turn, hands off its tasks to the memberlist library (source).
  3. This library resolves IPs and carries on with them (source).

At first glance it seems that we can't solve this problem in the serf layer and have to implement something within dkron.

@fopina
Copy link
Contributor Author

fopina commented Feb 21, 2024

@ivan-kripakov-m10 thank you very much!

As I'm using docker swarm, adding DKRON_RETRY_JOIN: tasks.agents to the server service was enough!
tasks.agents resolves to ONE OF the healthy replicas and apparently that's enough as the replicas are still connected amongst them and cluster membership is updated in all of them!

@vcastellm I think this issue still makes sense (as agents DO retry to join but without re-resolving hostname, so looks like a bug) but feel free to close it, Ivan's workaround is more than acceptable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants