Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An endpoint to determine whether a node is a raft leader #2582

Closed
nenkoru opened this issue Dec 9, 2024 · 7 comments
Closed

An endpoint to determine whether a node is a raft leader #2582

nenkoru opened this issue Dec 9, 2024 · 7 comments

Comments

@nenkoru
Copy link
Contributor

nenkoru commented Dec 9, 2024

Hello there!

As per thread on the forums[1] I think it might be worth implementing what Hashicorp Vault does with its /sys/health endpoint[2] to return 200 status code if the node is a raft leader.
This would allow for auto-failover when using HAProxy L7 load-balancer to a next node that has become a leader of the raft.

I have gone ahead and hacked together a PR #2581 which adds another webapi component to the controller which allows for determining whether a controller is a raft leader.
However I need a response whether this idea is viable enough to be even considered to be merged upstream.

[1] https://openziti.discourse.group/t/ziti-controller-ha-setup-behind-a-haproxy-load-balancer/3535
[2] https://developer.hashicorp.com/vault/api-docs/system/health

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 10, 2024

Tested it with the following HAproxy config. It works nicely and as intended. HAProxy automatically detects a backend node to direct traffic to using this /sys/health endpoint.
Tested by manually shutting down leader node -> after a few seconds traffic starts to be directed to a newly elected leader.

global
  # Global Configuration

defaults
  timeout connect 5000
  timeout client 50000
  timeout server 50000

frontend main
  mode tcp
  bind *:443
  use_backend ctrl

backend ctrl
  mode tcp
  option httpchk GET /sys/health
  http-check expect status 200
  server ctrl1 127.0.0.1:1281 check check-ssl verify none
  server ctrl2 127.0.0.1:1282 check check-ssl verify none
  server ctrl3 127.0.0.1:1283 check check-ssl verify none

@plorenz
Copy link
Member

plorenz commented Dec 10, 2024

Overall, seems like a useful thing to have. The only thing to note is that there is an existing health check implementation in controller/health.go.

My preference would be to consolidate this functionality in with that code and distinguish between overall health and raft later status with a parameter or path.

Thoughts?

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 10, 2024

Overall, seems like a useful thing to have. The only thing to note is that there is an existing health check implementation in controller/health.go.

My preference would be to consolidate this functionality in with that code and distinguish between overall health and raft later status with a parameter or path.

Thoughts?

I guess this change would be breaking backward compatibility with this 429 status code being returned if the node isn't the leader. If backward compatibility not to be the obstacle I think its safe to migrate this logic directly into health-check handler.
This health-check can't be used for routers, correct?

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 11, 2024

Apart from #2586 issue, it suprisingly works well.
Not only this allows for auto faillover in case of a disaster, but it also allows for the ziti desktop edge(for mac at least) to do not even notice there was a disaster with the controller it was using!

@plorenz
Copy link
Member

plorenz commented Dec 18, 2024

This health-check can't be used for routers, correct?

Hmm, I was going to say "Correct, router has a separate health check implemented, with router specific logic.", but then I went and looked again. There's a common http impl, but router and controller have different checks. It's been a bit since I've looked at this.

@plorenz
Copy link
Member

plorenz commented Jan 14, 2025

Changelog and panic fix here: #2650

@nenkoru
Copy link
Contributor Author

nenkoru commented Jan 15, 2025

So final HAProxy config might look like the following.

global
  # Global Configuration

defaults
  timeout connect 5000
  timeout client 50000
  timeout server 50000

frontend main
  mode tcp
  bind *:443
  use_backend ctrl

backend ctrl
  mode tcp
  option httpchk GET /health-checks/controller/raft
  http-check expect status 200
  server ctrl1 127.0.0.1:1281 check check-ssl verify none
  server ctrl2 127.0.0.1:1282 check check-ssl verify none
  server ctrl3 127.0.0.1:1283 check check-ssl verify none

plorenz added a commit that referenced this issue Jan 15, 2025
Add changelog and panic fix for health check update. Closes #2582
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants