Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add controller-isleader xweb component to respond with 200 if the controller in question is a raft leader #2581

Merged
merged 4 commits into from
Jan 14, 2025

Conversation

nenkoru
Copy link
Contributor

@nenkoru nenkoru commented Dec 9, 2024

closes(backref) #2582

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 10, 2024

moved from /sys/health to a /controller/health with changes made to make it more generic

@plorenz
Copy link
Member

plorenz commented Dec 13, 2024

Hi @nenkoru , before going too deep with this PR, we require a CLA before accepting contributions. Information on that is here: https://openziti.io/policies/CONTRIBUTING.html#code-contributions
Let us know if you're ok with submitting that.

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 17, 2024

Hi @nenkoru , before going too deep with this PR, we require a CLA before accepting contributions. Information on that is here: https://openziti.io/policies/CONTRIBUTING.html#code-contributions Let us know if you're ok with submitting that.

Hey, @plorenz!
Sent a signed ICLA to [email protected] yesterday.
Can't wait for processing to complete!

@plorenz
Copy link
Member

plorenz commented Dec 18, 2024

Hi @nenkoru , before going too deep with this PR, we require a CLA before accepting contributions. Information on that is here: https://openziti.io/policies/CONTRIBUTING.html#code-contributions Let us know if you're ok with submitting that.

Hey, @plorenz! Sent a signed ICLA to [email protected] yesterday. Can't wait for processing to complete!

CLA looks good 👍

Having looked at it again, I think it looks overall. The only 2 thing I'd like changed are:

  1. A flag to disable the 429 return so by default the results will be backwards compabible
  2. This should replace the existing health checker, rather than add to it. I'd rather have one flexible endpoint than multiple, since they are largely the same.

Let me know if those make sense.

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 18, 2024

Hi @nenkoru , before going too deep with this PR, we require a CLA before accepting contributions. Information on that is here: https://openziti.io/policies/CONTRIBUTING.html#code-contributions Let us know if you're ok with submitting that.

Hey, @plorenz! Sent a signed ICLA to [email protected] yesterday. Can't wait for processing to complete!

CLA looks good 👍

Having looked at it again, I think it looks overall. The only 2 thing I'd like changed are:

  1. A flag to disable the 429 return so by default the results will be backwards compabible
  2. This should replace the existing health checker, rather than add to it. I'd rather have one flexible endpoint than multiple, since they are largely the same.

Let me know if those make sense.

I think it gets complicated with moving this logic to a health-checks handler.
We need that controller env within a healtcheck, so we have to pass it into the structure of health-check, make it depend on the controller/env.
But router.go https://github.com/openziti/ziti/blob/main/router/router.go#L274 uses the same health-check and we have to pass the env, but it's impossible due to different types (controller vs router)

We might also opt for moving this heath-check to the actual gosundheit custom check https://github.com/openziti/ziti/blob/main/controller/health.go#L15, and try to find the result of the check within that health-check loop at https://github.com/openziti/ziti/blob/main/common/health/handler.go#L99 and decide whether to return 429 status code there.

Or we could just make health-checks specific for controller.

Thoughts?

@plorenz
Copy link
Member

plorenz commented Dec 19, 2024

Or we could just make health-checks specific for controller.

Yes, this was the direction I was thinking. Leave the router health check as it is and replace the use of the common health checker with this, controller specific one.

So the only changes required would be to

  1. remove the old of the existing health checker for the controller
  2. ensure that the new health check code is backwards compatible (uses the same paths, returns, etc, unless an indicator (request param or path) says that we should do the 429 return if not leader

@nenkoru
Copy link
Contributor Author

nenkoru commented Dec 24, 2024

Or we could just make health-checks specific for controller.

Yes, this was the direction I was thinking. Leave the router health check as it is and replace the use of the common health checker with this, controller specific one.

So the only changes required would be to

  1. remove the old of the existing health checker for the controller
  2. ensure that the new health check code is backwards compatible (uses the same paths, returns, etc, unless an indicator (request param or path) says that we should do the 429 return if not leader

Made changes accordingly. Naming was hard, but I named this as enableRaftControllerCheck. This flag makes it to do two things: Return 429 if not a raft leader and raft is enabled & Return isLeader and isRaftEnabled within raft. Making this essentially a feature flag.

Eg response:

< HTTP/1.1 429 Too Many Requests
< Content-Type: application/json
< Date: Tue, 24 Dec 2024 21:47:33 GMT
< Content-Length: 408
<
{
    "data": {
        "checks": [
            {
                "details": null,
                "healthy": true,
                "id": "bolt.read",
                "lastCheckDuration": "17.334µs",
                "lastCheckTime": "2024-12-24T21:47:24Z"
            }
        ],
        "healthy": true
    },
    "meta": {},
    "raft": {
        "isLeader": false,
        "isRaftEnabled": true
    }
}

Also I skimmed the zititest but couldn't an appropriate place to put tests for this behaviour.

I guess after tests created I could unmark this PR as being a draft and we could continue iterating on this!
Tell me what you think.

@plorenz
Copy link
Member

plorenz commented Jan 7, 2025

Hi @nenkoru

Sorry for the late response, I was out for a couple of weeks at the end of the year.

It's looking pretty good. I only have a couple of questions.

First, can you make sure your commits are signed, please? (see https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits).

Second, how would you feel about making the enableRaftControllerCheck based off either a query param or path difference? That would allow someone to use the same health check to monitor overall health as well as for detecting leadership changes. I can see someone wanting to do both.
Or were you thinking they'd run the health check at two different endpoints, one for each behavior?

@nenkoru
Copy link
Contributor Author

nenkoru commented Jan 13, 2025

Hi @nenkoru

Sorry for the late response, I was out for a couple of weeks at the end of the year.

It's looking pretty good. I only have a couple of questions.

First, can you make sure your commits are signed, please? (see https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits).

Second, how would you feel about making the enableRaftControllerCheck based off either a query param or path difference? That would allow someone to use the same health check to monitor overall health as well as for detecting leadership changes. I can see someone wanting to do both. Or were you thinking they'd run the health check at two different endpoints, one for each behavior?

Signed the commits, will do the path check for somethig like
/health-checks/controller/raft to output the overall health as well as controller's raft state(basically would append to the response json object).
Would also eliminate the need for enableRaftControllerCheck in yaml

@nenkoru
Copy link
Contributor Author

nenkoru commented Jan 13, 2025

Pushed the appropriate change

@nenkoru nenkoru marked this pull request as ready for review January 13, 2025 14:04
@nenkoru nenkoru requested review from a team as code owners January 13, 2025 14:04
@plorenz
Copy link
Member

plorenz commented Jan 14, 2025

Pushed the appropriate change

Looks good! I'll give it a quick test run tomorrow and then likely merge it up. Thank you!

@plorenz
Copy link
Member

plorenz commented Jan 14, 2025

Looks good. I see did see a panic when running this against a non-HA controller, so I'll add a check for that and also a CHANGELOG entry.
The other tweak I'm thinking about is the 429 return code. Was there a specific reason you picked that vs a 503 or something else in the 5xx range?

Copy link
Member

@plorenz plorenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@plorenz plorenz merged commit 12e3bb8 into openziti:main Jan 14, 2025
@nenkoru
Copy link
Contributor Author

nenkoru commented Jan 14, 2025

Looks good. I see did see a panic when running this against a non-HA controller, so I'll add a check for that and also a CHANGELOG entry. The other tweak I'm thinking about is the 429 return code. Was there a specific reason you picked that vs a 503 or something else in the 5xx range?

I picked it just to be consistent with the Hashicorp Vault[1] status code. I thought it might be good to just follow the leading Raft implementation in this case.

At glance 503 might be more suitable than 429, but I don't think it could make much of a difference so it's up for debate.

[1] https://developer.hashicorp.com/vault/api-docs/system/health#429

@plorenz
Copy link
Member

plorenz commented Jan 15, 2025

Looks good. I see did see a panic when running this against a non-HA controller, so I'll add a check for that and also a CHANGELOG entry. The other tweak I'm thinking about is the 429 return code. Was there a specific reason you picked that vs a 503 or something else in the 5xx range?

I picked it just to be consistent with the Hashicorp Vault[1] status code. I thought it might be good to just follow the leading Raft implementation in this case.

At glance 503 might be more suitable than 429, but I don't think it could make much of a difference so it's up for debate.

[1] https://developer.hashicorp.com/vault/api-docs/system/health#429

Ok, I left as 429, that seems like a good enough justification. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants