Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a /-/healthy endpoint for monitoring component health #2197

Merged

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Nov 29, 2024

PR Description

Adding a /-/healthy endpoint which returns an error if at least one component is not healthy. The /-/ready endpoint doesn't check for component health, so this could be useful for Kubernetes liveliness probes.

Which issue(s) this PR fixes

Fixes #2061

Notes to the Reviewer

Prometheus also has a /-/healthy endpoint, and so does Grafana Agent Static mode. However, both of these were not very useful because they always return HTTP 200. There is an open issue in the Prometheus repo to make it more useful.

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@ptodev ptodev requested review from clayton-cornell and a team as code owners November 29, 2024 13:22
@ptodev ptodev linked an issue Nov 29, 2024 that may be closed by this pull request
@ptodev ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 818a1b8 to c698048 Compare November 29, 2024 13:35
Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, thanks! Just minor comments.

docs/sources/reference/_index.md Show resolved Hide resolved
docs/sources/reference/http/_index.md Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
internal/service/http/http.go Show resolved Hide resolved
internal/service/http/http_test.go Show resolved Hide resolved
Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few suggestions here to expand an clarify some doc things. :-)

docs/sources/reference/_index.md Show resolved Hide resolved
docs/sources/reference/http/_index.md Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
docs/sources/reference/http/_index.md Outdated Show resolved Hide resolved
@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 29, 2024
@rfratto
Copy link
Member

rfratto commented Nov 29, 2024

Hi, I wanted to share some concerns with using component health for liveness probes.

Kubernetes intends for liveness probes to be an indication of when the pod is "stuck" and needs to be killed. Kubernetes will eventually kill the pod if the liveness probe fails enough times.

With this definition of liveness, component health is not a good indicator. Components like remote.s3 can report themselves as unhealthy if an object in S3 gets deleted, and custom components can fail if you give them an invalid configuration (such as someone rolls out a bad config from fleet management).

Killing the pod when components are unhealthy will not only not fix the problem, but since Alloy requires all components to be healthy on startup, it will also completely halt all telemetry collection until the problem is manually fixed.

@thampiotr
Copy link
Contributor

Hi, I wanted to share some concerns with using component health for liveness probes.

Oh yeah, I 100% agree and I would not recommend using it as liveness probe. But people still wanted it. I think adding this endpoint is harmless, because it can be used for other health checking purposes or diagnostics, not just as a liveness probe. We do know people who wanted to use it as liveness probe, but this would be doing it against our advice. They may have some special requirements though and it may work okay for them.

@ptodev
Copy link
Contributor Author

ptodev commented Dec 2, 2024

Hi @rfratto, thank you for opining! I agree that in the component health is not a good indicator for whether Alloy as a whole should be restarted.

Apparently, the issue which originally prompted this feature request is an informer timeout like the one referenced in #2161.
@thampiotr - if that timeout does happen, maybe the real issue is that the component should keep retrying to sync the informers rather than giving up? Judging by the code, the prometheus.operator components will only sync the informers when the config is updated?

I do think there is a benefit in the /healhy endpoint as a whole, although I do agree that we shouldn't recommend that users use it as a liveliness probe. It could be useful in situations where users update the Alloy config and then want to double check that it works as expected. They might do rolling deployments - e.g. 1% of Alloy instances first, and then if they are healthy, the deployment can continue to more instances. I mistakenly thought that liveliness probes in k8s can do similar things, but apparently that's not the case.

@ptodev
Copy link
Contributor Author

ptodev commented Dec 2, 2024

I opened another PR for the k8s informers to keep retrying, as mentioned in the comment above.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweaks to the docs and all good

@ptodev ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from a32e9b4 to 91659a0 Compare December 2, 2024 19:37
@ptodev ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 3e6c716 to cc86481 Compare December 11, 2024 11:51
@ptodev ptodev force-pushed the 2061-expose-alloy-overall-component-health-via-http-endpoint branch from 13e43bf to e095bb3 Compare December 11, 2024 12:05
@ptodev ptodev merged commit b97d2b6 into main Dec 11, 2024
17 of 18 checks passed
@ptodev ptodev deleted the 2061-expose-alloy-overall-component-health-via-http-endpoint branch December 11, 2024 15:54
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose Alloy overall component health via HTTP endpoint
4 participants