-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a non-global metrics registry in Teleport #50913
base: master
Are you sure you want to change the base?
Conversation
I didn't want to add a metrics RFD, but it would be good to start using the process registry instead of the global one for the next features we build/metrics we add. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'll let the experts approve first.
// and the global registry (used by some Teleport services and many dependencies). | ||
gatherers := prometheus.Gatherers{ | ||
prometheus.DefaultGatherer, | ||
process.metricsRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If conflicting metrics are registered I assume they'll be dropped, but unaffected metrics will keep working. Do you know if that's correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If conflicting metrics are registered I assume they'll be dropped
Currently, registration conflicts in the global registry can cause:
- hard failure / error returned
- panics
- silent failure (metric does not get registered and we don't know about it)
Adding a local registry will not change the failure modes in case of conflict in the same registry. However, we are adding a new failure mode: metrics conflicting between the local and global registry. In this case, the global will prevail (I did this for backward compatibility reasons as everything is using the global registry today) the local registry will take precedence.
As we start using the local registry more, we might create such hard to detect conflicts. The situation is not strictly worse than today (we already have some racy metric registration with silent failure going on 😬). To ensure no conflict happen we can prefix new metrics by wrapping the registry when passing it to the service.
I think we would benefit from metrics guideline, setting the teleport component in the metric subsystem would reduce the probability of conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very informative.
// and the global registry (used by some Teleport services and many dependencies). | ||
gatherers := prometheus.Gatherers{ | ||
prometheus.DefaultGatherer, | ||
process.metricsRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very informative.
853ac49
to
7be0540
Compare
d153fda
to
fa31b4a
Compare
This PR adds a new non-global per-process metrics registry in Teleport.
Using the global registry and global metrics causes conflicts in tests as we are starting multiple Teleport processes and/or other non-teleport processes (tbot, the operator, ...).
Having a new per-process metrics registry will allow Teleport services to register metrics scoped to their Teleport process. This will reduce the conflicts happening in tests.
To ensure backward compatibility, the Teleport metrics server serves both the process-scoped registry and the global registry.
Required for the autoupdate controller metrics PR.