Update scrape and remote_write libs for generic HostHealth rules #660
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODO
Issue
Within the tandem
cosl
PR, we have a centralized (Prometheus rather than individual charms) way to "inject" alerts on the fly so we can extend Prometheus for generic up/absent rules.Solution
The prometheus_scrape and prometheus_remote_write libs were updated by injecting generic HostHealth rules to the rules.
Documentation for implementation and testing
Context
In tandem with:
Testing Instructions
Without Grafana Agent
prom:metrics-endpoint avalanche
andalertmanager prom:alertmanager
juju add-unit avalanche -n 1
HostHealth
rules or query Prometheus withjuju show-unit prom/0 | yq -r '."prom/0"."relation-info"[1]."application-data"."alert_rules"' | jq
HostDown
Alert firing (showing the specific unit Avalanche/0)Host 'prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0' is down. VALUE = 0 LABELS = map[__name__:up instance:prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0 job:juju_prom_8b073ff8_avalanche_prometheus_scrape_avalanche-0 juju_application:avalanche juju_charm:avalanche-k8s juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc juju_unit:avalanche/0]
With Grafana Agent
Scrape
gagent:metrics-endpoint avalanche
,gagent:send-remote-write prom
, andalertmanager prom:alertmanager
juju add-unit avalanche -n 1
HostHealth
andAggregatorHostHealth
rules or query Grafana Agent withjuju show-unit gagent/0 | yq -r '."gagent/0"."relation-info"[0]."application-data"."alert_rules"' | jq
HostDown
Alert firing (showing the specific unit Avalanche/0)Remote write
HostUnavailable
Alert firing in both theHostHealth
andAggregatorHostHealth
groupsNote
This does not show each unit, with 2 avalanche units, alert labels shows once per app:
alertname=HostUnavailablejuju_application=avalanchejuju_charm=avalanche-k8sjuju_model=promjuju_model_uuid=8b073ff8-5456-492a-804b-7f9f15c996dcseverity=critical
Metrics not received from host ''. VALUE = 1 LABELS = map[juju_application:avalanche juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc]
Upgrade Notes
By fetching the new libs you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules. up/absent alerts are ubiquitous and are handled by the libs modified in this PR. Any custom alerts duplicating this behaviour can be removed.
With the new design introduced in this PR, you would get a separate HostUnavailable alert for Grafana Agent itself and each unit that is aggregated by it.