Update scrape and remote_write libs for generic HostHealth rules #660

MichaelThamm · 2024-12-19T14:02:41Z

TODO

Revert back to cosl@main (PYDEPS, requirements.txt, tox.ini)

Issue

Within the tandem cosl PR, we have a centralized (Prometheus rather than individual charms) way to "inject" alerts on the fly so we can extend Prometheus for generic up/absent rules.

Solution

The prometheus_scrape and prometheus_remote_write libs were updated by injecting generic HostHealth rules to the rules.

Documentation for implementation and testing

Context

In tandem with:

Enable Generic Alert Rules cos-lib#115

Testing Instructions

Without Grafana Agent

Deploy Prometheus, Avalanche, and Alertmanager
Relate prom:metrics-endpoint avalanche and alertmanager prom:alertmanager
Scale Avalanche with juju add-unit avalanche -n 1
In the Prometheus UI, check for the new HostHealth rules or query Prometheus with juju show-unit prom/0 | yq -r '."prom/0"."relation-info"[1]."application-data"."alert_rules"' | jq
Artificially take down one Avalanche unit with pebble

juju exec --unit avalanche/0 -- \
  PEBBLE_SOCKET=/charm/containers/avalanche/pebble.socket \
  pebble stop avalanche

Check the Prometheus UI and notice the HostDown Alert firing (showing the specific unit Avalanche/0)
Artificially take down the second Avalanche unit with pebble and check for each Avalanche unit in the labels
Check the alerts arriving in Alertmanager for:
- Host 'prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0' is down. VALUE = 0 LABELS = map[__name__:up instance:prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0 job:juju_prom_8b073ff8_avalanche_prometheus_scrape_avalanche-0 juju_application:avalanche juju_charm:avalanche-k8s juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc juju_unit:avalanche/0]

With Grafana Agent

Scrape

Deploy Grafana Agent, Prometheus, Avalanche, and Alertmanager
Relate gagent:metrics-endpoint avalanche, gagent:send-remote-write prom, and alertmanager prom:alertmanager
Scale Avalanche with juju add-unit avalanche -n 1
In the Prometheus UI, check for the new HostHealth and AggregatorHostHealth rules or query Grafana Agent with juju show-unit gagent/0 | yq -r '."gagent/0"."relation-info"[0]."application-data"."alert_rules"' | jq
Artificially take down one Avalanche unit with pebble
Check the Prometheus UI and notice the HostDown Alert firing (showing the specific unit Avalanche/0)
Artificially take down the second Avalanche unit with pebble and check for each Avalanche unit in the labels
Check the alerts arriving in Alertmanager (similar result as without Grafana Agent).

Remote write

Artificially take down Grafana Agent with pebble

juju exec --unit gagent/0 -- \
  PEBBLE_SOCKET=/charm/containers/agent/pebble.socket \
  pebble stop agent

Check the Prometheus UI and notice the HostUnavailable Alert firing in both the HostHealth and AggregatorHostHealth groups

Note

This does not show each unit, with 2 avalanche units, alert labels shows once per app:
alertname=HostUnavailablejuju_application=avalanchejuju_charm=avalanche-k8sjuju_model=promjuju_model_uuid=8b073ff8-5456-492a-804b-7f9f15c996dcseverity=critical

Check the alerts arriving in Alertmanager for:
- Metrics not received from host ''. VALUE = 1 LABELS = map[juju_application:avalanche juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc]

Upgrade Notes

By fetching the new libs you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules. up/absent alerts are ubiquitous and are handled by the libs modified in this PR. Any custom alerts duplicating this behaviour can be removed.

With the new design introduced in this PR, you would get a separate HostUnavailable alert for Grafana Agent itself and each unit that is aggregated by it.

… host health rules

sed-i · 2024-12-20T19:08:39Z

lib/charms/prometheus_k8s/v0/prometheus_scrape.py


-PYDEPS = ["cosl"]
+PYDEPS = ["git+https://github.com/canonical/cos-lib.git@feature/generic-alerts#egg=cosl"]


Reminder to revert before merging here and elsewhere.

Update prometheus_scrape and prometheus_remote_write libs for generic…

1a2e033

… host health rules

github-actions bot added the Libraries: Out of sync label Dec 19, 2024

MichaelThamm added 5 commits December 19, 2024 18:15

chore: remove TODOs

1b35ab8

Switch naming to HostMetricsMissing for absent metrics

48e9ce8

Merge branch 'main' into feature/host-health

4ba4abe

test: update cosl dep to the branch for testing CI

082a34c

chore: fmt and static fixes

dbf9341

MichaelThamm marked this pull request as ready for review December 20, 2024 18:15

MichaelThamm requested a review from a team as a code owner December 20, 2024 18:15

sed-i approved these changes Dec 20, 2024

View reviewed changes

Abuelodelanada approved these changes Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update scrape and remote_write libs for generic HostHealth rules #660

Update scrape and remote_write libs for generic HostHealth rules #660

MichaelThamm commented Dec 19, 2024 •

edited

Loading

sed-i Dec 20, 2024


		PYDEPS = ["cosl"]
		PYDEPS = ["git+https://github.com/canonical/cos-lib.git@feature/generic-alerts#egg=cosl"]

Update scrape and remote_write libs for generic HostHealth rules #660

Are you sure you want to change the base?

Update scrape and remote_write libs for generic HostHealth rules #660

Conversation

MichaelThamm commented Dec 19, 2024 • edited Loading

TODO

Issue

Solution

Context

Testing Instructions

Without Grafana Agent

With Grafana Agent

Scrape

Remote write

Upgrade Notes

sed-i Dec 20, 2024

Choose a reason for hiding this comment

MichaelThamm commented Dec 19, 2024 •

edited

Loading