Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update scrape and remote_write libs for generic HostHealth rules #660

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

MichaelThamm
Copy link

@MichaelThamm MichaelThamm commented Dec 19, 2024

TODO

  • Revert back to cosl@main (PYDEPS, requirements.txt, tox.ini)

Issue

Within the tandem cosl PR, we have a centralized (Prometheus rather than individual charms) way to "inject" alerts on the fly so we can extend Prometheus for generic up/absent rules.

Solution

The prometheus_scrape and prometheus_remote_write libs were updated by injecting generic HostHealth rules to the rules.

Documentation for implementation and testing

Context

In tandem with:

Testing Instructions

Without Grafana Agent

  1. Deploy Prometheus, Avalanche, and Alertmanager
  2. Relate prom:metrics-endpoint avalanche and alertmanager prom:alertmanager
  3. Scale Avalanche with juju add-unit avalanche -n 1
  4. In the Prometheus UI, check for the new HostHealth rules or query Prometheus with juju show-unit prom/0 | yq -r '."prom/0"."relation-info"[1]."application-data"."alert_rules"' | jq
  5. Artificially take down one Avalanche unit with pebble
juju exec --unit avalanche/0 -- \
  PEBBLE_SOCKET=/charm/containers/avalanche/pebble.socket \
  pebble stop avalanche
  1. Check the Prometheus UI and notice the HostDown Alert firing (showing the specific unit Avalanche/0)
  2. Artificially take down the second Avalanche unit with pebble and check for each Avalanche unit in the labels
  3. Check the alerts arriving in Alertmanager for:
    • Host 'prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0' is down. VALUE = 0 LABELS = map[__name__:up instance:prom_8b073ff8-5456-492a-804b-7f9f15c996dc_avalanche_avalanche/0 job:juju_prom_8b073ff8_avalanche_prometheus_scrape_avalanche-0 juju_application:avalanche juju_charm:avalanche-k8s juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc juju_unit:avalanche/0]

With Grafana Agent

Scrape

  1. Deploy Grafana Agent, Prometheus, Avalanche, and Alertmanager
  2. Relate gagent:metrics-endpoint avalanche, gagent:send-remote-write prom, and alertmanager prom:alertmanager
  3. Scale Avalanche with juju add-unit avalanche -n 1
  4. In the Prometheus UI, check for the new HostHealth and AggregatorHostHealth rules or query Grafana Agent with juju show-unit gagent/0 | yq -r '."gagent/0"."relation-info"[0]."application-data"."alert_rules"' | jq
  5. Artificially take down one Avalanche unit with pebble
  6. Check the Prometheus UI and notice the HostDown Alert firing (showing the specific unit Avalanche/0)
  7. Artificially take down the second Avalanche unit with pebble and check for each Avalanche unit in the labels
  8. Check the alerts arriving in Alertmanager (similar result as without Grafana Agent).

Remote write

  1. Artificially take down Grafana Agent with pebble
juju exec --unit gagent/0 -- \
  PEBBLE_SOCKET=/charm/containers/agent/pebble.socket \
  pebble stop agent
  1. Check the Prometheus UI and notice the HostUnavailable Alert firing in both the HostHealth and AggregatorHostHealth groups

Note

This does not show each unit, with 2 avalanche units, alert labels shows once per app:
alertname=HostUnavailablejuju_application=avalanchejuju_charm=avalanche-k8sjuju_model=promjuju_model_uuid=8b073ff8-5456-492a-804b-7f9f15c996dcseverity=critical

  1. Check the alerts arriving in Alertmanager for:
    • Metrics not received from host ''. VALUE = 1 LABELS = map[juju_application:avalanche juju_model:prom juju_model_uuid:8b073ff8-5456-492a-804b-7f9f15c996dc]

Upgrade Notes

By fetching the new libs you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules. up/absent alerts are ubiquitous and are handled by the libs modified in this PR. Any custom alerts duplicating this behaviour can be removed.

With the new design introduced in this PR, you would get a separate HostUnavailable alert for Grafana Agent itself and each unit that is aggregated by it.

@MichaelThamm MichaelThamm marked this pull request as ready for review December 20, 2024 18:15
@MichaelThamm MichaelThamm requested a review from a team as a code owner December 20, 2024 18:15

PYDEPS = ["cosl"]
PYDEPS = ["git+https://github.com/canonical/cos-lib.git@feature/generic-alerts#egg=cosl"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to revert before merging here and elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants