Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(alerting): detect flapping alerts #148

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

simskij
Copy link
Member

@simskij simskij commented Jul 9, 2024

Issue

We are currently not providing any alert rules that would detect when an alert is transitioning to and from pending in a fast pace, which means we could potentially loose out on information from recurring intermittent errors.

Solution

Detect whenever an alert is transitioning more than 5 times in the last hour.

Context

resolves #145

Testing Instructions

  1. Deploy COS
  2. Deploy a machine (juju deploy ubuntu)
  3. Deploy Grafana Agent and relate it to ubuntu.
  4. Relate the agent to COS
  5. Verify that the alert rule has made it over
  6. Turn off the host
  7. Wait until a scrape has occurred
  8. Turn on the host
  9. Wait until a scrape has occurred
  10. Repeat 6 to 9 5 times.
  11. Wait 5 minutes
  12. Verify that the alert is firing

Upgrade Notes

@simskij simskij requested a review from a team as a code owner July 9, 2024 11:16
- name: AlertRules
rules:
- alert: AlertRulesFlapping
expr: changes(ALERTS_FOR_STATE{severity = "warning"}[1h]) > 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!
Apparently there also ALERTS, but the value there is "1", whereas for ALERTS_FOR_STATE it's timestamp.
How did you find out this exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through @err404r :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALERTS is not working with changes for some reason, when I found this out I google and find out about ALERTS_FOR_STATE it is internal metrics which is used to track alert state in between prometheus restarts. According to official docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to replicate the alert per severity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, should we move this alert to prometheus? Seems like a generic alert rule.
We probably don't want topology injected here.
(And also we don't want prometheus topology injected there.)

src/prometheus_alert_rules/flapping.rules Show resolved Hide resolved
src/prometheus_alert_rules/flapping.rules Show resolved Hide resolved
@err404r
Copy link

err404r commented Jul 18, 2024

Hi, during next two day I'm going to test this rule with inhibitions rules, so please do not merge I might be able to find some useful improvements

Base automatically changed from feat/up-alert to main October 23, 2024 05:47
@lucabello
Copy link
Contributor

@err404r Any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add detection of flapping alerts
5 participants