Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] Fine tune form & search monitors to lower noise #16101

Closed
4 tasks
Tracked by #16428
jilladams opened this issue Nov 13, 2023 · 9 comments
Closed
4 tasks
Tracked by #16428

[SPIKE] Fine tune form & search monitors to lower noise #16101

jilladams opened this issue Nov 13, 2023 · 9 comments
Assignees
Labels
Monitoring Public Websites Scrum team in the Sitewide crew sitewide VA.gov frontend CMS team practice area

Comments

@jilladams
Copy link
Contributor

jilladams commented Nov 13, 2023

User Story or Problem Statement

As an owner of VA.gov search, I want to be notified about anomalous traffic/errors with low noise, so we don't miss infrastructure issues.

Chat with Adrian regarding a reasonable level...

We want to be proactively aware of anything we should be worries about. What is the cut off point?

Description or Additional Context

These 2 monitors generate almost nightly noise in the #public-websites-monitoring channel:

See here: Triaging the failed Search success rate monitor: Recording of the meeting for stepping through Datadog / how to think about what we're seeing: https://us06web.zoom.us/rec/share/E5dbb6Ewhi7ILpS2qMwZ1WmEnRa1caieprd3UEQWmlTVGCSKSKCORZeFrodBHPI.o6MlyTOrLC1En0z1?startTime=1698786445000
Passcode: 55$kzl1+

We need to figure out if there are real underlying causes, or if the monitors just need to be fine tuned so they're less noisy.

Acceptance Criteria

  • Modify the Search rate anomaly monitor to check for true anomalies in traffic.
    • Do not include a warn threshhold unless we will take action based on it.
    • Error threshhold should be at a level that will trigger team action, and that action should be documented in ticket comments.
  • If more substantial work is needed, create new tickets for the actual work
@jilladams jilladams added VA.gov frontend CMS team practice area Needs refining Issue status Public Websites Scrum team in the Sitewide crew Monitoring labels Nov 13, 2023
@FranECross FranECross changed the title Fine tune search monitors to lower noise [SPIKE] Fine tune search monitors to lower noise Nov 15, 2023
@FranECross FranECross removed the Needs refining Issue status label Nov 15, 2023
@FranECross
Copy link

FranECross commented Nov 29, 2023

@chriskim2311 to follow up with Adrian in the PW channel.

Thread here

@FranECross
Copy link

Wait until after the meeting on 12/6 with Adrian and Steve before working on this.

@jilladams
Copy link
Contributor Author

Based on the meeting with Mike Chelen and Steve Albers today from Code Yellow team:

  1. "VA.gov search success rate below threshold" monitor - https://vagov.ddog-gov.com/monitors/169139 - was tweaked:

    • Removed warning level. Their advice: don't warn unless you know you would do something about the warning. In this case, our behavior won't change unless we're truly having a rate limit / dip in traffic issue.
    • Lowered the error rate to 97%. We haven't dipped below 98% in the last month except the 1 day we were down to 93%. 97% can be a new baseline for measuring whether we still have slight variations in traffic.
  2. "Search rate anomaly" - https://vagov.ddog-gov.com/monitors/169140?live=2h

    • Is currently throwing an alarm if we get down to a total count of 5 (warn) or 2 (error) successfully submissions per 5 minutes. This would be an early warning system for a total failure. It's set up as a threshold alarm currently.
    • We may want to rewrite as a proper Anomaly monitor in Datadog.
    • We should try to tweak it as needed to reflect we're looking for big crash in Search. If we have trouble, can ask for support in #public-datadog

@jilladams
Copy link
Contributor Author

tl;dr: we still need this ticket, and need to refocus ACs around modifying the Search rate anomaly monitor to actually monitor for anomalies. Updating.

@FranECross FranECross changed the title [SPIKE] Fine tune search monitors to lower noise [SPIKE] Fine tune form & search monitors to lower noise Dec 14, 2023
@chriskim2311
Copy link
Contributor

New Search Anomaly Monitor: https://vagov.ddog-gov.com/monitors/186811

@randimays
Copy link
Contributor

@chriskim2311 and I took a look at this together and took some action on the monitors:

  • Search success rate: we left this one alone at 97% as it was recently adjusted and we need to keep an eye on it for noise/value
  • Old Search "anomaly" (that was actually a Datadog threshold monitor): we removed this one and replaced it with the new one Chris posted above
  • Find a Form success rate: adjusted from 99% to 97% (alert only). At that threshold we will reach out to Lighthouse to troubleshoot issues.

@jilladams
Copy link
Contributor Author

@chriskim2311
Copy link
Contributor

@jilladams Done!

@jilladams
Copy link
Contributor Author

Gonna go ahead and close this. If we find monitors stay noisy, we can revisit under a new ticket with more specifics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Monitoring Public Websites Scrum team in the Sitewide crew sitewide VA.gov frontend CMS team practice area
Projects
None yet
Development

No branches or pull requests

4 participants