Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike/monitor: Triage FormsAPI monitoring for Sept 30 deadline #15195

Closed
1 task
wesrowe opened this issue Sep 11, 2023 · 4 comments
Closed
1 task

Spike/monitor: Triage FormsAPI monitoring for Sept 30 deadline #15195

wesrowe opened this issue Sep 11, 2023 · 4 comments
Labels
backend Find a form CMS managed product, owned by Public Websites team Needs refining Issue status non-quarter-prio (PW) not related to quarterly priorities Public Websites Scrum team in the Sitewide crew sitewide VA.gov frontend CMS team practice area

Comments

@wesrowe
Copy link
Contributor

wesrowe commented Sep 11, 2023

Description

User story

AS A PO/PM
I WANT PW to manually monitor every error case we know of that Veterans can encounter on the FaF product
SO THAT Veteran's can't have problems we don't know about within 4 hours.

AS A PO/PM/Forms Stakeholder
I WANT members of the PW team to notice within 4 business hours when an API issue is impacting the Veteran Forms experience
SO THAT we can quickly respond to (or ideally anticipate) any Forms API issues that might impact Veterans.

Key failure modes we want to "alert" on (either be alerted automatically or notice through manual checking):

  • Rate limits: the rate of requests to the Forms API is nearing the current limit set by Lighthouse
  • Any other api error, such as the ones already monitored in Sentry

Engineering notes / background

Very helpful slack thread started by Jill on this topic: https://dsva.slack.com/archives/CE4304QPK/p1694461749806729

Monitoring tools notes:

  • Sentry email notifications have been broken for a long time, so Sentry can only be monitored manually (log in and check)
  • DataDog is the tool of choice and Platform just got more seats – this is our most likely route to automated notifications
  • Dave (wearing his Platform hat) stated that it's the responsibility of product teams to set up monitoring on vets-api itself, rather than asking other systems to do it.
  • Hence, for Forms rate-limit monitoring, he wants us to set up monitoring on vets-api to log the outbound request rate to Lighthouse's Forms API, rather than asking them to set up an alert for us.

Possible ACs

  • Asked Lighthouse if there is a notification system in place to warn teams when an endpoint's rate limit is being approached we think we can set it up in Datadog
  • Asked clarifying questions in slack thread
  • Requested Datadog write access for 2 members of PW team
  • Audited Sentry monitoring to make sure it is covering infrastructure-related failure cases we know about as much as possible
    • Added additional Sentry monitoring/reports as needed
  • Documented how a PW team member can check Sentry monitoring
  • Implemented in Datadog (if possible):
    • Trigger alert when requests-per-minute exceed 50% (?) of current Lighthouse limit, which is 500/min (?)
  • Tested alert at lower threshold to ensure it works as expected
  • Subscribed 4 (?) members of PW to monitoring alerts, including Jill, Wes, ___
  • Scheduled meeting with Dave to review monitoring approach

Acceptance criteria

  • Checked Sentry report every morning and afternoon for spikes in errors and escalated to Lighthouse FormsAPI team (Kristen Brown)
@wesrowe wesrowe added Needs refining Issue status Public Websites Scrum team in the Sitewide crew VA.gov frontend CMS team practice area Drupal engineering CMS team practice area Find a form CMS managed product, owned by Public Websites team labels Sep 11, 2023
@jilladams jilladams mentioned this issue Sep 11, 2023
26 tasks
@jilladams
Copy link
Contributor

Another thread re: Datadog limitations / access: https://dsva.slack.com/archives/C52CL1PKQ/p1694537162096629

Adrian noted this comment will help us get the Datadog access required to be able to make Monitors: department-of-veterans-affairs/va.gov-team#65206 (comment)

@wesrowe wesrowe added backend Drupal engineering CMS team practice area non-quarter-prio (PW) not related to quarterly priorities and removed Drupal engineering CMS team practice area labels Sep 12, 2023
@wesrowe wesrowe changed the title Spike/monitor: Devise a combination of automated and/or manual monitoring of request rate to the Forms API Spike/monitor: Triage FormsAPI monitoring for Sept 30 deadline Sep 13, 2023
@jilladams jilladams added Ruby and removed Drupal engineering CMS team practice area labels Sep 14, 2023
@jilladams
Copy link
Contributor

Previous forms threat space thinking: #9724

@jilladams
Copy link
Contributor

Notes from sync w Wes, Randi, Chris, Daniel, Christia:

  • Epic for monitoring across products
  • Ticket for Data dog monitor of 429 rate limits on Forms
  • Ticket for non-200 and non-429 responses errors on forms
    • There is some murk about where the errors originate
    • We need to determine a reasonable threshold for when to start caring about these kinds of errors, since they happen sporadically.
  • Request to LH forms team: Would love to see some kind of audit for errors we've seen over time and why. We may be able to do more on prevention if we understand this.

Info on the 8/7 outage and root causes / APi rate limits:

  • 240 requests / min = Old rate limit (here, from Kristen Brown)
  • ~262 requests/min = Highest rate we saw, at the height of 8/7 drama (here)
  • 500/min = New rate limit as of 8/8, after our increase request

@wesrowe wesrowe removed the Ruby label Sep 20, 2023
@wesrowe wesrowe mentioned this issue Sep 20, 2023
30 tasks
@wesrowe
Copy link
Contributor Author

wesrowe commented Sep 21, 2023

Closing, as deadline was addressed outside of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Find a form CMS managed product, owned by Public Websites team Needs refining Issue status non-quarter-prio (PW) not related to quarterly priorities Public Websites Scrum team in the Sitewide crew sitewide VA.gov frontend CMS team practice area
Projects
None yet
Development

No branches or pull requests

2 participants