Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery: Silent Failures impact on Platform #92876

Closed
8 tasks
Tracked by #78234
jennb33 opened this issue Sep 13, 2024 · 8 comments
Closed
8 tasks
Tracked by #78234

Discovery: Silent Failures impact on Platform #92876

jennb33 opened this issue Sep 13, 2024 · 8 comments
Assignees
Labels
2024 discovery engineering Engineering topics platform-product-team zero-silent-failures Work related to eliminating silent failures

Comments

@jennb33
Copy link
Contributor

jennb33 commented Sep 13, 2024

User Story

As the managers and developers of the VA Platform,
We need to do a light discovery on the impact of when silent failures are mitigated,
So that we can have a better idea of scope and level of effort required by the Platform team.

Issue Description

Although there has been an announcement in DSVA Slack that silent failures are to be eliminated by 11/14/2024, there has not been any direct instruction from the Platform team's OCTODE PO on the scope of work and the level of involvement by the team.
This ticket is for light discovery

Tasks

  • Get user stories/examples of the silent failures so that we can track the frequency and timeframe of the failures
  • Identify what could be causing submission failures
  • Identify what could be causing the other silent failures
  • Identify strategy to Platform to fix failures after more is understood
  • Identify impact to Platform in terms of level of effort required to execute strategies to fix the failures
  • Update these tasks based on actual scope from OCTODE PO
  • Document all findings as above and link documentation either here or in comments

Success Metrics

The Platform Product team will have a better idea of the level of effort required for this fast-moving OCTODE initiative, and can plan and point the work once the scope is clarified by the OCTODE PO.

Acceptance Criteria

  • There will be a strategy for mitigating possible impacts to the Platform Product team, once the submission failures and silent failures have been documented and analyzed.

NOTES:

  1. The scope for this project and the Platform Product team's involvement has yet to be determined. This is for preparation, and is subject to change once a clear scope of work is authorized by the OCTODE PO.
  2. Silent failure in this instance is defined as any veteran facing item that fails where the veteran has an opportunity or action to mediate it. Examples include when a veteran fills out a form, the submission fails, the veteran isn't informed that it fails. It does not include an instance where a content editor creates new content and submits it, it doesn't submit successfully but they aren't informed.
  3. Please review this guidance document for suggestions from OCTO.

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

@jennb33 jennb33 added needs-grooming Use this to designate any issues that need grooming from the team engineering Engineering topics needs-refinement Identifies tickets that need to be refined 2024 discovery and removed needs-grooming Use this to designate any issues that need grooming from the team needs-refinement Identifies tickets that need to be refined labels Sep 13, 2024
@jennb33 jennb33 assigned rjohnson2011 and unassigned rmtolmach Sep 30, 2024
@jennb33
Copy link
Contributor Author

jennb33 commented Oct 3, 2024

10/3/2024 update: @rjohnson2011 is reviewing the silent failures OCTO slack channel, then analysis can begin.

@jennb33
Copy link
Contributor Author

jennb33 commented Oct 7, 2024

10/7/2024 update: @rjohnson2011 will reference the spreadsheet that is mentioned in the Slack thread. Might have to use CAG or pass card, more information to come. He suspects that there will be a bunch of information that can be leveraged in the spreadsheet.

@rjohnson2011
Copy link
Contributor

rjohnson2011 commented Oct 8, 2024

To evaluate the impact of silent failures on the platform, the following sources were consulted to gather relevant examples:

DataDog Monitors

Silent Failures Tracking Playground
Sidekiq Death Queue

Silent Failures Board

A Github Project Board has been created to track silent failures as they are discovered and remediated. They are organized into category columns, including Lighthouse, Sidekiq, Backend System Exhaustion, and No Ctegory. The areas of most interest to platform are Backend System Exhaustion, Sidekiq and No Category.

Image
(Screenshot of Silent Failures Board)

Backend Systems Exhaustion

The tickets for these are all tagged for the Benefits team. For example, this ticket - Spike | Error: The required external domain headers ExternalUID or ExternalKey were missing or empty addresses a nested errors that's currently being investigated and resolved by Benefits engineers:

(ns0:Client) ID: 78709F68-B500-4A34-8B52-C4D0C804403B: The required external domain headers ExternalUID or ExternalKey were missing or empty

The impact of these error is a few a day (generally when 11 phone number digits are inputted), but has once spiked to 700 in a single day (possibly from re-submissions from a user with this error) and has since mostly resolved itself.

Sidekiq Exhaustion

This is generally the largest cause of silent failures. When queues get overloaded jobs can be queued for unacceptably long times, resulting in a user who believes information is submitted when it is still pending and is in fact not. Most of these have the potential to eventually resolve themselves when the queue is finally processed, but without the user knowing this, functionality can be impacted leading to unexpected behavior for the user in the meantime.

Out of all these Sidekiq silent failures, the EVSS::DocumentUpload service is the highest impact silent failure service currently on VA.gov. Based on this weekly summary notification the the service has failed between 21 and 236 times per week in the last 4 weeks.

No Category

Many of the "No Category" silent failures deal with errors stemming from bad file attachments. These can have more serious consequences since without an error, the user can proceed without an upload while not knowing anything is wrong. This is a moderately scoped problem, happening a couple dozen times per day on average so having a relatively small impact on Platform accounting for the hundreds of thousands of forms that are submitted daily on VA.gov.

Discovering Silent Failures

The documentation on how to discover silent failures lists a number of examples, typically involving API calls to Sidekiq or Lighthouse Benefits Intake API. All of these are currently being addressed by respective teams and have various impacts of failures, from a handful of instances per week to hundreds of times per day.

Status Reports

Weekly Status Reports on silent failures are being delivered to OCTO. The latest one - 10/4/24 includes 29 products noted to be part of 2,685 potential silent errors being investigated.

The top Sidekiq silent failures discovered are:
EVSS::DisabilityCompensationForm::SubmitForm526AllClaim
Sidekiq::Form526BackupSubmissionProcess::Submit
EVSS::DisabilityCompensationForm::SubmitUploads
EVSS::DocumentUpload
EVSS::RequestDecision

All of these are being addressed by their various teams and do not require any further effort of Platform regarding remediation at this time.

Platform Specific Failures

Failures that are specific to Platform that are not currently being addressed by other teams can be best found in the #oncall DSVA slack channel. The alert VETS API SIDEKIQ - Retry queue size is high occurs on occasion and has had several efforts to alter the capacity to process these jobs without failure. A ticket can be opened to further analyze the Sidekiq logs and determine the root cause, which seems to stem from the following error:

BackendServiceException: {:source=>"VaNotify::Service", :code=>"VANOTIFY_400"}

@humancompanion-usds humancompanion-usds added the zero-silent-failures Work related to eliminating silent failures label Oct 8, 2024
@rmtolmach
Copy link
Contributor

rmtolmach commented Oct 8, 2024

Review takeaways

  • Has discovery been done to determine if there are there any failures the Platform (backend) is responsible for that we're not alerting on?
  • I don't think we need to open a ticket to further analyze sidekiq logs. But I could be misunderstanding this.

Failures that are specific to Platform that are not currently being addressed by other teams can be best found in the #oncall DSVA slack channel.

  • We can look at these alerts with a lens of "should a user (Veteran, claimant, 3rd party, etc.) be aware that this occurred?" i.e. does this effect of this alert impact users?
  • The backend monitors dashboard aggregates all of the vets-api related monitors. 👀 looking through them I don't think we have any action items for them. Some of them are symptoms of something user-facing going on (the Sidekiq one Ryan mentioned), but nothing that requires our action.
  • But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

The alert VETS API SIDEKIQ - Retry queue size is high occurs on occasion and has had several efforts to alter the capacity to process these jobs without failure. A ticket can be opened to further analyze the Sidekiq logs and determine the root cause...

When the sidekiq retry queue error occurs, it's usually a different Sidekiq job each time, so as long as every Sidekiq job is being evaluated as part of this ZSF effort, we should be good. The Datadog alert is more of a symptom than a cause of potential silent failures.

Pasting this from the documentation to keep in mind:

Silent failures are any errors that occur within your application on any public facing VA platform (VA.gov, mobile application, VA Notify, et. al.) whereby the user (Veteran, claimant, 3rd party, etc.) interacting with your application is NOT made aware that an error has occurred.

@rjohnson2011
Copy link
Contributor

rjohnson2011 commented Oct 8, 2024

Has discovery been done to determine if there are there any failures the Platform (backend) is responsible for that we're not alerting #on?

I've looked through multiple DD dashboards owned by Platform as well as the #oncall alerts to try to discover anything that might be impacting uses without them knowing. It's difficult to find an error that would be classified as 'silent' that's not already being accounted for or known. Most of our Sidekiq errors for instance resolve once scaling occurs and doesn't have a notable impact on the user.

I don't think we need to open a ticket to further analyze sidekiq logs. But I could be misunderstanding this.

I think that's valid. There is a concern that I'm 'trying' to find errors that aren't really there to fulfill the objective. We have extensive monitoring in place and any failures that occur in Platform are typically resolved quickly.

We can look at these alerts with a lens of "should a user (Veteran, claimant, 3rd party, etc.) be aware that this occurred?" i.e. does this effect of this alert impact users?

Most of these again are scaling issues that would be difficult to create an alert for the user. Also they have slowed down considerably since we've autoscaled according to queue size.

The backend monitors dashboard aggregates all of the vets-api related monitors. 👀 looking through them I don't think we have any action items for them. Some of them are symptoms of something user-facing going on (the Sidekiq one Ryan mentioned), but nothing that requires our action.

I agree, nothing in the research was jumping out as an action item that requires immediate action or is unaccounted for by various teams where we could step in.

But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

This is a large ask, but depending on the need for accountability this could be an initiative to make certain all Sidekiq jobs are healthy and monitored. I would argue our Sidekiq monitors and extensive DataDog monitors for errors allow out to look over this with a large degree of confidence. Errors tend to be noticed and bubble up quickly, typically first amongst developers in staging even before users experience them at wide scale.

@rmtolmach
Copy link
Contributor

But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

This is a large ask, but depending on the need for accountability this could be an initiative to make certain all Sidekiq jobs are healthy and monitored.

There are 104 files in vets-api that end in _job.rb, meaning there are at least that many sidekiq jobs. Do we need to check all of them to see if there is proper error handling? 😰

@rmtolmach
Copy link
Contributor

Question for @LindseySaari here in Slack before we close out this ticket.

@rmtolmach
Copy link
Contributor

I created #94632 as follow-up based on the discovery done in this ticket. Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024 discovery engineering Engineering topics platform-product-team zero-silent-failures Work related to eliminating silent failures
Projects
None yet
Development

No branches or pull requests

5 participants