Discovery: Silent Failures impact on Platform #92876

jennb33 · 2024-09-13T20:31:53Z

User Story

As the managers and developers of the VA Platform,
We need to do a light discovery on the impact of when silent failures are mitigated,
So that we can have a better idea of scope and level of effort required by the Platform team.

Issue Description

Although there has been an announcement in DSVA Slack that silent failures are to be eliminated by 11/14/2024, there has not been any direct instruction from the Platform team's OCTODE PO on the scope of work and the level of involvement by the team.
This ticket is for light discovery

Tasks

Get user stories/examples of the silent failures so that we can track the frequency and timeframe of the failures
Identify what could be causing submission failures
Identify what could be causing the other silent failures
Identify strategy to Platform to fix failures after more is understood
Identify impact to Platform in terms of level of effort required to execute strategies to fix the failures
Update these tasks based on actual scope from OCTODE PO
Document all findings as above and link documentation either here or in comments

Success Metrics

The Platform Product team will have a better idea of the level of effort required for this fast-moving OCTODE initiative, and can plan and point the work once the scope is clarified by the OCTODE PO.

Acceptance Criteria

There will be a strategy for mitigating possible impacts to the Platform Product team, once the submission failures and silent failures have been documented and analyzed.

NOTES:

The scope for this project and the Platform Product team's involvement has yet to be determined. This is for preparation, and is subject to change once a clear scope of work is authorized by the OCTODE PO.
Silent failure in this instance is defined as any veteran facing item that fails where the veteran has an opportunity or action to mediate it. Examples include when a veteran fills out a form, the submission fails, the veteran isn't informed that it fails. It does not include an instance where a content editor creates new content and submits it, it doesn't submit successfully but they aren't informed.
Please review this guidance document for suggestions from OCTO.

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

jennb33 · 2024-10-03T15:06:35Z

10/3/2024 update: @rjohnson2011 is reviewing the silent failures OCTO slack channel, then analysis can begin.

jennb33 · 2024-10-07T15:07:20Z

10/7/2024 update: @rjohnson2011 will reference the spreadsheet that is mentioned in the Slack thread. Might have to use CAG or pass card, more information to come. He suspects that there will be a bunch of information that can be leveraged in the spreadsheet.

rjohnson2011 · 2024-10-08T16:12:47Z

To evaluate the impact of silent failures on the platform, the following sources were consulted to gather relevant examples:

DataDog Monitors

Silent Failures Tracking Playground
Sidekiq Death Queue

Silent Failures Board

A Github Project Board has been created to track silent failures as they are discovered and remediated. They are organized into category columns, including Lighthouse, Sidekiq, Backend System Exhaustion, and No Ctegory. The areas of most interest to platform are Backend System Exhaustion, Sidekiq and No Category.

(Screenshot of Silent Failures Board)

Backend Systems Exhaustion

The tickets for these are all tagged for the Benefits team. For example, this ticket - Spike | Error: The required external domain headers ExternalUID or ExternalKey were missing or empty addresses a nested errors that's currently being investigated and resolved by Benefits engineers:

(ns0:Client) ID: 78709F68-B500-4A34-8B52-C4D0C804403B: The required external domain headers ExternalUID or ExternalKey were missing or empty

The impact of these error is a few a day (generally when 11 phone number digits are inputted), but has once spiked to 700 in a single day (possibly from re-submissions from a user with this error) and has since mostly resolved itself.

Sidekiq Exhaustion

This is generally the largest cause of silent failures. When queues get overloaded jobs can be queued for unacceptably long times, resulting in a user who believes information is submitted when it is still pending and is in fact not. Most of these have the potential to eventually resolve themselves when the queue is finally processed, but without the user knowing this, functionality can be impacted leading to unexpected behavior for the user in the meantime.

Out of all these Sidekiq silent failures, the EVSS::DocumentUpload service is the highest impact silent failure service currently on VA.gov. Based on this weekly summary notification the the service has failed between 21 and 236 times per week in the last 4 weeks.

No Category

Many of the "No Category" silent failures deal with errors stemming from bad file attachments. These can have more serious consequences since without an error, the user can proceed without an upload while not knowing anything is wrong. This is a moderately scoped problem, happening a couple dozen times per day on average so having a relatively small impact on Platform accounting for the hundreds of thousands of forms that are submitted daily on VA.gov.

Discovering Silent Failures

The documentation on how to discover silent failures lists a number of examples, typically involving API calls to Sidekiq or Lighthouse Benefits Intake API. All of these are currently being addressed by respective teams and have various impacts of failures, from a handful of instances per week to hundreds of times per day.

Status Reports

Weekly Status Reports on silent failures are being delivered to OCTO. The latest one - 10/4/24 includes 29 products noted to be part of 2,685 potential silent errors being investigated.

The top Sidekiq silent failures discovered are:
EVSS::DisabilityCompensationForm::SubmitForm526AllClaim
Sidekiq::Form526BackupSubmissionProcess::Submit
EVSS::DisabilityCompensationForm::SubmitUploads
EVSS::DocumentUpload
EVSS::RequestDecision

All of these are being addressed by their various teams and do not require any further effort of Platform regarding remediation at this time.

Platform Specific Failures

Failures that are specific to Platform that are not currently being addressed by other teams can be best found in the #oncall DSVA slack channel. The alert VETS API SIDEKIQ - Retry queue size is high occurs on occasion and has had several efforts to alter the capacity to process these jobs without failure. A ticket can be opened to further analyze the Sidekiq logs and determine the root cause, which seems to stem from the following error:

BackendServiceException: {:source=>"VaNotify::Service", :code=>"VANOTIFY_400"}

rmtolmach · 2024-10-08T18:48:05Z

Review takeaways

Has discovery been done to determine if there are there any failures the Platform (backend) is responsible for that we're not alerting on?
I don't think we need to open a ticket to further analyze sidekiq logs. But I could be misunderstanding this.

Failures that are specific to Platform that are not currently being addressed by other teams can be best found in the #oncall DSVA slack channel.

We can look at these alerts with a lens of "should a user (Veteran, claimant, 3rd party, etc.) be aware that this occurred?" i.e. does this effect of this alert impact users?
The backend monitors dashboard aggregates all of the vets-api related monitors. 👀 looking through them I don't think we have any action items for them. Some of them are symptoms of something user-facing going on (the Sidekiq one Ryan mentioned), but nothing that requires our action.
But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

The alert VETS API SIDEKIQ - Retry queue size is high occurs on occasion and has had several efforts to alter the capacity to process these jobs without failure. A ticket can be opened to further analyze the Sidekiq logs and determine the root cause...

When the sidekiq retry queue error occurs, it's usually a different Sidekiq job each time, so as long as every Sidekiq job is being evaluated as part of this ZSF effort, we should be good. The Datadog alert is more of a symptom than a cause of potential silent failures.

Pasting this from the documentation to keep in mind:

Silent failures are any errors that occur within your application on any public facing VA platform (VA.gov, mobile application, VA Notify, et. al.) whereby the user (Veteran, claimant, 3rd party, etc.) interacting with your application is NOT made aware that an error has occurred.

rjohnson2011 · 2024-10-08T19:13:50Z

Has discovery been done to determine if there are there any failures the Platform (backend) is responsible for that we're not alerting #on?

I've looked through multiple DD dashboards owned by Platform as well as the #oncall alerts to try to discover anything that might be impacting uses without them knowing. It's difficult to find an error that would be classified as 'silent' that's not already being accounted for or known. Most of our Sidekiq errors for instance resolve once scaling occurs and doesn't have a notable impact on the user.

I don't think we need to open a ticket to further analyze sidekiq logs. But I could be misunderstanding this.

I think that's valid. There is a concern that I'm 'trying' to find errors that aren't really there to fulfill the objective. We have extensive monitoring in place and any failures that occur in Platform are typically resolved quickly.

We can look at these alerts with a lens of "should a user (Veteran, claimant, 3rd party, etc.) be aware that this occurred?" i.e. does this effect of this alert impact users?

Most of these again are scaling issues that would be difficult to create an alert for the user. Also they have slowed down considerably since we've autoscaled according to queue size.

The backend monitors dashboard aggregates all of the vets-api related monitors. 👀 looking through them I don't think we have any action items for them. Some of them are symptoms of something user-facing going on (the Sidekiq one Ryan mentioned), but nothing that requires our action.

I agree, nothing in the research was jumping out as an action item that requires immediate action or is unaccounted for by various teams where we could step in.

But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

This is a large ask, but depending on the need for accountability this could be an initiative to make certain all Sidekiq jobs are healthy and monitored. I would argue our Sidekiq monitors and extensive DataDog monitors for errors allow out to look over this with a large degree of confidence. Errors tend to be noticed and bubble up quickly, typically first amongst developers in staging even before users experience them at wide scale.

rmtolmach · 2024-10-08T19:56:33Z

But also, are there any failures the Platform (backend) is responsible for that we're not alerting on? Do we need to look through all of the sidekiq jobs to make sure there is a team responsible for each one? And if any are connected to orphaned products, we adopt them?? I don't love the sound of that.

This is a large ask, but depending on the need for accountability this could be an initiative to make certain all Sidekiq jobs are healthy and monitored.

There are 104 files in vets-api that end in _job.rb, meaning there are at least that many sidekiq jobs. Do we need to check all of them to see if there is proper error handling? 😰

rmtolmach · 2024-10-08T21:19:12Z

Question for @LindseySaari here in Slack before we close out this ticket.

rmtolmach · 2024-10-09T13:43:49Z

I created #94632 as follow-up based on the discovery done in this ticket. Closing!

jennb33 added the platform-product-team label Sep 13, 2024

jennb33 assigned LindseySaari and rmtolmach Sep 13, 2024

This was referenced Sep 16, 2024

Create list of Platform Product-owned integrations to prevent silent failures #92926

Closed

2024 Q4 Platform Product Internal ticket #78234

Open

jennb33 unassigned LindseySaari Sep 17, 2024

jennb33 assigned rjohnson2011 and unassigned rmtolmach Sep 30, 2024

humancompanion-usds added the zero-silent-failures Work related to eliminating silent failures label Oct 8, 2024

rmtolmach mentioned this issue Oct 9, 2024

Identify and document orphaned sidekiq jobs for ZSF initiative #94632

Closed

10 tasks

rmtolmach closed this as completed Oct 9, 2024

This was referenced Oct 11, 2024

MVP - Endpoint Monitoring and Updates: Compliance and Monitoring Enforcement #94847

Open

MVP - Endpoint Monitoring and Updates: Issue Reporting and Continuous Improvement #94855

Open

MVP - Endpoint Monitoring and Updates: Performance and Scalability #94858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery: Silent Failures impact on Platform #92876

Discovery: Silent Failures impact on Platform #92876

jennb33 commented Sep 13, 2024 •

edited

Loading

jennb33 commented Oct 3, 2024

jennb33 commented Oct 7, 2024

rjohnson2011 commented Oct 8, 2024 •

edited

Loading

rmtolmach commented Oct 8, 2024 •

edited

Loading

rjohnson2011 commented Oct 8, 2024 •

edited

Loading

rmtolmach commented Oct 8, 2024

rmtolmach commented Oct 8, 2024

rmtolmach commented Oct 9, 2024

Discovery: Silent Failures impact on Platform #92876

Discovery: Silent Failures impact on Platform #92876

Comments

jennb33 commented Sep 13, 2024 • edited Loading

User Story

Issue Description

Tasks

Success Metrics

Acceptance Criteria

NOTES:

Validation

jennb33 commented Oct 3, 2024

jennb33 commented Oct 7, 2024

rjohnson2011 commented Oct 8, 2024 • edited Loading

DataDog Monitors

Silent Failures Board

Backend Systems Exhaustion

Sidekiq Exhaustion

No Category

Discovering Silent Failures

Status Reports

Platform Specific Failures

rmtolmach commented Oct 8, 2024 • edited Loading

Review takeaways

rjohnson2011 commented Oct 8, 2024 • edited Loading

rmtolmach commented Oct 8, 2024

rmtolmach commented Oct 8, 2024

rmtolmach commented Oct 9, 2024

jennb33 commented Sep 13, 2024 •

edited

Loading

rjohnson2011 commented Oct 8, 2024 •

edited

Loading

rmtolmach commented Oct 8, 2024 •

edited

Loading

rjohnson2011 commented Oct 8, 2024 •

edited

Loading