-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery: Silent Failures impact on Platform #92876
Comments
10/3/2024 update: @rjohnson2011 is reviewing the silent failures OCTO slack channel, then analysis can begin. |
10/7/2024 update: @rjohnson2011 will reference the spreadsheet that is mentioned in the Slack thread. Might have to use CAG or pass card, more information to come. He suspects that there will be a bunch of information that can be leveraged in the spreadsheet. |
To evaluate the impact of silent failures on the platform, the following sources were consulted to gather relevant examples: DataDog MonitorsSilent Failures Tracking Playground Silent Failures BoardA Github Project Board has been created to track silent failures as they are discovered and remediated. They are organized into category columns, including Lighthouse, Sidekiq, Backend System Exhaustion, and No Ctegory. The areas of most interest to platform are Backend System Exhaustion, Sidekiq and No Category.
Backend Systems ExhaustionThe tickets for these are all tagged for the Benefits team. For example, this ticket - Spike | Error: The required external domain headers ExternalUID or ExternalKey were missing or empty addresses a nested errors that's currently being investigated and resolved by Benefits engineers:
The impact of these error is a few a day (generally when 11 phone number digits are inputted), but has once spiked to 700 in a single day (possibly from re-submissions from a user with this error) and has since mostly resolved itself. Sidekiq ExhaustionThis is generally the largest cause of silent failures. When queues get overloaded jobs can be queued for unacceptably long times, resulting in a user who believes information is submitted when it is still pending and is in fact not. Most of these have the potential to eventually resolve themselves when the queue is finally processed, but without the user knowing this, functionality can be impacted leading to unexpected behavior for the user in the meantime. Out of all these Sidekiq silent failures, the EVSS::DocumentUpload service is the highest impact silent failure service currently on VA.gov. Based on this weekly summary notification the the service has failed between 21 and 236 times per week in the last 4 weeks. No CategoryMany of the "No Category" silent failures deal with errors stemming from bad file attachments. These can have more serious consequences since without an error, the user can proceed without an upload while not knowing anything is wrong. This is a moderately scoped problem, happening a couple dozen times per day on average so having a relatively small impact on Platform accounting for the hundreds of thousands of forms that are submitted daily on VA.gov. Discovering Silent FailuresThe documentation on how to discover silent failures lists a number of examples, typically involving API calls to Sidekiq or Lighthouse Benefits Intake API. All of these are currently being addressed by respective teams and have various impacts of failures, from a handful of instances per week to hundreds of times per day. Status ReportsWeekly Status Reports on silent failures are being delivered to OCTO. The latest one - 10/4/24 includes 29 products noted to be part of 2,685 potential silent errors being investigated. The top Sidekiq silent failures discovered are: All of these are being addressed by their various teams and do not require any further effort of Platform regarding remediation at this time. Platform Specific FailuresFailures that are specific to Platform that are not currently being addressed by other teams can be best found in the #oncall DSVA slack channel. The alert VETS API SIDEKIQ - Retry queue size is high occurs on occasion and has had several efforts to alter the capacity to process these jobs without failure. A ticket can be opened to further analyze the Sidekiq logs and determine the root cause, which seems to stem from the following error:
|
Review takeaways
When the sidekiq retry queue error occurs, it's usually a different Sidekiq job each time, so as long as every Sidekiq job is being evaluated as part of this ZSF effort, we should be good. The Datadog alert is more of a symptom than a cause of potential silent failures. Pasting this from the documentation to keep in mind:
|
I've looked through multiple DD dashboards owned by Platform as well as the #oncall alerts to try to discover anything that might be impacting uses without them knowing. It's difficult to find an error that would be classified as 'silent' that's not already being accounted for or known. Most of our Sidekiq errors for instance resolve once scaling occurs and doesn't have a notable impact on the user.
I think that's valid. There is a concern that I'm 'trying' to find errors that aren't really there to fulfill the objective. We have extensive monitoring in place and any failures that occur in Platform are typically resolved quickly.
Most of these again are scaling issues that would be difficult to create an alert for the user. Also they have slowed down considerably since we've autoscaled according to queue size.
I agree, nothing in the research was jumping out as an action item that requires immediate action or is unaccounted for by various teams where we could step in.
This is a large ask, but depending on the need for accountability this could be an initiative to make certain all Sidekiq jobs are healthy and monitored. I would argue our Sidekiq monitors and extensive DataDog monitors for errors allow out to look over this with a large degree of confidence. Errors tend to be noticed and bubble up quickly, typically first amongst developers in staging even before users experience them at wide scale. |
There are 104 files in vets-api that end in |
Question for @LindseySaari here in Slack before we close out this ticket. |
I created #94632 as follow-up based on the discovery done in this ticket. Closing! |
User Story
As the managers and developers of the VA Platform,
We need to do a light discovery on the impact of when silent failures are mitigated,
So that we can have a better idea of scope and level of effort required by the Platform team.
Issue Description
Although there has been an announcement in DSVA Slack that silent failures are to be eliminated by 11/14/2024, there has not been any direct instruction from the Platform team's OCTODE PO on the scope of work and the level of involvement by the team.
This ticket is for light discovery
Tasks
Success Metrics
The Platform Product team will have a better idea of the level of effort required for this fast-moving OCTODE initiative, and can plan and point the work once the scope is clarified by the OCTODE PO.
Acceptance Criteria
NOTES:
Validation
Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.
The text was updated successfully, but these errors were encountered: