-
Notifications
You must be signed in to change notification settings - Fork 7
Incident Response
The VRO Platform is a moderately complex system that integrates with other systems and is hosted on a shared infrastructure. Incidents of unplanned service failures and disruptions are inevitable. The first objective of the incident response process is to restore a normal service operation as quickly as possible and minimize the incident's impact. This document describes the core steps for responding to incidents.
The VRO team defines an incident as an unplanned event or occurrence that disrupts normal operations, services, or functions on the VRO platform. These negatively impact the availability, performance, security, or functionality of a VRO service and require immediate attention to mitigate its effects and restore normalcy. Incidents can vary widely in scope and severity and can be caused by factors in or out of the VRO team's control.
Incident Type | Description |
---|---|
Service Outages | Complete or partial unavailability of infrastructure services. |
Performance Degradation | Noticeable slowdown or inefficiency in infrastructure services. |
Security Breaches | Unauthorized access, data breaches, or vulnerabilities affecting infrastructure integrity, confidentiality, or availability. |
Operational Failures | Failures in deployment pipelines, configuration management, or automated processes impacting normal operations. |
Resource Exhaustion | Over-utilization or exhaustion of resources leading to degraded service. |
Unexpected Behavior | Anomalies or unexpected behaviors in infrastructure services affecting development, testing, or deployment activities. |
The root cause of an incident is investigated and attributed to the team responsible for the issue’s origin - not necessarily the team addressing it - to ensure accurate reporting and unbiased resolution efforts. The GitHub issue for an incident should be annotated with the RC label corresponding to the root cause and should also be documented in the final Incident Report.
Root Cause Type | Description |
---|---|
VRO | An issue on the VRO platform that ties directly to our team's scope of responsibilities. These incidents are tracked to measure our MTTR (Mean Time to Resolve) metric. |
Partner Team Application or External VA | An issue with the partner team application controlled by the partner team, or due to a VA external system not functioning appropriately |
LHDI | An issue on the LHDI platform or ArgoCD, Aqua, k8s, SecRel, and Vault. These are not within the VRO team's control, but the VRO team reports them to LHDI and works to resolve them in partnership with LHDI. |
As a default, incident response is the responsibility of the VRO primary on-call engineer, which rotates with every VRO sprint period. Throughout the process, they might personally conduct each step or delegate tasks as needed; regardless, a single individual should be identified as being in charge of the incident response. If this responsibility needs to be transferred while an incident is active, then this handoff should be explicitly communicated. Working hours are from 9am - 5pm ET. Any incidents that occur or are reported outside of these hours will be addressed when the on-call engineer resumes their working schedule.
See the On Call Responsibilities page for more information.
The Report a VRO Incident
Slack Workflow is the intake form and process to escalate all problems discovered on the VRO platform. This workflow can be used by any member of the VA OCTO Slack workspace, and should be used for reporting incidents discovered both internally (i.e., by the on-call engineer and other VRO team members) and externally (i.e., by a partner team or third party).
The workflow can be found in the Workflows
folder in the #benefits-vro, #benefits-vro-support, and #benefits-vro-on-call channels, or using the command shortcut /report a vro incident
.
Below are two videos demonstrating the steps and tasks the VRO on-call engineer and the partner teams go through when using the workflow.
Partner Team Demo Video
ReportIncidents_PartnerTeamView.mp4
0:00
: Find the Incident Report bookmark in #benefits-vro-support.0:07
: Fill out the form.0:56
: Observe the automated post.1:08
: Observe the acknowledgment from the responding engineer.1:24
: Post additional comments in the thread.1:44
: Receive status updates in the thread.2:03
: Receive notification when the incident is resolved.
VRO Team Demo Video
ReportIncidents_VROTeamView.mp4
0:01
: Step 0: React with 👀 on the Incident Report Slack post.0:07
: Step 0: Click the Acknowledge button on the PagerDuty post.0:14
: Step 1: Post an initial update0:20
: Step 2: Post internal notes to #benefits-vro-on-call0:40
: Step 2: Post general updates to #benefits-vro-support1:12
: Click the Next Step button as tasks are completed.1:38
: Step 3: Update the GitHub issue and the Incidents epic2:51
: Click the Next Step button as tasks are completed.3:03
:** Step 5:** Log the incident on the wiki4:10
: Step 5: Close the GitHub issue4:20
: Click the Next Step button as tasks are completed.4:26
: Step 5: React with on the Incident Report slack post
Note
There are discrepancies between these demo videos and the current iteration of the workflow. Some of the language and formatting of the automated messages have changed since recording this video, but the steps and tasks remain consistent.
Generated using draw.io. Source file: incidentResponse.drawio.v2.txt (remove the .txt extension to use in draw.io)
- Step 0: Acknowledge
- Step 1: Triage
- Step 2: Contain/Stabilize
- Step 3: Remediate (short-term)
- Step 4: Monitor
- Step 5: Create Incident Report
- Step 6: Post-mortem Review
- Step 7: Long-term Remediation
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Communication to the person escalating a potential incident that VRO will begin an investigation. | Reduce the likelihood of uncoordinated troubleshooting efforts, reduce panic, and establish consistent data points for calculating incident metrics | 2 minutes | Within 60 minutes of when the incident is reported |
Tasks
Tasks for this step vary depending on the source or person reporting the incident.
-
'Report a VRO Incident' Slack workflow
- React with 👀 on the post in #benefits-vro-support
- Click the
Acknowledge
button on the PagerDuty post in #benefits-vro-on-call
-
Non-workflow Slack post
- React with 👀 to the post
- Use the Report a VRO Incident Slack workflow
- React with 👀 on the post in #benefits-vro-support
-
Email message or Slack DM from a known partner or stakeholder
- Reply: "Thank you for reporting this incident. If you have access to the VA OCTO slack, the resolution of this incident will be documented in the #benefits-vro-support channel."
- Use the Report a VRO Incident Slack workflow
- React with 👀 on the post in #benefits-vro-support
-
Email message from an unknown party
- Consult the VRO team and OCTO Enablement Team
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Sharing a brief assessment that determines an initial severity level (SEV) and the affected systems and next steps. | Gain situational awareness, respond with appropriate urgency, and provide a prompt update | 10 minutes | Within 30 minutes of Step 0 |
Tasks
- Post an initial update in the thread of the automated workflow message on the #benefits-vro-support channel.
- Review the Severity Levels (SEV) below for guidance on appropriate message language and frequency for subsequent updates.
SEV 1: Critical - Core functionality is unavailable or buggy
Examples: a VRO app appears offline, a VRO app's transactions are failing, a VRO data service appears unresponsive, inaccurate data is transmitted
Priority: immediate investigation
Sample language:
"We are currently investigating this incident. We will provide detailed updates of our findings and actions every ~30 minutes until the issue is resolved."
SEV 2: High - Core functionality is degraded
Examples: increased latency; increased retry attempts
Priority: immediate investigation
Initial Message:
- Confirmation that investigation is underway
- The next updates will be sent 30-60 minutes after an investigation has begun
Sample language:
"We are currently investigating this incident. We will provide detailed updates of our findings and actions every ~30 - 60 minutes until the issue is resolved."
SEV 3: Medium - Core functionality metrics are affected, but without noticeable performance degradation
Examples: sustained increase in CPU utilization; sustained increased in open database connections
Priority: investigation within the next business day; continued passive monitoring in the meantime
Initial Message:
- Confirmation that investigation will begin in the next business day
- The next updates will be sent every 1 hour after an investigation has begun
Sample language:
"We will investigate this incident within the next business day. We will provide detailed updates on our findings and actions every hour until the issue is resolved."
SEV 4: Low - Non-core functionality is affected
Examples: gaps or increased latency in transmitting data to an analytics platform
Priority: investigation within the next 1-2 business days is limited to solely identifying the root cause; continued passive monitoring in the meantime
Initial Message:
- Confirmation of when the investigation begins
- The next updates will be sent as needed
Sample language:
"We will investigate this incident at our earliest convenience. We will provide updates on our findings and actions as needed."
- All subsequent updates should include:
- Current status and progress made so far
- Specific actions taken since the last update
- Any changes to the estimated resolution time
- Next update time
See VRO Services, Points of Contact, and Issue Escalation Paths for more information.
Important
During the resolution process of an incident, the root cause is not included in updates or communications with partner teams to avoid bias and ensure impartiality.
Tip
- Use succinct and specific language; avoid jargon or acronyms that non-VRO team members may not be familiar with.
- Use Slack's
/remind
feature to set up notifications of when the next update is due. For example:/remind me in 30 minutes to post an update
.
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Work to prevent further damage and provide status updates. | Containing the situation might provide more immediate relief than implementing a remediation | varies |
SEV 1 , SEV 2 : Immediately after Step 1; SEV 3 , SEV 4 : As soon as VRO can prioritize |
Tasks
- Post internal notes to #benefits-vro-on-call
- Post updates in the thread of the #benefits-vro-support message within the frequency defined for the respective Severity.
- Click the
Next Step
button on the Report a VRO incident Slack workflow.
Considerations
- Is there a configuration change to prevent requests to the buggy system?
- Would an increase in computing resources temporarily stabilize the system?
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Get the system back to a minimally acceptable operating status. | Reduce the likelihood that the incident will reoccur in the short term, and increase the likelihood that the system will be stable. | Varies |
SEV 1 , SEV 2 : immediately after Step 2; SEV 3 , SEV 4 : as soon as VRO can prioritize |
Tasks
- Apply fix or temporary resolution based on investigation
- Post general updates to the thread in #benefits-vro-support within the frequency defined for the respective Severity.
- Update the GitHub issue that was created.
- Add blue
VRO-team
label - Add root cause label
RC VRO
RC LHDI
RC Partner Team or External VA
- Assign to the engineer(s) responding to the incident
- Include
SEV
level - Add to the current sprint
- Add to the
Incidents
Epic - Arrange to discuss the incident as a 16th-minute item during Daily Scrum
- Add blue
- Click the
Next Step
button on the Report a VRO incident Slack workflow.
Considerations
- Should compute resources be recalibrated?
- Would a rollback or roll-forward of code/configuration would be appropriate and feasible?
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Look for data points that indicate the incident is under control. As needed, return to Step 2 and Step 3. | Gain confidence that the incident is under control. | Minimum of 30 minutes | N/A |
Tasks
- (as needed) Post internal notes to #benefits-vro-on-call.
- Post general updates to the thread in #benefits-vro-support within the frequency defined for the respective severity level.
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Document the incident in the public Incident Reports wiki page. | Build and maintain a record of incidents that can reveal patterns, inform engineering decisions, and be a general resource for partner teams and the enablement team | 30 minutes | Within 1 business day of completing Step 4 |
Tasks
- Create an Incident Report on the wiki page.
- If the incident is a
SEV 1
orSEV 2
, reach out to partner teams to gather impact metrics (i.e., quantitative data of their applications' performance) and add to last column in the Incident Report.
- If the incident is a
- Log MTTR metric
- Close the GitHub issue that was created by the Incident Report workflow.
- Click the
Complete
button in the workflow thread on #benefits-vro-on-call.
Considerations
- How the incident was detected, including a timestamp
- Severity level
- Corrective measures taken
- Timestamp of when the system returned to operating status
- "Red herrings" that were encountered
- Follow-up tasks or action items
Ticket #3628 in our backlog to create a template for Incident Reports for consistency
This step is expected for SEV 1
and SEV 2
incidents, and at the team's discretion for SEV 3
and SEV 4
incidents.
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
As a more in-depth analysis, assess what happened, what went well, what did not go well, and measures to prevent a recurrence. Describe troubleshooting measures, including log snippets and command line tools. | Leverage the incident as a learning opportunity and surface further corrective measures | 4 hours | Within 5 business days of completing Step 5 |
Tasks
- Create a Post-mortem Review in the private wiki.
- Share this documentation with the VRO team and partner teams, and assess whether a team discussion is needed.
Considerations
- Follow principles of blameless post-mortems by focusing on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.
Description | Purpose | Estimated time to complete | SLA |
---|---|---|---|
Determine measures that would reduce the likelihood of this incident recurring and/or give the team better visibility into conditions that led to this incident. | Consider remediation measures that could not be achieved in the short-term response. | Varies | Within 2 sprint cycles of completing Step 5 |
Tasks
- Document recommended remediation steps.
- Share these with the VRO team during backlog refinement.