-
Notifications
You must be signed in to change notification settings - Fork 7
VRO On‐Call Overview
Gabriel Zurita edited this page Nov 14, 2024
·
9 revisions
Welcome to your on-call shift! This guide provides an overview and links to essential resources to help you efficiently manage your responsibilities.
- Proactive Monitoring: Serves as a backstop to automated monitoring, proactively addressing anomalies to enhance service quality.
- Team Shield: Protect the development team from disruptions caused by unplanned work, allowing them to maintain focus and productivity.
- Rapid Response: Respond immediately to incidents, manage deployments, and communicate effectively with stakeholders.
For more details, please take a look at the On-Call Responsibilities.
The on-call engineer's duties are outlined in priority order, particularly within the context of Incident Management:
-
Production Issues:
- Respond immediately to incidents and alerts from monitoring systems like PagerDuty, #benefits-vro-on-call, or #benefits-vro-alerts, prioritizing immediate resolution.
- Regularly check system metrics and verify the success of deployments.
- Expedite resolution for hotfixes, root cause analysis (RCA) work, etc.
- Monitor key communication channels (for support or incident-related discussions). See the VRO Communication Channels doc for Slack and Microsoft Team channels details.
-
Blockers:
- Address any issues that may block team productivity, such as problems with QA environments, CI infrastructure, test failures, or deployment failures.
-
Unplanned Work:
- Track requests from communication channels like Slack and other relevant team channels for additional support needs.
-
Planned Work:
- Handle routine production tasks during business hours, including non-urgent alerts and software release approvals.
- Prioritize immediate response to critical incidents over less time-sensitive tasks.
-
Support Role:
- Assist the primary engineer and take over if they're unavailable.
- May handle non-urgent tasks and routine production duties, allowing the primary engineer to focus on critical incidents.
- Availability: On-call engineers should be available during working hours (9 AM—5 PM ET) and ensure prompt responses to pages according to criticality.
- Timing: The on-call rotation aligns with the sprint schedule and covers each sprint's start to end.
- Handover: Document ongoing issues, communicate important updates, and ensure a smooth transition to the next engineer.
- Internal Contacts: See the Team Contact List for internal leads.
- External Contacts and Issue Escalation: For partner team support, refer to VRO Services, Points of Contact, and Issue Escalation Paths.
Note: some of the below could be further consolidated into single documents and simplified to have less content.
-
Monitoring Tools:
- PagerDuty Incident Dashboard
- DataDog Dashboards: Links to dashboards and instructions on gaining access
-
Incident Resources:
- Incident Response Guide
- Incident Reports: Log all SEV 1 and SEV 2 incidents using the Incident Report Slack Workflow and document details in the Incident Reports Wiki.
- Post-Incident Reviews (Private Repo)
- Metrics: Track MTTR and other metrics for continuous monitoring improvements. See the Metrics Documentation.
-
Regular On-Call Task Resources:
- On-Call Responsibilities
- On-Call Runbooks
- Deployments
- Dependabot On-Call Responsibility: Instructions for managing Dependabot PRs
- Recurring On-Call Sprint Work Issue Tracking: Log ongoing issues in recurring GitHub issues (e.g., #3384, #3439, #3499) to maintain visibility and track resolutions.
- SecRel Resources:
-
Tools:
- Aqua (VA intranet)
- Snyk: Scan results on the internal repository's Security tab
- BEP Intake Form (VA intranet)
- Benefits Web Services Page (VA intranet)