-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2/13/2024 Prod Offline Incident Post Mortem #17242
Comments
Action items from 2/21/2024 meeting:
Slide deck from the meeting is attached to this comment. |
@gracekretschmer-metrostar myself and Hassan would be blocked on continuing to investigate if apache failed because of database connections or if database connections were high because apache failed, until we get more access to things and we add the recommendations from the post-mortem as tickets and get the work done in those tickets. As mentioned in the Post-mortem PR:
@gracekretschmer-metrostar @michelle-dooley I'm thinking I'd put these into issues in the backlog of our Project? |
Submitted Post-mortem final draft to: https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/pull/1513 |
Sent postmortem to Erika to review. https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/pull/1513/commits/9708c431774d4091b7090d5e9304b948bf6e02eb |
Approved |
Background
On 2/13/2024, Tim Cosgrove (Accelerated Publishing PM) notified the CMS team that prod.cms.va.gov was down since 11pm EST. No one was directly notified of the outage, Tim Cosgrove noticed outage notifications in the DSVA slack channel: #cms-notifications. The outgoing CMS Tech Lead (Nathan Douglas) was able to identify that prod.cms.va.gov was down because it was removed from service and resolve the issue by performing an emergency prod deploy.
User Story or Problem Statement
How might we understand why prod.cms.va.gov was unexpectedly removed from service and prevent that from unknowingly happening in the future?
Assumptions
Prod.cms.va.gov and it's replacement instance were removed from service because it failed health checks.
Tasks
Acceptance Criteria
Team
Please check the team(s) that will do this work.
CMS Team
Public Websites
Facilities
User support
Accelerated Publishing
The text was updated successfully, but these errors were encountered: