Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2/13/2024 Prod Offline Incident Post Mortem #17242

Closed
5 of 9 tasks
gracekretschmer-metrostar opened this issue Feb 14, 2024 · 6 comments
Closed
5 of 9 tasks

2/13/2024 Prod Offline Incident Post Mortem #17242

gracekretschmer-metrostar opened this issue Feb 14, 2024 · 6 comments
Assignees
Labels
CMS Team CMS Product team that manages both editor exp and devops Epic Issue type

Comments

@gracekretschmer-metrostar
Copy link

gracekretschmer-metrostar commented Feb 14, 2024

Background

On 2/13/2024, Tim Cosgrove (Accelerated Publishing PM) notified the CMS team that prod.cms.va.gov was down since 11pm EST. No one was directly notified of the outage, Tim Cosgrove noticed outage notifications in the DSVA slack channel: #cms-notifications. The outgoing CMS Tech Lead (Nathan Douglas) was able to identify that prod.cms.va.gov was down because it was removed from service and resolve the issue by performing an emergency prod deploy.

User Story or Problem Statement

How might we understand why prod.cms.va.gov was unexpectedly removed from service and prevent that from unknowingly happening in the future?

Assumptions

Prod.cms.va.gov and it's replacement instance were removed from service because it failed health checks.

Tasks

Acceptance Criteria

  • The reason why prod was removed from service is understood.
  • Opportunities to notify CMS staff before prod is removed from service are identified.
  • Opportunities to prevent prod from being removed from service are identified.
  • A post mortem report is submitted in Github.

Team

Please check the team(s) that will do this work.

  • CMS Team
  • Public Websites
  • Facilities
  • User support
  • Accelerated Publishing
@gracekretschmer-metrostar
Copy link
Author

image

@gracekretschmer-metrostar
Copy link
Author

Action items from 2/21/2024 meeting:

Slide deck from the meeting is attached to this comment.

Prod Offline Postmortem 02212024 (Oddball Template).pptx

@7hunderbird
Copy link

7hunderbird commented Feb 22, 2024

@gracekretschmer-metrostar myself and Hassan would be blocked on continuing to investigate if apache failed because of database connections or if database connections were high because apache failed, until we get more access to things and we add the recommendations from the post-mortem as tickets and get the work done in those tickets.

As mentioned in the Post-mortem PR:

  • Discovery ticket to answer, "What failed in post-deploy scripts that prevented automatic healing of the instance?"
    • Hassan would like us to consider using a 2nd standby instance in the auto-scaling group
  • Discovery ticket to answer, "What logs can we turn on at the operating system level?"
  • Discovery ticket to answer, "What logs or monitoring do we want to turn on at the Amazon Web Services (AWS) level?"
  • Discovery ticket to answer, "What is the current version of Apache? What is available now? What benefits would upgrading Apache provide?"

@gracekretschmer-metrostar @michelle-dooley I'm thinking I'd put these into issues in the backlog of our Project?

@7hunderbird
Copy link

Submitted Post-mortem final draft to: https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/pull/1513

@EWashb
Copy link
Contributor

EWashb commented Apr 4, 2024

Approved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMS Team CMS Product team that manages both editor exp and devops Epic Issue type
Projects
None yet
Development

No branches or pull requests

5 participants