Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Why Prod Taken Offline #17255

Closed
3 of 6 tasks
Tracked by #17244
gracekretschmer-metrostar opened this issue Feb 15, 2024 · 1 comment
Closed
3 of 6 tasks
Tracked by #17244

Investigate Why Prod Taken Offline #17255

gracekretschmer-metrostar opened this issue Feb 15, 2024 · 1 comment
Assignees
Labels
CMS Team CMS Product team that manages both editor exp and devops

Comments

@gracekretschmer-metrostar
Copy link

gracekretschmer-metrostar commented Feb 15, 2024

User Story or Problem Statement

As CMS DevOps staff, I need to know why prod.cms.va.gov was unexpectedly taken offline on 2/13/2024, so that I prevent that from happening again in the future.

Description or Additional Context

On 2/13/2024, Tim Cosgrove (Accelerated Publishing PM) notified the CMS team that prod.cms.va.gov was down since 11pm EST. No one was directly notified of the outage, Tim Cosgrove noticed outage notifications in the DSVA slack channel: #cms-notifications. The outgoing CMS Tech Lead (Nathan Douglas) was able to identify that prod.cms.va.gov was down because it was removed from service and resolve the issue by performing an emergency prod deploy.

Acceptance Criteria

  • The reason why prod was removed from service is understood.
  • Opportunities to prevent prod from being removed from service are identified.

Team

Please check the team(s) that will do this work.

  • CMS Team
  • Public Websites
  • Facilities
  • Accelerated Publishing
@gracekretschmer-metrostar gracekretschmer-metrostar added the CMS Team CMS Product team that manages both editor exp and devops label Feb 15, 2024
@gracekretschmer-metrostar gracekretschmer-metrostar changed the title [Discovery] Investigate Why Prod Taken Offline Investigate Why Prod Taken Offline Feb 15, 2024
@7hunderbird
Copy link

With the help of @edmund-dunn and @Hassantariq-MetroStar we investigated what happened and identified the reason. We will report in #17244 the results.

In brief too many database connections were overloading the prod vm and caused it to crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMS Team CMS Product team that manages both editor exp and devops
Projects
None yet
Development

No branches or pull requests

4 participants