Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production Helm/AWS Secrets Manager Rollout #483

Open
9 tasks
sastels opened this issue Dec 16, 2024 · 8 comments
Open
9 tasks

Production Helm/AWS Secrets Manager Rollout #483

sastels opened this issue Dec 16, 2024 · 8 comments
Assignees
Labels
Reliability Task related to reliability. Security | Sécurité Tech Debt An issue targeting an identified technical debt

Comments

@sastels
Copy link

sastels commented Dec 16, 2024

Description

As a developer we need to safely deploy our migrated helm code to staging and production so that we can decommission our Kustomize code and gain the benefits of helm.

WHY are we building this?
We are building this because it falls in line with industry standards and ties in with our OKRs for Secrets Management.

WHAT are we building?
We are migrating all of our manifests code to Helm from Kustomize and reading secrets from AWS Secrets Manager rather than encrypted env files. We need to create scripts to do the migration safely and to minimize downtime.

VALUE created by this solution?
The benefits to this are many including traceability of our deployment components (diff with helm) and more secure readable manifests code and variables/secrets.

Additional Information

The rollout plan looks like this:

  • Deploy all of the helmfile stuff with targetgroupbindings DISABLED - we will then have the kustomize and admin code deployed side by side, but no traffic will go to the new stuff. We can therefore verify that everything is up and running (kubectl port-forward to the pods).
  • Once confirmed everything looks good, we will run a small script that will delete the old target group bindings
  • We will enable the Helmfile target group bindings and deploy that
  • We will delete the rest of the Kustomize deployments
  • We should be able to minimize downtime to a 1-2 minutes this way.

Acceptance Criteria

Given some context, when (X) action occurs, then (Y) outcome is achieved

  • The rollout plan is scripted and reviewed by the team.
  • Verify alarms still work, so trigger a few manually.
  • Run a soak test in the target environment during the rollout so we can confirm and measure the downtime period.
  • Run performance tests to make sure the migration did not introduce any degradation.
  • Establish a release timeline to share with the GCNotify and support teams. Perform the rollout outside of business hours.
  • Communicate to users any downtime if more than 1h. Communicate with the product manager to sync on that.

QA Steps

  • Run rollercoaster tests in the target environment after the roll out.
  • Verify that the pods' configuration (especially the scaling metrics) match between Kustomize and Helmfile for the target environment.
  • Use the admin to run a few smoke tests such as sending an email and SMS.
@sastels sastels added Reliability Task related to reliability. Security | Sécurité Tech Debt An issue targeting an identified technical debt labels Dec 16, 2024
@sastels sastels changed the title Copy of Staging/Production Helm/AWS Secrets Manager Rollout Production Helm/AWS Secrets Manager Rollout Dec 16, 2024
@P0NDER0SA
Copy link

Two PRs in preparation of prod migration.
cds-snc/notification-manifests#3207
cds-snc/notification-manifests#3208

We have a list of things we do need to modify because of how Production doesn't work exactly like the other environments. This is around tagging and some other things. We'll look to update the card with more specifics

@sastels
Copy link
Author

sastels commented Dec 18, 2024

Ben and Pond will hopefully look at today.

@sastels
Copy link
Author

sastels commented Dec 18, 2024

Also: change PR-bot to add manifest repo changes to release

@sastels
Copy link
Author

sastels commented Dec 19, 2024

Blocked until we get some of the other pieces in place. Ben will make a card.

@ben851
Copy link
Contributor

ben851 commented Jan 14, 2025

Helmfile release plan:

  1. Increase node count for EKS to accommodate the additional load
  2. Ensure the following branches are up to date:
    a. helm-migration-production-tgb-disabled
    b. helm-migration-production-tgb-enabled
  3. Run step1-helmfile-apply-no-tgb.sh production
  4. Ensure that deploys successfully
  5. Let this stew for the rest of the week, making sure that the releases work
  6. Next monday during working hours, increase the replica counts on helmfile to match existing production counts
  7. Next monday off hours run step2-delete-kustomize-tgb-helm-apply.sh production (30s or so of downtime) to migrate us to helm deployments
  8. Ensure that is working as expected, let it stew for 24-48 hrs
  9. Run step3-delete-kustomize-cleanup.sh production to remove Kustomize code
  10. Decrease node count for EKS back to originals

ESCAPE PLAN
If at any point things aren't going as planned, we can run the rever-environment script to get rid of the helmfile code and restore kustomize

@ben851
Copy link
Contributor

ben851 commented Jan 14, 2025

Steps 1-4 have been completed

@ben851
Copy link
Contributor

ben851 commented Jan 15, 2025

Will test out the release process this morning to verify it works.

@sastels
Copy link
Author

sastels commented Jan 15, 2025

The second beat worker may have caused errors last night - Ben will get rid of it for now and proceed with testing the rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reliability Task related to reliability. Security | Sécurité Tech Debt An issue targeting an identified technical debt
Projects
None yet
Development

No branches or pull requests

3 participants