-
Notifications
You must be signed in to change notification settings - Fork 7
VRO Deployment Policy
This document intends to define guidelines surrounding VRO application deployments for platform and partner services, to specify deployment intervals, and to plan for potential issues which may arise.
Summary
- VRO team will notify partner teams of a deployment of the most recent eligible build on the first Tuesday of a new sprint to allow for time to opt-in or opt-out of a deployment, which will occur in lower regions as soon 3PM ET that day; testing should occur on Wednesday, and production environment deployments will occur on Thursday.
- The on-call engineer(s) performing the deployment will notify stakeholders when a release begins and will provide sign off when complete, as well as provide updates during the deployment should any issues arise.
- One-off deployments can be requested as needed and will also include initial notice and signoff communication.
VRO is moving to a two-week release cadence, beginning on the first Tuesday of a new VRO sprint. Partner teams will be prompted to either opt-in or opt-out of a deployment using a Slack Workflow. Deployments will be performed at a time deemed most appropriate by the on-call engineer performing the deployment. Depending on the engineer’s availability and their time zone, a release should typically be scheduled in the morning or late afternoon. This regular schedule ensures that updates are released consistently and allows VRO and partner teams to plan and prepare for each deployment. A two-week interval also reduces the likelihood of deployed images expiring, which can block service restarts. Images created and signed by SecRel as part of its weekday morning scheduled runs should be suitable for deployment if the develop branch has not been modified since the scan and all SecRel checks have passed.
- Deployment notifications are sent
- Partner teams either opt-in or opt-out of deployment of their services using Slack Workflows
- Workflows must be submitted by 3PM ET at the latest
- Not submitting either form will default to opting out of the deployment
- VRO will deploy to lower regions, dev/qa/sandbox, and communicate the deployment to partner teams
- Validations may begin
- Validations will either continue if they have not already been completed
- If services are found to have defects, this is the time to fix them and re-validate
- Partner teams will be expected to provide signoff when validations are complete and their services are ready to deploy to higher environments
- If a defect cannot be resolved and validated by Thursday morning those services will not be deployed
- Deployments to prod-test and prod will occur
- VRO will communicate when the deployments have been completed
- Partner teams will be expected to re-validate their services in these higher regions and provide signoff when complete
Ad-hoc and emergency deployments can be requested through a link pinned in #benefits-vro and will require a commit hash and change log summary to be provided to schedule a deployment. The hash should have already passed SecRel scans and images should have already been created and signed, and changes should have already been validated in lower environments.
Pre-Deployment
An automated Slack message will be sent out on the first Tuesday of a new sprint to the #benefits-vro-on-call channel, prompting partner teams to either opt-in or opt-out of the deployment. The VRO engineer responsible for a deployment will complete a short form calling out any major bug fixes or features being released, which is then published to #benefits-vro. A message will be posted on Slack by the on-call VRO engineer when the deployment begins. In the event that the primary reason for a deployment is to refresh image signatures it should be communicated as well.
Ad-hoc and emergency deployments should be communicated in a similar manner so partner teams remain informed. When a deployment is requested, the on-call engineer will be responsible for verifying the requested commit hash is ready to be deployed and the requested deployment time is appropriate. The on-call engineer will communicate when a deployment is scheduled and will follow the same communication standards regarding beginning and signing off a deployment.
Post-Deployment
Automatic Slack messages triggered by pod activity are being published to #benefits-slack-alerts-cc, #benefits-slack-alerts-ee, and #benefits-slack-alerts-platform as part of the recent ArgoCD migration. These channels receive alerts only for activity on their respective services, and include alerts such as pod sync and health degradation. In the event of a deployment a pod sync notification will be sent, but it is also not necessarily indicative of a deployment since pods can be synced outside of a deployment context. Therefore a signoff message should be sent over Slack after the deployment to production indicating the services have been validated as working. It will also be communicated if a defect is introduced in a deployment and the service(s) must either be hotfixed or rolled back. In the event of a rollback an incident report should be created so the defect can be investigated and tracked. The process for reporting and responding to incidents is outlined on the VRO Incident Response wiki page.
Requirements and Blockers
GitHub is currently configured to require passing continuous integration tests in order to merge feature branches into develop. Engineers responsible for the code changes to be released will also be responsible for validating those changes in prod-test and ensuring unit and integration tests continue to pass. SecRel scans evaluate vulnerabilities against the develop branch of the code, and will only build and sign images if no new vulnerabilities are discovered. SecRel may find new vulnerabilities in some but not all services; if this is the case, a deployment may proceed with the unaffected services. Occasionally, SecRel may fail for other reasons such as incorrect Aqua gate checks and may still be eligible for release.
If any features cannot be validated or any new defects are discovered, the release captain and other responsible engineers will be responsible for discussing whether to continue with the deployment. In most cases, any new defects should result in the deployment for any affected services being canceled or delayed until any defects are resolved.
Performing the Deployment
This GitHub issue contains a diagram describing the development and release flow for new features. Generally, after new code is written and merged into the develop branch of the repository, images will be signed and pushed to dev and qa regions and then validated in each region.
The on-call VRO engineers for a given sprint will act as primary and secondary release captains and will be responsible for deploying software updates. On the first Tuesday of a new sprint, after receiving opt-in/opt-out confirmation from partner teams, the release captain will be responsible for first validating the images generated by the most recent SecRel run were built against the latest code in the develop branch. After that has been validated they may proceed to push applicable services to lower regions by updating image tags in the argocd-applications-vault repository. At this point the engineer(s) responsible for features included in the release should validate their changes in those lower regions - this should be complete by EOD Wednesday.
On Thursday, the release captain will then deploy the new images to prod-test and monitor the application health. The release captain will be responsible for monitoring the application health, and the engineer(s) responsible for new features being deployed will be responsible for validating them in prod-test at this time. Once the features have been validated and if the application is healthy, the release captain will deploy to production and the same validations should be re-run.
Post-Deployment Review
The release captain and/or the engineer(s) responsible for new features and bug fixes in a given release will be responsible for re-running validations in production while continuing to monitor application health and performance. A signoff message should be sent only after the services have been validated in production.
Each deployment may include:
- New features: new functionality added to the software;
- Bug fixes: corrections for known issues;
- Performance improvements: enhancements to improve the software's performance;
- Security updates: patches for security vulnerabilities;
- Configuration changes: updates to the software configuration or environment; and/or
- None of the above: the release is only being performed to update image signatures.
With each standard deployment, all VRO services and any requested partner team services will be released so image signatures remain updated. For urgent and ad-hoc deployments, the release captain should only update relevant services so as to minimize downtime of those without changes. Current understanding of SecRel image signing suggests regular deployments should be performed to mitigate service downtime due to image signatures expiring, and should therefore be performed regardless of whether there are new features or bug fixes staged for release. Given this requirement, all services should be updated at least once per month.
The VRO Incident Response wiki page details steps which must be taken to remediate and document any defects or incidents which occur with VRO’s platform, including during deployments. These steps should be followed for any of the potential issues outlined below.
SecRel Vulnerabilities
Issue: A SecRel scan fails with new vulnerabilities, preventing images from being created and signed for deployment. Mitigation: Address the vulnerabilities manually by upgrading versions or requesting exceptions when appropriate. Once addressed, rerun the SecRel scan to generate signed images.
Test Failures
Issue: Code changes have introduced a new defect which causes integration tests or manual validations to fail with a new build.
Mitigation: Correct any defects and reschedule the deployment once tests are once again passing.
Deployment Failures
Issue: Deployment process fails due to errors in the code or environment.
Mitigation: Ensure thorough testing in the prod-test environment and have a rollback plan in place. If the issue is minor enough to warrant a hotfix, it should be discussed with the deployment team whether one should be applied to the affected services.
Downtime
Issue: The software becomes unavailable during deployment.
Mitigation: Consider scheduling deployments during off-peak hours, and communicate planned downtime to partner teams in advance.
Performance Degradation
Issue: The software's performance is negatively impacted after deployment.
Mitigation: Conduct performance testing before deployment and monitor system performance closely post-deployment.
Configuration Issues
Issue: Incorrect configuration leads to functionality problems.
Mitigation: Validate configuration settings in the staging environment and use configuration management tools.