-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve node auto-scaling for Kubernetes #184
Comments
Have a preliminary PR to go into staging |
Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience. |
1 similar comment
Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience. |
The following command must be run in prod before merging karpenter to production: |
Above is no longer true, I added the correct TF resources into common. |
deployed in staging and working! will continue to test with the scaling work. |
Ben is optimizing the scale down configuration. The Celery worker sigterm might be fixed as well, we'll confirm with later tests. |
Karpenter is working "OK" with the non-reconciliation timeout. It is not as efficient as it could be but it's good enough for V1. Will aim for a release this week. |
Just a thought - verify what's up with staging before prod release |
Put in a PR to address the 502s in staging. This will not affect the karpenter only prod release, since it is the k8s api that is putting these warnings out. Karpenter can be released to prod today for testing and verification |
Ben will update the ADR on how to stabilize the deployments while using ephemeral nodes. |
PR to fix celery socket errors |
ready for a (hopefuly!) final test today |
We're still getting 502s in staging so we need to look into it. The pod destruction budget configuration for the API in Karpenter doesn't seem to hold. |
Going to remove API from spot instances. |
Staging test was successful. The celery pods did not spin up until CWAgent was ready. Created a PR to re-enable celery on karpenter in production |
Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this. |
1 similar comment
Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this. |
Implemented a fix in staging to prioritize the cwagent, and also tuned the liveness and readiness probes in staging. This is working but may be a bit too aggressive. I've opened a new PR to increase the delay times. |
Reverted the probes because they were not working. Will move this to review to monitor the system for a week. |
Steve ran some tests with the visibility timeout in staging, and reducing the timeout to 26 seconds. Will look into releasing this soon. |
CWAgent seems happier now. |
almost ready to go. waiting on Steve to merge his PR |
CWAgent OOMing :/ Might have a fix? Karpenter spot instances restart every day, so if CWAgent lasts for 24 hours we should be good. |
PR For alarm created |
I've been looking into alerts and alarms based on this, and I there doesn't seem to be a great way to create an alarm based off of karpenter logs. The error about being unable to provision nodes seems to occur daily, but resolves in such a short amount of time that it doesn't have any effect. Looking further, we already have alarms that celery has unavailable replicas which will trigger if karpenter is having issues. This should be sufficient. |
I'm gonna create an alarm for when karpenter itself is not running |
Alarm created, moving to review |
Description
As a developer/operator of GC Notify, I would like the system to be able to scale kubernetes nodes based off of load so that we are not constantly running with the maximum number of nodes when they are idle for the most part.
WHY are we building?
We are pushing changes to Notify that will increase our sending rate to meet OKRs. To accommodate this, we must increase the number of nodes available in Kubernetes. Since we are only using these nodes during peak periods, they are wasted for the most part. This increases costs with no additional benefit, and it would be good to be able scale these nodes on demand to maximize cost efficiency.
WHAT are we building?
There are two methods of autoscaling EKS - the built in Kubernetes functionality or using Karpenter. Karpenter is more flexible and allows us to take advantage of spot pricing on amazon to further maximize cost efficiency.
We will install and configure Karpenter in notify's EKS cluster.
VALUE created by our solution
Acceptance Criteria
QA Steps
The text was updated successfully, but these errors were encountered: