Improve node auto-scaling for Kubernetes #184

jimleroyer · 2023-08-10T13:22:42Z

Description

As a developer/operator of GC Notify, I would like the system to be able to scale kubernetes nodes based off of load so that we are not constantly running with the maximum number of nodes when they are idle for the most part.

WHY are we building?

We are pushing changes to Notify that will increase our sending rate to meet OKRs. To accommodate this, we must increase the number of nodes available in Kubernetes. Since we are only using these nodes during peak periods, they are wasted for the most part. This increases costs with no additional benefit, and it would be good to be able scale these nodes on demand to maximize cost efficiency.

WHAT are we building?

There are two methods of autoscaling EKS - the built in Kubernetes functionality or using Karpenter. Karpenter is more flexible and allows us to take advantage of spot pricing on amazon to further maximize cost efficiency.

We will install and configure Karpenter in notify's EKS cluster.

VALUE created by our solution

Reduced cost even when not using the burst features since we will be able to reduce the minimum node count in the main cluster
Increased performance since we will be able to accommodate a higher number of celery pods during burst periods
Easier to scale up in the future

Acceptance Criteria

Karpenter is installed and working in staging and production
Karpenter is configured to only use the approved node sizes
Predictable ramp up curves that do not impact the immediate performance of notify
We have node selectors for non-celery pods to ensure they are not deployed to ephemeral nodes
alarm created for when karpenter cannot provision new nodes

QA Steps

Run performance tests that test the node scaling curves to ensure the system is not negatively impacted
Verify that nodes do scale up and down as expected
Verify that non celery pods do not get deployed to the ephemeral nodes

sastels · 2023-09-18T13:12:14Z

Have a preliminary PR to go into staging

ben851 · 2023-09-20T13:08:58Z

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

ben851 · 2023-09-20T13:08:58Z

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

ben851 · 2023-09-20T14:50:39Z

The following command must be run in prod before merging karpenter to production:
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

ben851 · 2023-09-20T17:30:43Z

Above is no longer true, I added the correct TF resources into common.

sastels · 2023-09-21T13:18:29Z

deployed in staging and working! will continue to test with the scaling work.

jimleroyer · 2023-09-25T13:24:19Z

Ben is optimizing the scale down configuration. The Celery worker sigterm might be fixed as well, we'll confirm with later tests.

ben851 · 2023-10-03T13:23:03Z

Karpenter is working "OK" with the non-reconciliation timeout. It is not as efficient as it could be but it's good enough for V1. Will aim for a release this week.

ben851 · 2023-10-03T13:24:06Z

Just a thought - verify what's up with staging before prod release

ben851 · 2023-10-04T13:00:42Z

Put in a PR to address the 502s in staging. This will not affect the karpenter only prod release, since it is the k8s api that is putting these warnings out.

Karpenter can be released to prod today for testing and verification

ben851 · 2023-10-04T13:22:37Z

Ben will update the ADR on how to stabilize the deployments while using ephemeral nodes.

sastels · 2023-10-11T13:11:54Z

PR to fix celery socket errors
cds-snc/notification-api#1996

sastels · 2023-10-12T13:04:33Z

ready for a (hopefuly!) final test today

jimleroyer · 2023-10-16T13:30:06Z

We're still getting 502s in staging so we need to look into it. The pod destruction budget configuration for the API in Karpenter doesn't seem to hold.

ben851 · 2023-10-16T15:31:33Z

Going to remove API from spot instances.
Going to attempt a test in staging where we disable cloudwatch and restart celery to verify whether or not the celery cwagent init script works.

ben851 · 2023-10-16T17:27:48Z

Staging test was successful. The celery pods did not spin up until CWAgent was ready. Created a PR to re-enable celery on karpenter in production

ben851 · 2023-10-18T13:07:55Z

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

ben851 · 2023-10-18T13:07:56Z

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

ben851 · 2023-10-19T13:11:12Z

Implemented a fix in staging to prioritize the cwagent, and also tuned the liveness and readiness probes in staging. This is working but may be a bit too aggressive. I've opened a new PR to increase the delay times.

ben851 · 2023-10-23T13:16:56Z

Reverted the probes because they were not working. Will move this to review to monitor the system for a week.

ben851 · 2023-10-25T13:15:04Z

Steve ran some tests with the visibility timeout in staging, and reducing the timeout to 26 seconds. Will look into releasing this soon.

sastels · 2023-10-31T13:10:24Z

CWAgent seems happier now.

sastels · 2023-11-01T13:12:48Z

almost ready to go. waiting on Steve to merge his PR

sastels · 2023-11-02T13:19:38Z

CWAgent OOMing :/ Might have a fix? Karpenter spot instances restart every day, so if CWAgent lasts for 24 hours we should be good.

ben851 · 2023-11-02T19:44:58Z

PR For alarm created

ben851 · 2023-11-15T20:18:49Z

I've been looking into alerts and alarms based on this, and I there doesn't seem to be a great way to create an alarm based off of karpenter logs. The error about being unable to provision nodes seems to occur daily, but resolves in such a short amount of time that it doesn't have any effect.

Looking further, we already have alarms that celery has unavailable replicas which will trigger if karpenter is having issues. This should be sufficient.

ben851 · 2023-11-16T14:11:16Z

I'm gonna create an alarm for when karpenter itself is not running

ben851 · 2023-11-16T19:27:08Z

Alarm created, moving to review

jimleroyer changed the title ~~Improve node auto-scaling of scaling up for Kubernetes~~ Improve node auto-scaling for Kubernetes Aug 10, 2023

ben851 self-assigned this Sep 13, 2023

jimleroyer closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve node auto-scaling for Kubernetes #184

Improve node auto-scaling for Kubernetes #184

jimleroyer commented Aug 10, 2023 •

edited by sastels

Loading

sastels commented Sep 18, 2023 •

edited

Loading

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

sastels commented Sep 21, 2023

jimleroyer commented Sep 25, 2023

ben851 commented Oct 3, 2023

ben851 commented Oct 3, 2023

ben851 commented Oct 4, 2023

ben851 commented Oct 4, 2023

sastels commented Oct 11, 2023

sastels commented Oct 12, 2023

jimleroyer commented Oct 16, 2023

ben851 commented Oct 16, 2023

ben851 commented Oct 16, 2023

ben851 commented Oct 18, 2023

ben851 commented Oct 18, 2023

ben851 commented Oct 19, 2023

ben851 commented Oct 23, 2023

ben851 commented Oct 25, 2023

sastels commented Oct 31, 2023

sastels commented Nov 1, 2023

sastels commented Nov 2, 2023

ben851 commented Nov 2, 2023

ben851 commented Nov 15, 2023

ben851 commented Nov 16, 2023

ben851 commented Nov 16, 2023

Improve node auto-scaling for Kubernetes #184

Improve node auto-scaling for Kubernetes #184

Comments

jimleroyer commented Aug 10, 2023 • edited by sastels Loading

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps

sastels commented Sep 18, 2023 • edited Loading

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

ben851 commented Sep 20, 2023

sastels commented Sep 21, 2023

jimleroyer commented Sep 25, 2023

ben851 commented Oct 3, 2023

ben851 commented Oct 3, 2023

ben851 commented Oct 4, 2023

ben851 commented Oct 4, 2023

sastels commented Oct 11, 2023

sastels commented Oct 12, 2023

jimleroyer commented Oct 16, 2023

ben851 commented Oct 16, 2023

ben851 commented Oct 16, 2023

ben851 commented Oct 18, 2023

ben851 commented Oct 18, 2023

ben851 commented Oct 19, 2023

ben851 commented Oct 23, 2023

ben851 commented Oct 25, 2023

sastels commented Oct 31, 2023

sastels commented Nov 1, 2023

sastels commented Nov 2, 2023

ben851 commented Nov 2, 2023

ben851 commented Nov 15, 2023

ben851 commented Nov 16, 2023

ben851 commented Nov 16, 2023

jimleroyer commented Aug 10, 2023 •

edited by sastels

Loading

sastels commented Sep 18, 2023 •

edited

Loading