Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve node auto-scaling for Kubernetes #184

Closed
8 tasks done
jimleroyer opened this issue Aug 10, 2023 · 28 comments
Closed
8 tasks done

Improve node auto-scaling for Kubernetes #184

jimleroyer opened this issue Aug 10, 2023 · 28 comments
Assignees

Comments

@jimleroyer
Copy link
Member

jimleroyer commented Aug 10, 2023

Description

As a developer/operator of GC Notify, I would like the system to be able to scale kubernetes nodes based off of load so that we are not constantly running with the maximum number of nodes when they are idle for the most part.

WHY are we building?

We are pushing changes to Notify that will increase our sending rate to meet OKRs. To accommodate this, we must increase the number of nodes available in Kubernetes. Since we are only using these nodes during peak periods, they are wasted for the most part. This increases costs with no additional benefit, and it would be good to be able scale these nodes on demand to maximize cost efficiency.

WHAT are we building?

There are two methods of autoscaling EKS - the built in Kubernetes functionality or using Karpenter. Karpenter is more flexible and allows us to take advantage of spot pricing on amazon to further maximize cost efficiency.

We will install and configure Karpenter in notify's EKS cluster.

VALUE created by our solution

  • Reduced cost even when not using the burst features since we will be able to reduce the minimum node count in the main cluster
  • Increased performance since we will be able to accommodate a higher number of celery pods during burst periods
  • Easier to scale up in the future

Acceptance Criteria

  • Karpenter is installed and working in staging and production
  • Karpenter is configured to only use the approved node sizes
  • Predictable ramp up curves that do not impact the immediate performance of notify
  • We have node selectors for non-celery pods to ensure they are not deployed to ephemeral nodes
  • alarm created for when karpenter cannot provision new nodes

QA Steps

  • Run performance tests that test the node scaling curves to ensure the system is not negatively impacted
  • Verify that nodes do scale up and down as expected
  • Verify that non celery pods do not get deployed to the ephemeral nodes
@jimleroyer jimleroyer changed the title Improve node auto-scaling of scaling up for Kubernetes Improve node auto-scaling for Kubernetes Aug 10, 2023
@ben851 ben851 self-assigned this Sep 13, 2023
@sastels
Copy link

sastels commented Sep 18, 2023

Have a preliminary PR to go into staging

@ben851
Copy link
Contributor

ben851 commented Sep 20, 2023

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

1 similar comment
@ben851
Copy link
Contributor

ben851 commented Sep 20, 2023

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

@ben851
Copy link
Contributor

ben851 commented Sep 20, 2023

The following command must be run in prod before merging karpenter to production:
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

@ben851
Copy link
Contributor

ben851 commented Sep 20, 2023

Above is no longer true, I added the correct TF resources into common.

@sastels
Copy link

sastels commented Sep 21, 2023

deployed in staging and working! will continue to test with the scaling work.

@jimleroyer
Copy link
Member Author

Ben is optimizing the scale down configuration. The Celery worker sigterm might be fixed as well, we'll confirm with later tests.

@ben851
Copy link
Contributor

ben851 commented Oct 3, 2023

Karpenter is working "OK" with the non-reconciliation timeout. It is not as efficient as it could be but it's good enough for V1. Will aim for a release this week.

@ben851
Copy link
Contributor

ben851 commented Oct 3, 2023

Just a thought - verify what's up with staging before prod release

@ben851
Copy link
Contributor

ben851 commented Oct 4, 2023

Put in a PR to address the 502s in staging. This will not affect the karpenter only prod release, since it is the k8s api that is putting these warnings out.

Karpenter can be released to prod today for testing and verification

@ben851
Copy link
Contributor

ben851 commented Oct 4, 2023

Ben will update the ADR on how to stabilize the deployments while using ephemeral nodes.

@sastels
Copy link

sastels commented Oct 11, 2023

PR to fix celery socket errors
cds-snc/notification-api#1996

@sastels
Copy link

sastels commented Oct 12, 2023

ready for a (hopefuly!) final test today

@jimleroyer
Copy link
Member Author

We're still getting 502s in staging so we need to look into it. The pod destruction budget configuration for the API in Karpenter doesn't seem to hold.

@ben851
Copy link
Contributor

ben851 commented Oct 16, 2023

Going to remove API from spot instances.
Going to attempt a test in staging where we disable cloudwatch and restart celery to verify whether or not the celery cwagent init script works.

@ben851
Copy link
Contributor

ben851 commented Oct 16, 2023

Staging test was successful. The celery pods did not spin up until CWAgent was ready. Created a PR to re-enable celery on karpenter in production

@ben851
Copy link
Contributor

ben851 commented Oct 18, 2023

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

1 similar comment
@ben851
Copy link
Contributor

ben851 commented Oct 18, 2023

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

@ben851
Copy link
Contributor

ben851 commented Oct 19, 2023

Implemented a fix in staging to prioritize the cwagent, and also tuned the liveness and readiness probes in staging. This is working but may be a bit too aggressive. I've opened a new PR to increase the delay times.

@ben851
Copy link
Contributor

ben851 commented Oct 23, 2023

Reverted the probes because they were not working. Will move this to review to monitor the system for a week.

@ben851
Copy link
Contributor

ben851 commented Oct 25, 2023

Steve ran some tests with the visibility timeout in staging, and reducing the timeout to 26 seconds. Will look into releasing this soon.

@sastels
Copy link

sastels commented Oct 31, 2023

CWAgent seems happier now.

@sastels
Copy link

sastels commented Nov 1, 2023

almost ready to go. waiting on Steve to merge his PR

@sastels
Copy link

sastels commented Nov 2, 2023

CWAgent OOMing :/ Might have a fix? Karpenter spot instances restart every day, so if CWAgent lasts for 24 hours we should be good.

@ben851
Copy link
Contributor

ben851 commented Nov 2, 2023

PR For alarm created

@ben851
Copy link
Contributor

ben851 commented Nov 15, 2023

I've been looking into alerts and alarms based on this, and I there doesn't seem to be a great way to create an alarm based off of karpenter logs. The error about being unable to provision nodes seems to occur daily, but resolves in such a short amount of time that it doesn't have any effect.

Looking further, we already have alarms that celery has unavailable replicas which will trigger if karpenter is having issues. This should be sufficient.

@ben851
Copy link
Contributor

ben851 commented Nov 16, 2023

I'm gonna create an alarm for when karpenter itself is not running

@ben851
Copy link
Contributor

ben851 commented Nov 16, 2023

Alarm created, moving to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants