Karpenter Operational Monitoring #1692
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
What problem are you trying to solve?
We've been running Karpenter in production for the last year and while it is very stable and saving us a bunch of money, it is very hard to distinguish between expected behaviour, degradation and flat failure with the metrics that we get out of Karpenter at the moment.
Some example use cases:
We use weighted nodepools for our gameservers as we prefer C7i to C6i as we save money as they can be run "hotter", i.e. with higher CPU utilization, without framerate degration so we can run fewer of them. In order to get Karpenter to pick instances types that are on paper more expensive, the only current option that we are aware of is to do this. When Karpenter tries to spin up a C7i (or any other instance type) and there is no availability it will destroy the nodeclaim, and increments the karpenter_nodeclaims_terminated metric for label
insufficient_capacity
and then creates a new node claim for the lower weighted instance type, an instance spins up and all is well.When we can no longer spin up nodes in a region as we've hit out CPU or EBS quota, when Karpenter tries to spin up an instance it gets an error about hitting the quota limit, it then destroys the nodeclaim, and increments the karpenter_nodeclaims_terminated metric for label
insufficient_capacity
and then creates a new node claim and hits the same issue, repeat.We have seen, e.g. in apse2 az-2a, very low availability of C and M class nodes in gen 6 and 7, this also manifests very similarly to above where we see a lot of node claims terminated for
insufficient_capacity
but this can be hidden as we seeinsufficient_capacity
for the other zones in the same cluster as we may be falling back to available nodes in that zone.We see similar issues with liveness as some of our clusters run in localzones which can take a while to spin up nodes, especially SAE1 which can take 10+ mins to run the standard eks bootstrap due to rtt to US east 1 and back. In this we frequently see node claims terminated for liveness issues and then the next nodeclaim comes up just fine if a little slow. This is hard to distguish, especially during node scale up as we move into peak usage, between this and a total failure due to an issue creating EC2 instances i.e. an AWS outage with the EC2 api, which has hit us in the past.
What we would like
There may already be a way to do this with the current metrics, but there's nothing that we've found so far. What we'd like are very clear metrics, or a very clear method using the existing metrics, that indicates when there is an actual problem creating nodes and also whether it is a degradation vs a total failure.
So failing back through weighted nodepools would not be an error, but finding no more appropriate nodepools having trying all weighted/matching ones would be an error. A single liveness failure would not be an error, multiple would. A lack of availablity of a certain instance type would no be an error, hitting a quota limit would be etc.
For example, we could have something like a metric to indicate that Karpenter needs to add a node for provisioning (i.e. deployments are scaling up, more nodes are required), possibly with some sort of uuid and then tie all the node claims created to fullfill that provisioning need with that uuid so we can clearly see that in a certain cluster/region/zone/whatever is taking X attempts to fulfil a need to provision a new node.
Also some documentation improvements would be really useful, such as a documentation section on how monitor Karpenter in production would be super useful as there are number of threads in the Karpenter slack asking this question. And it would be great if there were some more detailed explanations of the metrics e.g. from the current docs
This doesn't really explain any more than the name of the metric already does so it doesn't really help a Karpenter user understand how to use this metric for monitoring.
I'm happy to discuss this in Slack and help out with designing a potential solution/options etc.
How important is this feature to you?
The text was updated successfully, but these errors were encountered: