Borrow Mechanism Between NodePools in Karpenter #1703
Labels
kind/feature
Categorizes issue or PR as related to a new feature.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
Summary:
Introduce a "borrow" mechanism between NodePools in Karpenter, inspired by Kueue's Cohort borrowing functionality. This feature would allow a NodePool to borrow CPU cores (or other resources) from another NodePool, optimizing the utilization of reserved compute instances and improving overall flexibility.
Background:
Karpenter significantly improves workload efficiency by automatically provisioning nodes that meet pod requirements and deprovisioning nodes when they are no longer needed. However, in environments with multiple NodePools (such as those containing both on-demand and reserved instances), it would be beneficial to allow NodePools to share underutilized resources. This would enable more efficient use of reserved instances, ensuring that the resources already allocated are fully leveraged before new nodes are provisioned.
Proposed Solution:
Implement a feature similar to Kueue’s "borrow" mechanism, where NodePools can borrow unused CPU cores or other resources from other NodePools. For translation, a Kueue "ClusterQueue" can be seen as equivalent to a NodePool in Karpenter.
Cohort of NodePools:
Allow grouping of NodePools into a cohort. NodePools in the same cohort should be able to borrow resources (e.g., CPU cores) from each other.
Borrowing Semantics:
When a NodePool runs out of its allocated resources, it should be able to borrow unused resources from another NodePool in the same cohort:
borrowingLimit
).Resource Prioritization:
Borrowed resources should prioritize workloads within nominal quotas, ensuring that borrowing is a secondary measure. If multiple workloads require borrowing, prioritize based on workload priority or creation timestamp, similar to Kueue's approach.
Reference:
For details on Kueue’s borrowing semantics, refer to the Kueue documentation.
Use Case:
This feature would be especially useful for environments with a mix of on-demand and reserved instances. For example, a NodePool running on reserved instances could lend unused CPU cores to an on-demand NodePool, reducing the need to provision additional on-demand nodes when reserved capacity is available. This would result in significant cost savings and better resource utilization.
The text was updated successfully, but these errors were encountered: