forked from timstclair/kubernetes
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Dedicated nodes, taints, and tolerations design doc.
- Loading branch information
David Oppenheimer
committed
Jan 24, 2016
1 parent
717551b
commit 14c2763
Showing
1 changed file
with
301 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,301 @@ | ||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||
|
||
<!-- BEGIN STRIP_FOR_RELEASE --> | ||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||
width="25" height="25"> | ||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||
|
||
If you are using a released version of Kubernetes, you should | ||
refer to the docs that go with that version. | ||
|
||
Documentation for other releases can be found at | ||
[releases.k8s.io](http://releases.k8s.io). | ||
</strong> | ||
-- | ||
|
||
<!-- END STRIP_FOR_RELEASE --> | ||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING --> | ||
|
||
# Taints, Tolerations, and Dedicated Nodes | ||
|
||
## Introduction | ||
|
||
This document describes *taints* and *tolerations*, which constitute a generic mechanism for restricting | ||
the set of pods that can use a node. We also describe one concrete use case for the mechanism, | ||
namely to limit the set of users (or more generally, authorization domains) | ||
who can access a set of nodes (a feature we call | ||
*dedicated nodes*). There are many other uses--for example, a set of nodes with a particular | ||
piece of hardware could | ||
be reserved for pods that require that hardware, or a node could be marked as unschedulable | ||
when it is being drained before shutdown, or a node could trigger evictions when it experiences | ||
hardware or software problems or abnormal node configurations; see #17190 and #3885 for more discussion. | ||
|
||
## Taints, tolerations, and dedicated nodes | ||
|
||
A *taint* is a new type that is part of the `NodeSpec`; when present, it prevents pods | ||
from scheduling onto the node unless the pod *tolerates* the taint (tolerations are listed | ||
in the `PodSpec`). Note that there are actually multiple flavors of taints: taints that | ||
prevent scheduling on a node, taints that cause the scheduler to try to avoid scheduling | ||
on a node but do not prevent it, taints that prevent a pod from starting on Kubelet even | ||
if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler), | ||
and taints that evict already-running pods. | ||
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) | ||
has more background on these diffrent scenarios. We will focus on the first | ||
kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case. | ||
|
||
Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that | ||
is dedicated to group A gets taint `dedicated=A` and the pods belonging to group A get | ||
toleration `dedicated=A`. (The exact syntax and semantics of taints and tolerations are | ||
described later in this doc.) This keeps all pods except those belonging to group A off of the nodes. | ||
This approach easily generalizes to pods that are allowed to | ||
schedule into multiple dedicated node groups, and nodes that are a member of multiple | ||
dedicated node groups. | ||
|
||
Note that because tolerations are at the granularity of pods, | ||
the mechanism is very flexible -- any policy can be used to determine which tolerations | ||
should be placed on a pod. So the "group A" mentioned above could be all pods from a | ||
particular namespace or set of namespaces, or all pods with some other arbitrary characteristic | ||
in common. We expect that any real-world usage of taints and tolerations will employ an admission controller | ||
to apply the tolerations. For example, to give all pods from namespace A access to dedicated | ||
node group A, an admission controller would add the corresponding toleration to all | ||
pods from namespace A. Or to give all pods that require GPUs access to GPU nodes, an admission | ||
controller would add the toleration for GPU taints to pods that request the GPU resource. | ||
|
||
Everything that can be expressed using taints and tolerations can be expressed using | ||
[node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. in the example | ||
in the previous paragraph, you could put a label `dedicated=A` on the set of dedicated nodes and | ||
a node affinity `dedicated NotIn A` on all pods *not* belonging to group A. But it is | ||
cumbersome to express exclusion policies using node affinity because every time you add | ||
a new type of restricted node, all pods that aren't allowed to use those nodes need to start avoiding those | ||
nodes using node affinity. This means the node affinity list can get quite long in clusters with lots of different | ||
groups of special nodes (lots of dedicated node groups, lots of different kinds of special hardware, etc.). | ||
Moreover, you need to also update any Pending pods when you add new types of special nodes. | ||
In contrast, with taints and tolerations, | ||
when you add a new type of special node, "regular" pods are unaffected, and you just need to add | ||
the necessary toleration to the pods you subsequent create that need to use the new type of special nodes. | ||
To put it another way, with taints and tolerations, only pods that use a set of special nodes | ||
need to know about those special nodes; with the node affinity approach, pods that have | ||
no interest in those special nodes need to know about all of the groups of special nodes. | ||
|
||
One final comment: in practice, it is often desirable to not | ||
only keep "regular" pods off of special nodes, but also to keep "special" pods off of | ||
regular nodes. An example in the dedicated nodes case is to not only keep regular | ||
users off of dedicated nodes, but also to keep dedicated users off of non-dedicated (shared) | ||
nodes. In this case, the "non-dedicated" nodes can be modeled as their own dedicated node group | ||
(for example, tainted as `dedicated=shared`), and pods that are not given access to any | ||
dedicated nodes ("regular" pods) would be given a toleration for `dedicated=shared`. (As mentioned earlier, | ||
we expect tolerations will be added by an admission controller.) In this case taints/tolerations | ||
are still better than node affinity because with taints/tolerations each pod only needs one special "marking", | ||
versus in the node affinity case where every time you add a dedicated node group (i.e. a new | ||
`dedicated=` value), you need to add a new node affinity rule to all pods (including pending pods) | ||
except the ones allowed to use that new dedicated node group. | ||
|
||
## API | ||
|
||
```go | ||
// The node this Taint is attached to has the effect "effect" on | ||
// any pod that that does not tolerate the Taint. | ||
type Taint struct { | ||
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` | ||
Value string `json:"value,omitempty"` | ||
Effect TaintEffect `json:"effect"` | ||
} | ||
|
||
type TaintEffect string | ||
|
||
const ( | ||
// Do not allow new pods to schedule unless they tolerate the taint, | ||
// but allow all pods submitted to Kubelet without going through the scheduler | ||
// to start, and allow all already-running pods to continue running. | ||
// Enforced by the scheduler. | ||
TaintEffectNoSchedule TaintEffect = "NoSchedule" | ||
// Like TaintEffectNoSchedule, but the scheduler tries not to schedule | ||
// new pods onto the node, rather than prohibiting new pods from scheduling | ||
// onto the node. Enforced by the scheduler. | ||
TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule" | ||
// Do not allow new pods to schedule unless they tolerate the taint, | ||
// do not allow pods to start on Kubelet unless they tolerate the taint, | ||
// but allow all already-running pods to continue running. | ||
// Enforced by the scheduler and Kubelet. | ||
TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit" | ||
// Do not allow new pods to schedule unless they tolerate the taint, | ||
// do not allow pods to start on Kubelet unless they tolerate the taint, | ||
// and try to eventually evict any already-running pods that do not tolerate the taint. | ||
// Enforced by the scheduler and Kubelet. | ||
TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute" | ||
) | ||
|
||
// The pod this Toleration is attached to tolerates any taint that matches | ||
// the triple <key,value,effect> using the matching operator <operator>. | ||
type Toleration struct { | ||
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` | ||
// operator represents a key's relationship to the value. | ||
// Valid operators are Exists and Equal. Defaults to Equal. | ||
// Exists is equivalent to wildcard for value, so that a pod can | ||
// tolerate all taints of a particular category. | ||
Operator TolerationOperator `json:"operator"` | ||
Value string `json:"value,omitempty"` | ||
Effect TaintEffect `json:"effect"` | ||
// TODO: For forgiveness (#1574), we'd eventually add at least a grace period | ||
// here, and possibly an occurrence threshold and period. | ||
} | ||
|
||
// A toleration operator is the set of operators that can be used in a toleration. | ||
type TolerationOperator string | ||
|
||
const ( | ||
TolerationOpExists TolerationOperator = "Exists" | ||
TolerationOpEqual TolerationOperator = "Equal" | ||
) | ||
|
||
``` | ||
|
||
(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) | ||
to understand the motivation for the various taint effects.) | ||
|
||
We will add | ||
|
||
```go | ||
// Multiple tolerations with the same key are allowed. | ||
Tolerations []Toleration `json:"tolerations,omitempty"` | ||
``` | ||
|
||
to `PodSpec`. A pod must tolerate all of a node's taints (except taints | ||
of type TaintEffectPreferNoSchedule) in order to be able | ||
to schedule onto that node. | ||
|
||
We will add | ||
|
||
```go | ||
// Multiple taints with the same key are not allowed. | ||
Taints []Taint `json:"taints,omitempty"` | ||
``` | ||
|
||
to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union | ||
of the taints specified by various sources. For now, the only source is | ||
the `NodeSpec` itself, but in the future one could imagine a node inheriting | ||
taints from pods (if we were to allow taints to be attached to pods), from | ||
the node's startup coniguration, etc. The scheduler should look at the `Taints` | ||
in `NodeStatus`, not in `NodeSpec`. | ||
|
||
Taints and tolerations are not scoped to namespace. | ||
|
||
## Implementation plan: taints, tolerations, and dedicated nodes | ||
|
||
Using taints and tolerations to implement dedicated nodes requires these steps: | ||
|
||
1. Add the API described above | ||
1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule) | ||
and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule). | ||
1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and | ||
TaintEffectNoScheduleNoAdmitNoExecute | ||
1. Implement code in Kubelet that evicts a pod that no longer satisfies | ||
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers | ||
instead, but since taints might be used to enforce security policies, it is better | ||
to do in kubelet because kubelet can respond quickly and can guarantee the rules will | ||
be applied to all pods. | ||
Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing | ||
taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod. | ||
1. Add a new `kubectl` command that adds/removes taints to/from nodes, | ||
1. (This is the one step is that is specific to dedicated nodes) | ||
Implement an admission controller that adds tolerations to pods that are supposed | ||
to be allowed to use dedicated nodes (for example, based on pod's namespace). | ||
|
||
In the future one can imagine a generic policy configuration that configures | ||
an admission controller to apply the appropriate tolerations to the desired class of pods and | ||
taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes, | ||
but also other uses of taints and tolerations, e.g. nodes that are restricted | ||
due to their hardware configuration. | ||
|
||
The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`. | ||
Examples usages: | ||
|
||
```sh | ||
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. | ||
# If a taint with that key already exists, its value and effect are replaced as specified. | ||
$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute | ||
|
||
# Remove from node 'foo' the taint with key 'dedicated' if one exists. | ||
$ kubectl taint nodes foo dedicated- | ||
``` | ||
|
||
## Example: implementing a dedicated nodes policy | ||
|
||
Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available | ||
only to pods in a particular namespace `banana`. First the administrator does | ||
|
||
```sh | ||
$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute | ||
$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute | ||
$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute | ||
|
||
``` | ||
|
||
(assuming they want to evict pods that are already running on those nodes if those | ||
pods don't already tolerate the new taint) | ||
|
||
Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify | ||
a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`. | ||
|
||
In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having | ||
to enumerate them by name. | ||
|
||
## Future work | ||
|
||
At present, the Kubernetes security model allows any user to add and remove any taints and tolerations. | ||
Obviously this makes it impossible to securely enforce | ||
rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints` | ||
field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`) | ||
and from mutating the `Tolerations` field of their pods. #17549 is relevant. | ||
|
||
Another security vulnterability arises if nodes are added to the cluster before receiving | ||
their taint. Thus we need to ensure that a new node does not become "Ready" until it has been | ||
configured with its taints. One way to do this is to have an admission controller that adds the taint whenever | ||
a Node object is created. | ||
|
||
A quota policy may want to treat nodes diffrently based on what taints, if any, | ||
they have. For example, if a particular namespace is only allowed to access dedicated nodes, | ||
then it may be convenient to give the namespace unlimited quota. (To use finite quota, | ||
you'd have to size the namespace's quota to the sum of the sizes of the machines in the | ||
dedicated node group, and update it when nodes are added/removed to/from the group.) | ||
|
||
It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). | ||
We have chosen not to do this for the reasons described in the "Future work" section of that doc. | ||
|
||
## Backward compatibility | ||
|
||
Old scheduler versions will ignore taints and tolerations. New scheduler versions | ||
will respect them. | ||
|
||
Users should not start using taints and tolerations until the full implementation | ||
has been in Kubelet and the master for enough binary versions that we | ||
feel comfortable that we will not need to roll back either Kubelet or | ||
master to a version that does not support them. Longer-term we will | ||
use a progamatic approach to enforcing this (#4855). | ||
|
||
## Related issues | ||
|
||
This proposal is based on the discussion in #17190. There are a number of other | ||
related issues, all of which are linked to from #17190. | ||
|
||
The relationship between taints and node drains is discussed in #1574. | ||
|
||
The concepts of taints and tolerations were originally developed as part of the | ||
Omega project at Google. | ||
|
||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() | ||
<!-- END MUNGE: GENERATED_ANALYTICS --> |