Skip to content

Commit

Permalink
Dedicated nodes, taints, and tolerations design doc.
Browse files Browse the repository at this point in the history
  • Loading branch information
David Oppenheimer committed Jan 24, 2016
1 parent 717551b commit 14c2763
Showing 1 changed file with 301 additions and 0 deletions.
301 changes: 301 additions & 0 deletions docs/design/taint-toleration-dedicated.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Taints, Tolerations, and Dedicated Nodes

## Introduction

This document describes *taints* and *tolerations*, which constitute a generic mechanism for restricting
the set of pods that can use a node. We also describe one concrete use case for the mechanism,
namely to limit the set of users (or more generally, authorization domains)
who can access a set of nodes (a feature we call
*dedicated nodes*). There are many other uses--for example, a set of nodes with a particular
piece of hardware could
be reserved for pods that require that hardware, or a node could be marked as unschedulable
when it is being drained before shutdown, or a node could trigger evictions when it experiences
hardware or software problems or abnormal node configurations; see #17190 and #3885 for more discussion.

## Taints, tolerations, and dedicated nodes

A *taint* is a new type that is part of the `NodeSpec`; when present, it prevents pods
from scheduling onto the node unless the pod *tolerates* the taint (tolerations are listed
in the `PodSpec`). Note that there are actually multiple flavors of taints: taints that
prevent scheduling on a node, taints that cause the scheduler to try to avoid scheduling
on a node but do not prevent it, taints that prevent a pod from starting on Kubelet even
if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler),
and taints that evict already-running pods.
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
has more background on these diffrent scenarios. We will focus on the first
kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case.

Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that
is dedicated to group A gets taint `dedicated=A` and the pods belonging to group A get
toleration `dedicated=A`. (The exact syntax and semantics of taints and tolerations are
described later in this doc.) This keeps all pods except those belonging to group A off of the nodes.
This approach easily generalizes to pods that are allowed to
schedule into multiple dedicated node groups, and nodes that are a member of multiple
dedicated node groups.

Note that because tolerations are at the granularity of pods,
the mechanism is very flexible -- any policy can be used to determine which tolerations
should be placed on a pod. So the "group A" mentioned above could be all pods from a
particular namespace or set of namespaces, or all pods with some other arbitrary characteristic
in common. We expect that any real-world usage of taints and tolerations will employ an admission controller
to apply the tolerations. For example, to give all pods from namespace A access to dedicated
node group A, an admission controller would add the corresponding toleration to all
pods from namespace A. Or to give all pods that require GPUs access to GPU nodes, an admission
controller would add the toleration for GPU taints to pods that request the GPU resource.

Everything that can be expressed using taints and tolerations can be expressed using
[node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. in the example
in the previous paragraph, you could put a label `dedicated=A` on the set of dedicated nodes and
a node affinity `dedicated NotIn A` on all pods *not* belonging to group A. But it is
cumbersome to express exclusion policies using node affinity because every time you add
a new type of restricted node, all pods that aren't allowed to use those nodes need to start avoiding those
nodes using node affinity. This means the node affinity list can get quite long in clusters with lots of different
groups of special nodes (lots of dedicated node groups, lots of different kinds of special hardware, etc.).
Moreover, you need to also update any Pending pods when you add new types of special nodes.
In contrast, with taints and tolerations,
when you add a new type of special node, "regular" pods are unaffected, and you just need to add
the necessary toleration to the pods you subsequent create that need to use the new type of special nodes.
To put it another way, with taints and tolerations, only pods that use a set of special nodes
need to know about those special nodes; with the node affinity approach, pods that have
no interest in those special nodes need to know about all of the groups of special nodes.

One final comment: in practice, it is often desirable to not
only keep "regular" pods off of special nodes, but also to keep "special" pods off of
regular nodes. An example in the dedicated nodes case is to not only keep regular
users off of dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
nodes. In this case, the "non-dedicated" nodes can be modeled as their own dedicated node group
(for example, tainted as `dedicated=shared`), and pods that are not given access to any
dedicated nodes ("regular" pods) would be given a toleration for `dedicated=shared`. (As mentioned earlier,
we expect tolerations will be added by an admission controller.) In this case taints/tolerations
are still better than node affinity because with taints/tolerations each pod only needs one special "marking",
versus in the node affinity case where every time you add a dedicated node group (i.e. a new
`dedicated=` value), you need to add a new node affinity rule to all pods (including pending pods)
except the ones allowed to use that new dedicated node group.

## API

```go
// The node this Taint is attached to has the effect "effect" on
// any pod that that does not tolerate the Taint.
type Taint struct {
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
Value string `json:"value,omitempty"`
Effect TaintEffect `json:"effect"`
}

type TaintEffect string

const (
// Do not allow new pods to schedule unless they tolerate the taint,
// but allow all pods submitted to Kubelet without going through the scheduler
// to start, and allow all already-running pods to continue running.
// Enforced by the scheduler.
TaintEffectNoSchedule TaintEffect = "NoSchedule"
// Like TaintEffectNoSchedule, but the scheduler tries not to schedule
// new pods onto the node, rather than prohibiting new pods from scheduling
// onto the node. Enforced by the scheduler.
TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
// Do not allow new pods to schedule unless they tolerate the taint,
// do not allow pods to start on Kubelet unless they tolerate the taint,
// but allow all already-running pods to continue running.
// Enforced by the scheduler and Kubelet.
TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
// Do not allow new pods to schedule unless they tolerate the taint,
// do not allow pods to start on Kubelet unless they tolerate the taint,
// and try to eventually evict any already-running pods that do not tolerate the taint.
// Enforced by the scheduler and Kubelet.
TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
)

// The pod this Toleration is attached to tolerates any taint that matches
// the triple <key,value,effect> using the matching operator <operator>.
type Toleration struct {
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
// operator represents a key's relationship to the value.
// Valid operators are Exists and Equal. Defaults to Equal.
// Exists is equivalent to wildcard for value, so that a pod can
// tolerate all taints of a particular category.
Operator TolerationOperator `json:"operator"`
Value string `json:"value,omitempty"`
Effect TaintEffect `json:"effect"`
// TODO: For forgiveness (#1574), we'd eventually add at least a grace period
// here, and possibly an occurrence threshold and period.
}

// A toleration operator is the set of operators that can be used in a toleration.
type TolerationOperator string

const (
TolerationOpExists TolerationOperator = "Exists"
TolerationOpEqual TolerationOperator = "Equal"
)

```

(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
to understand the motivation for the various taint effects.)

We will add

```go
// Multiple tolerations with the same key are allowed.
Tolerations []Toleration `json:"tolerations,omitempty"`
```

to `PodSpec`. A pod must tolerate all of a node's taints (except taints
of type TaintEffectPreferNoSchedule) in order to be able
to schedule onto that node.

We will add

```go
// Multiple taints with the same key are not allowed.
Taints []Taint `json:"taints,omitempty"`
```

to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union
of the taints specified by various sources. For now, the only source is
the `NodeSpec` itself, but in the future one could imagine a node inheriting
taints from pods (if we were to allow taints to be attached to pods), from
the node's startup coniguration, etc. The scheduler should look at the `Taints`
in `NodeStatus`, not in `NodeSpec`.

Taints and tolerations are not scoped to namespace.

## Implementation plan: taints, tolerations, and dedicated nodes

Using taints and tolerations to implement dedicated nodes requires these steps:

1. Add the API described above
1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule)
and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule).
1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and
TaintEffectNoScheduleNoAdmitNoExecute
1. Implement code in Kubelet that evicts a pod that no longer satisfies
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers
instead, but since taints might be used to enforce security policies, it is better
to do in kubelet because kubelet can respond quickly and can guarantee the rules will
be applied to all pods.
Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing
taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod.
1. Add a new `kubectl` command that adds/removes taints to/from nodes,
1. (This is the one step is that is specific to dedicated nodes)
Implement an admission controller that adds tolerations to pods that are supposed
to be allowed to use dedicated nodes (for example, based on pod's namespace).

In the future one can imagine a generic policy configuration that configures
an admission controller to apply the appropriate tolerations to the desired class of pods and
taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes,
but also other uses of taints and tolerations, e.g. nodes that are restricted
due to their hardware configuration.

The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`.
Examples usages:

```sh
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
# If a taint with that key already exists, its value and effect are replaced as specified.
$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute

# Remove from node 'foo' the taint with key 'dedicated' if one exists.
$ kubectl taint nodes foo dedicated-
```

## Example: implementing a dedicated nodes policy

Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available
only to pods in a particular namespace `banana`. First the administrator does

```sh
$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute

```

(assuming they want to evict pods that are already running on those nodes if those
pods don't already tolerate the new taint)

Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify
a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`.

In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having
to enumerate them by name.

## Future work

At present, the Kubernetes security model allows any user to add and remove any taints and tolerations.
Obviously this makes it impossible to securely enforce
rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints`
field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`)
and from mutating the `Tolerations` field of their pods. #17549 is relevant.

Another security vulnterability arises if nodes are added to the cluster before receiving
their taint. Thus we need to ensure that a new node does not become "Ready" until it has been
configured with its taints. One way to do this is to have an admission controller that adds the taint whenever
a Node object is created.

A quota policy may want to treat nodes diffrently based on what taints, if any,
they have. For example, if a particular namespace is only allowed to access dedicated nodes,
then it may be convenient to give the namespace unlimited quota. (To use finite quota,
you'd have to size the namespace's quota to the sum of the sizes of the machines in the
dedicated node group, and update it when nodes are added/removed to/from the group.)

It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
We have chosen not to do this for the reasons described in the "Future work" section of that doc.

## Backward compatibility

Old scheduler versions will ignore taints and tolerations. New scheduler versions
will respect them.

Users should not start using taints and tolerations until the full implementation
has been in Kubelet and the master for enough binary versions that we
feel comfortable that we will not need to roll back either Kubelet or
master to a version that does not support them. Longer-term we will
use a progamatic approach to enforcing this (#4855).

## Related issues

This proposal is based on the discussion in #17190. There are a number of other
related issues, all of which are linked to from #17190.

The relationship between taints and node drains is discussed in #1574.

The concepts of taints and tolerations were originally developed as part of the
Omega project at Google.



<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

0 comments on commit 14c2763

Please sign in to comment.