-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather use cases for service accounts as selectors #274
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# NPEP-173: NPEP template | ||
|
||
* Issue: [#173](https://github.com/kubernetes-sigs/network-policy-api/issues/173) | ||
* Status: Provisional | ||
|
||
## TLDR | ||
|
||
Ability to create policies that control network traffic based on workload identities a.ka. | ||
[Kubernetes service accounts](https://kubernetes.io/docs/concepts/security/service-accounts/). | ||
|
||
## Goals | ||
|
||
* Use service accounts (identities) as a way of describing (admin) network policies for pods | ||
|
||
## Non-Goals | ||
|
||
* There might be other identity management constructs like SPIFFE IDs which | ||
are outside the scope of this enhancement. We can only provide a way to select | ||
constructs known to core Kubernetes. | ||
|
||
## Introduction | ||
|
||
Every pod in Kubernetes will have an associated service account. If user does | ||
not explicitly provide a service account, a default service account is created | ||
in the namespace. A given pod will always have 1 service account associated | ||
with itself. So instead of using pod labels as a way of selecting pods sometimes | ||
there might be use cases to use service accounts as a way of selecting the pods. | ||
This NPEP tries to capture the design details for service accounts as selectors | ||
for policies. | ||
|
||
## User-Stories/Use-Cases | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one of the goals/motivation is also to decrease the implementation complexity and allow larger scale. I think we want to cover this here or in the goals section There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @LiorLieberman: check user story2 does that cover what you had in mind? If its not clear and requires clarity I can try to reword it based on suggestions |
||
|
||
1. **As a** cluster admin **I want** to select pods using their service accounts | ||
instead of labels **so that** I can avoid having to setup webhooks and | ||
validation admission policies to prevent users from changing labels on their | ||
namespaces and pods that will make my policy start/stop matching the traffic | ||
which is undesired | ||
Comment on lines
+33
to
+37
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure I understand this user story: the only difference between labels vs serviceAccount here is that labels can be changed after pod creation, and serviceAccount can't. But the user can still create a new serviceAccount and a new pod using this account, that will not "match" a policy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a huge different IMO.
Some more discussion in https://blog.howardjohn.info/posts/netpol-api/ and https://blog.howardjohn.info/posts/node-zerotrust/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes and that is exactly the use case I am trying to cover which is the fact that labels are mutable on pods and service accounts are not - create and delete of pods for sure is going to cause the same churn for both selectors, but updates of pods needs that extra step/precaution for labels. |
||
2. **As a** cluster admin **I want** to select pods using their service accounts | ||
instead of labels **so that** I can avoid the scale impact caused due to | ||
mutation of pod/namespace labels that will cause a churn which makes my CNI | ||
implementation react to that every time user changes a label. | ||
3. **As a** cluster admin my workloads have immutable identities and **I want** | ||
to apply policies per workloads using their service accounts instead of labels | ||
**since** I want to have eventual consistency of that policy in the cluster. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Plausibly: Whether this is to other k8s clusters or non-k8s, having SA enables this if you make the implementation choice that SA maps to a transport identity (i.e. in Istio it maps to TLS SAN) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm so this would be the non-goal I had listed out but we can revisit this and I can try to add this user story as well as a goal - given maybe the entity like you said is on another k8s cluster, let me learn more about the istio use case here and see if for each of my above use cases I can come up with concrete examples for the user story |
||
## Unresolved Discussions | ||
|
||
* How to provide a standard way to configure/describe the service mesh behavior | ||
of intercepting traffic and deciding whether to allow it based on information | ||
in the TLS packets? | ||
* NetworkPolicies apply on L3 and L4 while Meshes operate at L7 mostly. So when | ||
trying to express "denyAll connections except the allowed serviceAccount to | ||
serviceAccount connections"; how do we split the enforcement in this scenario | ||
between the CNI plugin that implementations the network policy at L3/L4 and | ||
service mesh implementation that implements the policy at L7? | ||
* One way is to probably split the implementation responsibility: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative option is to have a networkPolicyClass (or similar). |
||
1. CNI plugin can take of implementing denyAll connections and | ||
Comment on lines
+54
to
+57
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is assuming that is either CNI plugin or service mesh, and can perfectly be another component implementing it, see kube-network-policies per example ... I think we need to be agnostic of the implementation and focus on the behaviors we want to implement ... "we want to express that a group of pods using a specific service account can or can not communicate with another group identified by another service account" ... at what point happens the policy enforcement should not be in the scope and left to the implementation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how a non-CNI can implement it then. If both the CNI and the mesh are trying to enforce the same rule, it is either redundant (they both apply the same logic) or broken (one accepts, one denies). For instance, I may have kube-network-policies and Istio, and say "allow SA foo". I get a TLS request from out-of-cluster with a TLS SAN mapping to IMO we need something like |
||
allowed serviceAccount to serviceAccount connections upto L3/L4 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem with this, as I was saying in #274 (comment), is that it seems pointless. If this is going to work correctly, without accidentally allowing any connections that aren't supposed to be allowed, then the ANP/NP implementation needs to let through exactly the set of connections that the mesh implementation is going to inspect, which implies it needs to understand "serviceAccount selectors" and how they correspond to pod IP addresses, but at that point, it can just validate the serviceAccount selectors itself based entirely on IPs, and the semantics of the service mesh guarantee that this is exactly equivalent to validating based on the TLS certificates, so the mesh doesn't actually have to do anything. So we've been thinking about the TLS-based validation as being something roughly parallel to the "NetworkPolicy" step of our policy evaluation chain:
but what if instead it was parallel to the BANP step?
We'd say, if you want to use TLS-based policy, then that becomes the baseline (at least for pod-to-pod traffic). Then we don't have to worry about interleaving IP-based and TLS-based policy, because you'd only reach the TLS-based policy step if you had already exhausted the IP-based steps. (This doesn't mean you can't have "deny all unless accepted by serviceAccount match", it just means that the "deny all" part has to be implemented by the TLS-based policy implementation rather than by the core ANP/NP implementation in this case.) This may go with the idea I had vaguely tossed out about replacing BaselineAdminNetworkPolicy with "ClusterDefaultPolicy" as part of trying to make tenancy make sense... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Isn't there some eventual consistency at play here? With a TLS certificate, I know the identity of the Pod with confidence. With an IP address, I'm assuming it matches the Pod that had that IP based on the most recent data I have, but if my Pod -> IP mapping is out of date, this isn't as strong of a guarantee as a TLS cert is.
+1, I think I was trying to get to a similar point in another comment. "Deny all" could also be implemented at the identity-based policy layer.
It would definitely be worth exploring a way to express "default deny" in a sufficiently generic way that any category of implementation could support it. For example, that seemed to be focused on namespace granularity, which could be some common ground in terms of implementability. |
||
2. service mesh implementation can implement the allowed serviceAccount to | ||
serviceAccount connections on L7 | ||
Comment on lines
+51
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This concerns me the most. Cant think of a good way to divide the responsibility between mesh and cni in the same API without creating an extremely complex UX (and perhaps less portable experience) |
||
* NOTE: There are some service mesh implementations which have a | ||
CNI component in them that collapses the logic of enforcement - so maybe | ||
that might become a requirement that the implementation be able to handle | ||
end to end enforcement of the full policy | ||
* We might also want to express additional constraints at L7 (such as only | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should put this in scope, L7 is a totally different layer with much more complexity and infinite number of protocols with different semantics There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 I think we should not care about it and leave it to Gateway/mesh implementations. Most probably will be a different API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I generally agree. FWIW Istio literally splits these layers into two components (so there are really 3 layers in the system: L3/L4 CNI, L4 TLS, L7 TLS + HTTP) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ack; if we shouldn't worry about additional L7 requests I can remove this point, I think I was trying to capture some talking points from #274 (comment) but I can remove this given we probably don't want to club that here |
||
allowing requests from a given identity on specific paths or with specific | ||
HTTP/gRPC methods) - Ideally we would want these extra fields to be coexisting | ||
with the L3/L4 Authorization policy | ||
* Should this really be part of AdminNetworkPolicy and NetworkPolicy APIs or should | ||
this be a new CRD? | ||
* Making it part of existing APIs: Existing network policy APIs are pretty heavy | ||
with other types of selectors like namespaces, pods, nodes, networks, fqdns and | ||
expecting mesh implementations to implement the full CRD just to get the identity | ||
based selection might not be practical | ||
* Making it part of new API: We fall back to compatibility problem of the different | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why compatibility issues, the different policies applies should be an AND, in which sense is this incompatible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont see compatibility issues but I can see terrible UX if we end up with another API which sits on top of netpol. This means that as a user, to allow point A to talk with point B on a cluster with default deny I need to:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there are a non-trivial set of implementations that would like to support an identity-based version of network policy, but would not support the additional set of APIs based on label selectors. To me, that suggests that it would be cleaner to have a new API focused on identity-based APIs. I agree that @LiorLieberman's example is not a great UX, but I'd hope that most users would be fully served by only one of these APIs and would not need to layer like that. Could we express default deny within this new policy to avoid layering? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW in Istio we would ideally have a policy that consists only of SA/identity and not pod label selectors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
probably compatibility is not the right word here, but I meant the layering problem, but if we agree "AND" of the APIs is what we are after wherein L3/L4 netpol API is denying a connection but the new identity API one wants to allow that (for podsSAs) then we deny cause we need it to be allow on all APIs and there is one API denying the connection even if the other API explicitly allows it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
layers and coexistence of this new API on a same cluster as existing network policy APIs | ||
|
||
## API | ||
|
||
(... details, can point to PR with changes) | ||
|
||
## Conformance Details | ||
|
||
(This section describes the names to be used for the feature or | ||
features in conformance tests and profiles. | ||
|
||
These should be `CamelCase` names that specify the feature as | ||
precisely as possible, and are particularly important for | ||
Extended features, since they may be surfaced to users.) | ||
|
||
## Alternatives | ||
|
||
(List other design alternatives and why we did not go in that | ||
direction) | ||
|
||
## References | ||
|
||
(Add any additional document links. Again, we should try to avoid | ||
too much content not in version control to avoid broken links) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# NPEP-173: NPEP template | ||
|
||
* Issue: [#173](https://github.com/kubernetes-sigs/network-policy-api/issues/173) | ||
* Status: Provisional | ||
|
||
## TLDR | ||
|
||
Ability to create policies that control network traffic based on workload identities a.ka. | ||
[Kubernetes service accounts](https://kubernetes.io/docs/concepts/security/service-accounts/). | ||
|
||
## Goals | ||
|
||
* Use service accounts (identities) as a way of describing (admin) network policies for pods | ||
|
||
## Non-Goals | ||
|
||
* There might be other identity management constructs like SPIFFE IDs which | ||
are outside the scope of this enhancement. We can only provide a way to select | ||
constructs known to core Kubernetes. | ||
|
||
## Introduction | ||
|
||
Every pod in Kubernetes will have an associated service account. If user does | ||
not explicitly provide a service account, a default service account is created | ||
in the namespace. A given pod will always have 1 service account associated | ||
with itself. So instead of using pod labels as a way of selecting pods sometimes | ||
there might be use cases to use service accounts as a way of selecting the pods. | ||
This NPEP tries to capture the design details for service accounts as selectors | ||
for policies. | ||
|
||
## User-Stories/Use-Cases | ||
|
||
1. **As a** cluster admin **I want** to select pods using their service accounts | ||
instead of labels **so that** I can avoid having to setup webhooks and | ||
validation admission policies to prevent users from changing labels on their | ||
namespaces and pods that will make my policy start/stop matching the traffic | ||
which is undesired | ||
2. **As a** cluster admin **I want** to select pods using their service accounts | ||
instead of labels **so that** I can avoid the scale impact caused due to | ||
mutation of pod/namespace labels that will cause a churn which makes my CNI | ||
implementation react to that every time user changes a label. | ||
3. **As a** cluster admin my workloads have immutable identities and **I want** | ||
to apply policies per workloads using their service accounts instead of labels | ||
**since** I want to have eventual consistency of that policy in the cluster. | ||
|
||
## Unresolved Discussions | ||
|
||
* How to provide a standard way to configure/describe the service mesh behavior | ||
of intercepting traffic and deciding whether to allow it based on information | ||
in the TLS packets? | ||
* NetworkPolicies apply on L3 and L4 while Meshes operate at L7 mostly. So when | ||
trying to express "denyAll connections except the allowed serviceAccount to | ||
serviceAccount connections"; how do we split the enforcement in this scenario | ||
between the CNI plugin that implementations the network policy at L3/L4 and | ||
service mesh implementation that implements the policy at L7? | ||
* One way is to probably split the implementation responsibility: | ||
1. CNI plugin can take of implementing denyAll connections and | ||
allowed serviceAccount to serviceAccount connections upto L3/L4 | ||
2. service mesh implementation can implement the allowed serviceAccount to | ||
serviceAccount connections on L7 | ||
* NOTE: There are some service mesh implementations which have a | ||
CNI component in them that collapses the logic of enforcement - so maybe | ||
that might become a requirement that the implementation be able to handle | ||
end to end enforcement of the full policy | ||
* We might also want to express additional constraints at L7 (such as only | ||
allowing requests from a given identity on specific paths or with specific | ||
HTTP/gRPC methods) - Ideally we would want these extra fields to be coexisting | ||
with the L3/L4 Authorization policy | ||
* Should this really be part of AdminNetworkPolicy and NetworkPolicy APIs or should | ||
this be a new CRD? | ||
* Making it part of existing APIs: Existing network policy APIs are pretty heavy | ||
with other types of selectors like namespaces, pods, nodes, networks, fqdns and | ||
expecting mesh implementations to implement the full CRD just to get the identity | ||
based selection might not be practical | ||
* Making it part of new API: We fall back to compatibility problem of the different | ||
layers and coexistence of this new API on a same cluster as existing network policy APIs | ||
|
||
## API | ||
|
||
(... details, can point to PR with changes) | ||
|
||
## Conformance Details | ||
|
||
(This section describes the names to be used for the feature or | ||
features in conformance tests and profiles. | ||
|
||
These should be `CamelCase` names that specify the feature as | ||
precisely as possible, and are particularly important for | ||
Extended features, since they may be surfaced to users.) | ||
|
||
## Alternatives | ||
|
||
(List other design alternatives and why we did not go in that | ||
direction) | ||
|
||
## References | ||
|
||
(Add any additional document links. Again, we should try to avoid | ||
too much content not in version control to avoid broken links) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the main goal is to provide network policies based on identities, and service accounts are a good primitive in Kubernetes to do that.
I also think that we want to look for this new model to be more scalable to avoid implementation having to track all the existing Pod IPs
@thockin your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the interest in describing this as being "based on identities", particularly if you say that arbitrary CIDRs can be "identities". How is calling this "based on identities" better than saying "you can select CIDRs or service accounts or pod labels"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think arbitrary CIDRs should be considered as identities - there's a bit of nuance in that some implementations of identity-based authorization policies may ultimately resolve those identities to IPs for enforcement, but an end-user would not be interleaving IP-based policy rules with identity-based policy rules even if the dataplane may collapse them together. https://www.solo.io/blog/could-network-cache-based-identity-be-mistaken and https://cilium.io/blog/2024/03/20/improving-mutual-auth-security/ have more specific details on the potential lossiness of this approach and possible mitigations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I have added the scale and consistency user story to the next section under user stories, maybe I can change the goals to reflect "identity" or refactor the way I am writing "service account as a way to select pods"
Depending on whether we feel this is a new API CRD or part of ANP -> in which case we need to decide also the points Dan is raising: https://github.com/kubernetes-sigs/network-policy-api/pull/274/files#r1850516386 we might need to tweak the TLDR/Goals sections here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to change the TLDR and Goals to reflect the high level abstract ones instead of the specifics I had in mind in putting them into ANP/BANP based on the discussion we had in the sig-network-policy-api meeting on tuesday (19th) as its not really clear if this is going to be part of netpol API or a new API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second mike here. The main benefits I can see are;
We dont have to be opinionated on how the API would be implemented, and
service account as a way to select pods
can just be an implementation detail.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments were based on Cilium's definitions, where (AIUI) everything you can apply a network policy toward is considered an "identity". (e.g., this blog post talks about how it turns podSelectors into "identities").
I had initially assumed people were talking about the same thing here. In particular, I had assumed that people were still talking about having a mix of "traditional" NetworkPolicy rules and new service-account-based NetworkPolicy rules, but calling the whole thing "identity-based", which seemed pointless.
But it sounds like what most of the people here actually want is a separate parallel network policy stack, which only uses "identity-based" rules based on service-account/mTLS/SPIFFE/whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cilium identities are clear a no goal, and now I realize that makes the terminology confusing.
If we consider identity as "the set of characteristics that uniquely identify an application (running in a Pod)".
IPs can uniquely identify a Pod, but if a deployment is rolled out the IPs will change (pods don't restart are recreated), so the rules applied directly to IPs will not be identity based.
We can use the labels to identify the pods of the deployments/daemonsets but this identity system seems mostly based on the implementation details of kubernetes, and labels can be added arbitrarily.
Service Accounts by the kubernetes docs definitions are "a type of non-human account that, in Kubernetes, provides a distinct identity in a Kubernetes cluster. Application Pods, system components, and entities inside and outside the cluster can use a specific ServiceAccount's credentials to identify as that ServiceAccount. "
I think that ServiceAccounts can work independently on how we translate the API to the implementation, map ServiceAccount to IPs, use ServiceAccount Tokens or Certificates , .... and also allow us to define cross-cluster or cross-services policies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm my understanding was the same as Dan's which is why I called out the non-goal here but seems like that is one of the main goals looks like :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed this during yesterday's netpol api meeting and agreed that SPIFFEEID can be a non-goal for V0 maybe but maybe good to track the user stories for all of this first