Kubernetes external network orchestration #118

gupta-alok · 2021-03-30T09:27:12Z

gupta-alok
Mar 30, 2021

Primary/Standard K8s networking model relies on single NAT’ed interface and stateful load balancer when interworking with external networks. As a result, does not allow proper network separation and its implementation through Linux kernel IP stack mechanisms does not fulfill the performance requirements of many TelCo VNFs.

Secondary/Special network attachments were introduced to overcome these TelCo specific limitations:

Allows pods direct L2 attachment to multiple external networks for network separation.
Support of exotic TelCo specific protocols.
Different network attachment types (kernel interface, VirtIO device, SR-IOV VF device) to cover a wide spectrum of performance requirements, including those of high-throughput user plane CNFs.

Network interfaces to pods are provided by Container Network Interface (CNI) plugins. Multus, as meta CNI is able to handle a pod requesting more than the mandatory primary interface and delegates the interfaces plumbing and configuration to the actual CNI plugins responsible for each pod interface.

However, all this is achieved through static pre-provisioned external networks during the initial cluster deployment; which cannot be updated on demand in an automated manner. Resulting, K8s pods (CNFs) can only consume those preconfigured external networks. This limited support for external network orchestration makes the overall networking solutions static and CNFs bound to specific deployed K8s cluster.

It is expected that the external networks can be added/configured on demand during the lifetime of a K8s cluster. So, whenever a new K8s application pods (CNFs) have to be instantiated, a cloud admin can orchestrate the external networks so that, the cloud user can attach their application pods (CNFs) to provisioned external networks.

There should be a network orchestration API to automate the necessary configuration inside the Kubernetes cluster and on the DC fabric.

CsatariGergely · 2021-03-30T11:17:40Z

CsatariGergely
Mar 30, 2021

I think there are several different problems what should be addressed in Kubernetes network orchestration.
Some from the top of my head:

Interoperable Pod networking: Current implementation of multiple Pod interfaces is not abstracted from the solutions implementing them, meaning that the Pod manifest needs to contain CNI muliplexer specific annotations and in case of for example Multus annotations specific to the underlying CNI-s.
Dynamic network management: current Kubernetes network management is static and based on Pod deployment time.
There is no separate network administration API: It is not possible to separate the network administration parts, like VLAN, VxLAN, IP range allocations form the application descriptors. There should be a possibility to set these things up in a separate API what is used by different actors than the ones who are deploying applications.

Should we create a separate use case for these?

8 replies

electrocucaracha Mar 31, 2021
Collaborator

@CsatariGergely I was assuming the fact that Device plugins like SR-IOV can improve the network performance and usually that's main reason to use a multi-network approach, they define their reqs thru requests and limits so maybe that could be another point to consider.

tliron Apr 1, 2021

To add to @iawells , one aspect of these "attachment points" is that they are very very diverse. This is not just a technology implementation diversity, but also a semantic diversity. An "attachment" can mean something entirely different for different kinds of networking: a port, a slice, a subnet, a VLAN, an SD-WAN client connection, an SR-IOV slot, etc. etc. etc.

This is the main reason I tend to talk about "connecting to networking" rather than "connecting to networks", because even the semantics of "a network" differ.

jeffsaelens Apr 8, 2021
Collaborator

Following up on Tal's point, I'm not 100% sure what external network orchestration means. This line "However, all this is achieved through static pre-provisioned external networks" is accurate, but I don't see what K8s necessarily needs to do in this case. Are we advocating for K8s to just become the the master of the cloud universe? I know some have advocated for this approach with things like ClusterAPI, KubeVirt etc... but I think that is a different issue than the attachment point that Ian and Tal are referencing.

Assuming we could solve the attachment point, K8s not having control of external networks wouldn't go away. I'd also really hesitate at the idea that this be pushed as a paradigm. While we could ignore the least privileged discussions for a second and assume we want K8s talking directly to our VIM/Cloud APIs, how would this work in baremetal K8s?

I think understanding K8s role in the grand scheme of things would be important. Is it the one stop shop for everything? Or a component in the stack?

gupta-alok Apr 9, 2021
Author

Here, external network means, an L2 network that connects primary and secondary pod interfaces to the DC Edge for external connectivity. As mentioned in one of below comment (https://github.com/cncf/cnf-wg/discussions/118#discussioncomment-578273), the current ENO PoC focuses on the L2 ("VLAN provider network") use case for connecting secondary pod interfaces to a DCGW. This is a very common scenario which we need to automate for sure. It doesn't exclude other external network scenarios that we discussed. The ENO APIs should be extendable to cover those. Creation/Deletion/Updation of such external networks on demand (automation) is what external network orchestration.

Automation of 'attachment point' that a pod can consume is one aspect that, ENO controller will address through ENO northbound API via "L2ServiceAttachment" object equivalent to Network Attachment Definition CR. ENO southbound API is accountable for managing external network connections to DCGW. ENO SBI is fabric agnostic meaning, should support multi-vendor fabric via corresponding fabric API plugin.In baremetal K8s, each external network will be provisioned through the Fabric API as an L2 network. In openstack K8s, external network is orchestrated as, neutron provider network.

tliron Apr 9, 2021

I'd like to suggest that "external" might not be the best adjective here. I think what we're referring to is the datacenter/cloud networking environment. This includes host networking configuration, local switches and routers, SDN controllers, etc. The scope is thus both networking that Kubernetes interacts with directly (in order for the cluster to function as a cluster), including ingress setup, as well as more literally "external" networking having to do with how the cluster interacts with the world beyond (intranets, Internet, other datacenter services, etc.).

xmulligan · 2021-03-30T13:52:15Z

xmulligan
Mar 30, 2021

tagging
@CsatariGergely
@electrocucaracha
@taylor
@jeffsaelens
@xmulligan
@iawells

from the meeting

0 replies

electrocucaracha · 2021-03-30T14:31:14Z

electrocucaracha
Mar 30, 2021
Collaborator

This problem is only applicable to Multus and/or DANM given that NSM provides on demand networks.

2 replies

iawells Mar 30, 2021
Collaborator

If it's a use case I don't think 'applicable' is quite the right way to phrase it. That part of it a problem that NSM already addresses, is what your'e saying, so if we measured an NSM design against a use case framed like the above it would score well.

electrocucaracha Mar 30, 2021
Collaborator

Agree, maybe the problem statement needs to be formulated like Some CNFs require multiple networks provisioned on demand

iawells · 2021-03-30T18:01:43Z

iawells
Mar 30, 2021
Collaborator

Is requiring L2 networks an example, or is it an axiom? An axiom, here, would be the indisputable one way to do this (as in OpenStack, for instance, where any external attachment will first cross an OpenStack network which is, give or take, defined to be an L2 domain). The original proposal doesn't really make it clear.

I think it's an example. I don't require L2 networks - to be specific, I don't require bridge domains - for everything I do.

I think it might also be worth approaching this from the sorts of packets I might send. For instance, if I'm sending a lot of IP, many techniques work. However, .1q, 1.ac and qinq would work in some cases and not others; MPLS will not go over a VLAN carrier, an L2 bridge domain or a routed network; and so on.

The protocols I listed imply that I need something that gets me to the external network in more raw ways than previously, but that could equally be (and, as a fallback for some protocols, would have to be) an L1 point to point connection over raw copper or fibre - since all of these are based on Ethernet (which is default for all network interfaces and would make an axiom) and modern networks always have wires from one device to the other (I don't think we have to consider CS/CDMA and multiple endpoints on a single wire here; another axiom). I can implement bridge domains on this using these raw components, but I don't have to implement them if that's not what I want.

We also don't have much sense of where I'm sending this traffic to, and how I identify this. That problem is only half a cloud problem, since we don't control the external network. I agree with the point that we might want to create new networks, as described; but adding a new network may also involve changing the external network. I can't just create a new network for VLAN 7 on a link and assume it will do what I want. So I think how co-ordination works is an important part of the use case as well.

1 reply

tliron Apr 1, 2021

I agree 100%. I think the focus on L2 networks might be causing some tunnel vision in our discussions.

Let's call them what they are: Local Area Networks that are absurdly non-Local. And that implies all kinds of technical workarounds that are engineering miseries (cough overlay networking cough tunnels cough).

(I blame IPv4 for much of this over-reliance on L2. Imagine how we would redesign the the datacenter if IPv6 was widely available: L3 routing instead of L2 switching, with all the benefits and possibilities that would allow us.)

Anyway, that's wishful thinking and robust L2 will continue to be a requirement for a very long time. And, to your point, coordination would be necessary for many networking technologies, not just vLAN.

iawells · 2021-03-30T18:06:11Z

iawells
Mar 30, 2021
Collaborator

Btw, I think design-wise we have two things here:

how networking works out from a 'this is where the packets go' perspective (requiring use cases to explain whole-network designs and their implications)
independently, and as a separate element, how network interfaces are given to containers:
- providing packets: passthrough with virtio/SRIOV or using a raw socket (this kind of per-packet processing deserves a use case of its own, btw)
- the BGP use case: inserted into a namespace that the app can somehow consume
- inserted into the CNI's namespace (this is a Multus and NSM feature, though I don't know if we can justify a need for this with a use case)

0 replies

vukg · 2021-03-30T18:57:46Z

vukg
Mar 30, 2021

Let me ask a provocative question: why multiple interfaces per pod in first place?

4 replies

electrocucaracha Mar 30, 2021
Collaborator

The answer that I have found is to provide networks with different performance requirements and/or protocols

rgstori Apr 1, 2021

also multiple VRFs
usually it's the data plane traffic using acceleration on a secondary interface

tliron Apr 1, 2021

Good question. My provocative answer: we don't. However, it is a readily-available and good solution.

We are already running on a kernel with strong support for network interfaces. They are namespaced, isolatable, and virtualizable, features that together make it easy to "containerize" them in relatively secure ways. (Of course not secure enough for many applications, thus we have solutions like Cilium.)

We can do all this ourselves by multiplexing everything we need through a single interface, but this involves reinventing a lot of wheels. As it stands, we can reuse decades of technologies designed to work around Linux network interfaces.

Of course, we might want to reinvent those wheels if we think the existing wheels are insufficient for what we are trying to accomplish. (See NSM.)

jeffsaelens Apr 9, 2021
Collaborator

This comes up a bunch. I think a few things to consider here. First, are we just as an industry at large in need of fundamentally looking at how we do things at scale? Do we need a separate management network? If the answer is yes, then how do we attach a network function running in a pod to this network. Second, are if there are compliance issues in play? If yes, and the separation is a mandated one, some of the soul searching in the first point becomes moot.

tliron · 2021-04-01T19:30:07Z

tliron
Apr 1, 2021

Good topic, and other people made good points, here are a few more --

First, quick link to a presentation on the topic.

I want underscore your point about "pre-provisioning", because it's so crucial. One major problem with many deployments is that they can't just be placed in any vanilla cluster, you indeed need the cluster to have been installed and configured with certain abilities on the hosts, e.g. have a CNI plugin configured in a very certain way. Not only that, but it's a multi-host issue: you might very well have several pods with the same shared requirement, and they might be distributed on multiple hosts in the cluster (which might be a requirement for high-availability). So, what we're seeing a lot these days is that a certain product requires you to 1) install stuff on the baremetal nodes, and 2) then install K8s on them, and 3) deploy K8s workloads. This defeats many of the benefits of cloud technologies. So, what can we do?

One challenge is that installing a K8s cluster is out of scope for K8s, or at least widely diverse (see the Cluster API), so how would you be able to list these requirements in a way that is even remotely portable? But the more practical challenge is that, well, you can't just easily reinstall the entire cluster when you are deploying a new network service. (You could potentially have standby baremetal nodes that you can on-board into your cluster with the new requirements, but again this is extremely implementation-specific, if the feature exists at all.)

What we need is technologies that allow for existing, running hosts to be reconfigured. This is a very difficult issue, as you generally don't want the containers to touch the host. So I think we would need some kind of core K8s component (just like the kubelet) that exposes certain specific and controlled capabilities regarding host networking directly to K8s workloads. This can't be just a CNI plugin, because again the point might be that CNI plugin wasn't "pre-provisioned". Also note that even if the plugin was installed, it might not be configured according to the workload's requirements. An example of how difficult this could be: you might need to reconfigure something in the current host's BIOS, restart the host (after moving workloads to other hosts), and then rejoin the cluster with the new abilities. And, again, you might need to do the same with other hosts in the cluster, too, and this all needs to happen with minimal disruption.

To put it another way, we need to turn the "pre-provisioning" paradox into a "re-provisioning" solution. :)

Final point: we need a way to orchestrate this but I don't know if "API" is the right answer to that requirement. Cloud-native orchestration solutions are declarative and intent-oriented, not API-driven. This is not a small point: imagine if several different users are calling the same API and asking for networky things that cannot be orchestrated together due to a lack of resources (avilability of SR-IOV slots, number of VLAN IDs pre-configured in the switches, etc.). The declarative approach could allow an operator, which can look at the complete picture, to "reconcile" these different requirements in sensible ways, e.g. to prioritize certain users over others.

4 replies

electrocucaracha Apr 1, 2021
Collaborator

@tliron pre-provisioning guarantees the existence of networks and reduces the booting time for CNFs, on-demand provisioning requires a discovery mechanism that needs to be implemented in CNF's side (I created a rudimentary go library for that) and decrease readiness time for CNFs.

Regarding espousing network capabilities, do you think projects like NFD can collect/expose that information?

tliron Apr 1, 2021

Node discovery is only half of the node inclusion phase. Once we gather all its capabilities we may also need to configure it (e.g. using Redfish to make changes to the BIOS, or configuring/restarting a smart NIC) according to a plan associating cluster roles to such capabilities.

You can call it "on-demand provisioning", but the challenge is that you might want to provision a specific feature that involves re-configuring an already-in-use node.

This kind of complexity is not rare. Virtually all RAN components require accelerator cards and advanced CPU and motherboard features.

fkautz Apr 7, 2021

That's one of the nice features about nsm, you can install it on an existing k8s cluster and it'll work. E.g. we run it on unmodified GKE, EKS, AKS, and metal (formerly packet) in our integration tests. It doesn't solve the BIOS problem, but can certainly help in other scenarios where you want to establish connections on demand rather than having to stand up all possible connection/interface combinations you might use. If it turns out you do need something you haven't provisioned yet (e.g. a specific virtual L3 subnet or a specific type of packet core), you can add it without significant disruption.

Good for calling this out! We'll need to dive into these topics when we get into technical discussions.

jeffsaelens Apr 20, 2021
Collaborator

I think part of the issue is we try to shove "all" functionality into K8s itself. Somewhere along the way, K8s gobbled up the pipeline/DevOps concept as being all inclusive to it. I really think that despite the shift towards being Kube-native in this WG, to keep things focused, the cloud native piece should not be completely thrown out from our higher level considerations.

Workflow, automation, pipelines etc... should be included as potential tools in our toolbag for reaching a best practice. Multiple operators and 100 CRDs shouldn't be the predetermined solution each time we look at the topic of what a CNF wants from the infrastructure.

gupta-alok · 2021-04-07T07:12:37Z

gupta-alok
Apr 7, 2021
Author

Thanks for all the great comments. I guess I agree with more or less all the comments. Also, like we mentioned in #tug-networking-orchestration channel:

We acknowledge the difference between a cloud admin orchestrating external networks for consumption by K8s pods and cloud users attaching their K8s pods to provisioned external networks. In fact, ENO is an attempt at providing a network orchestration API on the K8s cluster for use by the cloud admin to automate the necessary configuration inside the cluster and on the DC fabric. ENO generates the artefacts (e.g. network attachment definition CRDs) that application pods refer to connect to networks.
A CNI is responsible for the plumbing of a single pod interface to an existing network at pod creation time based on the NAD and infrastructure established by ENO at network creation time. The authorization of a pod for connecting to a network depends if it is allowed to use the NAD (e.g. using namespaces). In our view a network orchestration operator like ENO is essentially different from a CNI, not an API shim above the CNIs.
The current ENO PoC focuses on the L2 ("VLAN provider network") use case for connecting secondary pod interfaces to a DCGW. This is a very common scenario which we need to automate for sure. It doesn't exclude other external network scenarios. The ENO APIs should be extendable to cover those. Calico as primary CNI with MetalLB as load balancer for incoming external traffic would, for example, be an L3 use case that we have already considered but not included in the PoC. Similarly, we have plans to evaluate NSM in next phase to extend ENO APIs to cover use-cases that are been addressed via NSM.

ENO aims at providing automation APIs for networking solutions underneath the K8s cloud platform. An L2 service across a DC fabric to the GW being one that we have addresses so far. There may be others to be added.

Don't necessarily read VLAN in this context as an L2 bridge domain. It really only means a VLAN on the access link between the K8s worker/server and the fabric. A DC fabric can and often will be EVPN-based, so that the L2 service is actually routed. Furthermore, the L3 DCGw function is often hosted on the fabric switches and in small Edge deployments, the "fabric" actually degenerates to a directly connected Gw.

The ENO API data model should be abstract and generic enough to cover all these scenarios. If not, let's improve it. It should be a matter of the south-bound fabric/Gw plugin to map the abstract model to the suitable fabric configuration.

We would like to share the ENO design document that will cover the object model and how E2E network orchestration can be realised. Should I create a separate discussion thread for ENO design document or do we have dedicated space under cnf-wg repo for design documents?

3 replies

tliron Apr 7, 2021

Thanks! That's very clear and very promising. I'm very encouraged that you've thought of a lot of the challenges we face already in ENO. It could be that as we start to list our technical requirements we might end up with something that resembles ENO quite closely. We definitely want to take our joint experiences and formulate a solid approach that can address current known requirements and also be forward-looking enough that we don't have to do this all over again in a few years.

I would suggest a separate thread for ENO just because this one is already titled more generically and includes discussion of various solutions.

CsatariGergely Apr 7, 2021

@gupta-alok I would propose to publish the ENO design document in its own separate repo and link it from there to the discussions in cnf-wg. You might need the repo anyways later, when you open source the solution, right? :)

gupta-alok Apr 7, 2021
Author

Thanks @CsatariGergely, that's a better alternate :)

jeffsaelens · 2021-04-19T16:42:33Z

jeffsaelens
Apr 19, 2021
Collaborator

19APR2021 CNF-WG Call

What is an external network (from a K8s standpoint)?
Shared network/infra paradigm.
What is K8s role in this? How does it interface with SDN abstractions?
What is the Role of the CNF vs the CNI?
Limit scope of work between what the CNF requests of the infra, and what K8s is/isn't transparent to.
What does an interface that enables sane orchestration look like?
Break down uses cases, look where we do and don't have alignment with upstream K8s community.
Role of GitOps/Pipelines, WorkFlow Engines and K8s itself?
We as a group need to clarify the verbiage around the "type" of networks we are talking about (possibly a glossary task?)

Tasks:
Glossary terms from the CNF WG standpoint:

What is a network attachment in this space?
What is an external network?
What is a tenant network?

Discuss the role of K8s itself, vs the CNF itself.

1 reply

jeffsaelens Apr 19, 2021
Collaborator

@gupta-alok @tliron @fkautz @iawells @electrocucaracha @CsatariGergely and others, please bring your thoughts to next Monday's (26th Apr) call.

dabernie · 2022-02-22T20:26:33Z

dabernie
Feb 22, 2022

@jeffsaelens what was the outcome of this CNF-WG call ?

1 reply

jeffsaelens Feb 23, 2022
Collaborator

I'll have to go back and look at the notes. That was 10 months back, and I'm drawing a blank besides the fact that we debated about topics such as the CNI, operators etc...

Honestly, a lot has changed since we tackled this topic. It's probably a good idea to reengage on this topic.

Kubernetes external network orchestration #118

Replies: 10 comments · 24 replies

electrocucaracha Mar 31, 2021 Collaborator

jeffsaelens Apr 8, 2021 Collaborator

gupta-alok Apr 9, 2021 Author

electrocucaracha Mar 30, 2021 Collaborator

iawells Mar 30, 2021 Collaborator

electrocucaracha Mar 30, 2021 Collaborator

iawells Mar 30, 2021 Collaborator

iawells Mar 30, 2021 Collaborator

electrocucaracha Mar 30, 2021 Collaborator

jeffsaelens Apr 9, 2021 Collaborator

electrocucaracha Apr 1, 2021 Collaborator

jeffsaelens Apr 20, 2021 Collaborator

gupta-alok Apr 7, 2021 Author

gupta-alok Apr 7, 2021 Author

jeffsaelens Apr 19, 2021 Collaborator

jeffsaelens Apr 19, 2021 Collaborator

jeffsaelens Feb 23, 2022 Collaborator

Replies: 10 comments 24 replies

electrocucaracha Mar 31, 2021
Collaborator

jeffsaelens Apr 8, 2021
Collaborator

gupta-alok Apr 9, 2021
Author

electrocucaracha
Mar 30, 2021
Collaborator

iawells Mar 30, 2021
Collaborator

electrocucaracha Mar 30, 2021
Collaborator

iawells
Mar 30, 2021
Collaborator

iawells
Mar 30, 2021
Collaborator

electrocucaracha Mar 30, 2021
Collaborator

jeffsaelens Apr 9, 2021
Collaborator

electrocucaracha Apr 1, 2021
Collaborator

jeffsaelens Apr 20, 2021
Collaborator

gupta-alok
Apr 7, 2021
Author

gupta-alok Apr 7, 2021
Author

jeffsaelens
Apr 19, 2021
Collaborator

jeffsaelens Apr 19, 2021
Collaborator

jeffsaelens Feb 23, 2022
Collaborator