-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking #20
Comments
@davidchisnall wrote:
@gizahNL wrote:
|
Because Linux containers are a bit more Lego block-like, a really common pattern is to have containers share namespaces (and in particular share the network namespace). This allows for those containers to have a common view of ports, interfaces, routes, and IP space. Orchestrators (well, Kubernetes specifically) may have baked-in assumptions that a set of containers treated as a single unit (Kubernetes pod, Amazon ECS task, etc) have a single exposed IP address. I don't know if nested jails are necessary for that, but that was the approach I saw @gizahNL use. If you're interested in doing networking, how about starting with something basic first? My initial approach was to look at
The OCI runtime spec doesn't say much about networking for Linux containers. Typically, the bundle config describes either that a new network namespace should be created (a Looking at other operating systems: Windows containers appear to have a somewhat different modeling of networking with a From my (limited) knowledge of FreeBSD jails, it looks like there is a bit more structure around how a jail's network is configured. Specifically, I see I'm not sure what the right path is for FreeBSD jails. The nested approach that @gizahNL suggested sounds to me like the closest to the existing Linux and Windows patterns that are used in the ecosystem and would likely play nicely with that style of separation of concerns where a separate component could configure the parent jail's network without input from runj. On the other hand, since jails do have more structure that could also be beneficial to expose via the bundle config (and add to the upstream specification). I'd love to have input here. |
In addition to fitting best with the current existing approaches for other OS's using a nested approach goes around the lack of FreeBSD network tooling in Linux container images. |
I'm not sure how much of a risk that would be on FreeBSD but it's something I'd generally avoid doing on Linux as it could increase risks related to container breakout or data exfiltration. I wonder if it would be better to create a rootfs with just the set of tools that you need and then create a base jail from that rootfs. |
Yes that would also work. I'd argue the risk is minimal since no code would be running in the base jail, except the networking configuration commands, which are fired off by a tool that I assume already has full root access. |
Thanks for the excellent write-up. I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced. Can you clarify a bit what you mean by the expectation that jails share an IP address? Does this assume that the code inside the jail sees the public IP address (i.e. no port forwarding / NAT)? Or just that any jail can establish connections with another on the same machine unless explicitly prevented by a firewall? Between NetGraph and VNET, there's a huge amount of flexibility in what can be expressed. I believe the most common idiom is for each jail to have a private IP address that is locally bridged so any jail (and the host) can connect to any other jail's IP address but for public services they must have explicit port forwarding. Outbound connections are NAT'd for IPv4, for IPv6 it's a bit simpler as the jail's IP address can be either made public and inbound ports can be either explicitly opened or blocked. If that abstraction works for K8s, then that's great but otherwise I'd like to understand a bit more about what it wants. As a high-level point: The FreeBSD Foundation has now committed to investing in container support for FreeBSD, with the remainder of this year being spent on building a concrete plan. We shouldn't try to work around missing features in FreeBSD, we should document what is missing. If FreeBSD's Linux ABI layer needs the |
Oof, if that is still true and nested jails are still very risky compared to normal jails that would make that strategy a no-go. The FreeBSD foundation wanting to work at container support is great news. Top of my wishlist would indeed be to decouple vnet from jails (I looked at the kernel code it's relatively doable, but also a bit much for a "my first kernel Code" thing, so I didn't go there ;) ) Second would be the ability to configure a vnet instance from the host OS without depending on anything inside the vnet jail, ifconfig, route, pfctl and co taking a jail param would likely be enough to start, though a nicer thing would be a (relatively) simple API to do most networking stuff (I couldn't get myself to grok the ioctl style config yet for interfaces & pf, it all seemed quite dense, and from what I read at least wrt ifconfig those ioctls are not really meant to program against yourself). For me personally getting Linux ip command to work is of lesser importance (I think it doesn't use ioctls anymore but a new socket type invented for it). I don't think there are many containers that depend on doing their own network configuration. |
Thanks for the heads up. I'd love to read more about this if you have any resources handy.
Yes, absolutely. I'm going to respond to your statements out-of-order since I think that'll make the answers more clear.
This is the default networking mode in Docker (also called the "bridge" mode). On Linux, Docker creates a (by default) Orchestrators like Amazon ECS use this mode by default and have placement logic to handle conflicts on exposed ports. Kubernetes, on the other hand, has explicitly chosen to avoid this and try to present a simpler model to applications running within the cluster (at the cost of additional complexity for the person operating the cluster).
Not precisely. Let's talk about Kubernetes specifically for a moment. The Kubernetes project has documentation on the networking model but I'll attempt to summarize as well. In Kubernetes, there are two assumptions that are core to the networking model: (1) processes in the same pod (regardless of which container they're in) have a view of the network as if they were just processes running on the same machine; i.e., they can communicate with each other over For (1) this is accomplished by sharing a network namespace. Each container in the pod sees the exact same set of network interfaces; there is no isolation between them. For (2), a CNI plugin (or set of chained CNI plugins) is responsible for adding an interface to the pod's network namespace that the pod (i.e., all the containers in the pod) can use for its outbound connections (and exposed services). A CNI plugin (the same or another) is responsible for IPAM within the cluster; pods do not typically have public (Internet-routable) IPv4 addresses and instead typically have a private-range address. There are a variety of mechanisms to do this; various types of overlay networks or vlan setups are common, or cloud providers like AWS may integrate with that cloud's network primitive (VPC) and attach an interface to the host (i.e., an ENI). Stepping away from Kubernetes, orchestrators like Amazon ECS can do this too, though it's less core to their networking models. Either way, the underlying primitive being used is the ability to share a network namespace among a set of containers rather than giving each container its own, isolated, view of the network.
Networking-wise: if FreeBSD does not already have a mechanism for a set of jails to share interfaces/view of the network (like shared network namespaces in Linux), I think that would be a very useful thing to add. I don't know enough about FreeBSD networking yet to know if that is the case or to know if @gizahNL's suggestion for sharing vnet instances is the right approach (though from my limited reading that does sound correct).
This also sounds useful, but could be worked around. On Linux, namespaces are garbage-collected by the kernel unless there is either an active process or mount holding the namespace open. In order to have a network namespace with a lifetime decoupled from the containers that make up a pod (in Kubernetes) or a task (in Amazon ECS), a common technique is to create a "pause container" that exists just to hold the namespace open and give an opportunity for that namespace to be fully configured (e.g., for the CNI plugins to run) ahead of the workload starting. A similar technique could be used here (if vnet sharing is the right approach) where a jail is created with the necessary tools for the express purpose of configuring the vnet. I'm not sure what else would be useful to add to FreeBSD yet; I'm sure we'll all learn more as we continue to talk and experiment. |
Thinking about this a bit more, it feels like Docker is a much better fit for the non-VNET model. The jails I used to manage had a very simple networking setup. I created a new loopback adaptor ( VNET is newer but it isn't necessarily better. It allows more things (for example, raw sockets, which would allow jailed processes to forge the header if they weren't hidden behind a firewall that blocked faked source IPs) and it comes with different scalability issues. With VNET, each jail gets a separate instance of the network stack. This consumes more kernel memory but avoids lock contention. Generally, it's a good choice if you have a lot of RAM, a lot of cores, and a lot of jails, but for deployments with a handful of jails it will add overhead that you don't need. For a client device doing For K8s, it's probably worth exposing some of the Netgraph bits to allow more abitrary network topologies for a particular deployment. |
That won't work afaik, because Docker containers assume localhost to be 127.0.0.1, and assume it to be non shared. |
Related Moby issue: moby/moby#33088 |
I don't believe that this is true. If you try to bind to 127.0.0.1 in a non-VNET jail, you will instead bind to the first IP provided to the jail. If you create a |
The kubernetes model groups containers into 'pods' which share a network namespace and there is an explicit expectation that containers in the pod can communicate via localhost (https://kubernetes.io/docs/concepts/workloads/pods/#pod-networking). In this model, nothing runs at the pod level so there should be no issues with a two-level jail structure with the pod's jail owning the vnet and child jails for each container. |
[I'm highly interested in getting basic CNI support in According to the CNI spec, the runtime needs to execute the CNI plugin. I know there are efforts to port the CNI-supported plugins to FreeBSD, but I'm working on a pretty minimal and only-partially compliant placeholder CNI plugin for use with the new containerd support for FreeBSD. As CNI support in the runtime would be critical to Kubernetes node-level support, is there any work done on adding that into |
I have some mostly working CNI plugins for FreeBSD here: https://github.com/dfr/plugins/tree/freebsd. These assume the 'netns' parameter for the plugin is the name of a VNET jail. The container jail is nested in the VNET jail which lets all the containers in a pod communicate via localhost. |
Also, as far as I can tell from working with the github.com/containers stack, common practice is for CNI plugins to be executed by the container engine (e.g. podman, buildah, cri-o, containerd), initialising a network namespace (or jail for freebsd) which is passed to the runtime via the runtime spec. |
I'm more interested on the Linux container side. I have no idea what's actually involved there as far as shoehorning that support into |
@dfr is correct; CNI support should be in the caller of runj rather than runj itself. containerd supports CNI plugins in its CRI implementation today. runj needs to support the networking primitives that the CNI plugins would then configure (the equivalent of a network namespace on Linux). I'm also interested in supporting networking outside CNI in the context of what |
👍
I think you can just change the parameter name from |
I like the idea of changing the parameter name - I'll look into that. I'm mostly against changing the plugin name - I like it being called 'bridge' for consistency with linux - this means that things like 'podman network create' just work on FreeBSD. |
In #32 I've added a mechanism for runj to model FreeBSD extensions to the runtime spec and added a couple networking-related settings using that mechanism. The end result is that runj can now configure jails to have access to the host's IPv4 network stack (similar to host networking for Linux containers). I'd be happy to take more contributions using this mechanism that model additional network settings (including those that might be needed by CNI plugins like interfaces and VNET settings) as well as modeling parent-child jail relationships. |
I had been thinking of just exposing an interface which allows the
container engine to explicitly set jail parameters for the container. Not
sure which is best but this approach puts the policy choices for the
container in the engine which makes sense to me. A possible approach might
look like
dfr/runtime-spec@2caca12
…On Sun, 11 Sept 2022 at 06:21, Samuel Karp ***@***.***> wrote:
In #32 <#32> I've added a
mechanism for runj to model FreeBSD extensions to the runtime spec and
added a couple networking-related settings using that mechanism. The end
result is that runj can now configure jails to have access to the host's
IPv4 network stack (similar to host networking for Linux containers). I'd
be happy to take more contributions using this mechanism that model
additional network settings (including those that might be needed by CNI
plugins like interfaces and VNET settings) as well as modeling parent-child
jail relationships.
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOTJJWUXHTLOBCVIYEMUDV5VT4HANCNFSM5G6QLLBQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dfr Thanks, that's an interesting approach. I think it's reasonable as a prototyping mechanism that we could add to runj, but probably not something I think would be appropriate to upstream into the spec itself. I would expect the spec to have a slightly higher level of abstraction such that the backend could be swapped out for something that isn't a jail (for example, possibly a bhyve VM) but still supports largely the same set of FreeBSD-specific features. As an example, the Linux portion of the spec models cgroups (which are used for resource limits) but it doesn't specify the exact materialization into cgroupfs. |
I've started playing around with vnet and trying to set up a bridged network similar to what Docker does on Linux, but I'm having trouble figuring out what I'm missing (probably both that I'm misunderstanding exactly what Docker is doing and that I'm failing to translate that to FreeBSD). On Linux, Docker creates a bridge and then a veth pair for each container, adding one end to the bridge and moving the other end into the container. Inside the container, the veth is set up with an IP address and that IP is then used as the next hop for the default route. There is also a set of iptables rules created on the host, though I'm not sure if those are used for normal traffic forwarding or are primarily used for exposing ports. The bridge is a separate non-overlapping CIDR from the host's network (172.17.0.0/16 by default) and something (?) is performing NAT. iptables configuration
I've been able to follow this guide to bridge an epair inside a jail with the primary interface in my VM and allow the jail to initiate DHCP from the network attached to the VM (in this case, VirtualBox's built-in DHCP server). That's not quite the same thing though. I can also omit DHCP and do static IP addressing for the bridge and for the epair interfaces (either side?), though no matter what I do I don't have working bidirectional network. I suspect packets are being sent but nothing is being received back, and as I'm typing this out to explain what I'm seeing I'm thinking that I'm likely missing something about configuring NAT. Here's what I've been doingOn the host/VM:
In my
Inside the jail:
I've also tried:
I'm going to continue looking, but figured I'd post here in case anyone has suggestions/pointers for me to look at. Meanwhile: I'll be adding |
You can have a look at VNET jails management of CBSD |
I use a very similar approach to handle networking for podman and buildah. Take a look at https://github.com/dfr/plugins/tree/freebsd - the code which manages the epairs is in These plugins are in the ports tree and you can install them with |
@dfr thanks for that! I've tried reading through the code in the Here's what I've been trying
However if I try to From outside the jail, the route table looks like this:
From inside the jail, it looks like this:
I see the following line in dmesg:
|
The table is automatically created when something is added. It looks like
you are doing everything right - I believe the error is coming from the
jail itself. Ttry adding allow.raw_sockets to the jail config.
…On Thu, 1 Dec 2022 at 07:50, Samuel Karp ***@***.***> wrote:
@dfr <https://github.com/dfr> thanks for that! I've tried reading through
the code in the bridge plugin and I'm ending up with steps that are
roughly the same as what I was doing (and I'm running into similar
problems). I did find where the PF table is manipulated, but I'm guessing
that I'm missing the table creation since all I see are add and delete
commands.
Here's what I've been trying
1. Create the bridge: ifconfig bridge create name bridge0
2. Create the epair: ifconfig epair create
3. Set a description (I didn't know this was a thing!): ifconfig
epair0a description "host-side interface"
4. Set a mac address on the jail-side interface: ifconfig epair0b link
00:00:00:00:00:01
5. Add the host-side interface to the bridge: ifconfig bridge0 addm
epair0a
6. Bring the host-side interface up: ifconfig epair0a up
7. Add an IP and subnet mask to the bridge: ifconfig bridge0 alias
172.17.0.1/16
8. Enable IP forwarding: sysctl net.inet.ip.forwarding=1
9. Add the jail IP address to a PF table: pfctl -t jail-nat -t add
172.17.0.2/32
10. Start a jail and pass the epair0b interface into the vnet (I did
this in a jail.conf file)
11. (inside the jail) Assign the IP to the interface: ifconfig epair0b
inet 172.17.0.2/32
12. (inside the jail) Bring the interface up: ifconfig epair0b up
13. (inside the jail) Add a route to the bridge, using the epair0b IP
as the gateway: route -4 -add 172.17.0.1/16 172.17.0.2
14. (inside the jail) Add a default route using the bridge gateway: route
-4 -add default 172.17.0.1
However if I try to ping an IP address (8.8.8.8, for example) I get this
output: ping: sendto: Invalid argument
—
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOTJMKMR4PJRE2CCP2QP3WLBKCZANCNFSM5G6QLLBQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is the current
(You can see I'm very creative with names like "foo" and "jail3"). The devfs ruleset is:
(this was needed for I've tried this both with an empty
I see that you responded from email; I updated that comment on GitHub with a bit more information there too. |
This is my pf.conf. The weird v4fib0egress stuff is coming from sysutils/egress-monitor - you can replace with the outgoing interface name:
|
Also, the rdr-anchor bit is only needed for 'port publishing' which adds redirect rules to route traffic into the container and it looks like I'm wrong about the table being auto-created. |
I rebooted to start from a fresh state, tried your
|
Thats odd. I think it would be helpful to exactly where the outgoing packet is being rejected. The tcpdump utility is often helpful here:
Also, the output of 'jls -n' on the host might be helpful. On my dev machine, I have this in /etc/sysctl.conf:
I think the allow_raw_sockets bit might be a clue? |
With that running in one terminal and ping 8.8.8.8
tcpdump -vv -e -i epair0b
This is both before and after changing sysctls to match (with the
|
Ok, we need to figure out what is causing sendto to return EINVAL - the packet is being rejected on the way into the kernel on the jail side and doesn't reach epair0b. I'll have time to try and reproduce this later today - I'll update this issue if I find anything. |
Perhaps a stupid question @samuelkarp, but did you add a default route? iirc Invalid argument is returned by ping when no route exists for the destination IP |
If there isn't a default route you will get ENETUNREACH, not EINVAL:
I haven't been able to reproduce the same error that Sam had yet. I built a fresh VM running 13.1-RELEASE and the following setup: rc.conf
pf.conf
In the VM, I run this script to setup a jail and run a ping test (assume that /root/testjail is pre-populated with base.txz from the install media): runtest.sh
which gives the following:
|
@dfr Thanks! I tried the script and got the following:
I then updated with |
I've figured out what I was doing wrong. In @dfr's script, the subnet mask on the interface in the jail is 255.255.0.0 (a /16). I had been setting the mask as 255.255.255.255 (/32). In @dfr's script, there's only a single default route to the gateway on the interface (the address ending in .1). In my attempt, I had two routes: one to route the /16 to the /32 as if it were a gateway and a second default route that matched @dfr's. This seems to have been an invalid configuration and caused the Edit: also, the NAT rules in PF are required as something needs to perform translation. |
Thanks for de-mystifying - it always bothers me to not understand why something doesn't work. The NAT is required to allow the jail's traffic to reach the public internet but without it you can still ping the bridge. |
Continuing the conversation from #19, specifically about networking.
cc @davidchisnall, @gizahNL
The text was updated successfully, but these errors were encountered: