Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking #20

Closed
samuelkarp opened this issue Oct 29, 2021 · 40 comments · Fixed by #35
Closed

Networking #20

samuelkarp opened this issue Oct 29, 2021 · 40 comments · Fixed by #35

Comments

@samuelkarp
Copy link
Owner

Continuing the conversation from #19, specifically about networking.

cc @davidchisnall, @gizahNL

@samuelkarp
Copy link
Owner Author

@davidchisnall wrote:

Thanks. I'm interested (time permitting) in working on some of the network integration (vnet + pf). Pot already seems to manage this reasonably well, so should provide a good reference. I don't have a very good understanding of how the various bits (containerd / runj / CNI) fit together (all of the docs seem to assume that you know everything already and throw terminology at you).

You shouldn't need nested jails for jail-to-jail networking, you 'just' need to set up the routing.

@gizahNL wrote:

You could take a look at my moby port. It has (barebones) working network and barebones pf support.

The strategy I used is creating a base jail that allows for a child jail to be spawned that does the vnet network, and a child jail that is the actual container. The rationale being that Linux containers lack the tools to configure the FreeBSD network stack, and Kubernetes pods assuming a shared network namespace.

I still have a PR open here that needs more work on it, but unfortunately I've been swamped with other commitments.

@samuelkarp
Copy link
Owner Author

You shouldn't need nested jails for jail-to-jail networking, you 'just' need to set up the routing.

Because Linux containers are a bit more Lego block-like, a really common pattern is to have containers share namespaces (and in particular share the network namespace). This allows for those containers to have a common view of ports, interfaces, routes, and IP space. Orchestrators (well, Kubernetes specifically) may have baked-in assumptions that a set of containers treated as a single unit (Kubernetes pod, Amazon ECS task, etc) have a single exposed IP address.

I don't know if nested jails are necessary for that, but that was the approach I saw @gizahNL use.

If you're interested in doing networking, how about starting with something basic first? My initial approach was to look at vnet=inherit as a default for now just so that a jail can have some network connectivity and then leave the more complicated bits for later since anything here would involve either (a) a new component or (b) changes to the OCI runtime spec (or both!).

I don't have a very good understanding of how the various bits (containerd / runj / CNI) fit together (all of the docs seem to assume that you know everything already and throw terminology at you).

The OCI runtime spec doesn't say much about networking for Linux containers. Typically, the bundle config describes either that a new network namespace should be created (a LinuxNamespace struct with the Type set to network and the Path empty) or that an existing namespace should be joined (the Path pointing to that namespace). Then something at a different layer (above the OCI runtime) is responsible for configuring that namespace with the appropriate network interfaces, routes, etc. This can be done by a CNI plugin (as is the case in Kubernetes, some situations in Amazon ECS), directly by a higher-level invoking runtime (like Docker does), or whatever other component you want; CNI is an optional and somewhat standard way to do it, but the whole setup is outside the scope of an OCI runtime on Linux anyway.

Looking at other operating systems: Windows containers appear to have a somewhat different modeling of networking with a WindowsNetwork struct in the bundle config that has some additional options around DNS and endpoints (which I assume are vNICs?). But there's also a NetworkNamespace ID specified there and my understanding is that multiple Windows containers can share the same network namespace.

From my (limited) knowledge of FreeBSD jails, it looks like there is a bit more structure around how a jail's network is configured. Specifically, I see vnet, ip4, and ip6 related options in the jail(8) manual page.

I'm not sure what the right path is for FreeBSD jails. The nested approach that @gizahNL suggested sounds to me like the closest to the existing Linux and Windows patterns that are used in the ecosystem and would likely play nicely with that style of separation of concerns where a separate component could configure the parent jail's network without input from runj. On the other hand, since jails do have more structure that could also be beneficial to expose via the bundle config (and add to the upstream specification). I'd love to have input here.

@gizahNL
Copy link
Contributor

gizahNL commented Oct 29, 2021

In addition to fitting best with the current existing approaches for other OS's using a nested approach goes around the lack of FreeBSD network tooling in Linux container images.
Of course that could be solved by mounting static linked binaries for that purpose into the Linux container, but that imho has more moving parts, and feels more likely to break.
The simplest solution to me was to create a base jail with its root set to /, so that all tools from the host are available, and are of the correct versions.

@samuelkarp
Copy link
Owner Author

create a base jail with its root set to /

I'm not sure how much of a risk that would be on FreeBSD but it's something I'd generally avoid doing on Linux as it could increase risks related to container breakout or data exfiltration.

I wonder if it would be better to create a rootfs with just the set of tools that you need and then create a base jail from that rootfs.

@gizahNL
Copy link
Contributor

gizahNL commented Oct 29, 2021

create a base jail with its root set to /

I'm not sure how much of a risk that would be on FreeBSD but it's something I'd generally avoid doing on Linux as it could increase risks related to container breakout or data exfiltration.

I wonder if it would be better to create a rootfs with just the set of tools that you need and then create a base jail from that rootfs.

Yes that would also work. I'd argue the risk is minimal since no code would be running in the base jail, except the networking configuration commands, which are fired off by a tool that I assume already has full root access.
For the case of minimising risk it makes sense of course.

@davidchisnall
Copy link

Thanks for the excellent write-up. I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Can you clarify a bit what you mean by the expectation that jails share an IP address? Does this assume that the code inside the jail sees the public IP address (i.e. no port forwarding / NAT)? Or just that any jail can establish connections with another on the same machine unless explicitly prevented by a firewall? Between NetGraph and VNET, there's a huge amount of flexibility in what can be expressed. I believe the most common idiom is for each jail to have a private IP address that is locally bridged so any jail (and the host) can connect to any other jail's IP address but for public services they must have explicit port forwarding. Outbound connections are NAT'd for IPv4, for IPv6 it's a bit simpler as the jail's IP address can be either made public and inbound ports can be either explicitly opened or blocked.

If that abstraction works for K8s, then that's great but otherwise I'd like to understand a bit more about what it wants.

As a high-level point: The FreeBSD Foundation has now committed to investing in container support for FreeBSD, with the remainder of this year being spent on building a concrete plan. We shouldn't try to work around missing features in FreeBSD, we should document what is missing. If FreeBSD's Linux ABI layer needs the ioctls for Linux's ip command to work and they don't currently, then we should raise that. If we need to be able to assign the same VNET instance to multiple jails, that's also something that we can ask for.

@gizahNL
Copy link
Contributor

gizahNL commented Oct 29, 2021

. I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Oof, if that is still true and nested jails are still very risky compared to normal jails that would make that strategy a no-go.

The FreeBSD foundation wanting to work at container support is great news.

Top of my wishlist would indeed be to decouple vnet from jails (I looked at the kernel code it's relatively doable, but also a bit much for a "my first kernel Code" thing, so I didn't go there ;) )

Second would be the ability to configure a vnet instance from the host OS without depending on anything inside the vnet jail, ifconfig, route, pfctl and co taking a jail param would likely be enough to start, though a nicer thing would be a (relatively) simple API to do most networking stuff (I couldn't get myself to grok the ioctl style config yet for interfaces & pf, it all seemed quite dense, and from what I read at least wrt ifconfig those ioctls are not really meant to program against yourself).

For me personally getting Linux ip command to work is of lesser importance (I think it doesn't use ioctls anymore but a new socket type invented for it). I don't think there are many containers that depend on doing their own network configuration.

@samuelkarp
Copy link
Owner Author

I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Thanks for the heads up. I'd love to read more about this if you have any resources handy.

Can you clarify a bit what you mean by the expectation that jails share an IP address?

Yes, absolutely. I'm going to respond to your statements out-of-order since I think that'll make the answers more clear.

I believe the most common idiom is for each jail to have a private IP address that is locally bridged so any jail (and the host) can connect to any other jail's IP address but for public services they must have explicit port forwarding.

This is the default networking mode in Docker (also called the "bridge" mode). On Linux, Docker creates a (by default) docker0 bridge and uses a veth pair to connect the container to the bridge. The bridge has a defined subnet (172.17.0.0/16) and Docker handles IPAM. Outbound connections are NAT'd and port forwarding can be configured to expose services.

Orchestrators like Amazon ECS use this mode by default and have placement logic to handle conflicts on exposed ports. Kubernetes, on the other hand, has explicitly chosen to avoid this and try to present a simpler model to applications running within the cluster (at the cost of additional complexity for the person operating the cluster).

Does this assume that the code inside the jail sees the public IP address (i.e. no port forwarding / NAT)? Or just that any jail can establish connections with another on the same machine unless explicitly prevented by a firewall? Between NetGraph and VNET, there's a huge amount of flexibility in what can be expressed.
[...]
If that abstraction works for K8s, then that's great but otherwise I'd like to understand a bit more about what it wants.

Not precisely. Let's talk about Kubernetes specifically for a moment. The Kubernetes project has documentation on the networking model but I'll attempt to summarize as well. In Kubernetes, there are two assumptions that are core to the networking model: (1) processes in the same pod (regardless of which container they're in) have a view of the network as if they were just processes running on the same machine; i.e., they can communicate with each other over localhost and will conflict with each other if they attempt to expose services on the same port, and (2) all pods within the cluster can communicate with each other by using the pod's IP address (i.e., the IP address is routable within the cluster) without NAT.

For (1) this is accomplished by sharing a network namespace. Each container in the pod sees the exact same set of network interfaces; there is no isolation between them. localhost in one container is the same localhost in another container in the same pod, eth0 in one container is the same eth0 in another container in the same pod, etc.

For (2), a CNI plugin (or set of chained CNI plugins) is responsible for adding an interface to the pod's network namespace that the pod (i.e., all the containers in the pod) can use for its outbound connections (and exposed services). A CNI plugin (the same or another) is responsible for IPAM within the cluster; pods do not typically have public (Internet-routable) IPv4 addresses and instead typically have a private-range address. There are a variety of mechanisms to do this; various types of overlay networks or vlan setups are common, or cloud providers like AWS may integrate with that cloud's network primitive (VPC) and attach an interface to the host (i.e., an ENI).

Stepping away from Kubernetes, orchestrators like Amazon ECS can do this too, though it's less core to their networking models. Either way, the underlying primitive being used is the ability to share a network namespace among a set of containers rather than giving each container its own, isolated, view of the network.

As a high-level point: The FreeBSD Foundation has now committed to investing in container support for FreeBSD, with the remainder of this year being spent on building a concrete plan. We shouldn't try to work around missing features in FreeBSD, we should document what is missing.

Networking-wise: if FreeBSD does not already have a mechanism for a set of jails to share interfaces/view of the network (like shared network namespaces in Linux), I think that would be a very useful thing to add. I don't know enough about FreeBSD networking yet to know if that is the case or to know if @gizahNL's suggestion for sharing vnet instances is the right approach (though from my limited reading that does sound correct).

Second would be the ability to configure a vnet instance from the host OS without depending on anything inside the vnet jail

This also sounds useful, but could be worked around. On Linux, namespaces are garbage-collected by the kernel unless there is either an active process or mount holding the namespace open. In order to have a network namespace with a lifetime decoupled from the containers that make up a pod (in Kubernetes) or a task (in Amazon ECS), a common technique is to create a "pause container" that exists just to hold the namespace open and give an opportunity for that namespace to be fully configured (e.g., for the CNI plugins to run) ahead of the workload starting. A similar technique could be used here (if vnet sharing is the right approach) where a jail is created with the necessary tools for the express purpose of configuring the vnet.

I'm not sure what else would be useful to add to FreeBSD yet; I'm sure we'll all learn more as we continue to talk and experiment.

@davidchisnall
Copy link

Thinking about this a bit more, it feels like Docker is a much better fit for the non-VNET model. The jails I used to manage had a very simple networking setup. I created a new loopback adaptor (lo1) and assigned them each an IP on that. They could all communicate, because they were on the same network interface. I then used pf to NAT these IPs and forward ports.

VNET is newer but it isn't necessarily better. It allows more things (for example, raw sockets, which would allow jailed processes to forge the header if they weren't hidden behind a firewall that blocked faked source IPs) and it comes with different scalability issues. With VNET, each jail gets a separate instance of the network stack. This consumes more kernel memory but avoids lock contention. Generally, it's a good choice if you have a lot of RAM, a lot of cores, and a lot of jails, but for deployments with a handful of jails it will add overhead that you don't need. For a client device doing docker build it's probably not better.

For K8s, it's probably worth exposing some of the Netgraph bits to allow more abitrary network topologies for a particular deployment.

@gizahNL
Copy link
Contributor

gizahNL commented Nov 8, 2021

Thinking about this a bit more, it feels like Docker is a much better fit for the non-VNET model. The jails I used to manage had a very simple networking setup. I created a new loopback adaptor (lo1) and assigned them each an IP on that. They could all communicate, because they were on the same network interface. I then used pf to NAT these IPs and forward ports.

VNET is newer but it isn't necessarily better. It allows more things (for example, raw sockets, which would allow jailed processes to forge the header if they weren't hidden behind a firewall that blocked faked source IPs) and it comes with different scalability issues. With VNET, each jail gets a separate instance of the network stack. This consumes more kernel memory but avoids lock contention. Generally, it's a good choice if you have a lot of RAM, a lot of cores, and a lot of jails, but for deployments with a handful of jails it will add overhead that you don't need. For a client device doing docker build it's probably not better.

For K8s, it's probably worth exposing some of the Netgraph bits to allow more abitrary network topologies for a particular deployment.

That won't work afaik, because Docker containers assume localhost to be 127.0.0.1, and assume it to be non shared.
Afaik vnet is needed to give a jail its own loopback networking.

@gizahNL
Copy link
Contributor

gizahNL commented Nov 8, 2021

Related Moby issue: moby/moby#33088

@davidchisnall
Copy link

That won't work afaik, because Docker containers assume localhost to be 127.0.0.1, and assume it to be non shared.
Afaik vnet is needed to give a jail its own loopback networking.

I don't believe that this is true. If you try to bind to 127.0.0.1 in a non-VNET jail, you will instead bind to the first IP provided to the jail. If you create a lo1 and assign a jail to the IP 127.0.0.2 there, then the jail attempting to bind to 127.0.0.1:1234 will instead bind to 127.0.0.1:1234 on lo1, and lo0 for the host will be completely unaffected.

@dfr
Copy link
Contributor

dfr commented Apr 26, 2022

The kubernetes model groups containers into 'pods' which share a network namespace and there is an explicit expectation that containers in the pod can communicate via localhost (https://kubernetes.io/docs/concepts/workloads/pods/#pod-networking).

In this model, nothing runs at the pod level so there should be no issues with a two-level jail structure with the pod's jail owning the vnet and child jails for each container.

@kbruner
Copy link

kbruner commented Sep 7, 2022

[I'm highly interested in getting basic CNI support in runj to support basic networking for Linux containers.

According to the CNI spec, the runtime needs to execute the CNI plugin.

I know there are efforts to port the CNI-supported plugins to FreeBSD, but I'm working on a pretty minimal and only-partially compliant placeholder CNI plugin for use with the new containerd support for FreeBSD.

As CNI support in the runtime would be critical to Kubernetes node-level support, is there any work done on adding that into runj? I can try to work on it to some degree but my Go skills are pretty basic.

@dfr
Copy link
Contributor

dfr commented Sep 7, 2022

I have some mostly working CNI plugins for FreeBSD here: https://github.com/dfr/plugins/tree/freebsd. These assume the 'netns' parameter for the plugin is the name of a VNET jail. The container jail is nested in the VNET jail which lets all the containers in a pod communicate via localhost.

@dfr
Copy link
Contributor

dfr commented Sep 7, 2022

Also, as far as I can tell from working with the github.com/containers stack, common practice is for CNI plugins to be executed by the container engine (e.g. podman, buildah, cri-o, containerd), initialising a network namespace (or jail for freebsd) which is passed to the runtime via the runtime spec.

@kbruner
Copy link

kbruner commented Sep 7, 2022

I'm more interested on the Linux container side. I have no idea what's actually involved there as far as shoehorning that support into containerd and/or runj, or how much that overlaps with support for jails.

@samuelkarp
Copy link
Owner Author

@dfr is correct; CNI support should be in the caller of runj rather than runj itself. containerd supports CNI plugins in its CRI implementation today. runj needs to support the networking primitives that the CNI plugins would then configure (the equivalent of a network namespace on Linux). I'm also interested in supporting networking outside CNI in the context of what jail(8) already supports.

@AkihiroSuda
Copy link

I have some mostly working CNI plugins for FreeBSD here: https://github.com/dfr/plugins/tree/freebsd.

👍

These assume the 'netns' parameter for the plugin is the name of a VNET jail. The container jail is nested in the VNET jail which lets all the containers in a pod communicate via localhost.

I think you can just change the parameter name from nens to something like vnet.
The plugin name can be also changed to something like freebsd-vnet or freebsd-bridge (for consistency with win-bridge.

@dfr
Copy link
Contributor

dfr commented Sep 8, 2022

I think you can just change the parameter name from nens to something like vnet. The plugin name can be also changed to something like freebsd-vnet or freebsd-bridge (for consistency with win-bridge.

I like the idea of changing the parameter name - I'll look into that. I'm mostly against changing the plugin name - I like it being called 'bridge' for consistency with linux - this means that things like 'podman network create' just work on FreeBSD.

@samuelkarp
Copy link
Owner Author

In #32 I've added a mechanism for runj to model FreeBSD extensions to the runtime spec and added a couple networking-related settings using that mechanism. The end result is that runj can now configure jails to have access to the host's IPv4 network stack (similar to host networking for Linux containers). I'd be happy to take more contributions using this mechanism that model additional network settings (including those that might be needed by CNI plugins like interfaces and VNET settings) as well as modeling parent-child jail relationships.

@dfr
Copy link
Contributor

dfr commented Sep 11, 2022 via email

@samuelkarp
Copy link
Owner Author

samuelkarp commented Sep 13, 2022

@dfr Thanks, that's an interesting approach. I think it's reasonable as a prototyping mechanism that we could add to runj, but probably not something I think would be appropriate to upstream into the spec itself. I would expect the spec to have a slightly higher level of abstraction such that the backend could be swapped out for something that isn't a jail (for example, possibly a bhyve VM) but still supports largely the same set of FreeBSD-specific features. As an example, the Linux portion of the spec models cgroups (which are used for resource limits) but it doesn't specify the exact materialization into cgroupfs.

@samuelkarp
Copy link
Owner Author

I've started playing around with vnet and trying to set up a bridged network similar to what Docker does on Linux, but I'm having trouble figuring out what I'm missing (probably both that I'm misunderstanding exactly what Docker is doing and that I'm failing to translate that to FreeBSD). On Linux, Docker creates a bridge and then a veth pair for each container, adding one end to the bridge and moving the other end into the container. Inside the container, the veth is set up with an IP address and that IP is then used as the next hop for the default route. There is also a set of iptables rules created on the host, though I'm not sure if those are used for normal traffic forwarding or are primarily used for exposing ports. The bridge is a separate non-overlapping CIDR from the host's network (172.17.0.0/16 by default) and something (?) is performing NAT.

iptables configuration
# Generated by iptables-save v1.8.7 on Tue Nov 29 19:30:42 2022
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Tue Nov 29 19:30:42 2022
# Generated by iptables-save v1.8.7 on Tue Nov 29 19:30:42 2022
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Tue Nov 29 19:30:42 2022

I've been able to follow this guide to bridge an epair inside a jail with the primary interface in my VM and allow the jail to initiate DHCP from the network attached to the VM (in this case, VirtualBox's built-in DHCP server).

That's not quite the same thing though. I can also omit DHCP and do static IP addressing for the bridge and for the epair interfaces (either side?), though no matter what I do I don't have working bidirectional network. I suspect packets are being sent but nothing is being received back, and as I'm typing this out to explain what I'm seeing I'm thinking that I'm likely missing something about configuring NAT.

Here's what I've been doing

On the host/VM:

# ifconfig bridge0 create
# ifconfig epair0 create
# ifconfig bridge0 inet 172.17.0.0/16
# ifconfig bridge0 \
    addm em0 \
    addm epair0a \
    up

In my jail.conf:

  vnet;
  vnet.interface = "epair0b";
  allow.mount;
  allow.raw_sockets = 1;
  mount.devfs;

Inside the jail:

# ifconfig epair0b link 00:00:00:00:00:01
# ifconfig epair0b inet 172.17.0.2/32
# route add -net default 172.17.0.2

I've also tried:

  • setting the IP address on epair0a
    • the same as epair0b
    • different from epair0b
  • setting the default route to be 172.17.0.1 or 172.17.0.0
  • having only epair0a on the bridge without em0

I'm going to continue looking, but figured I'd post here in case anyone has suggestions/pointers for me to look at.

Meanwhile: I'll be adding vnet and vnet.interface parameters to runj so at least the workflow described in this guide could be adapted to work for runj.

@Peter2121
Copy link

I've started playing around with vnet and trying to set up a bridged network similar to what Docker does on Linux, but I'm having trouble figuring out what I'm missing (probably both that I'm misunderstanding exactly what Docker is doing and that I'm failing to translate that to FreeBSD).

You can have a look at VNET jails management of CBSD
I use the VNET jails managed by CBSD now. There are epair interfaces for every jail and a bridge interface to go out of jails. To communicate with the external world there are some pf (or ipfw) rules - automatic 'hide' NAT and managed by cbsd expose incoming PAT.

@dfr
Copy link
Contributor

dfr commented Nov 30, 2022

I use a very similar approach to handle networking for podman and buildah. Take a look at https://github.com/dfr/plugins/tree/freebsd - the code which manages the epairs is in pkg/ip/link_freebsd.go. Interface addresses are assigned from a private address pool using ipam and NAT is enabled by putting those addresses into a PF table used by nat rules in /etc/pf.conf.

These plugins are in the ports tree and you can install them with pkg install containernetworking-plugins. I believe that containerd supports CNI so you may be able to use this directly. The name of the vnet jail is passed in via the CNI_NETNS environment variable (typically managed by github.com/containernetworking/cni/libcni). This could be the container jail but requires compatible ifconfig and route binaries inside the container. As you know, for podman/buildah I use a separate jail for networking with containers as children of the networking jail.

@samuelkarp
Copy link
Owner Author

samuelkarp commented Dec 1, 2022

@dfr thanks for that! I've tried reading through the code in the bridge plugin and I'm ending up with steps that are roughly the same as what I was doing (and I'm running into similar problems). I did find where the PF table is manipulated, but I'm guessing that I'm missing the table creation since all I see are add and delete commands.

Here's what I've been trying
  1. Create the bridge: ifconfig bridge create name bridge0
  2. Create the epair: ifconfig epair create
  3. Set a description (I didn't know this was a thing!): ifconfig epair0a description "host-side interface"
  4. Set a mac address on the jail-side interface: ifconfig epair0b link 00:00:00:00:00:01
  5. Add the host-side interface to the bridge: ifconfig bridge0 addm epair0a
  6. Bring the host-side interface up: ifconfig epair0a up
  7. Add an IP and subnet mask to the bridge: ifconfig bridge0 alias 172.17.0.1/16
  8. Enable IP forwarding: sysctl net.inet.ip.forwarding=1
  9. Add the jail IP address to a PF table: pfctl -t jail-nat -T add 172.17.0.2/32
  10. Start a jail and pass the epair0b interface into the vnet (I did this in a jail.conf file)
  11. (inside the jail) Assign the IP to the interface: ifconfig epair0b inet 172.17.0.2/32
  12. (inside the jail) Bring the interface up: ifconfig epair0b up
  13. (inside the jail) Add a route to the bridge, using the epair0b IP as the gateway: route -4 add 172.17.0.1/16 172.17.0.2
  14. (inside the jail) Add a default route using the bridge gateway: route -4 add default 172.17.0.1

However if I try to ping an IP address (8.8.8.8, for example) I get this output: ping: sendto: Invalid argument

From outside the jail, the route table looks like this:

% netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.0.2.2           UGS         em0
10.0.2.0/24        link#1             U           em0
10.0.2.15          link#1             UHS         lo0
127.0.0.1          link#2             UH          lo0
172.17.0.0/16      link#3             U       bridge0
172.17.0.1         link#3             UHS         lo0

Internet6:
Destination                       Gateway                       Flags     Netif Expire
::/96                             ::1                           UGRS        lo0
::1                               link#2                        UHS         lo0
::ffff:0.0.0.0/96                 ::1                           UGRS        lo0
fe80::/10                         ::1                           UGRS        lo0
fe80::%em0/64                     link#1                        U           em0
fe80::a00:27ff:fef3:cd05%em0      link#1                        UHS         lo0
fe80::%lo0/64                     link#2                        U           lo0
fe80::1%lo0                       link#2                        UHS         lo0
ff02::/16                         ::1                           UGRS        lo0

From inside the jail, it looks like this:

# netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            172.17.0.1         UGS     epair0b
172.17.0.0/16      172.17.0.2         UGS     epair0b
172.17.0.2         link#2             UH          lo0

I see the following line in dmesg:

arpresolve: can't allocate llinfo for 172.17.0.1 on epair0b

@dfr
Copy link
Contributor

dfr commented Dec 1, 2022 via email

@samuelkarp
Copy link
Owner Author

This is the current jail.conf I'm using:

foo {
  host.hostname = "jail3";
  path = "/";
  persist;
  vnet;
  vnet.interface = "epair0b";
  allow.mount;
  allow.raw_sockets = 1;
  mount.devfs;
  devfs_ruleset = 110;
}

(You can see I'm very creative with names like "foo" and "jail3"). The devfs ruleset is:

[devfsrules_jail_vnet_sam=110]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_jail
add include $devfsrules_jail_vnet
add path 'bpf*' unhide

(this was needed for dhclient to work per the guide I was following before)

I've tried this both with an empty /etc/pf.conf (since I didn't have PF set up at all before) and with this content:

pass from bridge0:network to any keep state

I see that you responded from email; I updated that comment on GitHub with a bit more information there too.

@dfr
Copy link
Contributor

dfr commented Dec 1, 2022

This is my pf.conf. The weird v4fib0egress stuff is coming from sysutils/egress-monitor - you can replace with the outgoing interface name:

v4egress_if = "v4fib0egress"
v6egress_if = "v6fib0egress"
nat on $v4egress_if inet from <cni-nat> to any -> ($v4egress_if)
nat on $v6egress_if inet6 from <cni-nat> to !ff00::/8 -> ($v6egress_if)
rdr-anchor "cni-rdr/*"
table <cni-nat>

@dfr
Copy link
Contributor

dfr commented Dec 1, 2022

Also, the rdr-anchor bit is only needed for 'port publishing' which adds redirect rules to route traffic into the container and it looks like I'm wrong about the table being auto-created.

@samuelkarp
Copy link
Owner Author

samuelkarp commented Dec 2, 2022

I rebooted to start from a fresh state, tried your /etc/pf.conf content, and I'm getting the same behavior and errors (ping output + dmesg lines).

lo0: link state changed to UP
em0: link state changed to UP
bridge0: Ethernet address: 58:9c:fc:00:12:0a
epair0a: Ethernet address: 02:7c:99:00:d2:0a
epair0b: Ethernet address: 02:7c:99:00:d2:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
bridge0: link state changed to UP
epair0a: promiscuous mode enabled
arpresolve: can't allocate llinfo for 172.17.0.1 on epair0b
arpresolve: can't allocate llinfo for 172.17.0.1 on epair0b
arpresolve: can't allocate llinfo for 172.17.0.1 on epair0b

@dfr
Copy link
Contributor

dfr commented Dec 2, 2022

Thats odd. I think it would be helpful to exactly where the outgoing packet is being rejected. The tcpdump utility is often helpful here:

  • in the jail, run 'tcpdump -vv -e -i epair0b'
  • on the host do the same with epair0a, bridge0 and em0 to see how far the packet gets.

Also, the output of 'jls -n' on the host might be helpful.

On my dev machine, I have this in /etc/sysctl.conf:

security.jail.allow_raw_sockets=1
net.inet.ip.forwarding=1       # Enable IP forwarding between interfaces
net.link.bridge.pfil_onlyip=0  # Only pass IP packets when pfil is enabled
net.link.bridge.pfil_bridge=0  # Packet filter on the bridge interface
net.link.bridge.pfil_member=0  # Packet filter on the member interface

I think the allow_raw_sockets bit might be a clue?

@samuelkarp
Copy link
Owner Author

in the jail, run 'tcpdump -vv -e -i epair0b'

With that running in one terminal and ping 8.8.8.8 in the other, ping reports packets sent while tcpdump sees no packets. That's interesting.

ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Invalid argument
ping: sendto: Invalid argument
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
tcpdump -vv -e -i epair0b
tcpdump: listening on epair0b, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

This is both before and after changing sysctls to match (with the sysctl(8) utility, not /etc/sysctl.conf).

Also, the output of 'jls -n' on the host might be helpful.

devfs_ruleset=110 nodying enforce_statfs=2 host=new ip4=inherit ip6=inherit jid=1 linux=new name=foo osreldate=1301000 osrelease=13.1-RELEASE parent=0 path=/ persist securelevel=-1 sysvmsg=disable sysvsem=disable sysvshm=disable vnet=new allow.nochflags allow.nomlock allow.mount allow.mount.nodevfs allow.mount.nofdescfs allow.mount.nolinprocfs allow.mount.nolinsysfs allow.mount.noprocfs allow.mount.notmpfs allow.mount.nozfs allow.noquotas allow.raw_sockets allow.noread_msgbuf allow.reserved_ports allow.set_hostname allow.nosocket_af allow.suser allow.nosysvipc allow.unprivileged_proc_debug children.cur=0 children.max=0 cpuset.id=3 host.domainname="" host.hostid=0 host.hostname=jail3 host.hostuuid=00000000-0000-0000-0000-000000000000 ip4.addr= ip4.saddrsel ip6.addr= ip6.saddrsel linux.osname=Linux linux.osrelease=3.17.0 linux.oss_version=198144

@dfr
Copy link
Contributor

dfr commented Dec 2, 2022

Ok, we need to figure out what is causing sendto to return EINVAL - the packet is being rejected on the way into the kernel on the jail side and doesn't reach epair0b. I'll have time to try and reproduce this later today - I'll update this issue if I find anything.

@gizahNL
Copy link
Contributor

gizahNL commented Dec 2, 2022

Perhaps a stupid question @samuelkarp, but did you add a default route? iirc Invalid argument is returned by ping when no route exists for the destination IP

@dfr
Copy link
Contributor

dfr commented Dec 2, 2022

If there isn't a default route you will get ENETUNREACH, not EINVAL:

$ sudo jexec 1 ping -c1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network is unreachable
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

I haven't been able to reproduce the same error that Sam had yet. I built a fresh VM running 13.1-RELEASE and the following setup:

rc.conf
hostname="jail-net-test"
keymap="uk.kbd"
ifconfig_xn0="DHCP"
ifconfig_xn0_ipv6="inet6 accept_rtadv"
sshd_enable="YES"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
zfs_enable="YES"
pf_enable="YES"
pf.conf
egress_if = "xn0"
nat on $egress_if inet from <cni-nat> to any -> ($egress_if)
nat on $egress_if inet6 from <cni-nat> to !ff00::/8 -> ($egress_if)
rdr-anchor "cni-rdr/*"
table <cni-nat>

In the VM, I run this script to setup a jail and run a ping test (assume that /root/testjail is pre-populated with base.txz from the install media):

runtest.sh
#! /bin/sh

sysctl net.inet.ip.forwarding=1

mount -t devfs -o ruleset=4 devfs /root/testjail/dev
j=$(jail -ci name=testjail vnet=new allow.raw_sockets path=/root/testjail persist)

epairA=$(ifconfig epair create)
epairB=$(echo $epairA | sed -e 's/a$/b/')

bridge=$(ifconfig bridge create)

ifconfig $bridge inet 10.99.0.1/16
ifconfig $bridge addm $epairA

ifconfig $epairA up
ifconfig $epairB vnet $j
jexec $j ifconfig $epairB inet 10.99.0.2/16
jexec $j route add default 10.99.0.1
pfctl -t cni-nat -T add 10.99.0.2

jexec $j ping -c5 8.8.8.8

which gives the following:

$ sudo ./runtest.sh
net.inet.ip.forwarding: 0 -> 1
add net default: gateway 10.99.0.1
1/1 addresses added.
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=119 time=6.983 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=6.710 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=8.820 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=6.516 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=6.536 ms

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 6.516/7.113/8.820/0.870 ms

@samuelkarp
Copy link
Owner Author

@dfr Thanks! I tried the script and got the following:

% sudo ./runtest.sh
net.inet.ip.forwarding: 0 -> 1
add net default: gateway 10.99.0.1
1 table created.
1/1 addresses added.
PING 8.8.8.8 (8.8.8.8): 56 data bytes

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss

I then updated with freebsd-update and pkg upgrade and rebooted...and then the script worked. So I think I must have had an older version of something in the system that was not working. I should have tried that sooner and I appreciate you helping me with this.

@samuelkarp
Copy link
Owner Author

samuelkarp commented Dec 3, 2022

I've figured out what I was doing wrong.

In @dfr's script, the subnet mask on the interface in the jail is 255.255.0.0 (a /16). I had been setting the mask as 255.255.255.255 (/32). In @dfr's script, there's only a single default route to the gateway on the interface (the address ending in .1). In my attempt, I had two routes: one to route the /16 to the /32 as if it were a gateway and a second default route that matched @dfr's. This seems to have been an invalid configuration and caused the EINVAL.

Edit: also, the NAT rules in PF are required as something needs to perform translation.

@dfr
Copy link
Contributor

dfr commented Dec 3, 2022

Thanks for de-mystifying - it always bothers me to not understand why something doesn't work. The NAT is required to allow the jail's traffic to reach the public internet but without it you can still ping the bridge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants