Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix docker network caps #2273

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

kradalby
Copy link
Collaborator

@kradalby kradalby commented Dec 9, 2024

Its December, all integration tests requiring networking seem to have broken...

Docker releases a patch release which changed the required permissions to be able to do tun devices in containers, this caused all containers to fail in tests causing us to fail all tests. This fixes it, and adds some tools for debugging in the future.

Signed-off-by: Kristoffer Dalby <[email protected]>
@vdovhanych
Copy link
Contributor

It can be because of github changing the ubuntu-latest to ubuntu-24.04 from ubuntu-22.04

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

It can be because of github changing the ubuntu-latest to ubuntu-24.04 from ubuntu-22.04

I thought so too, but I've now tested with 20, 22 and 24 with the same result...

It looks like the docker containers doesnt start, which is odd since it doesnt crash locally. I'll try to get logs from the tailscale containers, but its a bit hard since they dont run.

@vdovhanych
Copy link
Contributor

vdovhanych commented Dec 9, 2024

Ah i see.

This might be it though. You're launching docker with GO and it cant find it in path.

general_test.go:50: failed to create headscale environment: failed to list docker containers: exec: "docker": executable file not found in $PATH

you can try checking docker in the workflow file doing docker info and maybe also check path what could be the issue.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

This might be it though. You're launching docker with GO and it cant find it in path.

Thats just the last test where I tried to call docker ps just to see whats up, so the previous one fails with the "correct" error.

Take a look at the last one and it should be the same error as we see on the other prs.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

Ok, based on that there is no logs for the tailscale client in the log artefact uploaded after the test fails, I suspect that the container might not be starting at all 🤔

@vdovhanych
Copy link
Contributor

I think it could be still related. Go launches the commands in a shell that might be stripping everything configured in the path, so the containers won't launch because of that.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

I think it could be still related. Go launches the commands in a shell that might be stripping everything configured in the path, so the containers won't launch because of that.

The way we use docker speaks Docker API to the socket.

I think I've found the issue, I joined the job to a dev tailnet to be able to ssh in:

runner@fv-az1047-522:~$ docker logs -f ts-head-uhffys
WARNING: Skipping duplicate certificate in file ca-cert-user-4.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-5.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-3.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-1.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-2.pem
2024/12/09 11:31:40 logtail started
2024/12/09 11:31:40 Program starting: v1.79.0-dev20241206-tc2761162a, Go 1.23.4: []string{"tailscaled", "--tun=tsdev", "--verbose=10"}
2024/12/09 11:31:40 LogID: a4ab14e7bc330192c0a3616853d344f0e44b41f20e7adb0a7d76c42c66be8f0d
2024/12/09 11:31:40 logpolicy: using system state directory "/var/lib/tailscale"
logpolicy.ConfigFromFile /var/lib/tailscale/tailscaled.log.conf: open /var/lib/tailscale/tailscaled.log.conf: no such file or directory
logpolicy.Config.Validate for /var/lib/tailscale/tailscaled.log.conf: config is nil
2024/12/09 11:31:40 dns: [rc=unknown ret=direct]
2024/12/09 11:31:40 dns: using "direct" mode
2024/12/09 11:31:40 dns: using *dns.directManager
2024/12/09 11:31:40 dns: inotify addwatch: context canceled
2024/12/09 11:31:40 wgengine.NewUserspaceEngine(tun "tsdev") ...
2024/12/09 11:31:40 Linux kernel version: 6.8.0-1017-azure
2024/12/09 11:31:40 is CONFIG_TUN enabled in your kernel? `modprobe tun` failed with: modprobe: can't change directory to '/lib/modules': No such file or directory
2024/12/09 11:31:40 wgengine.NewUserspaceEngine(tun "tsdev") error: tstun.New("tsdev"): operation not permitted
2024/12/09 11:31:40 flushing log.
2024/12/09 11:31:40 logger closing down
2024/12/09 11:31:40 getLocalBackend error: createEngine: tstun.New("tsdev"): operation not permitted

So something has changed in the runners not allowing us to do tun devices...

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

Compared to an old job from the archive:

WARNING: Skipping duplicate certificate in file ca-cert-user-1.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-5.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-0.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-4.pem
WARNING: Skipping duplicate certificate in file ca-cert-user-3.pem
2024/11/25 09:26:50 logtail started
2024/11/25 09:26:50 Program starting: v1.74.1-t0ca17be4a, Go 1.23.1: []string{"tailscaled", "--tun=tsdev", "--verbose=10"}
2024/11/25 09:26:50 LogID: 7ccc4c8f8f541d8e4e4fd8001f219fd48405f0047f6ca444835710394293ba0e
2024/11/25 09:26:50 logpolicy: using system state directory "/var/lib/tailscale"
logpolicy.ConfigFromFile /var/lib/tailscale/tailscaled.log.conf: open /var/lib/tailscale/tailscaled.log.conf: no such file or directory
logpolicy.Config.Validate for /var/lib/tailscale/tailscaled.log.conf: config is nil
2024/11/25 09:26:50 dns: [rc=unknown ret=direct]
2024/11/25 09:26:50 dns: using "direct" mode
2024/11/25 09:26:50 dns: using *dns.directManager

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

Ok, so far this is my guess/take:

  • Something has change on the github runners
  • modprobe/what permissions the containers have has become stricter
  • is this how they are mounted into the container?
  • is this how the container is running?
  • is it forever disabled or is it something we can turn on?

Any insight, help and so on appreciated, I am quite frustrated so will take a break.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

Ah this is relevant:
opencontainers/runc@2ce40b6
tailscale/tailscale#14256

docker pulled a rug from under our feet.

@vdovhanych
Copy link
Contributor

This is really interesting, I think it needs to run in privileged mode for it to work. By default its not i think.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

Ok, last commit did the trick, thanks @Erisa for pointing me in the right direction.

I'll wrap up the PR later, I want to add some of the new debug/ssh/tailscale steps to the generator so I have them at hand later.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 9, 2024

@juanfont you can review/approve so I can get it in when that is done.

@kradalby kradalby force-pushed the kradalby/december-gh-action-broken branch from cd0fbae to c3a7c40 Compare December 9, 2024 14:45
@kradalby kradalby changed the title debug gh action integration tests broken fix docker network caps Dec 9, 2024
@kradalby kradalby marked this pull request as ready for review December 9, 2024 14:47
@kradalby kradalby enabled auto-merge (squash) December 9, 2024 14:55
@kradalby kradalby force-pushed the kradalby/december-gh-action-broken branch from c3a7c40 to d767a09 Compare December 9, 2024 15:14
@kradalby kradalby force-pushed the kradalby/december-gh-action-broken branch from d767a09 to b63ec7d Compare December 9, 2024 15:15
@kradalby kradalby merged commit 08bd4b9 into juanfont:main Dec 9, 2024
124 of 125 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants