You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've observed a couple of times that when a k8s unit reboots, kube-proxy fails to start. This breaks many functions of the node with somewhat perplexing symptoms.
The core of the problem appears to be:
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: + exec /snap/k8s/313/bin/kube-proxy --cluster-cidr=10.1.0.0/16 --healthz-bind-address=127.0.0.1 --hostname-override=juju-bd78f7-stg-netbox-30 --kubeconfig=/etc/kubernetes/proxy.conf --profiling=false
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.621814 611 server_linux.go:69] "Using iptables proxy"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.651253 611 server.go:1062] "Successfully retrieved node IP(s)" IPs=["10.142.102.91"]
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.652936 611 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=131072
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: E0824 00:33:53.653060 611 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: E0824 00:33:53.653169 611 run.go:74] "command failed" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 systemd[1]: snap.k8s.kube-proxy.service: Main process exited, code=exited, status=1/FAILURE
That is, kube-proxy tries to configure conntrack before the kernel module has loaded.
Observe that the cluster recovers. (If it does not, delete the affected pods and they should respawn quickly.)
System information
Script does not exist. Here's what the charm installed:
installed: v1.30.0 (313) 109MB classic,held
Can you suggest a fix?
If it's possible to customize the sysemctl units that snapd creates, allowing more retries would probably work around the problem. In the meantime I can work around it locally by installing a suitable override myself; and, for that matter, so could the charm.
Are you interested in contributing with a fix?
No response
The text was updated successfully, but these errors were encountered:
Glad to announce we have merged a patch which should alleviate this issue yesterday: #743
NOTE: you will still need to ensure the nf_conntrack module is installed on your host (Deb package name is linux-modules-$(uname -r) if you're running on Ubuntu/Debian), as the patch will only try to load it on kube-proxy startup.
You should already be seeing it in the edge channel if you'd like to test it, and will definitely be included on the stable tracks of the snap when those get published.
Summary
I've observed a couple of times that when a k8s unit reboots, kube-proxy fails to start. This breaks many functions of the node with somewhat perplexing symptoms.
The core of the problem appears to be:
That is, kube-proxy tries to configure conntrack before the kernel module has loaded.
Here's a gist with the full output from boot (line 319 is where I started it manually): https://gist.github.com/vmpjdc/06913c8125814eb98f8ebda3fd356ab2
What Should Happen Instead?
The kube-proxy service should start reliably on boot.
Reproduction Steps
Deploy k8s using Juju:
juju deploy -n3 --channel 1.30/beta --constraints 'mem=8G root-disk=50G cores=2' k8s
Optionally, deploy some services into the cluster with Juju.
Reboot a k8s unit.
Observe that kube-proxy did not start (
Current
=inactive
):(I'm not sure whether this happens every single time.)
Run
kubectl get pods -A
that some pods (probably a cilium pod, maybe others) are in non-Running states, e.g. Unknown, Terminating, etc.Start kube-proxy:
juju exec -u k8s/0 -- snap start k8s.kube-proxy
Observe that the cluster recovers. (If it does not, delete the affected pods and they should respawn quickly.)
System information
Script does not exist. Here's what the charm installed:
Can you suggest a fix?
If it's possible to customize the sysemctl units that snapd creates, allowing more retries would probably work around the problem. In the meantime I can work around it locally by installing a suitable override myself; and, for that matter, so could the charm.
Are you interested in contributing with a fix?
No response
The text was updated successfully, but these errors were encountered: