You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
Weave's internal design revolves heavily around single-threaded event loops (called “actor loops”). Some of these actor loops are responsible for gossip; one thing they gossip about is IPAM allocations.
In our setup at the Testground project, we have ~120 Kubernetes nodes, each running weave as a DaemonSet, and gossiping with others about IPAM allocations.
We deploy test workloads consisting of as many as 1,500 pods at once. As they come up, and weave allocates IPs, IPAM gossip spikes.
In some cases, a TCP window will get congested, and the Allocator actor loop will block on a write forever. This happened to us here:
Incoming messages get queued up in the TCP receive buffers, as the actor loop is not consuming them off the wire.
Stemming from (2), peers sending us gossip will eventually block on writing to the locked-up peer, and the above will repeat in their Weave peers.
In a short amount of time, this triggers a domino effect across the entire weave network, where many peers get blocked on the IPAM subsystem.
We have verified with ss on the host that indeed TCP send buffers were exhausted.
Suggested fix
Allowing network writes to time out, by setting write deadlines on the connection.
TCP Congestion control is one reason where a write could stall forever unless a timeout is set, but there are other reasons too. Right now the system is not only extremely vulnerable to these errors, but they can trigger a network-wide arrest.
Goroutine dump
See attached. You will also notice other goroutines waiting for the actor loop. The IPAM subsystem is essentially frozen.
@bboreham Looks like it! But this issue shows the nasty downstream effects that not setting write deadlines can have on the entire system :-( Bumping up the priority on this would be greatly appreciated -- would make weave more resilient to intermittent network issues.
The node will block all TCP (and even UDP DNS queries) connections between pods in the cluster several minutes after system boot.
For example, the cluster IP of CoreDNS service is 10.0.0.10. Usually, I can query an in-cluster DNS name by executing dig @10.0.0.10 some-service-name.default.svc.cluster.local on host. It returns valid IPs
But if I enable BBR congestion control with fq qdisc, the DNS query blocks and exceeds timeout. TCP connections will be blocked too, which show No route to host. However, ICMP queries (ping some cluster IP) can return valid responses.
If I reset sysctl params to the following values and perform a system reboot, then this issue disappears.
What you expected to happen?
I expect socket writes on the weave control plane to time out and fail by setting https://golang.org/pkg/net/#TCPConn.SetWriteDeadline, instead of blocking forever if the send buffer is congested.
What happened?
Weave's internal design revolves heavily around single-threaded event loops (called “actor loops”). Some of these actor loops are responsible for gossip; one thing they gossip about is IPAM allocations.
In our setup at the Testground project, we have ~120 Kubernetes nodes, each running weave as a DaemonSet, and gossiping with others about IPAM allocations.
We deploy test workloads consisting of as many as 1,500 pods at once. As they come up, and weave allocates IPs, IPAM gossip spikes.
In some cases, a TCP window will get congested, and the Allocator actor loop will block on a write forever. This happened to us here:
Goroutine stuck sending a ring update
``` goroutine 146 [IO wait, 10 minutes]: internal/poll.runtime_pollWait(0x7fd626e7eb68, 0x77, 0xffffffffffffffff) /usr/local/go/src/runtime/netpoll.go:184 +0x55 internal/poll.(*pollDesc).wait(0xc0003da618, 0x77, 0x14f00, 0x14fea, 0xffffffffffffffff) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45 internal/poll.(*pollDesc).waitWrite(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:96 internal/poll.(*FD).Write(0xc0003da600, 0xc003190000, 0x21369, 0x21369, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:276 +0x2a9 net.(*netFD).Write(0xc0003da600, 0xc003190000, 0x21369, 0x21369, 0xc000aa5aa8, 0x447b7c, 0x22000) /usr/local/go/src/net/fd_unix.go:220 +0x4f net.(*conn).Write(0xc001796360, 0xc003190000, 0x21369, 0x21369, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:196 +0x68 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*lengthPrefixTCPSender).Send(0xc001c692d0, 0xc00314c000, 0x21365, 0x22000, 0x21365, 0xc00314c000) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/protocol_crypto.go:110 +0xe1 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*LocalConnection).sendProtocolMsg(...) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection.go:410 github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*LocalConnection).SendProtocolMsg(0xc0017ec6e0, 0x1ca5005, 0xc0030fc000, 0x21364, 0x213c0, 0x0, 0x3) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/connection.go:127 +0xbe github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*gossipChannel).relayUnicast(0xc000272700, 0xa0897fc131a, 0xc0030fc000, 0x21364, 0x213c0, 0x213c0, 0x447dae) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/gossip_channel.go:108 +0xce github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*gossipChannel).GossipUnicast(0xc000272700, 0xa0897fc131a, 0xc003026000, 0x21336, 0x22000, 0xc003026000, 0x1) /go/src/github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh/gossip_channel.go:77 +0x18f github.com/weaveworks/weave/ipam.(*Allocator).sendRingUpdate(0xc00038a420, 0xa0897fc131a) /go/src/github.com/weaveworks/weave/ipam/allocator.go:814 +0xe8 github.com/weaveworks/weave/ipam.(*Allocator).donateSpace(0xc00038a420, 0x1ce900001ce80000, 0xa0897fc131a) /go/src/github.com/weaveworks/weave/ipam/allocator.go:918 +0x37d github.com/weaveworks/weave/ipam.(*Allocator).OnGossipUnicast.func1() /go/src/github.com/weaveworks/weave/ipam/allocator.go:588 +0x306 github.com/weaveworks/weave/ipam.(*Allocator).actorLoop(0xc00038a420, 0xc000438120, 0xc0002f2060) /go/src/github.com/weaveworks/weave/ipam/allocator.go:698 +0x203 created by github.com/weaveworks/weave/ipam.(*Allocator).Start /go/src/github.com/weaveworks/weave/ipam/allocator.go:183 +0x18f ```That stuck goroutine seizes up the entire actor loop, which has various nasty downstream effects:
Status check goroutine stuck
``` goroutine 5008 [chan send, 5 minutes]: github.com/weaveworks/weave/ipam.NewStatus(0xc00038a420, 0xc010000000, 0x4, 0xc0036c2848) /go/src/github.com/weaveworks/weave/ipam/status.go:51 +0xe7 main.HandleHTTP.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/weaveworks/weave/prog/weaver/http.go:314 +0x10f main.HandleHTTP.func4.1(0x1c95360, 0xc00063e8c0, 0xc00bc42300) /go/src/github.com/weaveworks/weave/prog/weaver/http.go:359 +0x44 net/http.HandlerFunc.ServeHTTP(0xc000607bc0, 0x1c95360, 0xc00063e8c0, 0xc00bc42300) /usr/local/go/src/net/http/server.go:2007 +0x44 github.com/weaveworks/weave/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000225340, 0x1c95360, 0xc00063e8c0, 0xc00bc42300) /go/src/github.com/weaveworks/weave/vendor/github.com/gorilla/mux/mux.go:159 +0x104 github.com/weaveworks/weave/common.(*loggingHandler).ServeHTTP(0xc001a661f0, 0x1c95360, 0xc00063e8c0, 0xc000024900) /go/src/github.com/weaveworks/weave/common/http.go:13 +0x139 net/http.(*ServeMux).ServeHTTP(0x2a5ede0, 0x1c95360, 0xc00063e8c0, 0xc000024900) /usr/local/go/src/net/http/server.go:2387 +0x1bd net/http.serverHandler.ServeHTTP(0xc001aaa000, 0x1c95360, 0xc00063e8c0, 0xc000024900) /usr/local/go/src/net/http/server.go:2802 +0xa4 net/http.(*conn).serve(0xc0026ac6e0, 0x1c9be60, 0xc006aeaf00) /usr/local/go/src/net/http/server.go:1890 +0x875 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2928 +0x384 ```We have verified with
ss
on the host that indeed TCP send buffers were exhausted.Suggested fix
Allowing network writes to time out, by setting write deadlines on the connection.
TCP Congestion control is one reason where a write could stall forever unless a timeout is set, but there are other reasons too. Right now the system is not only extremely vulnerable to these errors, but they can trigger a network-wide arrest.
Goroutine dump
See attached. You will also notice other goroutines waiting for the actor loop. The IPAM subsystem is essentially frozen.
weave.goroutines.log
The text was updated successfully, but these errors were encountered: