-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rejected Unix domain socket connections under load #3136
Comments
@nicktrav first let me say that UnixSocket support in Jetty is at best experimental, at worst it is neglected and close to being ejected. We stopped working on the unix socket connector when JNR was not being maintained and our PRs to them were not being accepted. So it is neglected and we've never seen good results with it. However, having said that, we see that now JNR is getting some activity again and if there is a user that wishes to use UnixDomain, then we will see what we can do. Currently we hard wire the UnixDomain socket to only do async accepts - hence it will be using the poll. Our other connectors have the option of using 1 or more threads doing blocking accepts, which may offer a better solution for you, but I'm not sure why we don't offer that in the connector constructor. We also have some strangeness in setting a low thread priority for accept - again no idea why we do that. So let me make a branch with some of these things fixed and if you could test against it, that would be helpful (we will still have minimal time on this for a while). Also, have you been able to measure performance other than the connection accept rate? Does it give you any significant improvement over the loopback connector? |
Signed-off-by: Greg Wilkins <[email protected]>
So in branch https://github.com/eclipse/jetty.project/tree/jetty-9.4.x-3136-unixsocket-accept I have added a constructor to allow the number of acceptor threads to be passed in. This indicates what the problem is with that... the JNR blocking accept is not interrupt-able, so you can't do a gentle shutdown of the server (and hence the test hangs forever if acceptors is set to >0). I think this is jnr/jnr-unixsocket#52... oh actually this is jnr/jnr-unixsocket#21 raised by me with PR jnr/jnr-unixsocket#25 to fix it.... but as the comments say, this was left to rot for too long. I will freshen that up.... Anyway, perhaps you can test with my branch and allocate at least as many acceptor threads as you have cores to see if that helps the bottleneck. You will have to ctrl-C your server when you are done. Going epoll would probably be better, but as you say that is a JNR feature request. |
Thanks for the follow-up, @gregw. Apologies for the delay. I intend to try out your patch when I get a chance, sometime later this week. I'll keep you posted. Good to know that the domain socket stuff isn't well supported. We were definitely planning on making good use of that, so we'd definitely be interested in seeing that sick around in Jetty. I think our preference would be to use the domain socket connector as our security model now favors using the filesystem for ACLs on the socket, rather than using something like TLS wrapping a TCP socket on the loopback device, which is the world we are migrating from (which works perfectly, fwiw). Maybe I'm misunderstanding though ... I'd be interested in hearing your thoughts on what it would take to get epoll support into the domain socket acceptor. Is that something that you'd be able to advocate for to JNR, as a user of their library? I'm not aware of any alternatives. |
@nicktrav I think to get an epoll implementation into https://github.com/jnr/jnr-unixsocket is probably going to need either somebody contributing one or a commercial sponsor to pay for it. JNR do accept PRs, but historically have been a bit slow. Jetty has no commercial sponsors that are needing UDS, but a few have experimented. Interesting that your security model favours UDS over loopback? I would have thought that anything that can compromise loopback would pretty much own your server anyway... but then I'm not up to date with all such unix attack vectors. If it truly is a more secure way, then perhaps we will get more interest and thus it will climb up our priorities. Tell us how your testing goes, as feedback from an active user is also a good way to elevate priority! |
I tried the patch and unfortunately it didn't make any difference. I think this is to be expected. No matter how many threads I threw at the problem, with an accept queue of zero, I saw failures under bursty load. We did some more digging and noticed that Jetty makes the I realize now that my initial comment about the 1024 default was incorrect. We're not overriding the default, so the syscall is made with 0. Ad as side note, if you override the accept queue length higher than what is in We might see if we can get epoll support into the JNR library, or at least request it as a feature. There's a Kqueue implementation there already, so there's prior art for non- I'm not sure if there's much more worth doing here, so feel free to close out this issue. That said, the request for ongoing Unix domain socket support in Jetty still stands. Thanks! |
This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We recently started migrating our Jetty services over to expose only Unix domain sockets, and we're fronting these servers with Envoy, which is terminating TLS and proxying requests locally over the domain socket.
We're running into some pretty severe performance regressions under moderate amounts of load (a few hundred QPS) in scenarios of bursty load. The connections that Envoy is establishing to the domain socket appear to not be accepted off the domain socket fast enough by Jetty. The socket accept queue overflows, and Envoy sees
EAGAIN
when callingconnect()
. These failed connects show up as 500s to clients.As a workaround, we're setting the accept queue length much higher than the default of 1024, and this immediately fixes our issues. This seems to have been a fix mentioned in #1281.
I mention this workaround as we think that ideally Jetty would be able to handle this kind of load. (As a comparison, we have some services that use the Go HTTP stack, and these have no issues accepting the load being fired at it. Under the hood it seems to make use of epoll.)
Jetty appears to be making use of Java NIO to poll the socket to determine when to call
accept()
. Specifically,ManagedSelector#onSelected
callsSelectorManager#doAccept
, which then calls (in the case of using a domain socket)UnixServerSocketChannel#accept
to make the syscall.My immediate question would be: accepting new connections off of the domain socket seems have a bottleneck ... somewhere. Is this something that could be improved? Maybe there's a better ExecutionStrategy that could be employed? (It looks like the current implementation uses Eat What You Kill). Maybe epoll could be used to do the selection, rather than
jnr.enxio.channels.PollSelector
(which just usespoll(2)
), though this looks like a JNR feature request, rather than a Jetty one.The workaround is ok, but the fact this is fine in other frameworks / languages with the same accept queue length (of 1024) makes me think this is an issue with Jetty.
I've put together a reproducer, with instructions here, which demonstrates the issues we're seeing. I tested this on a Debian Jesse VM (GCP), but we're also seeing this on our production environment hosts, which run a mixture of Cent6/7.
Jetty version:
OS Version:
Thanks!
The text was updated successfully, but these errors were encountered: