-
-
Notifications
You must be signed in to change notification settings - Fork 119
socket: support network IO deadline (#339) #348
base: master
Are you sure you want to change the base?
Conversation
There are a few issues with this at this point.
Now having said that, we have other problems, which is that the handshake can be borked by a bad actor, and the library doesn't properly handle handshakes without blocking other working pipes. That is a separate bug, and needs to be addressed correctly -- this change is the wrong way to fix that problem. I've planned to fix that issue "correctly", I just haven't had cycles to get to it. (Other paying work and personal priorities have taken precedence. ) I still plan to fix it properly. |
Hi, thank you for responding. We are interested in moving forward with v2, but had faced several problems with TLS connections while still using v1. We encountered several variations of bad tcp/tls connections:
By default We use a long-running req/rep server with lots of short-lived connections, and are concerned |
Now that your TCP keepalive fix up is merged, this should address the biggest portion of the concern about stale ports. We do need to do something about the problem blocking in the handshake, but this PR is not the way to do it. The main design behind SP assumes mostly long lived connections. I realize that is not what you're doing, but actually if you're using very short lived TLS connections you're probably suffering elsewhere anyway (TLS is not cheap). Having said all that, if an otherwise well behaved peer is misbehaving by going silent, while doing the TCP keepalive handling (i.e. is never sending data, but just holding the port open), that's not good. But in order for the peer to do so, they have to be willing to also burn a TCP port. One for one burn. I don't think should normally be a concern therefore. (There are better and cheaper ways to DoS you, especially as this requires getting through the handshake.) The handshake phase should have a timeout. I believe I have other code that does this -- I don't recall whether it is here, in the websocket code, or in mangos v2, or in NNG, but I know that I "recently" (as in the past year or so) wrote code timeout stalled handshakes to protect against a peer just opening a port and then walking away. As you said, fixing this properly may imply more risk than we want to introduce into v1. At the same time, if you're on v1, it probably isn't too hard to vendor and keep your local changes until we can get the fix into v2, and for you to be more comfortable with v2. |
Yes, we do use short-lived req/rep transactions, which was the reason for the former interest in UDP. TCP keep-alive (btw I had forgotten to turn it on by default, as it is in v2) does help with the clean-up; As per your suggestion, we have vendored the above changes, this has been through testing, and confirmed to address the broken req/rep TLS connections we were seeing. |
Thinking about this, a few points:
|
I would like to clarify, this is just a wrapper around the This ensures that (read/write) IO operations fail with a TimeoutError instead of blocking indefinitely. Since the stateful SP sockets do IO during the initial handshake, do you mean putting an IO timeout on the handshake (as in this code), or additional code that achieves the same effect? To avoid confusion, I would not call them 'idle' timeouts. We just had a problem with gRPC which has its own application-layer keep-alive, since (Azure) NAT black-holes TCP packets after |
This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. As a special case, it allows client/servers to recover from a bad connection during the initial TLS handshake.
This resolves #339 by supporting an I/O timeout on read/write of connection-oriented protocols.
As a special case, it deals with IO timeout during initial tcp+tls handshake.