socket: support network IO deadline (#339) #348

grrtrr · 2019-04-11T19:07:46Z

This resolves #339 by supporting an I/O timeout on read/write of connection-oriented protocols.
As a special case, it deals with IO timeout during initial tcp+tls handshake.

gdamore · 2019-04-11T19:22:02Z

There are a few issues with this at this point.

Its against mangos v1 and not the newer v2 repo. I'm not really interested in continuing to add changes to the older v1 code base.
I believe that using a network I/O deadline is mistaken. Instead we should be setting KeepAlive to detect I/O failures at the transport level. There are perfectly good reasons why an application with an active connection might not have data to send -- some patterns actually don't send data in one direction or the other. (E.g. PUB doesn't receive) -- and yet I always set up a pending read in order to detect the situation where the remote peer has disconnected.

Now having said that, we have other problems, which is that the handshake can be borked by a bad actor, and the library doesn't properly handle handshakes without blocking other working pipes. That is a separate bug, and needs to be addressed correctly -- this change is the wrong way to fix that problem.

I've planned to fix that issue "correctly", I just haven't had cycles to get to it. (Other paying work and personal priorities have taken precedence. ) I still plan to fix it properly.

grrtrr · 2019-04-12T01:04:14Z

Hi, thank you for responding. We are interested in moving forward with v2, but had faced several problems with TLS connections while still using v1.

We encountered several variations of bad tcp/tls connections:

blocked during mangos handshake (original start of #339),
tls+tcp blocked during Dial() or Accept() (found out the last days),
it can happen any time during a read() or write() on the TCP connection.

By default net.Conn IO timeouts are zero, hence blocked tcp reads/writes will not time out.

We use a long-running req/rep server with lots of short-lived connections, and are concerned
that blocked goroutines will over time accumulate and use up tcp ports.

gdamore · 2019-04-12T05:41:23Z

Now that your TCP keepalive fix up is merged, this should address the biggest portion of the concern about stale ports. We do need to do something about the problem blocking in the handshake, but this PR is not the way to do it. The main design behind SP assumes mostly long lived connections. I realize that is not what you're doing, but actually if you're using very short lived TLS connections you're probably suffering elsewhere anyway (TLS is not cheap).

Having said all that, if an otherwise well behaved peer is misbehaving by going silent, while doing the TCP keepalive handling (i.e. is never sending data, but just holding the port open), that's not good. But in order for the peer to do so, they have to be willing to also burn a TCP port. One for one burn. I don't think should normally be a concern therefore. (There are better and cheaper ways to DoS you, especially as this requires getting through the handshake.)

The handshake phase should have a timeout. I believe I have other code that does this -- I don't recall whether it is here, in the websocket code, or in mangos v2, or in NNG, but I know that I "recently" (as in the past year or so) wrote code timeout stalled handshakes to protect against a peer just opening a port and then walking away.

As you said, fixing this properly may imply more risk than we want to introduce into v1. At the same time, if you're on v1, it probably isn't too hard to vendor and keep your local changes until we can get the fix into v2, and for you to be more comfortable with v2.

grrtrr · 2019-04-12T20:54:29Z

Yes, we do use short-lived req/rep transactions, which was the reason for the former interest in UDP.

TCP keep-alive (btw I had forgotten to turn it on by default, as it is in v2) does help with the clean-up;
since we have short-lived transactions this does not help detecting a failed attempt (defaults on Linux are 2hours wait time, then after 9 probes each 75 seconds apart) quickly enough.

As per your suggestion, we have vendored the above changes, this has been through testing, and confirmed to address the broken req/rep TLS connections we were seeing.

gdamore · 2019-08-11T16:58:46Z

Thinking about this, a few points:

I would like whatever we choose to do here to be done in mangos v2 first, and only after that backported to mangos v1 as appropriate.
I think we should break this into two separate timeouts. First a short (NNG uses 10 secs) timeout during handshaking. (I had proposed 5 sec elsewhere, but I guess 10 sec is probably sufficient.) Second, a longer tunable timeout like you have here. I'd propose also to rename this to "IdleTImeout" or "IdleDeadline".)

grrtrr · 2019-08-12T22:42:28Z

I would like to clarify, this is just a wrapper around the net.Conn interface to set read/write deadlines.

This ensures that (read/write) IO operations fail with a TimeoutError instead of blocking indefinitely.

Since the stateful SP sockets do IO during the initial handshake, do you mean putting an IO timeout on the handshake (as in this code), or additional code that achieves the same effect?

To avoid confusion, I would not call them 'idle' timeouts. We just had a problem with gRPC which has its own application-layer keep-alive, since (Azure) NAT black-holes TCP packets after
about 4 minutes; they are using keep-alive packets to make the connection seem alive. But this is a different problem.

This allows connection-oriented protocols to recover from bad network conditions, by enabling to set a limit on the time for network I/O. As a special case, it allows client/servers to recover from a bad connection during the initial TLS handshake.

gdamore added the enhancement label Apr 12, 2019

grrtrr force-pushed the mangos_339 branch from b6c5a34 to f0d361f Compare April 12, 2019 20:58

grrtrr mentioned this pull request Aug 1, 2019

fixes #96 NewConnPipe can wait indefinitely on peer nanomsg/mangos#102

Merged

grrtrr force-pushed the mangos_339 branch from f0d361f to ba675d5 Compare August 12, 2019 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

socket: support network IO deadline (#339) #348

socket: support network IO deadline (#339) #348

grrtrr commented Apr 11, 2019 •

edited

Loading

gdamore commented Apr 11, 2019

grrtrr commented Apr 12, 2019

gdamore commented Apr 12, 2019

grrtrr commented Apr 12, 2019

gdamore commented Aug 11, 2019

grrtrr commented Aug 12, 2019 •

edited

Loading

socket: support network IO deadline (#339) #348

Are you sure you want to change the base?

socket: support network IO deadline (#339) #348

Conversation

grrtrr commented Apr 11, 2019 • edited Loading

gdamore commented Apr 11, 2019

grrtrr commented Apr 12, 2019

gdamore commented Apr 12, 2019

grrtrr commented Apr 12, 2019

gdamore commented Aug 11, 2019

grrtrr commented Aug 12, 2019 • edited Loading

grrtrr commented Apr 11, 2019 •

edited

Loading

grrtrr commented Aug 12, 2019 •

edited

Loading