HPCC-32184 socket nonblocking improvements #18841

jakesmith · 2024-06-28T23:12:42Z

Tighten up semantics of jsocket and securesocket and their use
cases.
Specifically, this fixes issues with SSL traffic being handled
by select handlers, where previously it would tend to deadlock
reading because the underlying buffering nature of SSL meant that
either not all the data was available, or more than the amount read
was available, and no future notify event triggered more reading.

jsocket

ensure reads all available, return if >= min
clearup nonblocking toggle, only change if necessary
clarify write() semantics (always writes all or throw an exception)
add option to throw graceful close after reading min_size

jutil

add incidental hold() utility method

mp

remove slowclient handling, add trackPortProbe - track and trace port probes
use a dedicated selecthandler for connections, and refactor connect logic, to avoid blocking
maintain list of connecting clients, delete any that have stalled if timedout via a maintenance thread.
refactor main MP packet reader handling to be non-blocking

securesocket

ensure readtms semantics match jsocket
correct handling of ssl error states (SSL_ERROR_WANT_WRITE, SSL_ERROR_WANT_READ, SSL_ERROR_SYSCALL, SSL_ERROR_ZERO_RETURN), triggering wait_read or wait_write

dafilesrv

rework dafsserver notifySelected loop slightly, to correctly expect/deal with 0 data.
Default all accepted dafilesrv sockets to non-blocking

general

readtms - ensure consistenly waits for min_size.
tidy up looping code in various places (outside of jsocket) that relied on readtms 0 reading >=1 and looped instead of supplying min=max=len.
change calls that relied on read/readtms 0 reading >=1
add readtmsAllowClose for common case where code treats graceful close as 0 bytes.

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

github-actions · 2024-06-28T23:13:00Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32184

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR
Assigning user: [email protected]

jakesmith · 2024-06-28T23:13:30Z

@mckellyln - please take a look. Thanks.

mckellyln · 2024-07-08T15:38:22Z

I am seeing this often in the dafilesrv log from the hpcc4j client test (non-secure) - altho things seem to work ok -
00000217 USR 2024-07-08 11:20:42.882 3454725 3460504 "jsocket(7,2121) write err = 11 : A!:7601 -> 10.0.0.114:56762 (15)"
00000218 USR 2024-07-08 11:20:42.895 3454725 3460503 "jsocket(7,2121) write err = 11 : A!:7601 -> 10.0.0.114:56772 (16)"
00000219 USR 2024-07-08 11:20:42.911 3454725 3460505 "jsocket(7,2121) write err = 11 : A!:7601 -> 10.0.0.114:56786 (17)"
0000021A USR 2024-07-08 11:20:42.916 3454725 3460497 "ERROR: dafsserver.cpp(3043) : notifySelected(3) : socket not opened"
0000021B USR 2024-07-08 11:20:42.933 3454725 3460497 "ERROR: dafsserver.cpp(3043) : notifySelected(3) : socket not opened"

mckellyln · 2024-07-08T17:08:04Z

And 10000s of these in roxie log -
0001632D PRG 2024-07-08 13:06:25.586 3457437 3458184 "ERROR: assert(res == sizeof(UdpRequestToSendMsg)) failed - file: udptrr.cpp, line 1369"
0001632E PRG 2024-07-08 13:06:25.586 3457437 3458184 "UdpReceiver: receive_data::run read failed port=9001 - Exp: assert(res == sizeof(UdpRequestToSendMsg)) failed - file: udptrr.cpp, line 1369"
0001632F PRG 2024-07-08 13:06:26.587 3457437 3458184 "Backtrace:"
00016330 PRG 2024-07-08 13:06:26.589 3457437 3458184 " /opt/HPCCSystems/lib/libjlib.so(_Z16printStackReportx+0x5c) [0x7fcda301288c]"
00016331 PRG 2024-07-08 13:06:26.589 3457437 3458184 " /opt/HPCCSystems/lib/libjlib.so(_Z20raiseAssertExceptionPKcS0_j+0x9a) [0x7fcda301440a]"
00016332 PRG 2024-07-08 13:06:26.589 3457437 3458184 " /opt/HPCCSystems/lib/libudplib.so(_ZN15CReceiveManager12receive_data3runEv+0x488) [0x7fcda3c8c698]"
00016333 PRG 2024-07-08 13:06:26.589 3457437 3458184 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x35) [0x7fcda3147205]"
00016334 PRG 2024-07-08 13:06:26.589 3457437 3458184 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x28) [0x7fcda3146188]"

mckellyln · 2024-07-11T16:49:35Z

system/jlib/jsocket.cpp

+
+////
+void ScopedNonBlockingMode::init(ISocket *_socket)
+{


I think we should add here -

if (_socket->connectionless()) return;

I don't think it's incorrect at the moment.
It could make sense for UDP readtms/writetms to call and return blocked before it could read anything,
and not that we use it anywhere, if it was UDP+SSL, I suspect it would be necessary too.

Saying that, in our current use patterns, I don't think it's useful (we didn't need it before, so we don't need it now?), so I agree, to minimize the impact, we'll switch off the non-blocking mode for UDP for now.

I just realized - if we turn off non-block for UDP then we must put the wait_read() back into the beginning of readtms() for UDP - before the recvfrom.

Perhaps we could just change recv() and send() to use MSG_DONTWAIT flag ?
So we dont have issue with one thread in reading and one thread writing and getting non-blocking flag mixed up ?
(but still need to decide about UDP - if it gets wait_read() added back in or we also use MSG_DONTWAIT))

@jakesmith I think what remains for this PR is:
1). UDP - if we decide to leave as blocking - put wait_read() into jsocket readtms() / wait_write() into writetms().
2). set all TCP sockets as non-blocking after connect()/accept() or leave in readtms()/writetms(), but don't restore to blocking, just leave socket in non-blocking mode always.

mckellyln · 2024-07-11T17:46:49Z

system/jlib/jsocket.cpp

@@ -2094,8 +2138,10 @@ size32_t CSocket::writetms(void const* buf, size32_t minSize, size32_t size, uns
                    errclose();
                    err = JSOCKERR_broken_pipe;
                }
-                if ((err == JSE_WOULDBLOCK) && nonblocking)
+                if (err == JSE_WOULDBLOCK)


Do we also check if err == EAGAIN here ?

yes, I think should. I'll change it.

mckellyln · 2024-07-11T17:48:36Z

system/jlib/jsocket.cpp

@@ -1969,11 +1982,13 @@ void CSocket::readtms(void* buf, size32_t min_size, size32_t max_size, size32_t
                }
                else
                {
-                    if (nonblocking && (err == JSE_WOULDBLOCK || err == EAGAIN)) // if EGAIN or EWOULDBLOCK - no more data to read
+                    if (err == JSE_WOULDBLOCK || err == EAGAIN) // if EGAIN or EWOULDBLOCK - no more data to read


spelling in comment - EGAIN
Should we add a JSE_EAGAIN ?

yeah, I'll add JSE_EGAIN for completeness.

mckellyln · 2024-07-15T13:05:49Z

If a legacy client tries to connect to Dali (ie sending only 12 bytes) or perhaps some rogue client - how long would Dali/MP keep the connection open waiting for the remaining data ?
In my tests the connection seems to remain open for a very long time.

jakesmith · 2024-07-16T11:37:50Z

If a legacy client tries to connect to Dali (ie sending only 12 bytes) or perhaps some rogue client - how long would Dali/MP keep the connection open waiting for the remaining data ? In my tests the connection seems to remain open for a very long time.

It should be clearing up stalled connected sockets after ~90 seconds (maintenance thread runs every 10s)..
I'll check.

jakesmith · 2024-07-16T11:44:44Z

system/mp/mpcomm.cpp

+            }
+            bool closeIfTimedout(cycle_t now)
+            {
+                return false;


If a legacy client tries to connect to Dali (ie sending only 12 bytes) or perhaps some rogue client - how long would Dali/MP keep the connection open waiting for the remaining data ? In my tests the connection seems to remain open for a very long time.

@mckellyln - that would explain why it isn't closing them! :) [ I must have left in whilst testing at some point ]
Will remove!

mckellyln · 2024-07-16T14:01:48Z

system/mp/mpcomm.cpp

+
+                if (remaining)
+                {
+                    sock->readtms(activeptr, 0, remaining, szRead, WAIT_FOREVER);


WAIT_FOREVER here ?

it was made that, in "sympathy" to the way it was previously coded:

sock->read(&hdr,sizeof(hdr));

but we could rationalize at the same time as this PR too..
To be conservative, could make it a reasonably long time JIC e.g. 15 minutes - of course it should never take that long, but... don't want to change this too much at same time

I missed the main point here... in above reply.
It is readtms with min_size = 0, the timeout is basically irrelevant - it can't hit the timeout, because if blocked it will bail out.
There's a comment in CSocket::readtms:

* NB: timeout is meaningless if min_size is 0

mckellyln · 2024-07-19T11:50:30Z

system/mp/mpcomm.cpp

+    CMPServer *parent;
+    int mpSoMaxConn;
+    unsigned acceptThreadPoolSize = 0;
+    unsigned maxListenHandlerSockets = 60000; // what is a sensible default limit?


Just a comment, if you expect a large number of fds then when creating the selectHandler you can specify an argument to it for how many fds it handles before creating a new thread for more:
createSocketSelectHandler(traceName, fdsPerThread)

I think we should leave that for a separate discussion/change. It (MP) has always used the default (UINT_MAX) and hence 1 CSocketEpollThread. Actually looks like everywhere uses the default. May be worth revisiting to see if it helps when a lot, but separate issue.

This limit (maxListenHandlerSockets), is limiting the number of clients that are actively connecting at once - once they're actually connected they are processed by handleAcceptedSocket and added to the standard MP select handler.
Not sure what a sensible # is here, perhaps much lower, but it's here to catch a crazy burst of connections coming in (like a DoS).

mckellyln · 2024-07-22T14:35:36Z

fs/dafsserver/dafsserver.cpp

+                            msgWritePtr = (byte *)msg.reserveTruncate(sizeof(size32_t));
+                        size32_t szRead;
+                        if (readtmsCheckGC(sock, msgWritePtr, 0, sizeof(size32_t)-left, szRead, WAIT_FOREVER))
+                            THROWJSOCKEXCEPTION(JSOCKERR_graceful_close);


Is this different than a sock->read(msgWritePtr, 0, sizeof(size32_t)-left, szRead, WAIT_FOREVER, false); ?

no, it's the same. I'll change it.
i.e. if hit GC before then hit max then we haven't receive the packet size and we want to throw GC exception.

rpastrana · 2024-07-22T15:29:03Z

@jakesmith this is currently targeting master, but can/should it target an existing minor?
Also, it's my understanding this fixes a regression, is there a plan to revert the culprit (possibly HPCC-31474 introduced in 9.6.4) from 9.6.x as well?

jakesmith · 2024-07-22T17:07:49Z

@jakesmith this is currently targeting master, but can/should it target an existing minor? Also, it's my understanding this fixes a regression, is there a plan to revert the culprit (possibly HPCC-31474 introduced in 9.6.4) from 9.6.x as well?

unfortunately it's not a regression. It's a fundamental flaw in the way jsocket/securesocket and their clients (dafilesrv/MP) handle traffic, but are only really noticeable when SSL traffic is involved.

The intent was to target 9.8, this PR was created before 9.8 was branched I think, I will change the base.
On paper, this change is too big to go into 9.6

However, this issue fundamentally affects SSL handling, non-SSL handling although potentially problematic (blocking traffic from all clients in select handler) was not susceptible to the same SSL stall.
But, a separate regression fix was already fixed in 9.6 which cause problems for non-SSL: https://hpccsystems.atlassian.net/issues/HPCC-32174

@jpmcmu - has the non-ssl (BM) issues been tested in 9.6 since then?

jpmcmu · 2024-07-22T17:23:27Z

@jakesmith It looks like our automated testing on bare-metal isn't running into the timeout issue after 9.6.26. I will run some further testing but it looks like HPCC-32174 likely fixed the issue on bare-metal

jakesmith · 2024-07-22T17:34:46Z

@jakesmith It looks like our automated testing on bare-metal isn't running into the timeout issue after 9.6.26. I will run some further testing but it looks like HPCC-32174 likely fixed the issue on bare-metal

I've run:

java -cp dfsclient-9.6.23-0-SNAPSHOT-jar-with-dependencies-timeout.jar org.hpccsystems.dfs.client.ReadIssue -url http://localhost:8010 -cluster mythor -read_timeout 10000

FWIW I've run this against the current candidate-9.6.x on a BM system (with no SSL) and it passed.

jakesmith · 2024-07-23T11:11:20Z

I think this is close to being ready to merge. @ghalliday - please also review.

ghalliday

@jakesmith I only spotted a few minor typos/optimisations. The new code seems more logical.

ghalliday · 2024-07-23T15:12:42Z

system/jlib/jsocket.cpp

+// that could do with clearing up. Some of it is very similar to wait_write
+
+// As they current stand:
+// connect_timeout if successful, will always leave the socket in a non-bocking state


typo: blocking

ghalliday · 2024-07-23T15:18:01Z

system/jlib/jsocket.cpp

+
+            // loop around to read more, or detect blocked.
+        }
+        else // NB rc < 0, (if rc == 0 handeld already above)


typo: handled

ghalliday · 2024-07-23T15:29:14Z

system/mp/mpcomm.cpp

+        portProbeLastLog = get_cycles_now();
+    else
+    {
+        cycle_t nowCycles = get_cycles_now();


minor: Could move this outside the critical section, and use in line 461 as well.

will change.

ghalliday · 2024-07-23T15:30:58Z

system/mp/mpcomm.cpp

+    {
+        cycle_t nowCycles = get_cycles_now();
+        cycle_t cyclesSinceLastLog = nowCycles - portProbeLastLog;
+        if ((cyclesSinceLastLog >= (queryOneSecCycles()*60))) // log max every minute


cache (queryOneSecCycles()*60) into a variable if portProbeLastLog is 0 (since this is in a crtisec)

wouldn't have thought significant enough to worry about, but will change.

ghalliday · 2024-07-23T15:31:57Z

system/mp/mpcomm.cpp

+        {
+            portProbeLastLog = nowCycles;
+            unsigned __int64 totalProbes = portProbeCloseCounts[CloseType_graceful] + portProbeCloseCounts[CloseType_error] + portProbeCloseCounts[CloseType_timeout];
+            DBGLOG("Port probes: %" I64F "u [graceful=%" I64F "u (%" I64F "u ms),error=%" I64F "u (%" I64F "u ms),timedout=%" I64F "u (%" I64F "u ms). Last: %s, type=%s, time: %" I64F "u",


move this logging outside the critical section.

ghalliday · 2024-07-23T15:37:06Z

system/mp/mpcomm.cpp

+            }
+            const char *queryPeerHostText() const
+            {
+                if (0 == peerHostText.length())


Any potential problems with thread safety? Also function below?

I think very unlikely, but theoretically, and in practice, every [normal] connection calls queryPeerHostText(), so I am going to do in ctor instead.

ghalliday · 2024-07-23T15:39:59Z

system/mp/mpcomm.cpp

+                    if (it == handlers.end())
+                        break;
+                    CSocketHandler *socketHandler = *it;
+                    if (socketHandler->closeIfTimedout(get_cycles_now()))


call get_cycles_now() once before the critical section?

ghalliday · 2024-07-23T15:48:47Z

system/mp/mpcomm.cpp

+                handlers.emplace_back(socketHandler);
+                numHandlers = handlers.size();
+            }
+            if (0 == (numHandlers % 100)) // for info. log at each 100 boundary


Could this cause excessive logging if the number of handlers oscillates around a multiple of 100?

I suspect only if of major concern if under attack.. in which case the tracing may be useful sign it is(?)

ghalliday · 2024-07-23T15:51:09Z

system/mp/mpcomm.cpp

+            switch (exceptionCode)
+            {
+                case JSOCKERR_timeout_expired:
+                    trackPortProbe(get_cycles_now() - socketHandler.queryCreateTime(), socketHandler.queryPeerEndpointText(), CloseType_timeout);


trivial: If you passed in socketHandler.queryCreateTime() instead you could avoid a call to get_cycles_now.

will change

ghalliday · 2024-07-23T16:17:05Z

system/mp/mpcomm.cpp

+            sock->getPeerEndpoint(ep1);
+            PROGLOG("MP: Connect Thread: socket accepted from %s",ep1.getEndpointHostText(s).str());
+#endif
+            sock->set_keep_alive(true);


a new call. Worth a comment why?

it's not new really. It's moved and earlier.
It's now upfront/early, here after the accept, before adding to the new connection select handler.
It used to be in handleAcceptedSocket(), which was also logically after the accept() (though delayed until threadpool thread started)

jakesmith · 2024-07-23T17:45:10Z

@ghalliday - please see review changes in new commit.

ghalliday

Changes look good to me

mckellyln · 2024-07-24T15:18:45Z

running -
soapplus -url https://.:19876/roxie -stress 20 4 -delay 4 6 -d 1 -i echo6.xml -vx
gives me many of these -
00000000 USR 2024-07-24 11:16:41.164 600456 600460 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error
SSL_read error 5 - error:00000005:lib(0)::reason(5)"
00000001 USR 2024-07-24 11:16:41.164 600456 600459 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error
SSL_read error 5 - error:00000005:lib(0)::reason(5)"
00000002 USR 2024-07-24 11:16:41.164 600456 600458 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error
SSL_read error 5 - error:00000005:lib(0)::reason(5)"
00000003 USR 2024-07-24 11:16:41.165 600456 600465 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error
SSL_read error 5 - error:00000005:lib(0)::reason(5)"
00000004 USR 2024-07-24 11:16:41.165 600456 600457 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error
SSL_read error 5 - error:00000005:lib(0)::reason(5)"
I will look into the cause

mckellyln · 2024-07-24T15:27:56Z

A -
nc -v dali-IP 7070
seems to hang on forever ? Or at least longer than 90 seconds.

mckellyln

Just want to look into my my two issues posted above.

jakesmith · 2024-07-25T10:19:04Z

A - nc -v dali-IP 7070 seems to hang on forever ? Or at least longer than 90 seconds.

I've just tested this. Daserver does close the client socket. I stepped over the point (after 90 seconds) where it does this (in CConnectSelectHandler::close - mpcomm.cpp, line 747).
After it closes the socket, the socket states look like this:

$ netstat -anp | fgrep 7070
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:7070            0.0.0.0:*               LISTEN      42587/daserver
tcp        0      0 172.26.29.236:47940     172.26.29.236:7070      CLOSE_WAIT  44473/nc
tcp        0      0 172.26.29.236:7070      172.26.29.236:47940     FIN_WAIT2   -

looks like nc remains running despite the fact the socket was closed?

jakesmith · 2024-07-25T10:24:20Z

running - soapplus -url https://.:19876/roxie -stress 20 4 -delay 4 6 -d 1 -i echo6.xml -vx gives me many of these - 00000000 USR 2024-07-24 11:16:41.164 600456 600460 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000001 USR 2024-07-24 11:16:41.164 600456 600459 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000002 USR 2024-07-24 11:16:41.164 600456 600458 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000003 USR 2024-07-24 11:16:41.165 600456 600465 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000004 USR 2024-07-24 11:16:41.165 600456 600457 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" I will look into the cause

@mckellyln - can you attach echo6.xml ?

jakesmith · 2024-07-25T10:29:40Z

running - soapplus -url https://.:19876/roxie -stress 20 4 -delay 4 6 -d 1 -i echo6.xml -vx gives me many of these - 00000000 USR 2024-07-24 11:16:41.164 600456 600460 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000001 USR 2024-07-24 11:16:41.164 600456 600459 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000002 USR 2024-07-24 11:16:41.164 600456 600458 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000003 USR 2024-07-24 11:16:41.165 600456 600465 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" 00000004 USR 2024-07-24 11:16:41.165 600456 600457 "ERROR: 5: jthread.cpp(342) : unknown : Input/output error SSL_read error 5 - error:00000005:lib(0)::reason(5)" I will look into the cause

@mckellyln - can you attach echo6.xml ?

I guess this is a custom configuration, enabling port 19876, or one I'm not familiar with at least.
Can you share your environment.xml ?

mckellyln · 2024-07-25T12:32:06Z

I'll send the steps and info to you and also debug myself.

jakesmith · 2024-07-25T14:53:51Z

soapplus -ssl crashes in 9.6.x and 9.8.x (in CSecureSocket::wait_read).
It looks like it hasn't worked for a while in ssl mode. In the use context it's being used (in soapplus) it's creating a CSecureSocket based on a handle (not an ISocket) - it looks like various calls in CSecureSocket rely on there being a ISocket under it.. So I think it is generally flawed at the moment if creating a CSecureSocket from a handle.

I'll open a separate JIRA - but not directly related to this PR.

jakesmith · 2024-07-25T14:56:42Z

soapplus -ssl crashes in 9.6.x and 9.8.x (in CSecureSocket::wait_read). It looks like it hasn't worked for a while in ssl mode. In the use context it's being used (in soapplus) it's creating a CSecureSocket based on a handle (not an ISocket) - it looks like various calls in CSecureSocket rely on there being a ISocket under it.. So I think it is generally flawed at the moment if creating a CSecureSocket from a handle.

I'll open a separate JIRA - but not directly related to this PR.

NB: soapplus (http.cpp) is the only place in the system that calls createSecureSocket with a handle vs an ISocket

jakesmith · 2024-07-25T15:02:35Z

have checked 9.4, it doesn't hit the same issue - but more by luck than design (because happens to avoid calling wait_read)

mckellyln

I think it looks good, approved.

jakesmith · 2024-07-26T08:20:19Z

soapplus -ssl crashes in 9.6.x and 9.8.x (in CSecureSocket::wait_read). It looks like it hasn't worked for a while in ssl mode. In the use context it's being used (in soapplus) it's creating a CSecureSocket based on a handle (not an ISocket) - it looks like various calls in CSecureSocket rely on there being a ISocket under it.. So I think it is generally flawed at the moment if creating a CSecureSocket from a handle.
I'll open a separate JIRA - but not directly related to this PR.

NB: soapplus (http.cpp) is the only place in the system that calls createSecureSocket with a handle vs an ISocket

I have added 1 more more commit, that fixes securesocket when created with a socket handle only (vs ISocket).
There were various methods that relied on having an ISocket and would crash if called. Now it implicitly creates an ISocket in the createSecureSocket method that is passed a handle. NB: this only affects soapplus, which is also broken in other ways, which I have some pending fixes for (separate PR), but included this fix here, because it's a base library fix and seemed more appropriate here.

@ghalliday @mckellyln

Tighten up semantics of jsocket and securesocket and their use cases. Specifically, this fixes issues with SSL traffic being handled by select handlers, where previously it would tend to deadlock reading because the underlying buffering nature of SSL meant that either not all the data was available, or more than the amount read was available, and no future notify event triggered more reading. jsocket - ensure reads all available, return if >= min - clearup nonblocking toggle, only change if necessary - clarify write() semantics (always writes all or throw an exception) - add option to throw graceful close after reading min_size jutil - add incidental hold() utility method mp - remove slowclient handling, add trackPortProbe - track and trace port probes - use a dedicated selecthandler for connections, and refactor connect logic, to avoid blocking - maintain list of connecting clients, delete any that have stalled if timedout via a maintenance thread. - refactor main MP packet reader handling to be non-blocking securesocket - ensure readtms semantics match jsocket - correct handling of ssl error states (SSL_ERROR_WANT_WRITE, SSL_ERROR_WANT_READ, SSL_ERROR_SYSCALL, SSL_ERROR_ZERO_RETURN), triggering wait_read or wait_write - Fix securesocket with fd only issues. dafilesrv - rework dafsserver notifySelected loop slightly, to correctly expect/deal with 0 data. - Default all accepted dafilesrv sockets to non-blocking general - readtms - ensure consistenly waits for min_size. - tidy up looping code in various places (outside of jsocket) that relied on readtms 0 reading >=1 and looped instead of supplying min=max=len. - change calls that relied on read/readtms 0 reading >=1 - add readtmsAllowClose for common case where code treats graceful close as 0 bytes. Signed-off-by: Jake Smith <[email protected]>

rpastrana · 2024-07-26T13:34:32Z

@jpmcmu @drealeed FYI

jakesmith requested a review from mckellyln June 28, 2024 23:12

jakesmith force-pushed the HPCC-32184-nonblocking branch from 6e8a360 to dd72b2e Compare July 9, 2024 12:37

mckellyln reviewed Jul 11, 2024

View reviewed changes

jakesmith marked this pull request as ready for review July 12, 2024 10:34

jakesmith force-pushed the HPCC-32184-nonblocking branch from fe4d4f0 to 8b8566c Compare July 12, 2024 10:40

jakesmith commented Jul 16, 2024

View reviewed changes

mckellyln reviewed Jul 16, 2024

View reviewed changes

jakesmith force-pushed the HPCC-32184-nonblocking branch from 763ddb6 to 636e712 Compare July 19, 2024 09:17

mckellyln reviewed Jul 19, 2024

View reviewed changes

jakesmith requested a review from mckellyln July 19, 2024 14:31

mckellyln reviewed Jul 22, 2024

View reviewed changes

jakesmith changed the base branch from master to candidate-9.8.x July 23, 2024 09:18

jakesmith force-pushed the HPCC-32184-nonblocking branch from 8393010 to f96e1d9 Compare July 23, 2024 11:02

jakesmith requested a review from mckellyln July 23, 2024 11:02

jakesmith force-pushed the HPCC-32184-nonblocking branch from f96e1d9 to 6989e42 Compare July 23, 2024 11:09

jakesmith requested a review from ghalliday July 23, 2024 11:10

ghalliday reviewed Jul 23, 2024

View reviewed changes

jakesmith requested a review from ghalliday July 23, 2024 17:44

ghalliday approved these changes Jul 24, 2024

View reviewed changes

mckellyln reviewed Jul 25, 2024

View reviewed changes

jakesmith requested a review from mckellyln July 25, 2024 15:02

mckellyln approved these changes Jul 25, 2024

View reviewed changes

jakesmith force-pushed the HPCC-32184-nonblocking branch from 4f17ad5 to b406298 Compare July 26, 2024 08:16

jakesmith mentioned this pull request Jul 26, 2024

HPCC-32305 Soapplus fixes #18919

Merged

39 tasks

jakesmith force-pushed the HPCC-32184-nonblocking branch from b406298 to 6ca6595 Compare July 26, 2024 11:35

ghalliday merged commit 6cae8ca into hpcc-systems:candidate-9.8.x Jul 26, 2024
44 of 45 checks passed

HPCC-32184 socket nonblocking improvements #18841

HPCC-32184 socket nonblocking improvements #18841

Conversation

jakesmith commented Jun 28, 2024 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Jun 28, 2024

jakesmith commented Jun 28, 2024

mckellyln commented Jul 8, 2024

mckellyln commented Jul 8, 2024

mckellyln Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mckellyln commented Jul 15, 2024 • edited Loading

jakesmith commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mckellyln Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpastrana commented Jul 22, 2024

jakesmith commented Jul 22, 2024 • edited Loading

jpmcmu commented Jul 22, 2024

jakesmith commented Jul 22, 2024

jakesmith commented Jul 23, 2024

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakesmith commented Jul 23, 2024

ghalliday left a comment

Choose a reason for hiding this comment

mckellyln commented Jul 24, 2024

mckellyln commented Jul 24, 2024 • edited Loading

mckellyln left a comment

Choose a reason for hiding this comment

jakesmith commented Jul 25, 2024

jakesmith commented Jul 25, 2024

jakesmith commented Jul 25, 2024

mckellyln commented Jul 25, 2024

jakesmith commented Jul 25, 2024

jakesmith commented Jul 25, 2024

jakesmith commented Jul 25, 2024

mckellyln left a comment

Choose a reason for hiding this comment

jakesmith commented Jul 26, 2024

rpastrana commented Jul 26, 2024

jakesmith commented Jun 28, 2024 •

edited

Loading

mckellyln Jul 11, 2024 •

edited

Loading

mckellyln commented Jul 15, 2024 •

edited

Loading

mckellyln Jul 19, 2024 •

edited

Loading

jakesmith commented Jul 22, 2024 •

edited

Loading

mckellyln commented Jul 24, 2024 •

edited

Loading