Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-31130 Handle MP accept connection negotiations asynchronously #18273

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Feb 2, 2024

And handoff slow/blocking connection negotiations to separate dedicated thread.

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@jakesmith jakesmith requested a review from mckellyln February 2, 2024 15:04
@jakesmith
Copy link
Member Author

@mckellyln - please review.
It is worth noting, that the code in handleAcceptedSocket is an almost wholesale move of the code that used to be in CMPConnectThread::run()

@jakesmith jakesmith force-pushed the HPCC-31130-mp-accept-threadpool branch from 19304a9 to 06dfbfb Compare February 2, 2024 15:29
Copy link

github-actions bot commented Feb 2, 2024

https://track.hpccsystems.com/browse/HPCC-31130
This pull request is already registered

std::vector<Owned<ISocket>> slowClientsSocks;
CriticalSection crit;
Semaphore sem;
bool stopped = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should stopped be std::atomic ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, I'll change it.

{
constexpr unsigned timeoutMs = 5000;
Owned<ISocket> handledSock = sock.getClear();
if (owner.handleAcceptedSocket(handledSock.getLink(), timeoutMs, false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try for 5000 ms before handing off to async thread(s) ? Why not just always use async thread(s) ?
Maybe almost same thing if we change 5000 ms here to 100 ms or something very short ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does always use an async thread - this code is in in CMPConnectionThread::threadmain (a thread in the pool).

What it's doing here (within a thread of the pool), is handing this connection off, to outside the thread pool, to a separate 'slow' handler, so pool doesn't clog up.

@jakesmith jakesmith requested a review from mckellyln February 7, 2024 13:25
@jakesmith
Copy link
Member Author

@mckellyln - made 'stopped' an atomic, and see reply re. other question.

@mckellyln
Copy link
Contributor

I've done some testing and this PR looks good, but I can still get some bad behavior when a rogue client connects to Thor and sends in various amounts of garbage data. I have created https://track.hpccsystems.com/browse/HPCC-31251 for additional work on this.

@jakesmith jakesmith force-pushed the HPCC-31130-mp-accept-threadpool branch from d920488 to 873ebe1 Compare February 8, 2024 16:43
Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR.

@ghalliday
Copy link
Member

@jakesmith please can you add a summary of the solution to the jira.

@jakesmith
Copy link
Member Author

@jakesmith please can you add a summary of the solution to the jira.

Added. Also noticed a problem with the way the optional config. option was being picked up in k8s. Addressed in new commit.
@ghalliday

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith A couple of problems and a few minor other comments.

system/mp/mpcomm.cpp Show resolved Hide resolved
continue;
}

Linked<ISocket> sock = slowClientsSocks.back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be protected within the critical section - otherwise it could get an invalid item. Theoretically the empty() call needs to be as well - on a chip with a very weak memory model..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put both in crit block.

{
CriticalBlock b(crit);
for (auto &sock : slowClientsSocks)
sock->close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to call close() on a socket that is currently being listened to on another thread? (If stop is called while waiting for a slow response). From memory it can be complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will potentially currently being read from (CSocket::readtms) at this point, and on a ::poll call.

I think socket::close() will cause the poll to quit (I did test this scenario).
I thought it was a used technique elsewhere.

@mckellyln - what are your thoughts on this/interrupting the read?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading up, I think it is unsafe, also, the handling of getting slowClientsSocks.back() in crit, then later calling pop_back() is unsafe is new items are being added.

I am going to remove the socket from the slowClientsSocks before calling handleAcceptedSocket.
It will mean it can't be interrupted and will timeout, but that's not really a problem, just a theoretical slightly longer delay during shutdown if happens to be on a stalling socket.

void add(ISocket *sock) // NB: takes ownership
{
if (stopped)
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leak sock.
I'm not sure what is the cleanest pattern to implement this. Possibly pass an Owned or link the socket....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to assume it will get to the emplace_back once past this point (and thus own).
So a specific sock->Release here if stopped is a reasonable resolution.

running = false;
listensock->cancel_accept();
if (listen)
{
if (!threadPool->joinAll(true, 1000*60*5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth commenting that stop the listening threads first in case any get added to the stop processor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the join for this (CMPConnectThread) accept thread ahead of this, and add a comment.

@@ -2178,194 +2265,272 @@ void CMPConnectThread::startPort(unsigned short port)
if (!listen)
return;
running = true;
if (acceptThreadPoolSize)
{
class CFactory : public CInterfaceOf<IThreadFactory>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CFactory: The name could be more descriptive - what is it a factory for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename to CMPConnectThreadFactory.
I do tend to use generic names if inline classes are used and therefore implicitly relevant to the surrounding code.

system/mp/mpcomm.cpp Show resolved Hide resolved
running = false;
listensock->cancel_accept();
if (listen)
{
if (!threadPool->joinAll(true, 1000*60*5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadPool could be null if the pool size is set to 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will add a check.

Owned<ISocket> handledSock = sock.getClear();
if (owner.handleAcceptedSocket(handledSock.getLink(), timeoutMs, false))
{
// handoff to slowClientProcessor, which will retry for standard CONFIRM_TIMEOUT period owner.slowClientProcessor.add(sock.getClear());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious (almost) copy of the next line on the end of the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, couldn't even spot it in vscode at first!
Will remove.

{
constexpr unsigned timeoutMs = 5000;
Owned<ISocket> handledSock = sock.getClear();
if (owner.handleAcceptedSocket(handledSock.getLink(), timeoutMs, false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth changing, but it is counter-intuitive that a function called handleAcceptedSocket() returns true if it fails to handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, and pondered this, but really true|false is insufficient. As it stands returning false means it did not timeout, but it doesn't mean it succeeded. Various failure conditions also return false.
So really it should return a state, but I only care about 1 of them (timeout).

@jakesmith jakesmith requested a review from ghalliday February 9, 2024 13:51
@jakesmith
Copy link
Member Author

@ghalliday - please see new commit.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith please squash

And handoff slow/blocking connection negotiations to separate
dedicated thread.

Signed-off-by: Jake Smith <[email protected]>
@jakesmith jakesmith force-pushed the HPCC-31130-mp-accept-threadpool branch from 19ae797 to 6c13c81 Compare February 9, 2024 15:46
@jakesmith
Copy link
Member Author

@jakesmith please squash

@ghalliday - done

@ghalliday ghalliday merged commit abda0ef into hpcc-systems:candidate-9.4.x Feb 9, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants