Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32960 Fix epoll memory corruption issues #19283

Merged

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Nov 8, 2024

Although the SelectItem 'items' array is protected by a CS, epoll_wait can receive pending events for SelectItems that have been deleted, which were then link counted and caused corruption.

Like the select handler version, avoid deleting the SelectItem's when items are removed, instead mark them deleted and let the triggerselect() approach remove them on the main thread.

Also fix a theoretical issue when adding a existing socket to CSocketEpollThread, it did not check other CSocketEpollThread's to see if it needed to be removed. (given default hdlPerThrd is UINT_MAX, it will not have happened).

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

github-actions bot commented Nov 8, 2024

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32960

Jirabot Action Result:
Assigning user: [email protected]
Workflow Transition To: Merge Pending
Updated PR

@jakesmith
Copy link
Member Author

@ghalliday @mckellyln - please review.

NB: I've opened a related JIRA for refactoring beyond 9.8 : https://hpccsystems.atlassian.net/browse/HPCC-32970

My guess is, that we may have hit this issue pre 9.8 but more rarely, and that the 9.8 non-blocking changes has meant that the timing is different and it has exposed this race condition.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One initial comment. Will re-review now.

EXCLOG(e,"CSocketSelectThread::updateItems");
e->Release();
}
items.remove(n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth keeping the logic from delSelItemPos - swap this with the last, and then remove the last item. May even be a useful method in the array class - since it saves moving all the following elements in memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is now, it's walking the array backwards, and isn't having to shift anything

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.
other minor comments:
selectvarschange should really be an atomic, since it is tested outside a critical section

// new handler thread
if (n >= hdlPerThrd)
return (0|rm);
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more: The previous logic could mean that you could avoid walking any further handlers once the socket had been added and removed. I don't know how commit it is to replace rather than add fresh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, except, due to a bug where called, it would exit if only removed (in practice as noted in the PR comment, this would never actually happen since capacity is UINT_MAX).

But, yes, because of the bug, I missed the purpose of the logic, I'll restore and fix bug, i.e.:

                if ((addrm & (SOCK_ADDED | SOCK_REMOVED)) == (SOCK_ADDED | SOCK_REMOVED))

Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice find and fix.
Approved.
We should create a many threaded unit test that adds and removes and sends etc. rapidly.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith looks good. A couple of very minor comments. Please squash.

addrm |= threads.item(i).add(sock,mode,nfy);
// if both added and removed, we're done
if (((addrm & (SOCK_ADDED | SOCK_REMOVED)) == (SOCK_ADDED | SOCK_REMOVED)) ||
(addrm == SOCK_FAILED))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly cleaner to say (addrm & SOCK_FAILED), but tracing through the code this will never currently be set with any other flags.
Maybe REMOVED could be set, then FAILED in weird situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change to &

{
si.del = true;
rm = SOCK_REMOVED;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a break to avoiding looking at the later sockets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, same is/was true of old select add. I will add there too, so consistent.

Although the SelectItem 'items' array is protected by a CS,
epoll_wait can receive pending events for SelectItems that have
been deleted, which were then link counted and caused corruption.

Like the select handler version, avoid deleting the SelectItem's
when items are removed, instead mark them deleted and let the
triggerselect() approach remove them on the main thread.

Also fix a theoretical issue when adding a existing socket to
CSocketEpollThread, it did not check other CSocketEpollThread's
to see if it needed to be removed. (given default hdlPerThrd is
UINT_MAX, it will not have happened).

Signed-off-by: Jake Smith <[email protected]>
@jakesmith jakesmith force-pushed the HPCC-32960-epoll-corruption branch from 3f76392 to 0ecb3e4 Compare November 11, 2024 15:51
@ghalliday ghalliday merged commit b16bd33 into hpcc-systems:candidate-9.8.x Nov 11, 2024
49 checks passed
Copy link

Jirabot Action Result:
Added fix version: 9.8.38
Workflow Transition: 'Resolve issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants