-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32960 Fix epoll memory corruption issues #19283
HPCC-32960 Fix epoll memory corruption issues #19283
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32960 Jirabot Action Result: |
@ghalliday @mckellyln - please review. NB: I've opened a related JIRA for refactoring beyond 9.8 : https://hpccsystems.atlassian.net/browse/HPCC-32970 My guess is, that we may have hit this issue pre 9.8 but more rarely, and that the 9.8 non-blocking changes has meant that the timing is different and it has exposed this race condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One initial comment. Will re-review now.
EXCLOG(e,"CSocketSelectThread::updateItems"); | ||
e->Release(); | ||
} | ||
items.remove(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth keeping the logic from delSelItemPos - swap this with the last, and then remove the last item. May even be a useful method in the array class - since it saves moving all the following elements in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is now, it's walking the array backwards, and isn't having to shift anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
other minor comments:
selectvarschange should really be an atomic, since it is tested outside a critical section
system/jlib/jsocket.cpp
Outdated
// new handler thread | ||
if (n >= hdlPerThrd) | ||
return (0|rm); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more: The previous logic could mean that you could avoid walking any further handlers once the socket had been added and removed. I don't know how commit it is to replace rather than add fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, except, due to a bug where called, it would exit if only removed (in practice as noted in the PR comment, this would never actually happen since capacity is UINT_MAX).
But, yes, because of the bug, I missed the purpose of the logic, I'll restore and fix bug, i.e.:
if ((addrm & (SOCK_ADDED | SOCK_REMOVED)) == (SOCK_ADDED | SOCK_REMOVED))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice find and fix.
Approved.
We should create a many threaded unit test that adds and removes and sends etc. rapidly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith looks good. A couple of very minor comments. Please squash.
system/jlib/jsocket.cpp
Outdated
addrm |= threads.item(i).add(sock,mode,nfy); | ||
// if both added and removed, we're done | ||
if (((addrm & (SOCK_ADDED | SOCK_REMOVED)) == (SOCK_ADDED | SOCK_REMOVED)) || | ||
(addrm == SOCK_FAILED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly cleaner to say (addrm & SOCK_FAILED), but tracing through the code this will never currently be set with any other flags.
Maybe REMOVED could be set, then FAILED in weird situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change to &
{ | ||
si.del = true; | ||
rm = SOCK_REMOVED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a break to avoiding looking at the later sockets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, same is/was true of old select add. I will add there too, so consistent.
Although the SelectItem 'items' array is protected by a CS, epoll_wait can receive pending events for SelectItems that have been deleted, which were then link counted and caused corruption. Like the select handler version, avoid deleting the SelectItem's when items are removed, instead mark them deleted and let the triggerselect() approach remove them on the main thread. Also fix a theoretical issue when adding a existing socket to CSocketEpollThread, it did not check other CSocketEpollThread's to see if it needed to be removed. (given default hdlPerThrd is UINT_MAX, it will not have happened). Signed-off-by: Jake Smith <[email protected]>
3f76392
to
0ecb3e4
Compare
Jirabot Action Result: |
Although the SelectItem 'items' array is protected by a CS, epoll_wait can receive pending events for SelectItems that have been deleted, which were then link counted and caused corruption.
Like the select handler version, avoid deleting the SelectItem's when items are removed, instead mark them deleted and let the triggerselect() approach remove them on the main thread.
Also fix a theoretical issue when adding a existing socket to CSocketEpollThread, it did not check other CSocketEpollThread's to see if it needed to be removed. (given default hdlPerThrd is UINT_MAX, it will not have happened).
Type of change:
Checklist:
Smoketest:
Testing: