-
-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new nni_list_node named a_finish_node in struct nng_aio #1695
Conversation
Assume that sub0_ctx_cancel and sub0_recv_cb are executed concurrently. When sub0_recv_cb is unlocked, and before aio is deleted from the finish queue. nni_list_active in sub0_ctx_cancel may mistakenly believe that aio is still on recv_queue, thus triggering repeated execution of nni_list_remove. The second execution will trigger a null pointer access. So unlocking after aio is deleted from the finish list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work in all circumstances. In particular I'm worried about a deadlock situation when the callback on the aio tries to resubmit the aio for receive. This is actually a fairly common idiom.
I think in that case, you'd wind up with an incorrect attempt to reenter the socket lock, resulting in a deadlock.
The fact that you've located the root cause here (the fact that nni_list_active is misreporting here) is super useful. I'm going to think about this because clearly that test is busted in this case -- your excellent sleuthing suggests several other solutions might be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally hit approve, changing to request changes.
I agree with you, unlocking needs to be earlier than nni_aio_finish_sync. |
1. Add finish nni_list in struct nng_aio. 2. Use a_finish_node to store finished aio in function sub0_recv_cb, thus separated from sub0_ctx recv_queue.
I have updated my PR. |
I need time to assess this properly -- sorry I've been busy. My two areas of particular concern, which I need to satisfy myself are:
I'm also concerned that increasing the size of the aio will have a deleterious affect for some uses. There are situations where people have huge numbers of aios, and have complained about how big they are. This increases the size by up to 16 bytes. It might be able to make a much smaller increase, or even no increase at all, by utilizing a flag. Still, I think your work here is good, and I might wind up merging it all, but I want to have time to consider these issues first. |
Anticipating a prompt release |
Sorry been a bit busy. I've been thinking about this problem though -- and one idea I have is to use the existing node as a "singly linked" node, in a way that nni_list_node_active won't think is active. We can do this because we don't have any requirement for random access (or even random removal) for the completion queue. So we only need one of the list pointers. |
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing a fix as part of #1695. This approach takes a revised approach to avoid adding extra memory, and it also is slightly faster as we do not need to update both pointers in the linked list, by reusing the reap node. As part of this a new internal API, nni_aio_completions, is introduced. In all likelihood we will be able to use this to solve some similar crashes in other areas of the code.
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing a fix as part of #1695. This approach takes a revised approach to avoid adding extra memory, and it also is slightly faster as we do not need to update both pointers in the linked list, by reusing the reap node. As part of this a new internal API, nni_aio_completions, is introduced. In all likelihood we will be able to use this to solve some similar crashes in other areas of the code.
I'm closing this in favor of #1523 -- I think what I've done there is superior, but it was inspired by the excellent work you did here, and I've given what credit I can. |
I have got your idea. Using an existing node as a "single link" node is indeed a better choice in terms of memory usage and efficiency. Since reap will only appear after finish, we can ensure that aio completions and aio reap will not execute concurrently. One more little suggestion, maybe we can use an anonymous union in struct nng_aio.
|
I am not sure that anonymous unions work on all C versions we need to support. I don't think it adds enough value to have them to be worth the pain their use might cause if they aren't. |
Also btw I went ahead merged my PR. I think that there are other places in the code that suffer the same defect and I will try to find and fix them all. |
Assume that sub0_ctx_cancel and sub0_recv_cb are executed concurrently.
When sub0_recv_cb is unlocked, and before aio is deleted from the finish queue. nni_list_active in sub0_ctx_cancel may mistakenly believe that aio is still on recv_queue, thus triggering repeated execution of nni_list_remove. The second execution will trigger a null pointer access.
So unlocking after aio is deleted from the finish list.
fixes #1682
Note that the above format should be used in your git commit comments.
You agree that by submitting a PR, you have read and agreed to our
contributing guidelines.