-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(worker/broker): events thread was killed prematurely #45
Conversation
|
40700f8
to
1c9171e
Compare
bea3f42
to
349d0ae
Compare
if exiting() then | ||
return | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remoe this? The pop()
above may take time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may take a time (1 second I think), but I wanted to have consistent behavior. If event is long running, it should yield, like reconfigure
does. I want here to leave the process in consistent state, e.g. no events are skipped, nor no event is skipped from publishing it (if write thread fails we may want to add another guard to put event back to write queue still, but I am not sure can you put thing in front of the queue, so it needs new apis), and I left that logic to:
if exiting() then
kill(read_thread)
kill(write_thread)
kill(events_thread)
return
end
Perhaps later on we want to change that to drain events when exiting. And if that is needed it is easier to do it there at the end.
if exiting() then | ||
return | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
ca491f0
to
a31f8b5
Compare
a5e3071
to
370a117
Compare
end | ||
|
||
-- timeout | ||
goto continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder if we can add a log saying the worker is timeout connecting to the broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't want to flood logs with something that happens all the time. E.g. when there is no events to be read, the recv_frame is just timeouting every 1 seconds.
eac45a2
to
9c4813f
Compare
688f3fc
to
c58af36
Compare
d44a080
to
2234698
Compare
### Summary This is discussed in FTI-5930 and a workaround for this issue was proposed in: https://github.com/Kong/kong/pull/13004 Before (dbless node): 1. Worker 0 gets "reconfigure" event and acquires coroutine mutex 2. Worker 1 gets "reconfigure" event and acquires coroutine mutex 3. Worker 0 crashes and with it the events broker crashes too 4. Worker 1 restarts its events processing because of connection issue to events broker and it prematurely kills the executing "reconfigure" event handler as well, that then doesn't release its mutex. 5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained) 6. Worker 0 gets "reconfigure" event and acquires coroutine mutex 7. Worker 1 gets "reconfigure" event and is now in deadlock situation After (dbless node): 1. Worker 0 gets "reconfigure" event and acquires coroutine mutex 2. Worker 1 gets "reconfigure" event and acquires coroutine mutex 3. Worker 0 crashes and with it the events broker crashes too 4. Worker 1 restarts its events processing because of connection issue to events broker but it finishes the execution of an executing event handlers allowing the release of locks 5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained) 6. Worker 0 gets "reconfigure" event and acquires coroutine mutex 7. Worker 1 gets "reconfigure" event and acquires coroutine mutex
2234698
to
0e1153e
Compare
preparing more prs to fix it all pr by pr. Thus turning this to |
-- pairs is "random" enough for unique | ||
for _, client_queue in pairs(self._clients) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that the order of iteration of pairs
depends on the value of the hash of each key in the table, so while this is in fact "random" or unpredictable when keys are created/rehashed, doesn't it mean the iteration order will stay the same for the self._clients
table until then? I.e. would this implementation risk to run "unique" events on the same worker repeatedly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samugi it will mean a bit like that. Order may change or it may not. It is undefined. We don't almost ever use unique events in Kong and what this does is that then runs it in a same worker. Not a huge deal. There is a bigger benefit of it actually retrying it on other worker if it fails on previous. So it is more robust, but does not balance the load the same as before, though I have no idea why we would need that.
On successs there is break
. Unique basically means "run on any one worker".
Summary
This is discussed in FTI-5930 and a workaround for this issue was proposed in: https://github.com/Kong/kong/pull/13004
Before (dbless node):
After (dbless node):