fix(worker/broker): events thread was killed prematurely #45

bungle · 2024-05-14T12:32:30Z

Summary

This is discussed in FTI-5930 and a workaround for this issue was proposed in: https://github.com/Kong/kong/pull/13004

Before (dbless node):

Worker 0 gets "reconfigure" event and acquires coroutine mutex
Worker 1 gets "reconfigure" event and acquires coroutine mutex
Worker 0 crashes and with it the events broker crashes too
Worker 1 restarts its events processing because of connection issue to events broker and it prematurely kills the executing "reconfigure" event handler as well, that then doesn't release its mutex.
Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained)
Worker 0 gets "reconfigure" event and acquires coroutine mutex
Worker 1 gets "reconfigure" event and is now in deadlock situation

After (dbless node):

Worker 0 gets "reconfigure" event and acquires coroutine mutex
Worker 1 gets "reconfigure" event and acquires coroutine mutex
Worker 0 crashes and with it the events broker crashes too
Worker 1 restarts its events processing because of connection issue to events broker but it finishes the execution of an executing event handlers allowing the release of locks
Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained)
Worker 0 gets "reconfigure" event and acquires coroutine mutex
Worker 1 gets "reconfigure" event and acquires coroutine mutex

bungle · 2024-05-14T12:33:07Z

~~Smaller~~ Perhaps Bigger version of #44.

lualib/resty/events/worker.lua

outsinre · 2024-05-15T02:43:06Z

lualib/resty/events/worker.lua

-            if exiting() then
-                return
-            end


Should we remoe this? The pop() above may take time.

It may take a time (1 second I think), but I wanted to have consistent behavior. If event is long running, it should yield, like reconfigure does. I want here to leave the process in consistent state, e.g. no events are skipped, nor no event is skipped from publishing it (if write thread fails we may want to add another guard to put event back to write queue still, but I am not sure can you put thing in front of the queue, so it needs new apis), and I left that logic to:

if exiting() then kill(read_thread) kill(write_thread) kill(events_thread) return end

Perhaps later on we want to change that to drain events when exiting. And if that is needed it is easier to do it there at the end.

outsinre · 2024-05-15T02:43:59Z

lualib/resty/events/worker.lua

-            if exiting() then
-                return
-            end


Either here.

lualib/resty/events/worker.lua

outsinre · 2024-05-17T06:07:31Z

lualib/resty/events/worker.lua

+            end
+
+            -- timeout
+            goto continue


Just wonder if we can add a log saying the worker is timeout connecting to the broker.

No, we don't want to flood logs with something that happens all the time. E.g. when there is no events to be read, the recv_frame is just timeouting every 1 seconds.

lualib/resty/events/worker.lua

### Summary This is discussed in FTI-5930 and a workaround for this issue was proposed in: https://github.com/Kong/kong/pull/13004 Before (dbless node): 1. Worker 0 gets "reconfigure" event and acquires coroutine mutex 2. Worker 1 gets "reconfigure" event and acquires coroutine mutex 3. Worker 0 crashes and with it the events broker crashes too 4. Worker 1 restarts its events processing because of connection issue to events broker and it prematurely kills the executing "reconfigure" event handler as well, that then doesn't release its mutex. 5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained) 6. Worker 0 gets "reconfigure" event and acquires coroutine mutex 7. Worker 1 gets "reconfigure" event and is now in deadlock situation After (dbless node): 1. Worker 0 gets "reconfigure" event and acquires coroutine mutex 2. Worker 1 gets "reconfigure" event and acquires coroutine mutex 3. Worker 0 crashes and with it the events broker crashes too 4. Worker 1 restarts its events processing because of connection issue to events broker but it finishes the execution of an executing event handlers allowing the release of locks 5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained) 6. Worker 0 gets "reconfigure" event and acquires coroutine mutex 7. Worker 1 gets "reconfigure" event and acquires coroutine mutex

bungle · 2024-05-20T09:04:53Z

preparing more prs to fix it all pr by pr. Thus turning this to draft.

samugi · 2024-05-20T10:14:08Z

lualib/resty/events/broker.lua

+    -- pairs is "random" enough for unique
+    for _, client_queue in pairs(self._clients) do


my understanding is that the order of iteration of pairs depends on the value of the hash of each key in the table, so while this is in fact "random" or unpredictable when keys are created/rehashed, doesn't it mean the iteration order will stay the same for the self._clients table until then? I.e. would this implementation risk to run "unique" events on the same worker repeatedly?

@samugi it will mean a bit like that. Order may change or it may not. It is undefined. We don't almost ever use unique events in Kong and what this does is that then runs it in a same worker. Not a huge deal. There is a bigger benefit of it actually retrying it on other worker if it fails on previous. So it is more robust, but does not balance the load the same as before, though I have no idea why we would need that.

On successs there is break. Unique basically means "run on any one worker".

bungle force-pushed the fix/events-thread-prematurely-killed branch from 40700f8 to 1c9171e Compare May 14, 2024 12:39

outsinre reviewed May 14, 2024

View reviewed changes

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved

bungle force-pushed the fix/events-thread-prematurely-killed branch 4 times, most recently from bea3f42 to 349d0ae Compare May 14, 2024 17:10

chronolaw changed the title ~~fix(*): events thread was killed prematurely~~ fix(lib): events thread was killed prematurely May 15, 2024

outsinre reviewed May 15, 2024

View reviewed changes

bungle force-pushed the fix/events-thread-prematurely-killed branch 4 times, most recently from ca491f0 to a31f8b5 Compare May 15, 2024 07:19

chronolaw reviewed May 15, 2024

View reviewed changes

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved

bungle force-pushed the fix/events-thread-prematurely-killed branch 3 times, most recently from a5e3071 to 370a117 Compare May 16, 2024 12:54

outsinre reviewed May 17, 2024

View reviewed changes

fix(tests): fix tests to not be order dependent on multi-worker scenario

1454851

bungle force-pushed the fix/events-thread-prematurely-killed branch 11 times, most recently from eac45a2 to 9c4813f Compare May 17, 2024 08:33

bungle force-pushed the fix/events-thread-prematurely-killed branch 3 times, most recently from 688f3fc to c58af36 Compare May 17, 2024 08:42

bungle changed the title ~~fix(lib): events thread was killed prematurely~~ fix(worker/broker): events thread was killed prematurely May 17, 2024

bungle force-pushed the fix/events-thread-prematurely-killed branch 2 times, most recently from d44a080 to 2234698 Compare May 17, 2024 14:19

bungle force-pushed the fix/events-thread-prematurely-killed branch from 2234698 to 0e1153e Compare May 17, 2024 14:19

bungle marked this pull request as draft May 20, 2024 09:04

samugi reviewed May 20, 2024

View reviewed changes

bungle closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker/broker): events thread was killed prematurely #45

fix(worker/broker): events thread was killed prematurely #45

bungle commented May 14, 2024 •

edited

Loading

bungle commented May 14, 2024 •

edited

Loading

outsinre May 15, 2024

bungle May 15, 2024 •

edited

Loading

outsinre May 15, 2024

bungle May 15, 2024

outsinre May 17, 2024

bungle May 17, 2024 •

edited

Loading

bungle commented May 20, 2024

samugi May 20, 2024

bungle May 20, 2024 •

edited

Loading

		-- pairs is "random" enough for unique
		for _, client_queue in pairs(self._clients) do

fix(worker/broker): events thread was killed prematurely #45

fix(worker/broker): events thread was killed prematurely #45

Conversation

bungle commented May 14, 2024 • edited Loading

Summary

bungle commented May 14, 2024 • edited Loading

outsinre May 15, 2024

Choose a reason for hiding this comment

bungle May 15, 2024 • edited Loading

Choose a reason for hiding this comment

outsinre May 15, 2024

Choose a reason for hiding this comment

bungle May 15, 2024

Choose a reason for hiding this comment

outsinre May 17, 2024

Choose a reason for hiding this comment

bungle May 17, 2024 • edited Loading

Choose a reason for hiding this comment

bungle commented May 20, 2024

samugi May 20, 2024

Choose a reason for hiding this comment

bungle May 20, 2024 • edited Loading

Choose a reason for hiding this comment

bungle commented May 14, 2024 •

edited

Loading

bungle commented May 14, 2024 •

edited

Loading

bungle May 15, 2024 •

edited

Loading

bungle May 17, 2024 •

edited

Loading

bungle May 20, 2024 •

edited

Loading