Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(worker/broker): events thread was killed prematurely #45

Closed
wants to merge 2 commits into from

Conversation

bungle
Copy link
Member

@bungle bungle commented May 14, 2024

Summary

This is discussed in FTI-5930 and a workaround for this issue was proposed in: https://github.com/Kong/kong/pull/13004

Before (dbless node):

  1. Worker 0 gets "reconfigure" event and acquires coroutine mutex
  2. Worker 1 gets "reconfigure" event and acquires coroutine mutex
  3. Worker 0 crashes and with it the events broker crashes too
  4. Worker 1 restarts its events processing because of connection issue to events broker and it prematurely kills the executing "reconfigure" event handler as well, that then doesn't release its mutex.
  5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained)
  6. Worker 0 gets "reconfigure" event and acquires coroutine mutex
  7. Worker 1 gets "reconfigure" event and is now in deadlock situation

After (dbless node):

  1. Worker 0 gets "reconfigure" event and acquires coroutine mutex
  2. Worker 1 gets "reconfigure" event and acquires coroutine mutex
  3. Worker 0 crashes and with it the events broker crashes too
  4. Worker 1 restarts its events processing because of connection issue to events broker but it finishes the execution of an executing event handlers allowing the release of locks
  5. Worker 0 restarts with pristine state (only the possible and time-outing node level locks are retained)
  6. Worker 0 gets "reconfigure" event and acquires coroutine mutex
  7. Worker 1 gets "reconfigure" event and acquires coroutine mutex

@bungle
Copy link
Member Author

bungle commented May 14, 2024

Smaller Perhaps Bigger version of #44.

@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch from 40700f8 to 1c9171e Compare May 14, 2024 12:39
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 4 times, most recently from bea3f42 to 349d0ae Compare May 14, 2024 17:10
@chronolaw chronolaw changed the title fix(*): events thread was killed prematurely fix(lib): events thread was killed prematurely May 15, 2024
Comment on lines -201 to -203
if exiting() then
return
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remoe this? The pop() above may take time.

Copy link
Member Author

@bungle bungle May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may take a time (1 second I think), but I wanted to have consistent behavior. If event is long running, it should yield, like reconfigure does. I want here to leave the process in consistent state, e.g. no events are skipped, nor no event is skipped from publishing it (if write thread fails we may want to add another guard to put event back to write queue still, but I am not sure can you put thing in front of the queue, so it needs new apis), and I left that logic to:

    if exiting() then
        kill(read_thread)
        kill(write_thread)
        kill(events_thread)
        return
    end

Perhaps later on we want to change that to drain events when exiting. And if that is needed it is easier to do it there at the end.

Comment on lines -235 to -237
if exiting() then
return
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 4 times, most recently from ca491f0 to a31f8b5 Compare May 15, 2024 07:19
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 3 times, most recently from a5e3071 to 370a117 Compare May 16, 2024 12:54
lualib/resty/events/worker.lua Outdated Show resolved Hide resolved
end

-- timeout
goto continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder if we can add a log saying the worker is timeout connecting to the broker.

Copy link
Member Author

@bungle bungle May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't want to flood logs with something that happens all the time. E.g. when there is no events to be read, the recv_frame is just timeouting every 1 seconds.

lualib/resty/events/worker.lua Show resolved Hide resolved
lualib/resty/events/worker.lua Show resolved Hide resolved
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 11 times, most recently from eac45a2 to 9c4813f Compare May 17, 2024 08:33
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 3 times, most recently from 688f3fc to c58af36 Compare May 17, 2024 08:42
@bungle bungle changed the title fix(lib): events thread was killed prematurely fix(worker/broker): events thread was killed prematurely May 17, 2024
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch 2 times, most recently from d44a080 to 2234698 Compare May 17, 2024 14:19
### Summary

This is discussed in FTI-5930 and a workaround for this issue was proposed in:
https://github.com/Kong/kong/pull/13004

Before (dbless node):

1. Worker 0 gets "reconfigure" event and acquires coroutine mutex
2. Worker 1 gets "reconfigure" event and acquires coroutine mutex
3. Worker 0 crashes and with it the events broker crashes too
4. Worker 1 restarts its events processing because of connection issue
   to events broker and it prematurely kills the executing "reconfigure"
   event handler as well, that then doesn't release its mutex.
5. Worker 0 restarts with pristine state (only the possible and
   time-outing node level locks are retained)
6. Worker 0 gets "reconfigure" event and acquires coroutine mutex
7. Worker 1 gets "reconfigure" event and is now in deadlock situation

After (dbless node):

1. Worker 0 gets "reconfigure" event and acquires coroutine mutex
2. Worker 1 gets "reconfigure" event and acquires coroutine mutex
3. Worker 0 crashes and with it the events broker crashes too
4. Worker 1 restarts its events processing because of connection issue
   to events broker but it finishes the execution of an executing event
   handlers allowing the release of locks
5. Worker 0 restarts with pristine state (only the possible and
   time-outing node level locks are retained)
6. Worker 0 gets "reconfigure" event and acquires coroutine mutex
7. Worker 1 gets "reconfigure" event and acquires coroutine mutex
@bungle bungle force-pushed the fix/events-thread-prematurely-killed branch from 2234698 to 0e1153e Compare May 17, 2024 14:19
@bungle bungle marked this pull request as draft May 20, 2024 09:04
@bungle
Copy link
Member Author

bungle commented May 20, 2024

preparing more prs to fix it all pr by pr. Thus turning this to draft.

Comment on lines +60 to +61
-- pairs is "random" enough for unique
for _, client_queue in pairs(self._clients) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that the order of iteration of pairs depends on the value of the hash of each key in the table, so while this is in fact "random" or unpredictable when keys are created/rehashed, doesn't it mean the iteration order will stay the same for the self._clients table until then? I.e. would this implementation risk to run "unique" events on the same worker repeatedly?

Copy link
Member Author

@bungle bungle May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samugi it will mean a bit like that. Order may change or it may not. It is undefined. We don't almost ever use unique events in Kong and what this does is that then runs it in a same worker. Not a huge deal. There is a bigger benefit of it actually retrying it on other worker if it fails on previous. So it is more robust, but does not balance the load the same as before, though I have no idea why we would need that.

On successs there is break. Unique basically means "run on any one worker".

@bungle bungle closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants