Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Bot becomes unresponsive if cache ever goes offline #113

Open
Cohedrin opened this issue Sep 29, 2020 · 0 comments
Open

Bot becomes unresponsive if cache ever goes offline #113

Cohedrin opened this issue Sep 29, 2020 · 0 comments

Comments

@Cohedrin
Copy link
Contributor

Cohedrin commented Sep 29, 2020

Issue

If the Alchemy.Cache.Guilds guild child process is ever killed for any reason (i.e. timeouts), the bot will become unresponsive and no commands or messages can be processed for that server.

Analysis:

  • The cacher is the second genstate step, which means if it ever stops processing things, so does everything else downstream (like command handlers)
  • The cacher does a sync call to the Guilds.Cache gen server
    • The default timeout of the genserver is 5000 ms, which means if the bot is backloged processing other commands, or just not getting much cpu time, genstate can crash
  • After a crash, the state is stuck in "unavailable" => true state, and nothing seems to get it out of that state
  • At this point, all further messages are discarded, and it is not possible to perform any actions (at least to my knowledge) through the bot.

Reproduction steps

  • The simplest way I've found is to add a :timer.sleep(5001) at the top of this handler
  • Then run a command like
msg = %{"activities" => [], "client_status" => %{}, "game" => nil, "guild_id" => guild_id, "roles" => [], "status" => "offline", "user" => %{"id" => your_user_id}}
[1,2,3,4,5,6,7,8]
|> Task.async_stream(fn _ -> Alchemy.Cache.Guilds.update_presence(msg)  end, max_concurrency: 10, timeout: 30000) 
|> Enum.map(fn e -> e end)

from iex.

  • a bunch of errors will be spit out, then the bot will enter the unrecoverable state.

For checking the state

children = Supervisor.which_children(Alchemy.Cache.Guilds.GuildSupervisor)
pids = children |> Enum.map(fn e -> Tuple.to_list(e) |> Enum.at(1) end)
has_been_restarted = Enum.any?(pids, fn pid ->
  state = :sys.get_state(pid)
  state["unavailable"] == true && state["id"] == guild_id
end)

if has_been_restarted is true, things are broken. get_state returns some more useful info (the state of the process), but for the purposes of determining that this is working that's all that's relevant.

Notes

I was attempting to submit a pr to fix this issue, but was having trouble determining what the proper way of fixing this would be.

It seems like we just need to refresh the "seed" state of the cache when this happens, but it wasn't clear to me where that should happen (or is currently happening from). I also was unsure if there was a hidden reason that we could not do this on genstate death.

Issues aside, wanted to say thanks for the awesome library! I was only able to debug this in couple hours because of the great work you've put into this so far to make this work so well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant