You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last night, various things on the control node crashed. I haven't been able to establish the exact chain of events, but I believe that first either Redis or the trimmer had some sort of minor issue, which caused the trimmer to die with an error. This meant that the job logs accumulated, and eventually Redis was slaughtered by the OOM killer. Unfortunately, the RDB file already had these accumulated logs, which meant that restarting Redis would simply lead to another OOM kill immediately.
As I understand it, this could also happen if the trimmer was still working fine but the analyzer or the log-firehose crashed, because that would also break trimming (via no longer updating last_analyzed_log_entry and last_broadcasted_log_entry, respectively).
I'm not sure what the solution for this is – other than redesigning the entire log system so that messages don't go into Redis in the first place (which is planned).
I'll also use this to document the fix:
Stop the/most pipelines' SSH connections to prevent them from immediately spamming Redis with further log lines.
Stop anything else memory-intensive that's still running and not really needed (dashboard, websocket, cogs).
Restart Redis and hope that it doesn't OOM. (If it does, free more RAM or temporarily increase swap, I guess?)
Run the analyzer and the firehose manually for all jobs.
This is to update the two fields mentioned above so that the trimmer can do its job. The analyzer will do its usual thing, the firehose will send the log messages into the void (since the dashboard WebSocket server isn't running), but that's fine.
The normal way to run these is with updates-listener, but that wouldn't work because the pipelines are disconnected, so no job IDs are being pushed to the updates channel.
This grepping for job IDs is obviously not perfect, but it should at least trim it down enough to get out of the OOM zone as most job IDs are 24 or 25 characters long.
Last night, various things on the control node crashed. I haven't been able to establish the exact chain of events, but I believe that first either Redis or the trimmer had some sort of minor issue, which caused the trimmer to die with an error. This meant that the job logs accumulated, and eventually Redis was slaughtered by the OOM killer. Unfortunately, the RDB file already had these accumulated logs, which meant that restarting Redis would simply lead to another OOM kill immediately.
As I understand it, this could also happen if the trimmer was still working fine but the analyzer or the log-firehose crashed, because that would also break trimming (via no longer updating
last_analyzed_log_entry
andlast_broadcasted_log_entry
, respectively).I'm not sure what the solution for this is – other than redesigning the entire log system so that messages don't go into Redis in the first place (which is planned).
I'll also use this to document the fix:
updates-listener
, but that wouldn't work because the pipelines are disconnected, so no job IDs are being pushed to the updates channel.redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/analyze-logs
redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 FIREHOSE_SOCKET_URL=tcp://127.0.0.1:12345 plumbing/log-firehose
redis-cli keys '*' | grep -P '^[0-9a-z]{24,25}$' | REDIS_URL=redis://127.0.0.1:6379/0 plumbing/trim-logs >/dev/null
The text was updated successfully, but these errors were encountered: