Add backstop check for free memory before starting job #699

bloodearnest · 2024-01-02T11:44:08Z

Currently, we can run MAX_WORKERS jobs, each with up to 128GB of memory (a global config). At time of writing MAX_WORKERS is 20, so that's a potential of 2560GB. We have ~610GB on TPP, so we are ~4x overcommitted at peak. Normally this is fine, as most jobs use a lot less memory that this. Occasionally, this is not true, and while each job is below 128GB, 20 of them exceeds 610GB, and we get the bad kind of OOM behaviour (as it has to choose which docker process to kill, which the OS equivalent of Undefined Behaviour).

The proper solution for this is probably to enable per-job limits. Most jobs can run on a limit of 64GB or 32GB, only some need 128GB. This would also enable size based scheduling, which is well understood (think cloud VMs scheduling).

A dumber and simpler option might be to just check for a minimum amount of free memory before executing a job. This should be fairly simple to execute, and would apply dynamic backpressure without complicated scheduling algos. We could to the same with disk too, perhaps

madwort · 2024-03-07T11:38:48Z

Based on the incident in https://bennettoxford.slack.com/archives/C02GL3A9THD/p1708422081557029 my suggestion is to not start any jobs when there is less than 60GB free memory. Nb. if we run this in combination with #712 that only gives a job 15GB available before it's at risk of being killed.

madwort mentioned this issue Mar 7, 2024

Reset jobs to pending when memory is low #712

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backstop check for free memory before starting job #699

Add backstop check for free memory before starting job #699

bloodearnest commented Jan 2, 2024

madwort commented Mar 7, 2024 •

edited

Loading

Add backstop check for free memory before starting job #699

Add backstop check for free memory before starting job #699

Comments

bloodearnest commented Jan 2, 2024

madwort commented Mar 7, 2024 • edited Loading

madwort commented Mar 7, 2024 •

edited

Loading