You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In one sense this isn't a problem as the jobs are cancelled as soon as we come out of maintenance mode. But as well as a source of user confusion this potentially makes our stats/dashboard a bit misleading as we continue to consider the jobs active until the end of maintenance mode.
We could address this with an extra case in this branch (if a job isn't running but is cancelled them move it immediately to failed):
log.warning(f"DB maintenance mode active, killing db job {job.id}")
# we ignore the JobStatus returned from these API calls, as this is not a hard error
api.terminate(job_definition)
api.cleanup(job_definition)
# reset state to pending and exit
set_code(
job,
StatusCode.WAITING_DB_MAINTENANCE,
"Waiting for database to finish maintenance",
)
return
But I'm wary of adding even more complexity to the state manipulation code here. Maybe there's a more principled way of refactoring things here to get the behaviour we want?
It turns out this is more of a problem than just a confusing UX. Job Runner will refuse to schedule a new job running action X while there is an existing job for action X pending. If you realise there's a problem with one of your pending jobs then the natural thing to do is to cancel it and schedule a new version with the fixed code. But if we're in database maintenance mode then you can't do this because you can't actually cancel the job to schedule a new one.
In one sense this isn't a problem as the jobs are cancelled as soon as we come out of maintenance mode. But as well as a source of user confusion this potentially makes our stats/dashboard a bit misleading as we continue to consider the jobs active until the end of maintenance mode.
We could address this with an extra case in this branch (if a job isn't running but is cancelled them move it immediately to failed):
job-runner/jobrunner/run.py
Lines 228 to 241 in 4fba743
But I'm wary of adding even more complexity to the state manipulation code here. Maybe there's a more principled way of refactoring things here to get the behaviour we want?
Slack thread:
https://bennettoxford.slack.com/archives/C069YDR4NCA/p1733311815069519
The text was updated successfully, but these errors were encountered: