Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelled database jobs aren't cancelled while in db-maintenance mode #766

Open
evansd opened this issue Dec 4, 2024 · 2 comments
Open

Comments

@evansd
Copy link
Contributor

evansd commented Dec 4, 2024

In one sense this isn't a problem as the jobs are cancelled as soon as we come out of maintenance mode. But as well as a source of user confusion this potentially makes our stats/dashboard a bit misleading as we continue to consider the jobs active until the end of maintenance mode.

We could address this with an extra case in this branch (if a job isn't running but is cancelled them move it immediately to failed):

job-runner/jobrunner/run.py

Lines 228 to 241 in 4fba743

if mode == "db-maintenance" and job_definition.allow_database_access:
if job.state == State.RUNNING:
log.warning(f"DB maintenance mode active, killing db job {job.id}")
# we ignore the JobStatus returned from these API calls, as this is not a hard error
api.terminate(job_definition)
api.cleanup(job_definition)
# reset state to pending and exit
set_code(
job,
StatusCode.WAITING_DB_MAINTENANCE,
"Waiting for database to finish maintenance",
)
return

But I'm wary of adding even more complexity to the state manipulation code here. Maybe there's a more principled way of refactoring things here to get the behaviour we want?

Slack thread:
https://bennettoxford.slack.com/archives/C069YDR4NCA/p1733311815069519

@evansd
Copy link
Contributor Author

evansd commented Dec 11, 2024

It turns out this is more of a problem than just a confusing UX. Job Runner will refuse to schedule a new job running action X while there is an existing job for action X pending. If you realise there's a problem with one of your pending jobs then the natural thing to do is to cancel it and schedule a new version with the fixed code. But if we're in database maintenance mode then you can't do this because you can't actually cancel the job to schedule a new one.

Slack thread:
https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1733933228089519

@bloodearnest
Copy link
Member

Oh yikes. Yeah, we need to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants