TPP Job raised INTERNAL_ERROR #709

iaindillingham · 2024-02-27T11:01:35Z

Honeycomb reports that a TPP Job raised an INTERNAL_ERROR.¹ The affected jobs in Job Request 22288 are:

I've scanned the logs on L3 (both are 53,466 lines long 🤔), but can't determine why the jobs failed.

https://bennettoxford.slack.com/archives/C0270Q313H7/p1708990931394309 ↩

evansd · 2024-02-27T11:04:25Z

Interesting, looks like the state machine hit something unexpected:

Traceback (most recent call last):
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 153, in handle_single_job
    synchronous_transition = trace_handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 190, in trace_handle_job
    synchronous_transition = handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 232, in handle_job
    api.terminate(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/logging.py", line 41, in wrapper
    status = method(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/local.py", line 208, in terminate
    assert current_status.state in [
AssertionError

From Honeycomb logs:

I note that we entered maintenance mode immediately afterwards, so my assumption would be that it's somehow related. We've had race-condition-like issues around this before.

bloodearnest · 2024-02-27T11:25:37Z

I dug into it in the slack thread but here's my conclusion:

so, db mode killed it. It looks like it was in EXECUTING state, but apparently wasn't.
Hypothesis: we know that it takes a while for job-runner to detect db mode. Its possible both of those actually errored due to missing db tables, and were not running, but the loop hadn't gotten round to detecting that and modifying state to EXECUTED, so they tripped over this assert.
This makes sense of the fact neither container is still running (as they'd already finished).
I think there are a couple of actions here:

remove/rework this assertion. Maybe we log unexpected states? But it shouldn't cause an internel error I think?
follow up on TPP adding a delay before removing tables: https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729

madwort · 2024-02-28T12:26:31Z

Just having a read around this - the terminate() was called from the db maintenance mode block which checks job.state == State.RUNNING which I think comes from the database not from the actual state of the world? Most of the rest of handle_job() bases its decisions on api.get_status().

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully. When I did some work on job-runner before, I looked at cancellation but I did not look at db maintenance mode, so it's not on my state transition diagram and I didn't consider this possibility when I added that assert.

* as seen in #709

madwort · 2024-02-28T12:52:20Z

here's a demo of the behaviour #710

I think this has fallen through the cracks because we have unit tests of LocalDockerAPI in test_local_executor.py & unit tests of run.py::manage_jobs in test_run.py, but I don't think we have integration tests that runs stuff via manage_jobs with the local executor - in this PR I've simulated the behaviour of run.py in test_local_executor, but obviously it would be better to use the code itself in this case. Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is. (Or we fix it in run.py & delete this test & it remains untested)

evansd · 2024-02-28T13:12:27Z

Ah, great, thanks for digging into this properly.

Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully

Pragmatically, both of these feel like the right immediate next steps to me.

bloodearnest · 2024-04-24T11:14:59Z

Note: removing tech-support label, as this is scheduled to fix during deck-scrubbing

* as seen in #709

iaindillingham added the tech-support label Feb 27, 2024

inglesp assigned madwort Feb 28, 2024

madwort added a commit that referenced this issue Feb 28, 2024

failing test for db maintenance mode terminate error

3e1d708

* as seen in #709

madwort mentioned this issue Feb 28, 2024

failing test for db maintenance mode terminate error #710

Closed

bloodearnest removed the tech-support label Apr 24, 2024

madwort added a commit that referenced this issue Jul 12, 2024

failing test for db maintenance mode terminate error

78ffe93

* as seen in #709

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPP Job raised INTERNAL_ERROR #709

TPP Job raised INTERNAL_ERROR #709

iaindillingham commented Feb 27, 2024

evansd commented Feb 27, 2024

bloodearnest commented Feb 27, 2024

madwort commented Feb 28, 2024 •

edited

Loading

madwort commented Feb 28, 2024 •

edited

Loading

evansd commented Feb 28, 2024

bloodearnest commented Apr 24, 2024

TPP Job raised INTERNAL_ERROR #709

TPP Job raised INTERNAL_ERROR #709

Comments

iaindillingham commented Feb 27, 2024

Footnotes

evansd commented Feb 27, 2024

bloodearnest commented Feb 27, 2024

madwort commented Feb 28, 2024 • edited Loading

madwort commented Feb 28, 2024 • edited Loading

evansd commented Feb 28, 2024

bloodearnest commented Apr 24, 2024

madwort commented Feb 28, 2024 •

edited

Loading

madwort commented Feb 28, 2024 •

edited

Loading