Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPP Job raised INTERNAL_ERROR #709

Open
iaindillingham opened this issue Feb 27, 2024 · 6 comments
Open

TPP Job raised INTERNAL_ERROR #709

iaindillingham opened this issue Feb 27, 2024 · 6 comments
Assignees

Comments

@iaindillingham
Copy link
Member

Honeycomb reports that a TPP Job raised an INTERNAL_ERROR.1 The affected jobs in Job Request 22288 are:

I've scanned the logs on L3 (both are 53,466 lines long 🤔), but can't determine why the jobs failed.

Footnotes

  1. https://bennettoxford.slack.com/archives/C0270Q313H7/p1708990931394309

@evansd
Copy link
Contributor

evansd commented Feb 27, 2024

Interesting, looks like the state machine hit something unexpected:

Traceback (most recent call last):
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 153, in handle_single_job
    synchronous_transition = trace_handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 190, in trace_handle_job
    synchronous_transition = handle_job(job, api, mode, paused)
  File "/home/opensafely/jobrunner/code/jobrunner/run.py", line 232, in handle_job
    api.terminate(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/logging.py", line 41, in wrapper
    status = method(job_definition)
  File "/home/opensafely/jobrunner/code/jobrunner/executors/local.py", line 208, in terminate
    assert current_status.state in [
AssertionError

From Honeycomb logs:

I note that we entered maintenance mode immediately afterwards, so my assumption would be that it's somehow related. We've had race-condition-like issues around this before.

@bloodearnest
Copy link
Member

I dug into it in the slack thread but here's my conclusion:

so, db mode killed it. It looks like it was in EXECUTING state, but apparently wasn't.
Hypothesis: we know that it takes a while for job-runner to detect db mode. Its possible both of those actually errored due to missing db tables, and were not running, but the loop hadn't gotten round to detecting that and modifying state to EXECUTED, so they tripped over this assert.
This makes sense of the fact neither container is still running (as they'd already finished).
I think there are a couple of actions here:

  1. remove/rework this assertion. Maybe we log unexpected states? But it shouldn't cause an internel error I think?
  2. follow up on TPP adding a delay before removing tables: https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729

@madwort
Copy link
Contributor

madwort commented Feb 28, 2024

Just having a read around this - the terminate() was called from the db maintenance mode block which checks job.state == State.RUNNING which I think comes from the database not from the actual state of the world? Most of the rest of handle_job() bases its decisions on api.get_status().

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully. When I did some work on job-runner before, I looked at cancellation but I did not look at db maintenance mode, so it's not on my state transition diagram and I didn't consider this possibility when I added that assert.

@madwort
Copy link
Contributor

madwort commented Feb 28, 2024

here's a demo of the behaviour #710

I think this has fallen through the cracks because we have unit tests of LocalDockerAPI in test_local_executor.py & unit tests of run.py::manage_jobs in test_run.py, but I don't think we have integration tests that runs stuff via manage_jobs with the local executor - in this PR I've simulated the behaviour of run.py in test_local_executor, but obviously it would be better to use the code itself in this case. Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is. (Or we fix it in run.py & delete this test & it remains untested)

@evansd
Copy link
Contributor

evansd commented Feb 28, 2024

Ah, great, thanks for digging into this properly.

Maybe safest to fix this issue in the executor anyway & then we can use this failing test as-is

I think it's a reasonable assumption for maintenance mode to attempt to stop all jobs that could possibly be running rather than all jobs that are running , so maybe we should just update terminate() to handle this gracefully

Pragmatically, both of these feel like the right immediate next steps to me.

@bloodearnest
Copy link
Member

Note: removing tech-support label, as this is scheduled to fix during deck-scrubbing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants