Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand why pipeline dependencies are confusing for end user #770

Open
sebbacon opened this issue Mar 31, 2022 · 3 comments
Open

Understand why pipeline dependencies are confusing for end user #770

sebbacon opened this issue Mar 31, 2022 · 3 comments

Comments

@sebbacon
Copy link
Contributor

This job failed with a transient DB error, so we asked the user to re-run it. It's the bottom job on this page (link):

image

The user reports

the failed job was completed successfully...however, its status remains failed ... and that leads to the failure of the notebook action with the following error message: "generate_study_population_hospitalisation_4 failed on a previous run and must be re-run". Just in case, I tried the run_all but it failed as well (see here). My understanding is that a job needs to be run with the failed cohort extractor action and notebook action together at the same time. Am I right?

  • Why does the user believe the job was completely successfully? There's no evidence of that. Perhaps they misinterpreted our instruction that it was safe to re-run?
  • Or perhaps they did re-run it, and something else has gone wrong?

When we've understood this, consider if there are any small UI tweaks or terminology changes we could make to improve intelligibility

@Jongmassey
Copy link
Contributor

The completed job was a reference to this job which was one that I believe @evansd fettled the status of following some unspecified internal error that meant that job didn't actually fail but was reported as such.

@sebbacon
Copy link
Contributor Author

My hypothesis is that the user misunderstood that they needed to re-run the generate_study_population_hospitalisation_4, presumably because previously when they'd experienced a failure and we fixed it, fixing it resulted in the action that had appeared to fail actually not failing. Whereas now they need to start a new job.

@sebbacon
Copy link
Contributor Author

Ah, in fact I now realise we actually advised them incorrectly:

Following this, one step of the job failed with a database connection error after some time. Several other jobs from other jobs failed at the same time from other projects and both TPP and the Opensafely tech team are investigating the root cause and implementing mitigations.

The failed job was reset by us to run again and completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants