Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write metadata for all output files to manifest.json #724

Merged
merged 2 commits into from
Apr 25, 2024
Merged

Conversation

bloodearnest
Copy link
Member

  • Write output metadata to manifest.json
  • Add a cli command to backfill manifests for all workspaces

The main motivation for this was allowing airlock to access commit
data for level 4.

However, it was simple to extend this idea to include a) all outputs and
b) much more metadata, like job, action, size, timestamp, content hash,
and whether a file has been excluded from level 4 and why. This data is
useful for airlock, but also potentially for other systems and general
reporting.
@bloodearnest bloodearnest changed the title l4 manifest Write metadata for all output files to manifest.json Apr 24, 2024
I have manually tested this using the db from the test backend. I have
not included automated tests, as a) this is a one off migration command,
b) they would be quite hard to write and c) this is an offline process
that is not mission critical
Copy link
Contributor

@evansd evansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

One more general question I have is under what circumstances old outputs get deleted? It used to be that if you ran action_x and it produced a different set of outputs to last time then any previous outputs would get deleted (and removed from the manifest). But I think that general logic has changed at some point? In any case, as long as the manifest reflects what's actually on disk then I think that's OK.

jobrunner/cli/manifests.py Show resolved Hide resolved
jobrunner/cli/manifests.py Show resolved Hide resolved
jobrunner/cli/manifests.py Show resolved Hide resolved
@bloodearnest
Copy link
Member Author

Looks good.

One more general question I have is under what circumstances old outputs get deleted? It used to be that if you ran action_x and it produced a different set of outputs to last time then any previous outputs would get deleted (and removed from the manifest). But I think that general logic has changed at some point? In any case, as long as the manifest reflects what's actually on disk then I think that's OK.

I think that is still done? https://github.com/opensafely-core/job-runner/blob/main/jobrunner/run.py#L384

It's just done in the scheduler, not the executor (which has no knowledge of any outputs except the current job's outputs)

@bloodearnest bloodearnest merged commit ba22eb5 into main Apr 25, 2024
14 checks passed
@bloodearnest bloodearnest deleted the l4-manifest branch April 25, 2024 13:07
@evansd
Copy link
Contributor

evansd commented Apr 25, 2024

It's just done in the scheduler, not the executor (which has no knowledge of any outputs except the current job's outputs)

Ah yeah, I see. Doesn't that mean though that these don't get removed from the manifest and so that will end up with stale outputs in it which don't actually exist on disk?

@bloodearnest
Copy link
Member Author

bloodearnest commented Apr 25, 2024

It's just done in the scheduler, not the executor (which has no knowledge of any outputs except the current job's outputs)

Ah yeah, I see. Doesn't that mean though that these don't get removed from the manifest and so that will end up with stale outputs in it which don't actually exist on disk?

Oh rats, yes it will.

I think we'll need to make the executor APIs delete_files() update the manifest too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants