Exactly-once/at-least-once task execution for a topped-up pipeline #14347
Unanswered
nerdstrike
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have pipelines that are very expensive and should not be repeated for the same inputs if possible. We also have an unreliable upstream source, so we look for new inputs hourly rather than be notified when to run. Sometimes those "new" inputs can reside for several months! When everything falls over, we might retry all the inputs and trigger more instances of work that are already completed. We want to reject any that the workflow system has seen/executed/has in-progress.
So far, I have only really found exactly-once discussed in the context of heavyweight systems like Kafka, but I see that Prefect implements a unique per-task signature. The goal would be to have a system that I throw inputs at and only the new ones stick and get executed. I wonder whether we could achieve what we're hoping to with Prefect if we turn off/elongate the task table's clean-up cadence (of a week or whatever it is?). Obviously that leads to an ever-growing task table, but I think that's more tractable than implementing reliable checkpoint in each pipeline to self-terminate, or to skip submission of a previously done piece of work. To be clear: We need to not run the same inputs concurrently, and we want to avoid unintentionally running a previously completed input weeks later.
My questions are:
As always the prototype becomes production, so the tech choice is important to get right.
Caveats: I've never done anything with Dask, I've only toyed with Prefect enough to be impressed.
Thank you for any help you can provide!
Beta Was this translation helpful? Give feedback.
All reactions