Recover from pipeline errors #1559
Replies: 1 comment 2 replies
-
I think this is an excellent start, I think restart on failure with backoff covers majority of the cases where Conduit can take action. Additionally, if a connector returns a fatal error and stops processing, how do we know if the connector really meant to stop or it simply encountered a recoverable error. There is no distinction. Furthermore the connector itself contains more context An open question, should the connectors be allowed to return non-fatal (retry-able) errors to instruct Conduit to cycle through?
Can we let the processor decide when it is game over and when it isn't? Like a fatal fault should stop the pipeline,
Does this imply that pipeline is placed in a 'recovering' state while the failed connector is being cycled? |
Beta Was this translation helpful? Give feedback.
-
Conduit has a very strict engine - whenever a pipeline experiences an error, the pipeline is stopped and put in a degraded state. A user has to step in, investigate the error, and restart the pipeline when it's safe to do so. Depending on the connector implementation, transient errors can cause a pipeline to stop (e.g. connectivity issues), which quickly gets annoying in a production environment. To not rely on the connector being able to recover from transient errors, Conduit should include a way to automatically restart failing pipelines, as a best effort to keep them running. Given that Conduit already tracks acknowledged record positions, restarting a pipeline is a safe operation, which shouldn't result in lost data. In the worst-case scenario, under the
rightwrong circumstances, some records might get delivered twice (which should be handled gracefully in idempotent destination connectors).The proposal is to introduce a robust error-handling mechanism within Conduit to handle and distinguish transient and fatal errors, ensuring the reliability and resilience of pipelines.
Introducing Fatal Errors
Fatal errors are defined as errors that are not recoverable, causing the pipeline to stop permanently. When a fatal error occurs, the pipeline will be stopped, and no further attempts will be made to restart it. This ensures that critical issues are promptly addressed by the user and prevents Conduit from entering into an endless loop of restarting failed pipelines.
By default, all errors should be considered non-fatal and recoverable. If an error is fatal, it needs to be marked as such.
Recovering with a Backoff Delay
To mitigate transient errors and prevent constant restarts of pipelines, Conduit will implement a backoff delay strategy for restarting pipelines. The backoff delay introduces a waiting period before attempting to restart the pipeline, allowing time for potential issues to resolve themselves.
To prevent continuous attempts at restarting a pipeline that constantly fails, Conduit will make it possible to set a limit on the number of consecutive fails before permanently stopping the pipeline. As soon as the pipeline successfully processes at least 1 record or runs without encountering an error for a longer time frame, the count of consecutive fails will be reset. The time frame will also be configurable.
The backoff delay parameters will be configurable to allow users to adjust the delay based on their requirements. Conduit should provide sane defaults for the backoff delay functionality:
This results in a default delay progression of 1s, 2s, 4s, 8s, 16s, 32s, 1m,..., ensuring a balance between allowing time for recovery and minimizing downtime.
Open Questions
Beta Was this translation helpful? Give feedback.
All reactions