Recover from pipeline errors #1559

lovromazgon · 2024-05-13T18:18:27Z

lovromazgon
May 13, 2024
Maintainer

Conduit has a very strict engine - whenever a pipeline experiences an error, the pipeline is stopped and put in a degraded state. A user has to step in, investigate the error, and restart the pipeline when it's safe to do so. Depending on the connector implementation, transient errors can cause a pipeline to stop (e.g. connectivity issues), which quickly gets annoying in a production environment. To not rely on the connector being able to recover from transient errors, Conduit should include a way to automatically restart failing pipelines, as a best effort to keep them running. Given that Conduit already tracks acknowledged record positions, restarting a pipeline is a safe operation, which shouldn't result in lost data. In the worst-case scenario, under the ~~right~~ wrong circumstances, some records might get delivered twice (which should be handled gracefully in idempotent destination connectors).

The proposal is to introduce a robust error-handling mechanism within Conduit to handle and distinguish transient and fatal errors, ensuring the reliability and resilience of pipelines.

Introducing Fatal Errors

Fatal errors are defined as errors that are not recoverable, causing the pipeline to stop permanently. When a fatal error occurs, the pipeline will be stopped, and no further attempts will be made to restart it. This ensures that critical issues are promptly addressed by the user and prevents Conduit from entering into an endless loop of restarting failed pipelines.

By default, all errors should be considered non-fatal and recoverable. If an error is fatal, it needs to be marked as such.

Recovering with a Backoff Delay

To mitigate transient errors and prevent constant restarts of pipelines, Conduit will implement a backoff delay strategy for restarting pipelines. The backoff delay introduces a waiting period before attempting to restart the pipeline, allowing time for potential issues to resolve themselves.

To prevent continuous attempts at restarting a pipeline that constantly fails, Conduit will make it possible to set a limit on the number of consecutive fails before permanently stopping the pipeline. As soon as the pipeline successfully processes at least 1 record or runs without encountering an error for a longer time frame, the count of consecutive fails will be reset. The time frame will also be configurable.

The backoff delay parameters will be configurable to allow users to adjust the delay based on their requirements. Conduit should provide sane defaults for the backoff delay functionality:

Minimum delay before restart (default: 1 second)
Maximum delay before restart (default: 1 minute)
Backoff factor (default: 2)
Maximum number of retries (default: infinite)
Delay after which the fail count is reset (default: 1 minute)

This results in a default delay progression of 1s, 2s, 4s, 8s, 16s, 32s, 1m,..., ensuring a balance between allowing time for recovery and minimizing downtime.

Open Questions

Should failed processors trigger a retry, or should the pipeline be considered failed immediately?
How does this functionality interact with a Dead Letter Queue (DLQ)?
- Restarting a pipeline with a nack threshold DLQ might result in records being continuously sent to the DLQ up to the threshold and then restarting.
In the event of a failure in the source connector, should the destination connector be restarted as well, or should it continue operating independently until the source connector is restored?
- The answer to this question affects where we catch errors and implement restarts.

lyuboxa · 2024-05-13T18:37:26Z

lyuboxa
May 13, 2024
Maintainer

I think this is an excellent start, I think restart on failure with backoff covers majority of the cases where Conduit can take action.
Since each connector may represent a different kind of data source/destination with their own traits, it is not up to conduit to decide
how to handle these failures.

Additionally, if a connector returns a fatal error and stops processing, how do we know if the connector really meant to stop or it simply encountered a recoverable error. There is no distinction. Furthermore the connector itself contains more context
about the data source, Conduit does not.

An open question, should the connectors be allowed to return non-fatal (retry-able) errors to instruct Conduit to cycle through?
For example, a network connection intermittent failure is such an example, where the connector may lean on Conduit to restart
the pipeline, rather than implement its own error handling backoff restart/reconnect.

Should failed processors trigger a retry, or should the pipeline be considered failed immediately?

Can we let the processor decide when it is game over and when it isn't? Like a fatal fault should stop the pipeline,
but an intermittent may request a state rest (restart of the pipeline).

In the event of a failure in the source connector, should the destination connector be restarted as well, or should it continue operating independently until the source connector is restored?

Does this imply that pipeline is placed in a 'recovering' state while the failed connector is being cycled?

2 replies

lovromazgon May 14, 2024
Maintainer Author

if a connector returns a fatal error and stops processing, how do we know if the connector really meant to stop or it simply encountered a recoverable error
...
should the connectors be allowed to return non-fatal (retry-able) errors to instruct Conduit to cycle through?

In my proposal I'm arguing that Conduit should regard all errors by default recoverable. This means that existing connectors will get the recovering functionality without the need to change them. However, I'm also proposing to introduce so called "fatal errors". A connector developer needs to explicitly mark an error to be fatal, to make the Conduit pipeline stop permanently and break the restart loop.

The reasoning behind this is simply to minimize the need to update existing connectors.

Can we let the processor decide when it is game over and when it isn't?

Yes, the paragraph above answers this question.

Does this imply that pipeline is placed in a 'recovering' state while the failed connector is being cycled?

Good point, we will need to introduce another pipeline state. We can generally improve pipeline state handling while we're at it, as the state is currently not guarded by a lock and it's hard to wait for a state change.

lyuboxa May 14, 2024
Maintainer

[..]

The reasoning behind this is simply to minimize the need to update existing connectors.

Got it, makes sense.

Does this imply that pipeline is placed in a 'recovering' state while the failed connector is being cycled?

Good point, we will need to introduce another pipeline state. We can generally improve pipeline state handling while we're at it, as the state is currently not guarded by a lock and it's hard to wait for a state change.

In general, I think the connectors/processors which generate data ought to be given more thought. Destination will not take action until the former provide any data to use. But I think a recovering state will signal that the pipeline has experienced fault and is in the process of recovering, essentially the loop of running -> fault -> recovery -> running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from pipeline errors #1559

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Recover from pipeline errors #1559

lovromazgon May 13, 2024 Maintainer

Introducing Fatal Errors

Recovering with a Backoff Delay

Open Questions

Replies: 1 comment · 2 replies

lyuboxa May 13, 2024 Maintainer

lovromazgon May 14, 2024 Maintainer Author

lyuboxa May 14, 2024 Maintainer

lovromazgon
May 13, 2024
Maintainer

Replies: 1 comment 2 replies

lyuboxa
May 13, 2024
Maintainer

lovromazgon May 14, 2024
Maintainer Author

lyuboxa May 14, 2024
Maintainer