test: recovery test for sources #16356

tabVersion · 2024-04-17T07:51:59Z

Background

In our previous tests, we did not thoroughly validate whether each source connector could resume consumption from any specific record position while ensuring exactly-once processing. The primary challenges in achieving this were the need to restart clusters within the CI environment to perform recovery operations and the inability to control the consumption rate to manage progress.

Motivation

The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

Prerequisites

We now support triggering recovery via SQL commands (#16259), controlling read speeds with rate limits (#15948), and supporting truncation at any position for chunks. Our new support for bash commands in the slt (#12451 (comment)) makes it easier to control external components. The inclusion of 'key as' syntax allows clear marking of each message's offset to check for overlaps (#13707).

Procedure

To test recovery, we can use a smaller dataset, for example, 20 messages in Kafka, setting stream parallelism to 1 and streaming_rate_limit to 1. We will trigger recovery at any time between 0-20 seconds to ensure that all 20 messages are read correctly and to verify that there are no duplicates in the offsets. The same test applies to the fs source; if a data line is read twice, the recorded offset will be the same.

This test will help ensure that all our connectors meet the exactly-once requirements, safeguarding the integrity of our data processing systems.

kwannoel · 2024-04-17T08:29:44Z

Agree! Besides e2e unit tests, I also would like to have fuzzing similar to that of e2e deterministic recovery test.

kwannoel · 2024-05-14T09:53:56Z

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.
After running the recovery part, sleep for some duration, and check that the records should have been ingested still.

Test Variables:

DDL: create sink, create source + mv, create table with source.
recovery: crash loop scenario (keep triggering restarts), long recovery (20mins).

xxchan · 2024-05-14T15:41:58Z

I've managed to using the RECOVER command to write recovery tests https://github.com/risingwavelabs/risingwave/pull/16733/files#diff-b83fa16ce469bbf54f92cf95f9d778e4cd0acf6a20489589c8bb636b6162da02

xxchan · 2024-05-14T15:43:53Z

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (#12451 (comment))

tabVersion · 2024-05-14T19:48:53Z

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (#12451 (comment))

IIRC, the recovery process just happens in meta inner, any connection to the fe will not be influenced.

st1page · 2024-06-07T09:29:59Z

Motivation
The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

In addition to the exactly-once semantic guarantee, it is also meaningful to simply test whether re-creating the subscription to the upstream system after recovery will cause problems, such as the issue in #17112

tabVersion added the type/feature label Apr 17, 2024

github-actions bot added this to the release-1.9 milestone Apr 17, 2024

xxchan changed the title ~~test coverage: recover test connectors~~ test: recovery test for sources May 12, 2024

xxchan mentioned this issue May 14, 2024

feat: rework pubsub source to support parallel read and at-least-once #16733

Merged

13 tasks

fuyufjh closed this as completed May 14, 2024

kwannoel reopened this May 14, 2024

kwannoel modified the milestones: release-1.9, release-1.10 May 14, 2024

kwannoel self-assigned this May 14, 2024

xxchan mentioned this issue Aug 6, 2024

feat(cdc): support sql server cdc #17429

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: recovery test for sources #16356

test: recovery test for sources #16356

tabVersion commented Apr 17, 2024

kwannoel commented Apr 17, 2024

kwannoel commented May 14, 2024 •

edited

Loading

xxchan commented May 14, 2024

xxchan commented May 14, 2024

tabVersion commented May 14, 2024

st1page commented Jun 7, 2024

test: recovery test for sources #16356

test: recovery test for sources #16356

Comments

tabVersion commented Apr 17, 2024

kwannoel commented Apr 17, 2024

kwannoel commented May 14, 2024 • edited Loading

xxchan commented May 14, 2024

xxchan commented May 14, 2024

tabVersion commented May 14, 2024

st1page commented Jun 7, 2024

kwannoel commented May 14, 2024 •

edited

Loading