Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: recovery test for sources #16356

Open
tabVersion opened this issue Apr 17, 2024 · 6 comments
Open

test: recovery test for sources #16356

tabVersion opened this issue Apr 17, 2024 · 6 comments
Assignees
Milestone

Comments

@tabVersion
Copy link
Contributor

  • Background

In our previous tests, we did not thoroughly validate whether each source connector could resume consumption from any specific record position while ensuring exactly-once processing. The primary challenges in achieving this were the need to restart clusters within the CI environment to perform recovery operations and the inability to control the consumption rate to manage progress.

  • Motivation

The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

  • Prerequisites

We now support triggering recovery via SQL commands (#16259), controlling read speeds with rate limits (#15948), and supporting truncation at any position for chunks. Our new support for bash commands in the slt (#12451 (comment)) makes it easier to control external components. The inclusion of 'key as' syntax allows clear marking of each message's offset to check for overlaps (#13707).

  • Procedure

To test recovery, we can use a smaller dataset, for example, 20 messages in Kafka, setting stream parallelism to 1 and streaming_rate_limit to 1. We will trigger recovery at any time between 0-20 seconds to ensure that all 20 messages are read correctly and to verify that there are no duplicates in the offsets. The same test applies to the fs source; if a data line is read twice, the recorded offset will be the same.

This test will help ensure that all our connectors meet the exactly-once requirements, safeguarding the integrity of our data processing systems.

@github-actions github-actions bot added this to the release-1.9 milestone Apr 17, 2024
@kwannoel
Copy link
Contributor

Agree! Besides e2e unit tests, I also would like to have fuzzing similar to that of e2e deterministic recovery test.

@xxchan xxchan changed the title test coverage: recover test connectors test: recovery test for sources May 12, 2024
@fuyufjh fuyufjh closed this as completed May 14, 2024
@kwannoel kwannoel reopened this May 14, 2024
@kwannoel kwannoel modified the milestones: release-1.9, release-1.10 May 14, 2024
@kwannoel kwannoel self-assigned this May 14, 2024
@kwannoel
Copy link
Contributor

kwannoel commented May 14, 2024

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.
After running the recovery part, sleep for some duration, and check that the records should have been ingested still.

Test Variables:

  1. DDL: create sink, create source + mv, create table with source.
  2. recovery: crash loop scenario (keep triggering restarts), long recovery (20mins).

@xxchan
Copy link
Member

xxchan commented May 14, 2024

I've managed to using the RECOVER command to write recovery tests https://github.com/risingwavelabs/risingwave/pull/16733/files#diff-b83fa16ce469bbf54f92cf95f9d778e4cd0acf6a20489589c8bb636b6162da02

@xxchan
Copy link
Member

xxchan commented May 14, 2024

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (#12451 (comment))

@tabVersion
Copy link
Contributor Author

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (#12451 (comment))

IIRC, the recovery process just happens in meta inner, any connection to the fe will not be influenced.

@st1page
Copy link
Contributor

st1page commented Jun 7, 2024

Motivation
The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

In addition to the exactly-once semantic guarantee, it is also meaningful to simply test whether re-creating the subscription to the upstream system after recovery will cause problems, such as the issue in #17112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants