Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

Open
BugenZhao opened this issue Aug 12, 2024 · 1 comment
Labels
component/meta Meta related issue. component/test Test related issue. no-issue-activity

Comments

@BugenZhao
Copy link
Member

Currently, when killing a node in the cluster, we expect that the recovery can be done in a fixed interval of time (15s as the following).

Some(tokio::spawn(async move {
let t = thread_rng().gen_range(Duration::default()..Duration::from_secs(1));
tokio::time::sleep(t).await;
cluster.kill_node(&opts).await;
tokio::time::sleep(Duration::from_secs(15)).await;
}))

This is fragile, of course. As an additional layer of defense, when running the next statement/query, we check the error messages and tolerate the ones that are likely to be caused by recovery in progress.

if let Err(err) = tester
.run_async(record.clone())
.timed(|_res, elapsed| {
tracing::debug!("Record {:?} finished in {:?}", record, elapsed)
})
.await
{
let err_string = err.to_string();
// cluster could be still under recovering if killed before, retry if
// meets `no reader for dml in table with id {}`.
let should_retry = (err_string.contains("no reader for dml in table")
|| err_string
.contains("error reading a body from connection: broken pipe"))
|| err_string.contains("failed to inject barrier") && i < MAX_RETRY;
if !should_retry {
panic!("{}", err);
}
tracing::error!("failed to run test: {err}\nretry after {delay:?}");
} else {
break;
}

The problem is that it can be difficult to keep the list of error patterns correctly maintained, given the fact that we're actively working on the barrier manager for partial checkpoint and corresponding error reporting mechanism. As a result, we may find the test flaky.


To make this more robust, we can (naturally) leverage the system functions for checking the cluster status introduced in #17641. However, as suggested by @kwannoel, there could be some race cases where we get the "healthy" status before the recovery is actually triggered. A possible solution could be attaching a sequence number for the cluster's lifespan, which increments every time a recovery is triggered, so that we can tell the difference between different states.

@BugenZhao BugenZhao added component/test Test related issue. component/meta Meta related issue. labels Aug 12, 2024
@github-actions github-actions bot added this to the release-2.0 milestone Aug 12, 2024
@BugenZhao BugenZhao removed this from the release-2.0 milestone Aug 19, 2024
Copy link
Contributor

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/meta Meta related issue. component/test Test related issue. no-issue-activity
Projects
None yet
Development

No branches or pull requests

1 participant