You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is fragile, of course. As an additional layer of defense, when running the next statement/query, we check the error messages and tolerate the ones that are likely to be caused by recovery in progress.
tracing::debug!("Record {:?} finished in {:?}", record, elapsed)
})
.await
{
let err_string = err.to_string();
// cluster could be still under recovering if killed before, retry if
// meets `no reader for dml in table with id {}`.
let should_retry = (err_string.contains("no reader for dml in table")
|| err_string
.contains("error reading a body from connection: broken pipe"))
|| err_string.contains("failed to inject barrier") && i < MAX_RETRY;
if !should_retry {
panic!("{}", err);
}
tracing::error!("failed to run test: {err}\nretry after {delay:?}");
}else{
break;
}
The problem is that it can be difficult to keep the list of error patterns correctly maintained, given the fact that we're actively working on the barrier manager for partial checkpoint and corresponding error reporting mechanism. As a result, we may find the test flaky.
To make this more robust, we can (naturally) leverage the system functions for checking the cluster status introduced in #17641. However, as suggested by @kwannoel, there could be some race cases where we get the "healthy" status before the recovery is actually triggered. A possible solution could be attaching a sequence number for the cluster's lifespan, which increments every time a recovery is triggered, so that we can tell the difference between different states.
The text was updated successfully, but these errors were encountered:
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.
You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄
Currently, when killing a node in the cluster, we expect that the recovery can be done in a fixed interval of time (15s as the following).
risingwave/src/tests/simulation/src/slt.rs
Lines 329 to 334 in e4b87be
This is fragile, of course. As an additional layer of defense, when running the next statement/query, we check the error messages and tolerate the ones that are likely to be caused by recovery in progress.
risingwave/src/tests/simulation/src/slt.rs
Lines 298 to 318 in e4b87be
The problem is that it can be difficult to keep the list of error patterns correctly maintained, given the fact that we're actively working on the barrier manager for partial checkpoint and corresponding error reporting mechanism. As a result, we may find the test flaky.
To make this more robust, we can (naturally) leverage the system functions for checking the cluster status introduced in #17641. However, as suggested by @kwannoel, there could be some race cases where we get the "healthy" status before the recovery is actually triggered. A possible solution could be attaching a sequence number for the cluster's lifespan, which increments every time a recovery is triggered, so that we can tell the difference between different states.
The text was updated successfully, but these errors were encountered: