Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217

lmatz · 2023-12-26T11:40:01Z

Describe the bug

https://buildkite.com/risingwave-test/longevity-chaos-mesh/builds/376#018ca450-a95e-4c3b-953d-68d154850aef

This experiment is implemented by @xuefengze.

In this experiment, we applied network partition from compute node to meta node. (Didn't say between compute node and meta node because we can specify direction, although it seems to have little thing to do with the following problem).

The duration is 10min.

We triggered the fault around 12:17:02. The partition will exist until 12:27:02.

We can see that while the partition exists, we can execute the select query without a problem.

But when we try to create table t1. The query was stuck for a long time and returned the error message at 12:27:03. This is exactly when the partition experiment finishes.

Although it is expected that the SQL fails, is it reasonable to let the create table t1 stuck for such a long time?
For such a query, I think that we always expect it to finish with single-digit latency. Shall we add a timeout here?

After the partition experiment finished, we retried to create table t1 and it succeeded.

However, after that, when the network works as normal, executing a select query will fail.

Is it due to a similar issue as found in #14030?

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-01T02:09:14Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

lmatz added type/bug Something isn't working found-by-chaos-mesh labels Dec 26, 2023

github-actions bot added this to the release-1.6 milestone Dec 26, 2023

This was referenced Dec 26, 2023

Tracking: issues found by chaos mesh #14213

Open

Chaos Mesh compute-node-limit-bandwidth create table statement fails too slow? #14226

Open

lmatz modified the milestones: release-1.6, release-1.7 Jan 9, 2024

fuyufjh assigned chenzl25 Mar 6, 2024

lmatz modified the milestones: release-1.7, future-release-1.9 Mar 19, 2024

chenzl25 mentioned this issue Mar 29, 2024

feat(frontend): add heartbeat between frontend and compute #16014

Closed

9 tasks

chenzl25 removed this from the release-1.9 milestone May 14, 2024

github-actions bot added the no-issue-activity label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217

Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217

lmatz commented Dec 26, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2024

Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217

Chaos Mesh compute-meta-network-partition batch query fails occasionally #14217

Comments

lmatz commented Dec 26, 2023 • edited Loading

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

github-actions bot commented Aug 1, 2024

lmatz commented Dec 26, 2023 •

edited

Loading