Daily Chaos Mesh with longevity pipeline failed #15677

xuefengze · 2024-03-14T05:21:11Z

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly nexmark     
Namespace            longcmkf-20240313-153221
TestBed              medium-arm-3cn-all-affinity
RW Version           nightly-20240313    
Test Start time      2024-03-13 15:36:18 
Test End time        2024-03-13 17:50:36 
Test Queries         q0,q1,q2,q3,q4,q5,q7,q8,q9,q10,q14,q15,q16,q17,q18,q20,q21,q22,q101,q102,q103,q104,q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240313-153221/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/681

The faults end(faults duration: 17:36:01 - 17:46:01) before running the SQL command.

The reason that CREATE MV nexmark_q10_1_chaos_mesh failed is because compute-1 restarted.
https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000&var-pod=benchmark-risingwave-compute-c-1&var-search=

The text was updated successfully, but these errors were encountered:

lmatz · 2024-03-14T05:57:25Z

It seems the panic originates from storage: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1710344178000&orgId=1&to=1710352236000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20240313-153221&var-pod=benchmark-risingwave-compute-c-1&var-search=

5:     0xaaaae8e0e934 - core::fmt::write::h0c4a627b12e3d78d |  
-- | --
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/fmt/mod.rs:1120:17 |  
  | 6:     0xaaaaf1d4ebdc - std::io::Write::write_fmt::h318ea8fa9c57551f |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/io/mod.rs:1810:15 |  
  | 7:     0xaaaaf1d55ae4 - std::sys_common::backtrace::_print::ha8e58b5a72e68af5 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:47:5 |  
  | 8:     0xaaaaf1d55ae4 - std::sys_common::backtrace::print::hbce77ec9c0ee5615 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:34:9 |  
  | 9:     0xaaaaf1d5715c - std::panicking::default_hook::{{closure}}::h62de2f44289d99c3 |  
  | 10:     0xaaaaf1d56e90 - std::panicking::default_hook::h6808acaf9c18352a |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:292:9 |  
  | 11:     0xaaaaeff4f5f4 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h2700c70562ffa71b |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2029:9 |  
  | 12:     0xaaaaeff4f5f4 - risingwave_rt::panic_hook::set_panic_hook::{{closure}}::h5cfd70b8a9038c58 |  
  | at /risingwave/src/utils/runtime/src/panic_hook.rs:25:9 |  
  | 13:     0xaaaaeff4f5f4 - std::panicking::update_hook::{{closure}}::h9caa177955fdf881 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:234:47 |  
  | 14:     0xaaaaf1d577a8 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h861e29f50d7dc506 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2029:9 |  
  | 15:     0xaaaaf1d577a8 - std::panicking::rust_panic_with_hook::h7d9418a0e45d61fb |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:783:13 |  
  | 16:     0xaaaaf1d57550 - std::panicking::begin_panic_handler::{{closure}}::he5c28c60ba487bed |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:657:13 |  
  | 17:     0xaaaaf1d5618c - std::sys_common::backtrace::__rust_end_short_backtrace::hcdcd48a39dadcd83 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:171:18 |  
  | 18:     0xaaaaf1d572f8 - rust_begin_unwind |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:645:5 |  
  | 19:     0xaaaae8e0b5c8 - core::panicking::panic_fmt::had704c91ae6f0365 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panicking.rs:72:14 |  
  | 20:     0xaaaaf03ef8f4 - risingwave_storage::hummock::store::hummock_storage::HummockStorage::build_read_version_tuple::hde1b5dfd63f7cbef |  
  | at /risingwave/src/storage/src/hummock/store/hummock_storage.rs:379:21 |  
  | 21:     0xaaaaf118b06c - risingwave_storage::hummock::store::hummock_storage::HummockStorage::iter_inner::{{closure}}::h49e69459138cb7ed |  
  | at /risingwave/src/storage/src/hummock/store/hummock_storage.rs:268:13 |  
  | 22:     0xaaaaf118b06c - risingwave_storage::monitor::monitored_store::MonitoredStateStore<S>::monitored_iter::{{closure}}::h60977c2e723c1172 |  
  | at /risingwave/src/storage/src/monitor/monitored_store.rs:96:14 |  
  | 23:     0xaaaaf118b06c - <F as futures_core::future::TryFuture>::try_poll::h8870776b32a6c271 |  
  | at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/future.rs:82:9 |  
  | 24:     0xaaaaf118b06c - <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll::h4f17fb0995334160 |  
  | at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/try_future/into_future.rs:34:9

hzxa21 · 2024-03-20T04:25:16Z

May be related to the race fixed in #15738. Let's see whether the same panic still occurs in recent runs.

hzxa21 · 2024-03-20T04:30:34Z

Also, I would like to understand more about the test case. When the issue happened, we were running

[ name: networkchaos-compute-meta-longcmkf-20240313-153221, mode: one, action: partition, direction: to, duration: 600s ]

I think this means we will randomly create network partition between one CN and meta. When network partition happens, do we expect DDL like CREATE MV can still succeed without failure? Or did the MV creation happens after the network partition is resolved?

lmatz · 2024-03-20T04:46:38Z

did the MV creation happens after the network partition is resolved?

Yes, I think MV creation happened after the network partition was resolved.
@xuefengze please confirm, thanks!
BTW, during the network partition period, chaos mesh will not randomly remove the partition from time to time, right? i.e. it will keep the partition in place until the end of the period.

When network partition happens, do we expect DDL like CREATE MV can still succeed without failure?

If DDL happens during a network partition, I believe DDL should not succeed as no communication can happen between the meta node and CN.
What we expect is a clear timeout error message that can be returned to users without triggering panic in RW.

xuefengze · 2024-03-20T05:17:22Z

did the MV creation happens after the network partition is resolved?

Yes, the partition began at 17:36:03 UTC, and the duration is 600s, so it had already ended at 17:46:03 UTC.

hzxa21 · 2024-03-20T06:12:56Z

did the MV creation happens after the network partition is resolved?

Yes, I think MV creation happened after the network partition was resolved. @xuefengze please confirm, thanks! BTW, during the network partition period, chaos mesh will not randomly remove the partition from time to time, right? i.e. it will keep the partition in place until the end of the period.

When network partition happens, do we expect DDL like CREATE MV can still succeed without failure?

If DDL happens during a network partition, I believe DDL should not succeed as no communication can happen between the meta node and CN. What we expect is a clear timeout error message that can be returned to users without triggering panic in RW.

Thanks for the explanation. That makes sense to me. The storage panic happens when there are more than one storage table instances relevant to the same vnode id, which indicates there is a race somewhere. Let's see whether #15738 fixes it.

Just FYI, I noticed that when the network partition is ongoing, the partitioned CN exits by itself (and restarts by k8s I guess) because meta has expired the worker node. Relevant codes:

risingwave/src/rpc_client/src/meta_client.rs

Line 294 in 13e2d9a

std::process::exit(1);

. This is not relevant to the reported issue because the storage panic happened at around 17:50, during which the network partitioning has been recovered.

hzxa21 · 2024-04-11T07:03:04Z

@xuefengze Is the issue resolved in recent runs?

xuefengze · 2024-04-11T07:07:45Z

Is the issue resolved in recent runs?

Yes, all recent runs were successful.

hzxa21 · 2024-04-12T03:04:03Z

Thanks for the confirmation. I will close the issue.

github-actions bot added this to the release-1.8 milestone Mar 14, 2024

lmatz added found-by-chaos-mesh type/bug Something isn't working labels Mar 14, 2024

hzxa21 self-assigned this Mar 19, 2024

hzxa21 closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daily Chaos Mesh with longevity pipeline failed #15677

Daily Chaos Mesh with longevity pipeline failed #15677

xuefengze commented Mar 14, 2024

lmatz commented Mar 14, 2024

hzxa21 commented Mar 20, 2024

hzxa21 commented Mar 20, 2024

lmatz commented Mar 20, 2024

xuefengze commented Mar 20, 2024

hzxa21 commented Mar 20, 2024

hzxa21 commented Apr 11, 2024

xuefengze commented Apr 11, 2024

hzxa21 commented Apr 12, 2024

Daily Chaos Mesh with longevity pipeline failed #15677

Daily Chaos Mesh with longevity pipeline failed #15677

Comments

xuefengze commented Mar 14, 2024

lmatz commented Mar 14, 2024

hzxa21 commented Mar 20, 2024

hzxa21 commented Mar 20, 2024

lmatz commented Mar 20, 2024

xuefengze commented Mar 20, 2024

hzxa21 commented Mar 20, 2024

hzxa21 commented Apr 11, 2024

xuefengze commented Apr 11, 2024

hzxa21 commented Apr 12, 2024