Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daily Chaos Mesh with longevity pipeline failed #15677

Closed
xuefengze opened this issue Mar 14, 2024 · 9 comments
Closed

Daily Chaos Mesh with longevity pipeline failed #15677

xuefengze opened this issue Mar 14, 2024 · 9 comments
Assignees
Labels
found-by-chaos-mesh type/bug Something isn't working
Milestone

Comments

@xuefengze
Copy link
Contributor

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly nexmark     
Namespace            longcmkf-20240313-153221
TestBed              medium-arm-3cn-all-affinity
RW Version           nightly-20240313    
Test Start time      2024-03-13 15:36:18 
Test End time        2024-03-13 17:50:36 
Test Queries         q0,q1,q2,q3,q4,q5,q7,q8,q9,q10,q14,q15,q16,q17,q18,q20,q21,q22,q101,q102,q103,q104,q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240313-153221/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/681

The faults end(faults duration: 17:36:01 - 17:46:01) before running the SQL command.
image

The reason that CREATE MV nexmark_q10_1_chaos_mesh failed is because compute-1 restarted.
https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20240313-153221&from=1710344178000&to=1710352236000&var-pod=benchmark-risingwave-compute-c-1&var-search=
image

@github-actions github-actions bot added this to the release-1.8 milestone Mar 14, 2024
@lmatz lmatz added found-by-chaos-mesh type/bug Something isn't working labels Mar 14, 2024
@lmatz
Copy link
Contributor

lmatz commented Mar 14, 2024

It seems the panic originates from storage: https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?from=1710344178000&orgId=1&to=1710352236000&var-data_source=PE59595AED52CF917&var-namespace=longcmkf-20240313-153221&var-pod=benchmark-risingwave-compute-c-1&var-search=

5:     0xaaaae8e0e934 - core::fmt::write::h0c4a627b12e3d78d |  
-- | --
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/fmt/mod.rs:1120:17 |  
  | 6:     0xaaaaf1d4ebdc - std::io::Write::write_fmt::h318ea8fa9c57551f |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/io/mod.rs:1810:15 |  
  | 7:     0xaaaaf1d55ae4 - std::sys_common::backtrace::_print::ha8e58b5a72e68af5 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:47:5 |  
  | 8:     0xaaaaf1d55ae4 - std::sys_common::backtrace::print::hbce77ec9c0ee5615 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:34:9 |  
  | 9:     0xaaaaf1d5715c - std::panicking::default_hook::{{closure}}::h62de2f44289d99c3 |  
  | 10:     0xaaaaf1d56e90 - std::panicking::default_hook::h6808acaf9c18352a |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:292:9 |  
  | 11:     0xaaaaeff4f5f4 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h2700c70562ffa71b |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2029:9 |  
  | 12:     0xaaaaeff4f5f4 - risingwave_rt::panic_hook::set_panic_hook::{{closure}}::h5cfd70b8a9038c58 |  
  | at /risingwave/src/utils/runtime/src/panic_hook.rs:25:9 |  
  | 13:     0xaaaaeff4f5f4 - std::panicking::update_hook::{{closure}}::h9caa177955fdf881 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:234:47 |  
  | 14:     0xaaaaf1d577a8 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h861e29f50d7dc506 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/alloc/src/boxed.rs:2029:9 |  
  | 15:     0xaaaaf1d577a8 - std::panicking::rust_panic_with_hook::h7d9418a0e45d61fb |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:783:13 |  
  | 16:     0xaaaaf1d57550 - std::panicking::begin_panic_handler::{{closure}}::he5c28c60ba487bed |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:657:13 |  
  | 17:     0xaaaaf1d5618c - std::sys_common::backtrace::__rust_end_short_backtrace::hcdcd48a39dadcd83 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/sys_common/backtrace.rs:171:18 |  
  | 18:     0xaaaaf1d572f8 - rust_begin_unwind |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:645:5 |  
  | 19:     0xaaaae8e0b5c8 - core::panicking::panic_fmt::had704c91ae6f0365 |  
  | at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panicking.rs:72:14 |  
  | 20:     0xaaaaf03ef8f4 - risingwave_storage::hummock::store::hummock_storage::HummockStorage::build_read_version_tuple::hde1b5dfd63f7cbef |  
  | at /risingwave/src/storage/src/hummock/store/hummock_storage.rs:379:21 |  
  | 21:     0xaaaaf118b06c - risingwave_storage::hummock::store::hummock_storage::HummockStorage::iter_inner::{{closure}}::h49e69459138cb7ed |  
  | at /risingwave/src/storage/src/hummock/store/hummock_storage.rs:268:13 |  
  | 22:     0xaaaaf118b06c - risingwave_storage::monitor::monitored_store::MonitoredStateStore<S>::monitored_iter::{{closure}}::h60977c2e723c1172 |  
  | at /risingwave/src/storage/src/monitor/monitored_store.rs:96:14 |  
  | 23:     0xaaaaf118b06c - <F as futures_core::future::TryFuture>::try_poll::h8870776b32a6c271 |  
  | at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.30/src/future.rs:82:9 |  
  | 24:     0xaaaaf118b06c - <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll::h4f17fb0995334160 |  
  | at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/try_future/into_future.rs:34:9

@hzxa21 hzxa21 self-assigned this Mar 19, 2024
@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 20, 2024

May be related to the race fixed in #15738. Let's see whether the same panic still occurs in recent runs.

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 20, 2024

Also, I would like to understand more about the test case. When the issue happened, we were running

[ name: networkchaos-compute-meta-longcmkf-20240313-153221, mode: one, action: partition, direction: to, duration: 600s ]

I think this means we will randomly create network partition between one CN and meta. When network partition happens, do we expect DDL like CREATE MV can still succeed without failure? Or did the MV creation happens after the network partition is resolved?

@lmatz
Copy link
Contributor

lmatz commented Mar 20, 2024

did the MV creation happens after the network partition is resolved?

Yes, I think MV creation happened after the network partition was resolved.
@xuefengze please confirm, thanks!
BTW, during the network partition period, chaos mesh will not randomly remove the partition from time to time, right? i.e. it will keep the partition in place until the end of the period.

When network partition happens, do we expect DDL like CREATE MV can still succeed without failure?

If DDL happens during a network partition, I believe DDL should not succeed as no communication can happen between the meta node and CN.
What we expect is a clear timeout error message that can be returned to users without triggering panic in RW.

@xuefengze
Copy link
Contributor Author

did the MV creation happens after the network partition is resolved?

Yes, the partition began at 17:36:03 UTC, and the duration is 600s, so it had already ended at 17:46:03 UTC.

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 20, 2024

did the MV creation happens after the network partition is resolved?

Yes, I think MV creation happened after the network partition was resolved. @xuefengze please confirm, thanks! BTW, during the network partition period, chaos mesh will not randomly remove the partition from time to time, right? i.e. it will keep the partition in place until the end of the period.

When network partition happens, do we expect DDL like CREATE MV can still succeed without failure?

If DDL happens during a network partition, I believe DDL should not succeed as no communication can happen between the meta node and CN. What we expect is a clear timeout error message that can be returned to users without triggering panic in RW.

Thanks for the explanation. That makes sense to me. The storage panic happens when there are more than one storage table instances relevant to the same vnode id, which indicates there is a race somewhere. Let's see whether #15738 fixes it.

Just FYI, I noticed that when the network partition is ongoing, the partitioned CN exits by itself (and restarts by k8s I guess) because meta has expired the worker node. Relevant codes:

std::process::exit(1);
. This is not relevant to the reported issue because the storage panic happened at around 17:50, during which the network partitioning has been recovered.

@hzxa21
Copy link
Collaborator

hzxa21 commented Apr 11, 2024

@xuefengze Is the issue resolved in recent runs?

@xuefengze
Copy link
Contributor Author

Is the issue resolved in recent runs?

Yes, all recent runs were successful.

@hzxa21
Copy link
Collaborator

hzxa21 commented Apr 12, 2024

Thanks for the confirmation. I will close the issue.

@hzxa21 hzxa21 closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-chaos-mesh type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants