Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to start risedev after #8770 #9038

Closed
gengteng opened this issue Apr 7, 2023 · 6 comments
Closed

Failed to start risedev after #8770 #9038

gengteng opened this issue Apr 7, 2023 · 6 comments
Labels
type/bug Something isn't working
Milestone

Comments

@gengteng
Copy link
Contributor

gengteng commented Apr 7, 2023

Describe the bug

This problem only occurs after the #8770(b67e00f) commit.

When running the ./risedev d command, the following error message appears:

dev cluster: starting 3 services for default...  
 ✅ tmux: session risedev
⠁✅ prepare: all previous services have been stopped
⠁  meta-node-5690: waiting for online...

ERROR - Failed to start: meta-node-5690 exited while waiting for connection: status 139

Caused by:
        meta-node-5690 exited while waiting for connection: status 139
* Use `./risedev configure` to enable new compoenents, if they are missing.
* Use `./risedev l` to view logs, or visit `/Users/gengteng/CLionProjects/risingwave/.risingwave/log`
* Run `./risedev k` to clean up cluster.
* Run `./risedev clean-data` to clean data, which might potentially fix the issue.
---


Error: meta-node-5690 exited while waiting for connection: status 139


Stack backtrace:
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: std::backtrace::Backtrace::create
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/backtrace.rs:332:13
   3: anyhow::error::<impl anyhow::Error>::msg
             at /Users/gengteng/.cargo/registry/src/rsproxy.cn-8f6827c7555bfaf8/anyhow-1.0.70/src/error.rs:83:36
   4: risedev::wait::wait
             at ./src/risedevtool/src/wait.rs:59:24
   5: risedev::task::ExecuteContext<W>::wait_tcp
             at ./src/risedevtool/src/task.rs:177:9
   6: <risedev::task::task_configure_grpc_node::ConfigureGrpcNodeTask as risedev::task::Task>::execute
             at ./src/risedevtool/src/task/task_configure_grpc_node.rs:46:13
   7: risedev_dev::task_main
             at ./src/risedevtool/src/bin/risedev-dev.rs:191:17
   8: risedev_dev::main
             at ./src/risedevtool/src/bin/risedev-dev.rs:403:23
   9: core::ops::function::FnOnce::call_once
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/core/src/ops/function.rs:250:5
  10: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/sys_common/backtrace.rs:121:18
  11: std::rt::lang_start::{{closure}}
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/rt.rs:166:18
  12: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/core/src/ops/function.rs:287:13
  13: std::panicking::try::do_call
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panicking.rs:487:40
  14: std::panicking::try
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panicking.rs:451:19
  15: std::panic::catch_unwind
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panic.rs:140:14
  16: std::rt::lang_start_internal::{{closure}}
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/rt.rs:148:48
  17: std::panicking::try::do_call
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panicking.rs:487:40
  18: std::panicking::try
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panicking.rs:451:19
  19: std::panic::catch_unwind
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/panic.rs:140:14
  20: std::rt::lang_start_internal
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/rt.rs:148:20
  21: std::rt::lang_start
             at /rustc/31f858d9a511f24fedb8ed997b28304fec809630/library/std/src/rt.rs:165:17
  22: _main
[cargo-make] ERROR - Error while executing command, exit code: 1
[cargo-make] WARN - Build Failed.

This problem only occurs after the #8770 commit. Strangely, when trying to delete the Query::as_simple_values method and expand it at the call site, the problem can be fixed. Additionally, when implementing a public function for Query after rolling back to #9003, the same error occurs:

impl Query {
    #[allow(dead_code)]
    pub fn test(&self) {}
}

However, removing the pub from the function or removing the &self parameter can start normally.

To Reproduce

  1. Run the ./risedev d command on a Mac with an M1 chip.
  2. The error message Error: meta-node-5690 exited while waiting for connection: status 139 appears.

Expected behavior

The ./risedev d command can start normally without any error messages.

Additional context

  • This problem only occurs after the fix(binder): Incorrect cast when specifying columns #8770 commit.
  • The fact that expanding the Query::as_simple_values method at the call site fixes the problem is strange.
  • When implementing a public function for Query after rolling back to fix(udf): fix wrong number of rows #9003, the same error occurs. However, removing the pub from the function or removing the &self parameter can start normally.
  • ./risedev check and ./risedev test both pass without any issues.
@gengteng gengteng added the type/bug Something isn't working label Apr 7, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Apr 7, 2023
@xiangjinwu
Copy link
Contributor

Seems another occurrence of #6205. Does the issue disappear magically on some commits after the problematic one, or does it consistently happen on all commits after that one?

@gengteng
Copy link
Contributor Author

gengteng commented Apr 7, 2023

Seems another occurrence of #6205. Does the issue disappear magically on some commits after the problematic one, or does it consistently happen on all commits after that one?

Thank you for your response. The issue consistently happens on all commits after the problematic one. Additionally, I was able to reproduce the issue even before the problematic commit by implementing a non-static public method for Query.

Please let me know if you need any further information from me.

@lmatz
Copy link
Contributor

lmatz commented Apr 7, 2023

I yesterday encountered the error too, and it is magically gone after I rebase with the latest main and cargo cleaned once. I don't know which action really takes the effect. 😄

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Apr 7, 2023

I was able to reproduce it on b67e00f and also see it disappear magically on 8e0bf7a (maybe earlier). This is the relevant backtrace of segfault obtained with lldb -w -n meta-node:

(lldb) bt
* thread #7, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001956dd770 libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x00000001956dd624 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x00000001956dd2fc libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x00000001956dd220 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1228
    frame #4: 0x00000001956df6b0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x00000001956e20f0 libunwind.dylib`_Unwind_Backtrace + 348
    frame #6: 0x00000001061ce1c0 meta-node`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h3c90f7da2ae71937 at libunwind.rs:93:5 [opt]
    frame #7: 0x00000001061ce1b0 meta-node`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::hb2b06b1b0aadcef3 at mod.rs:66:5 [opt]
    frame #8: 0x00000001061ce1a4 meta-node`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 at backtrace.rs:332:13 [opt]
    frame #9: 0x0000000105960ffc meta-node`anyhow::error::_$LT$impl$u20$anyhow..Error$GT$::msg::h88bf86dcaa2f30c2(message=<unavailable>) at error.rs:83:36
    frame #10: 0x0000000101935624 meta-node`risingwave_meta::telemetry::TrackingId::from_meta_store::_$u7b$$u7b$closure$u7d$$u7d$::h07034bb74a05d6b3((null)=0x000000016fbe57b0) at telemetry.rs:48:27
    frame #11: 0x0000000101936ae0 meta-node`risingwave_meta::telemetry::TrackingId::get_or_create_meta_store::_$u7b$$u7b$closure$u7d$$u7d$::h88fe799f50cfa62d((null)=0x000000016fbe57b0) at telemetry.rs:61:48
    frame #12: 0x0000000101a082e4 meta-node`risingwave_meta::rpc::server::start_service_as_election_leader::_$u7b$$u7b$closure$u7d$$u7d$::hbeba35617386b4a4((null)=0x000000016fbe57b0) at server.rs:580:13
    frame #13: 0x00000001019f4f8c meta-node`risingwave_meta::rpc::server::rpc_serve_with_store::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::ha08aee51df67d9e0((null)=0x000000016fbe57b0) at server.rs:269:9
... (more omitted)

This confirms it is the same issue of capturing backtrace (this time inside anyhow::Error), and I opened #9042 as another mitigation.

@gengteng Does checking out latest main resolve this issue on your side? If yes we can close this issue. And thank you for reporting.

@gengteng
Copy link
Contributor Author

gengteng commented Apr 7, 2023

@xiangjinwu
Thank you for your help in resolving my issue. I can confirm that the latest main branch has resolved the problem.

@xiangjinwu
Copy link
Contributor

Tracked by #6205. Closing as duplicate.

@xiangjinwu xiangjinwu closed this as not planned Won't fix, can't repro, duplicate, stale Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants