Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ux: malfunctioning "graceful shutdown": pressing ctrl-c in playground or single-node terminates the process with panicking #17351

Open
BugenZhao opened this issue Jun 19, 2024 · 7 comments
Assignees
Labels
component/dev Dev related issues, such as rise dev, ci. type/enhancement Improvements to existing implementation.

Comments

@BugenZhao
Copy link
Member

^C
2024-06-19T17:59:58.02596+08:00  INFO        rw-main risingwave_meta_node: receive ctrl+c
2024-06-19T17:59:58.025957+08:00  INFO           main risingwave_cmd_all::standalone: Ctrl+C received, now exiting
2024-06-19T17:59:58.026082+08:00  INFO        rw-main risingwave_meta_node::server: Shutting down services
2024-06-19T17:59:58.026072+08:00  INFO        rw-main risingwave_rpc_client::meta_client: Heartbeat loop is stopped

2024-06-19T17:59:58.029081+08:00  INFO        rw-main risingwave_rpc_client::meta_client: Heartbeat loop is stopped
2024-06-19T17:59:58.02926+08:00  INFO        rw-main risingwave_meta::manager::system_param: System params notifier is stopped
2024-06-19T17:59:58.029655+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker start cleaning up
2024-06-19T17:59:58.029702+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker finished cleaning up
2024-06-19T17:59:58.029733+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker exited
2024-06-19T17:59:58.029791+08:00  INFO        rw-main risingwave_meta::hummock: Hummock version checkpoint is stopped
2024-06-19T17:59:58.029846+08:00  INFO        rw-main risingwave_meta::hummock: Vacuum metadata loop is stopped
2024-06-19T17:59:58.029915+08:00  INFO        rw-main risingwave_meta::hummock: Vacuum object loop is stopped
2024-06-19T17:59:58.030093+08:00  INFO        rw-main risingwave_meta::rpc::metrics: Worker number monitor is stopped
2024-06-19T17:59:58.030216+08:00  INFO        rw-main risingwave_meta::rpc::metrics: Fragment info monitor is stopped
2024-06-19T17:59:58.031265+08:00  INFO        rw-main risingwave_meta::hummock::manager::timer_task: Hummock timer loop is stopped

2024-06-19T17:59:58.033336+08:00 DEBUG ThreadId(57) risingwave_rpc_client::meta_client: matching tonic code Unknown error
2024-06-19T17:59:58.033468+08:00 DEBUG ThreadId(57) risingwave_rpc_client::meta_client: skipping the current refresh, somewhere else is already doing it
2024-06-19T17:59:58.033732+08:00 DEBUG ThreadId(56) risingwave_rpc_client::meta_client: matching tonic code Unknown error
2024-06-19T17:59:58.033922+08:00 DEBUG ThreadId(56) risingwave_rpc_client::meta_client: skipping the current refresh, somewhere else is already doing it


thread 'rw-main' panicked at src/compute/src/lib.rs:227:31:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(178))
thread 'rw-main' panicked at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/shutdown.rs:51:21:
Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.
stack backtrace:
2024-06-19T17:59:58.036347+08:00  WARN        rw-main risingwave_storage::hummock::local_version::pinned_version: failed to send req unpin version id: 908
2024-06-19T17:59:58.03839+08:00 ERROR        rw-main risingwave_storage::hummock::store::hummock_storage: unable to send shutdown event=HummockEvent { debug_string: "Shutdown" }

   0: rust_begin_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/result.rs:1654:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/result.rs:1077:23
   4: risingwave_compute::start::{{closure}}
             at ./src/compute/src/lib.rs:227:13
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/future/future.rs:123:9
   6: risingwave_cmd_all::standalone::standalone::{{closure}}::{{closure}}
             at ./src/cmd_all/src/standalone.rs:218:89
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
   8: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
   9: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
  10: tokio::runtime::task::core::Core<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:13
  11: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panic/unwind_safe.rs:272:9
  13: std::panicking::try::do_call
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:552:40
  14: ___rust_try
  15: std::panicking::try
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:516:19
  16: std::panic::catch_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panic.rs:146:14
  17: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
  18: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
  19: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
  20: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:271:5
  21: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18
  22: tokio::runtime::task::LocalNotified<S>::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:427:9
  23: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:639:17
  24: tokio::runtime::coop::with_budget
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:107:5
  25: tokio::runtime::coop::budget
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:73:5
  26: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:575:9
  27: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:526:24
  28: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:491:21
  29: tokio::runtime::context::scoped::Scoped<T>::set
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/scoped.rs:40:9
  30: tokio::runtime::context::set_scheduler::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context.rs:176:26
  31: std::thread::local::LocalKey<T>::try_with
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/thread/local.rs:284:16
  32: std::thread::local::LocalKey<T>::with
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/thread/local.rs:260:9
  33: tokio::runtime::context::set_scheduler
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context.rs:176:9
  34: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:486:9
  35: tokio::runtime::context::runtime::enter_runtime
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/runtime.rs:65:16
  36: tokio::runtime::scheduler::multi_thread::worker::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:478:5
  37: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:447:45
  38: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/task.rs:42:21
  39: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
  41: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
  42: tokio::runtime::task::core::Core<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:13
  43: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
  44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panic/unwind_safe.rs:272:9
  45: std::panicking::try::do_call
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:552:40
  46: ___rust_try
  47: std::panicking::try
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:516:19
  48: std::panic::catch_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panic.rs:146:14
  49: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
  50: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
  51: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
  52: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:271:5
  53: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18
  54: tokio::runtime::task::UnownedTask<S>::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:464:9
  55: tokio::runtime::blocking::pool::Task::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:159:9
  56: tokio::runtime::blocking::pool::Inner::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:513:17
  57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:471:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Two panics will occur when pressing Ctrl-C, resulting in an ungraceful exit process. We should consider improve this user experience.

@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. component/dev Dev related issues, such as rise dev, ci. labels Jun 19, 2024
@github-actions github-actions bot added this to the release-1.10 milestone Jun 19, 2024
@BugenZhao
Copy link
Member Author

The cause is that we override the system behavior of killing the process immediately on SIGINT by handling them with tokio::signal::ctrl_c in order to implement graceful shutdown, but not actually comprehensive or correct.

I'm wondering that if some sort of graceful shutdown is really necessary in our system. What about simply letting the components be killed by the system? As...

  • frontend nodes and compactor nodes are stateless,
  • compute nodes and meta nodes are also designed to survive unexpected exit (like OOM kill)

cc @zwang28 @yezizp2012

@yezizp2012
Copy link
Member

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

@BugenZhao
Copy link
Member Author

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀

@yezizp2012
Copy link
Member

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀

IIUC, madsim only issue a soft killing to shutdown the cluster after all tests are completed. During recovery it's hard killing.

@BugenZhao
Copy link
Member Author

BugenZhao commented Jul 1, 2024

I think graceful shutdown still has some benefits. For example, a compute node can actively tell the meta node to remove it when being killed, which is more responsive compared to the current approach that meta node passively finds that based on the timeout of heartbeat messages. (manual operation is required now: https://docs.risingwave.com/docs/current/k8s-cluster-scaling/#scale-in)

However, what we're doing now for graceful shutdown seems not to help at all, that is, shutting down each manager/worker one by one without doing any extra clean-ups. This can be replaced by exiting the process directly without any loss but make the code cleaner.

@BugenZhao BugenZhao changed the title ux: pressing ctrl-c in playground or single-node terminates the process with panicking ux: malfunctioning "graceful shutdown": pressing ctrl-c in playground or single-node terminates the process with panicking Jul 2, 2024
@BugenZhao
Copy link
Member Author

Also link to #7675 which added the tests for graceful shutdown under simulation. cc @wangrunji0408 Could you share some updates on this (if any)? For example, is it still used or enforced currently?

@BugenZhao
Copy link
Member Author

Only clean-up & refactor tasks remaining.

@BugenZhao BugenZhao removed this from the release-2.0 milestone Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/dev Dev related issues, such as rise dev, ci. type/enhancement Improvements to existing implementation.
Projects
None yet
Development

No branches or pull requests

2 participants