Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sink actor occasionally encounters errors during the process of scale-in #13787

Closed
shanicky opened this issue Dec 4, 2023 · 5 comments
Closed
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@shanicky
Copy link
Contributor

shanicky commented Dec 4, 2023

Describe the bug

No response

Error message/log

gRPC request to stream service failed: Internal error: Actor 10548 exit unexpectedly: Executor error: Sink error: Remote sink error: INVALID_ARGUMENT: invalid epoch: new epoch ID should be larger than current epoch
  backtrace of inner error:
   0: anyhow::kind::Adhoc::new
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.75/src/kind.rs:71:36
   1: risingwave_jni_core::Java_com_risingwave_java_binding_Binding_sendSinkWriterErrorToChannel::{{closure}}
             at ./risingwave/src/jni_core/src/lib.rs:954:50
   2: risingwave_jni_core::execute_and_catch::{{closure}}
             at ./risingwave/src/jni_core/src/lib.rs:206:59
   3: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
   4: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
   5: std::panicking::try
 at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
   6: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
   7: risingwave_common::util::panic::rw_catch_unwind::{{closure}}
             at ./risingwave/src/common/src/util/panic.rs:44:9
   8: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
   9: tokio::task::task_local::LocalKey<T>::sync_scope
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:173:15
  10: risingwave_common::util::panic::rw_catch_unwind
             at ./risingwave/src/common/src/util/panic.rs:42:5
  11: risingwave_jni_core::execute_and_catch
             at ./risingwave/src/jni_core/src/lib.rs:206:11
  12: Java_com_risingwave_java_binding_Binding_sendSinkWriterErrorToChannel
             at ./risingwave/src/jni_core/src/lib.rs:948:5
  13: <unknown>

  backtrace of `StreamError`:
   0: <risingwave_stream::error::Inner as core::convert::From<risingwave_stream::error::ErrorKind>>::from
             at ./risingwave/src/stream/src/error.rs:37:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/convert/mod.rs:757:9
   2: <risingwave_stream::error::StreamError as core::convert::From<risingwave_stream::error::ErrorKind>>::from
             at ./risingwave/src/stream/src/error.rs:110:34
   3: <T as core::convert::Into<U>>::into
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/convert/mod.rs:757:9
   4: <risingwave_stream::error::StreamError as core::convert::From<risingwave_stream::executor::error::StreamExecutorError>>::from
             at ./risingwave/src/stream/src/error.rs:139:36
   5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/result.rs:1963:27
   6: <risingwave_stream::executor::dispatch::DispatchExecutor as risingwave_stream::executor::StreamConsumer>::execute::{{closure}}
             at ./risingwave/src/stream/src/executor/dispatch.rs:318:36
   7: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
   8: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.28/src/stream.rs:120:9
   9: <&mut S as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.28/src/stream.rs:104:9
  10: <tokio_stream::stream_ext::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/next.rs:42:9
  11: <tokio_stream::stream_ext::try_next::TryNext<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/try_next.rs:43:9
  12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  13: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
    at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/future.rs:124:23
  14: risingwave_stream::executor::actor::Actor<C>::run_consumer::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:196:18
  15: <tokio::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/maybe_done.rs:68:48
  16: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/macros/join.rs:126:24
  17: <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/poll_fn.rs:58:9
  18: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:155:9
  19: risingwave_stream::task::stream_manager::LocalStreamManagerCore::build_actors::{{closure}}::{{closure}}
             at ./risingwave/src/stream/src/task/stream_manager.rs:698:51
  20: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  21: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  22: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:343:19
  23: await_tree::registry::TreeRoot::instrument::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/registry.rs:51:45
  24: <futures_util::future::either::Either<A,B> as core::future::future::Future>::poll
     at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/either.rs:109:32
  25: core::ops::function::FnOnce::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/ops/function.rs:250:5
  26: tokio_metrics::task::instrument_poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-metrics-0.3.0/src/task.rs:2530:15
  27: <tokio_metrics::task::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-metrics-0.3.0/src/task.rs:2430:9
  28: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  29: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:334:17
  30: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/loom/std/unsafe_cell.rs:16:9
  31: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:323:13
  32: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:485:19
  33: <core::panic::unwind_safe::AssertUnwindSafe<F> ascore::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  34: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  35: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  36: std::panic::catch_unwind
     at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  37: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:473:18
  38: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:208:27
  39: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:153:15
  40: tokio::runtime::task::raw::RawTask::poll
   at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/raw.rs:200:18
  41: tokio::runtime::task::LocalNotified<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/mod.rs:400:9
  42: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:576:18
  43: tokio::runtime::coop::with_budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
  44: tokio::runtime::coop::budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
  45: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:575:9
  46: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:526:24
  47: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:491:21
  48: tokio::runtime::context::scoped::Scoped<T>::set
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/scoped.rs:40:9
  49: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:486:9
  50: tokio::runtime::context::runtime::enter_runtime
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
  51: tokio::runtime::scheduler::multi_thread::worker::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:478:5
  52: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:447:45
  53: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/task.rs:42:21
  54: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  55: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:334:17
  56: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/loom/std/unsafe_cell.rs:16:9
  57: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:323:13
  58: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:485:19
  59: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  60: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  61: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  62: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  63: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:473:18
  64: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:208:27
  65: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:153:15
  66: tokio::runtime::task::raw::RawTask::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/raw.rs:200:18
  67: tokio::runtime::task::UnownedTask<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/mod.rs:437:9
  68: tokio::runtime::blocking::pool::Task::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:159:9
  69: tokio::runtime::blocking::pool::Inner::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:513:17
  70: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
            at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:471:13
  71: std::sys_common::backtrace::__rust_begin_short_backtrace
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/sys_common/backtrace.rs:154:18
  72: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/thread/mod.rs:529:17
  73: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  74: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  75: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  76: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  77: std::thread::Builder::spawn_unchecked_::{{closure}}
  at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/thread/mod.rs:528:30
  78: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/ops/function.rs:250:5
  79: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/alloc/src/boxed.rs:2007:9
  80: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/alloc/src/boxed.rs:2007:9
  81: std::sys::unix::thread::Thread::new::thread_start
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/sys/unix/thread.rs:108:17
  82: start_thread
             at ./nptl/pthread_create.c:442:8
  83: clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81


Full log

t.txt

@shanicky shanicky added the type/bug Something isn't working label Dec 4, 2023
@github-actions github-actions bot added this to the release-1.6 milestone Dec 4, 2023
@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 4, 2023

@wenym1 PTAL

@wenym1
Copy link
Contributor

wenym1 commented Dec 4, 2023

@shanicky Any context for how the bug comes from? Is there any way to reproduce the bug?

@fuyufjh
Copy link
Member

fuyufjh commented Dec 5, 2023

@shanicky Any context for how the bug comes from? Is there any way to reproduce the bug?

Please talk to @shanicky. This happened in a customer's environment and blocked us to scaling in a cluster.

@wenym1
Copy link
Contributor

wenym1 commented Dec 5, 2023

This code is a sanity check on whether the epoch is increasing monotonically. I just check the logic of starting epoch and sending barrier to the remote sink, and the logic LGTM.

Is it possible that the barrier of an epoch is sent twice during scale-in? @shanicky

@wenym1
Copy link
Contributor

wenym1 commented Dec 5, 2023

More information can be provided in the log with #13806

@wenym1 wenym1 modified the milestones: release-1.6, release-1.7 Jan 9, 2024
@wenym1 wenym1 modified the milestones: release-1.7, release-1.8 Mar 6, 2024
@wenym1 wenym1 modified the milestones: release-1.8, release-1.9 Apr 8, 2024
@shanicky shanicky modified the milestones: release-1.9, release-1.10 May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants