Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(meta): try to report the root cause of recovery #13441

Merged
merged 25 commits into from
Feb 27, 2024

Conversation

zwang28
Copy link
Contributor

@zwang28 zwang28 commented Nov 15, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR attempts to provide more concise error about root cause of recovery, taking these considerations into account. The idea is

  • Currently each compute node only reports the first actor error to meta node. This may miss the root cause because the first reported actor error is not guaranteed to be the problematic one. This PR chooses the most likely error from all locally received actor errors when responds inject barrier or collect barrier RPC. This PR waits for as many actors' error as possible, identifies the potential one that is likely to be the root cause, and reports it when next recovery starts. That's the time point when all actors have exited.
  • The trivial function try_find_root_cause is rule-based. It sorts all errors and only takes the highest-ranked one. It assigns low rankings for specific error types:
    • StreamError::Internal, which is used here.
    • StreamError::Executor. Among all StreamError::Executor, ChannelClosed and Internal are assigned low rankings.
  • Meta node no longer short-circuit on compute node error on inject or collect barrier. It waits until all compute node respond, and combine their errors into one, so that the potential root cause won't be missed. Meta node simply logs these errors received from each compute node. This can be further refined to identify and only log the wanted one.

Note that this PR still cannot log failure caused by compute node crashing.

#12827

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@yezizp2012
Copy link
Member

Reporting the root cause error in ForceStopActorsResponse is a little bit weird, can we just report the error in barrier related RPCs? So that we can also return the root error to frontend in this PR: #13108 . 🤔 Cc @BugenZhao @kwannoel WDYT?

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 15, 2023

Reporting the root cause error in ForceStopActorsResponse is a little bit weird, can we just report the error in barrier related RPCs? So that we can also return the root error to frontend in this PR: #13108 . 🤔 Cc @BugenZhao @kwannoel WDYT?

I expect the root cause error to be an event log, if introducing persistent event log is agreed upon. Will it help, if we can read and write event log at any time from any worker node. #13392 (comment)

Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Attention: 91 lines in your changes are missing coverage. Please review.

Comparison is base (e09e453) 68.07% compared to head (1629bcd) 68.05%.
Report is 3 commits behind head on main.

Files Patch % Lines
src/stream/src/task/stream_manager.rs 0.00% 30 Missing ⚠️
src/meta/src/barrier/mod.rs 50.90% 27 Missing ⚠️
src/compute/src/rpc/service/stream_service.rs 0.00% 13 Missing ⚠️
src/stream/src/task/barrier_manager.rs 0.00% 8 Missing ⚠️
src/meta/src/rpc/ddl_controller.rs 0.00% 4 Missing ⚠️
src/stream/src/error.rs 0.00% 3 Missing ⚠️
src/stream/src/executor/error.rs 0.00% 3 Missing ⚠️
...c/stream/src/task/barrier_manager/managed_state.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13441      +/-   ##
==========================================
- Coverage   68.07%   68.05%   -0.03%     
==========================================
  Files        1516     1516              
  Lines      261578   261683     +105     
==========================================
+ Hits       178078   178095      +17     
- Misses      83500    83588      +88     
Flag Coverage Δ
rust 68.05% <23.52%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the rest

@@ -124,11 +124,12 @@ impl StreamService for StreamServiceImpl {
request: Request<ForceStopActorsRequest>,
) -> std::result::Result<Response<ForceStopActorsResponse>, Status> {
let req = request.into_inner();
self.mgr.stop_all_actors().await?;
let previous_actor_failure_cause = self.mgr.stop_all_actors().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta node simply logs these errors received from each compute node. This can be further refined to identify and only log the wanted one.

Agree with this. So why not sorting the errors on Meta node? In that way, the compute nodes simply report all errors to meta, and the meta node can pick the "root" one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doable. But in case there are lots of actors, and we want to keep backtrace, then the total data size to meta node can be large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we should do some local sort in CN first, then only return a single error to meta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the sorting? 😕 Note that local sorting means buffering. We're not sure how many errors we should process or how long we should wait, which inevitably leads to returning error in ForceStopActorsResponse.

A simple proposal is to use gRPC streaming to return errors, filtering out some of them to ensure that the ranking of these errors strictly increases. When the meta node finds an error interesting or when the timeout has been reached, we can simply cut off the CollectBarrier RPC connection and we're done.

Copy link
Member

@yezizp2012 yezizp2012 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still recommend maintaining and distinguishing actor errors on the compute nodes, and returning them through the CollectBarrier, SendBarrier RPCs. On the CN, we can return errors at the following time points: 1. After receiving all reported errors from actors; 2. After receiving the first error and timing out. This way, corresponding streaming DDL in Foreground mode can also directly display errors on the client side.
To avoid buffering the errors, we can simply count the Channel closed error instead. Or even we can group and count according to different types of errors, and only buffer one copy of the error, if our streaming error can be well classified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to handle errors of multiple compute nodes in meta node, as ServerError has erased inner error type, it's harder to implement a try_find_root_cause in meta node than in compute node. Although currently the service_name field has an 1-1 mapping to each concrete error type, I suppose we shouldn't rely on it.

That's why I just keep all compute nodes' errors instead. I think it's still OK to check several compute node's error, as long as each is concise.

Copy link
Contributor Author

@zwang28 zwang28 Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still recommend maintaining and distinguishing actor errors on the compute nodes, and returning them through the CollectBarrier, SendBarrier RPCs. On the CN, we can return errors at the following time points: 1. After receiving all reported errors from actors; 2. After receiving the first error and timing out.

This also LGTM and be much easier to implement. Just add another timeout in current implementation.

To avoid buffering the errors

Regarding "buffering the errors in compute node", as we have actually been doing it this way all along, so it's not a big deal?

@BugenZhao
Copy link
Member

Note that this PR still cannot log failure caused by compute node crashing.

Do you mean panicking? Suddenly I realize that there's an opportunity to report the panic message in the panic hook, and the error reported here must be the root cause. But of course, this is another code path.

@kwannoel
Copy link
Contributor

Reporting the root cause error in ForceStopActorsResponse is a little bit weird, can we just report the error in barrier related RPCs? So that we can also return the root error to frontend in this PR: #13108 . 🤔 Cc @BugenZhao @kwannoel WDYT?

I expect the root cause error to be an event log, if introducing persistent event log is agreed upon. Will it help, if we can read and write event log at any time from any worker node. #13392 (comment)

Hmm I'm a little bit confused. Seems like this PR accomplish the exact same thing as event log PR #13392. Could you share how they are complementary?

I also agree with @yezizp2012 's comment. We should report the error in barrier related RPCs which collect barrier.

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 16, 2023

Could you share how they are complementary

#13392 records failed streaming job creation, which does not necessarily indicate a recovery.
This PR records root cause of actor failure, which leads to recovery.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 16, 2023

Reporting the root cause error in ForceStopActorsResponse is a little bit weird, can we just report the error in barrier related RPCs? So that we can also return the root error to frontend in this PR: #13108 . 🤔 Cc @BugenZhao @kwannoel WDYT?

I expect the root cause error to be an event log, if introducing persistent event log is agreed upon. Will it help, if we can read and write event log at any time from any worker node. #13392 (comment)

I mean, I don't quite get the implication. Seems like root cause error being an event log has no relation to whether we report the error in barrier related RPCs.

Even if we report the error in barrier related RPCs it can still be an event log and vice versa.

By reporting it in the barrier related RPCs, it's linked to the error source, e.g. barrier collection failed. So it's more intuitive when we are figuring out what's the origin of the error.

Linking it to ForceStopActorsResponse seems unintuitive, because we get an error that's unrelated to the action of ForceStopActors.

@BugenZhao
Copy link
Member

Internal errors, which should be referred to as "Uncategorized" errors, can correspond to various types of errors. So it's inaccurate to always assign low rankings to them. For example, all bail! macro calls will generate an Internal error in executors, and they're likely to be a notable error.

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 16, 2023

Seems like root cause error being an event log has no relation to whether we report the error in barrier related RPCs.

I mean if event log can be introduced, we don't need to piggyback error through any RPC response like this PR, just initiate a new event log from the worker node at the right time instead.
@kwannoel

@kwannoel
Copy link
Contributor

Seems like root cause error being an event log has no relation to whether we report the error in barrier related RPCs.

I mean if event log can be introduced, we don't need to piggyback error through any RPC response like this PR, just initiate a new event log from the worker node at the right time instead. @kwannoel

Seems more ambiguous to find root cause in that case.
Because event log is not limited to recovery, there could be other usecases for it.
There's no hard rule which prevents unrelated errors from being included in the event log, then we have to do the extra job of finding which is the cause.

Another scenario is that we have some error loop in CN, triggering recovery again and again. Then we have multiple errors being produced. Then it also seems harder to find root cause.

Additionally, if we trace error at origin, I think it is possible to then diagnose if it's because CN crashed, which is not uncommon.

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 16, 2023

Internal errors, which should be referred to as "Uncategorized" errors, can correspond to various types of errors. So it's inaccurate to always assign low rankings to them. For example, all bail! macro calls will generate an Internal error in executors, and they're likely to be a notable error.

  • For the outermost StreamError, none of its Internal usages can be root cause based on my review. If I'm wrong, maybe we can specialize them into concrete errors case by case.

For example, all bail! macro calls will generate an Internal error in executors, and they're likely to be a notable error.

  • For the inner StreamExecutorError, its Internal does have low ranking (1), but it's still higher than ChannelClosed (0). So with regards to the concern above, StreamExecutorError::Internal actually has higher enough ranking I think.

@BugenZhao
Copy link
Member

  • For the outermost StreamError, none of its Internal usages can be root cause based on my review. If I'm wrong, maybe we can specialize them into concrete errors case by case.

What about this?

if remaining_actors.contains(actor_id) {
bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
}

  • For the inner StreamExecutorError, its Internal does have low ranking (1), but it's still higher than ChannelClosed (0). So with regards to the concern above, StreamExecutorError::Internal actually has higher enough ranking I think.

Makes sense to me.

@yezizp2012
Copy link
Member

  • For the outermost StreamError, none of its Internal usages can be root cause based on my review. If I'm wrong, maybe we can specialize them into concrete errors case by case.

What about this?

if remaining_actors.contains(actor_id) {
bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
}

That's exactly the error return path of barrier processing related RPC. We'd better replacing it with the try find root cause logic.

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 16, 2023

  • For the outermost StreamError, none of its Internal usages can be root cause based on my review. If I'm wrong, maybe we can specialize them into concrete errors case by case.

What about this?

if remaining_actors.contains(actor_id) {
bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
}

That's exactly the error return path of barrier processing related RPC. We'd better replacing it with the try find root cause logic.

The info from this bail! is redundant in terms of finding root cause, because the info is derived from failure_actors, which we will find root cause from. So assigning this bail! a low ranking is reasonable to me.

The early return here that triggers recovery doesn't prevent us from calling try_find_root_cause later when recovery does start.

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 21, 2023

So try_find_root_cause for response of inject_barrier and barrier_complete RPC. No modifying ForceStopActorsResponse. Does this look good to you @kwannoel @yezizp2012

One problem is meta node just shows the first RPC error when it injects or collects barrier, which may not be from the problematic compute node. May need to replace try_join_all.

In contrast, ForceStopActors will succeed most of the time, so that we can get error messages from all compute nodes through its response.

@yezizp2012
Copy link
Member

So try_find_root_cause for response of inject_barrier and barrier_complete RPC. No modifying ForceStopActorsResponse. Does this look good for you @kwannoel @yezizp2012

👍 Yes that would be perfect and the root cause error message for Foreground streaming job can display directly in the user session.

Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem is meta node just shows the first RPC error when it injects or collects barrier, which may not be from the problematic compute node. May need to replace try_join_all.

Ok let's track that.

Rest LGTM

@BugenZhao
Copy link
Member

So try_find_root_cause for response of inject_barrier and barrier_complete RPC. No modifying ForceStopActorsResponse.

LGTM. Ping me when the PR is ready for review again. :)

@zwang28
Copy link
Contributor Author

zwang28 commented Nov 21, 2023

I've updated the PR description. @yezizp2012 @kwannoel @BugenZhao
It's ready for review.

Currently each compute node only reports the first actor error to meta node. This may miss the root cause because the first actor error is not guaranteed to be the problematic one.

TBH, I think try_find_root_cause when responding inject_barrier and barrier_complete RPC doesn't make much difference for the issue above, because the root cause still may not yet be added to ManagedBarrierState::failure_actors.
But actually the issue is essentially very rare, so try_find_root_cause in compute node may be unnecessary. I'm OK to remove it.

Meta node no longer short-circuit on compute node error on inject or collect barrier. It waits until all compute node respond, and combine their errors into one, so that the potential root cause won't be missed.

This method just merge errors into one in meta node, instead of choosing the most likely one.

Currently I still find the barrier failure error in meta node hard to read, mainly because:

  1. compute node's backtrace is always printed, even with "{}". I think it's because the backtrace has been converted to part of the error message string when going across the boundary between compute node and meta node?
  2. this span always includes the err. If with backtrace, each log inside it will be verbose.

if results_err.is_empty() {
return Ok(results_ok);
}
let merged_error = merge_node_rpc_errors("merged RPC Error", results_err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message does not look that informative.

@BugenZhao
Copy link
Member

IIUC, now we take errors reported by all compute nodes into consideration, but only the error from a single actor will be reported on each compute node, which means that there could still be race cases and the single error is not the root cause of recovery?

@zwang28
Copy link
Contributor Author

zwang28 commented Feb 19, 2024

IIUC, now we take errors reported by all compute nodes into consideration, but only the error from a single actor will be reported on each compute node, which means that there could still be race cases and the single error is not the root cause of recovery?

Yes.
I'm still hesitating about whether to bring back the try_find_root_cause in compute node, because I think the upcoming partial checkpoint may change the way notify_failure.
I'm adding try_find_root_cause back.

Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can collect all actor error in the force_stop_actors rpc.

src/meta/src/barrier/rpc.rs Show resolved Hide resolved
src/meta/src/barrier/rpc.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zwang28
Copy link
Contributor Author

zwang28 commented Feb 19, 2024

IIUC, now we take errors reported by all compute nodes into consideration, but only the error from a single actor will be reported on each compute node, which means that there could still be race cases and the single error is not the root cause of recovery?

Yes. I'm still hesitating about whether to bring back the try_find_root_cause in compute node, because I think the upcoming partial checkpoint may change the way notify_failure. I'm adding try_find_root_cause back.

Brought back try_find_root_cause in compute node. PTAL @BugenZhao
Compute node now always waits 3s before report inject/collect barrier error to meta node. Hopefully the root actor failure has been collected locally within that period.

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@@ -36,6 +38,14 @@ impl StreamServiceImpl {
pub fn new(mgr: LocalStreamManager, env: StreamEnvironment) -> Self {
StreamServiceImpl { mgr, env }
}

async fn try_get_root_actor_failure(&self) -> Option<StreamError> {
tokio::time::sleep(Duration::from_secs(3)).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a previous epoch has failed and the root failure was located, do we still need to delay by 3 seconds for each subsequent epoch? Can we improve this to let the following epochs fail immediately, by moving the sleep into the event loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to cache the root cause.
But blocking event loop by sleep will also prevent new actor failure from being added to LocalBarrierWorker::failure_actors, which make sleep useless.
Thinking about other way 🤔 .

Copy link
Contributor Author

@zwang28 zwang28 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realize that there is no guarantee that, the LocalBarrierWorker instance being queried is the one active when actor fails 🥵 . For example:

  1. An actor fails.
  2. try_get_root_actor_failure starts to sleep.
  3. A recovery succeeds, resets the LocalBarrierWorker, along with failure cause.
  4. An actor fails again.
  5. After 3s, try_get_root_actor_failure queries LocalBarrierWorker but it gets wrong failure cause from step 4, not step 1.

Note that the issue still holds even without sleep, because sending LocalBarrierEvent::AwaitEpochCompleted and then LocalBarrierEvent::TryGetRootActorFailure are not atomic. I thought about doing the sleep and then try_get_root_actor_failure here, but failed to come up with a good implementation.

To fix it, each LocalBarrierWorker instance may need an unique identifier. But I doubt it's worth doing at the moment.
@wenym1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to cache the root cause. But blocking event loop by sleep will also prevent new actor failure from being added to LocalBarrierWorker::failure_actors, which make sleep useless. Thinking about other way 🤔 .

Another solution to this can be, when there is any actor failure, in the worker loop, we keep polling the channel to receive actor failure until a 3 seconds timeout is reached. A concern is that the current barrier manager only has a single channel, and the actor failure will be mixed with other channel event, which may make this logic complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change to report actor error in a separated channel, if it's not complicated. Let me have a try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, the sleeping for different epochs is concurrent. We may leave it as it is if we find it's hard to improve that. 🤣

Copy link
Contributor

@wenym1 wenym1 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #15157 we can keep fetching actor error within a timeout in worker loop when getting any actor error. This PR change to report actor error in a separate channel.

Copy link
Contributor Author

@zwang28 zwang28 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #15157 we can keep fetching actor error within a timeout in worker loop when getting any actor error. This PR change to report actor error in a separate channel.

Implemented. PTAL @wenym1

Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM!

src/stream/src/task/barrier_manager.rs Outdated Show resolved Hide resolved
src/stream/src/task/barrier_manager.rs Outdated Show resolved Hide resolved
@zwang28 zwang28 enabled auto-merge February 27, 2024 08:52
@zwang28 zwang28 added this pull request to the merge queue Feb 27, 2024
Merged via the queue into main with commit 17212b3 Feb 27, 2024
25 of 27 checks passed
@zwang28 zwang28 deleted the wangzheng/recovery_cause branch February 27, 2024 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants