-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(meta): try to report the root cause of recovery #13441
Conversation
Reporting the root cause error in |
I expect the root cause error to be an event log, if introducing persistent event log is agreed upon. Will it help, if we can read and write event log at any time from any worker node. #13392 (comment) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13441 +/- ##
==========================================
- Coverage 68.07% 68.05% -0.03%
==========================================
Files 1516 1516
Lines 261578 261683 +105
==========================================
+ Hits 178078 178095 +17
- Misses 83500 83588 +88
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the rest
@@ -124,11 +124,12 @@ impl StreamService for StreamServiceImpl { | |||
request: Request<ForceStopActorsRequest>, | |||
) -> std::result::Result<Response<ForceStopActorsResponse>, Status> { | |||
let req = request.into_inner(); | |||
self.mgr.stop_all_actors().await?; | |||
let previous_actor_failure_cause = self.mgr.stop_all_actors().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta node simply logs these errors received from each compute node. This can be further refined to identify and only log the wanted one.
Agree with this. So why not sorting the errors on Meta node? In that way, the compute nodes simply report all errors to meta, and the meta node can pick the "root" one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doable. But in case there are lots of actors, and we want to keep backtrace, then the total data size to meta node can be large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we should do some local sort in CN first, then only return a single error to meta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the sorting? 😕 Note that local sorting means buffering. We're not sure how many errors we should process or how long we should wait, which inevitably leads to returning error in ForceStopActorsResponse
.
A simple proposal is to use gRPC streaming to return errors, filtering out some of them to ensure that the ranking of these errors strictly increases. When the meta node finds an error interesting or when the timeout has been reached, we can simply cut off the CollectBarrier
RPC connection and we're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still recommend maintaining and distinguishing actor errors on the compute nodes, and returning them through the CollectBarrier
, SendBarrier
RPCs. On the CN, we can return errors at the following time points: 1. After receiving all reported errors from actors; 2. After receiving the first error and timing out. This way, corresponding streaming DDL in Foreground
mode can also directly display errors on the client side.
To avoid buffering the errors, we can simply count the Channel closed
error instead. Or even we can group and count according to different types of errors, and only buffer one copy of the error, if our streaming error can be well classified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to handle errors of multiple compute nodes in meta node, as ServerError has erased inner error type, it's harder to implement a try_find_root_cause
in meta node than in compute node. Although currently the service_name field has an 1-1 mapping to each concrete error type, I suppose we shouldn't rely on it.
That's why I just keep all compute nodes' errors instead. I think it's still OK to check several compute node's error, as long as each is concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still recommend maintaining and distinguishing actor errors on the compute nodes, and returning them through the CollectBarrier, SendBarrier RPCs. On the CN, we can return errors at the following time points: 1. After receiving all reported errors from actors; 2. After receiving the first error and timing out.
This also LGTM and be much easier to implement. Just add another timeout in current implementation.
To avoid buffering the errors
Regarding "buffering the errors in compute node", as we have actually been doing it this way all along, so it's not a big deal?
Do you mean panicking? Suddenly I realize that there's an opportunity to report the panic message in the panic hook, and the error reported here must be the root cause. But of course, this is another code path. |
Hmm I'm a little bit confused. Seems like this PR accomplish the exact same thing as event log PR #13392. Could you share how they are complementary? I also agree with @yezizp2012 's comment. We should report the error in barrier related RPCs which collect barrier. |
#13392 records failed streaming job creation, which does not necessarily indicate a recovery. |
I mean, I don't quite get the implication. Seems like root cause error being an event log has no relation to whether we report the error in barrier related RPCs. Even if we report the error in barrier related RPCs it can still be an event log and vice versa. By reporting it in the barrier related RPCs, it's linked to the error source, e.g. barrier collection failed. So it's more intuitive when we are figuring out what's the origin of the error. Linking it to |
|
I mean if event log can be introduced, we don't need to piggyback error through any RPC response like this PR, just initiate a new event log from the worker node at the right time instead. |
Seems more ambiguous to find root cause in that case. Another scenario is that we have some error loop in CN, triggering recovery again and again. Then we have multiple errors being produced. Then it also seems harder to find root cause. Additionally, if we trace error at origin, I think it is possible to then diagnose if it's because CN crashed, which is not uncommon. |
|
What about this? risingwave/src/stream/src/task/barrier_manager/managed_state.rs Lines 269 to 271 in 7810f9c
Makes sense to me. |
That's exactly the error return path of barrier processing related RPC. We'd better replacing it with the try find root cause logic. |
The info from this The early return here that triggers recovery doesn't prevent us from calling try_find_root_cause later when recovery does start. |
So One problem is meta node just shows the first RPC error when it injects or collects barrier, which may not be from the problematic compute node. May need to replace try_join_all. In contrast, |
👍 Yes that would be perfect and the root cause error message for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem is meta node just shows the first RPC error when it injects or collects barrier, which may not be from the problematic compute node. May need to replace try_join_all.
Ok let's track that.
Rest LGTM
LGTM. Ping me when the PR is ready for review again. :) |
I've updated the PR description. @yezizp2012 @kwannoel @BugenZhao
TBH, I think
This method just merge errors into one in meta node, instead of choosing the most likely one. Currently I still find the barrier failure error in meta node hard to read, mainly because:
|
src/meta/src/barrier/rpc.rs
Outdated
if results_err.is_empty() { | ||
return Ok(results_ok); | ||
} | ||
let merged_error = merge_node_rpc_errors("merged RPC Error", results_err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message does not look that informative.
IIUC, now we take errors reported by all compute nodes into consideration, but only the error from a single actor will be reported on each compute node, which means that there could still be race cases and the single error is not the root cause of recovery? |
Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can collect all actor error in the force_stop_actors rpc.
# Conflicts: # src/meta/src/barrier/rpc.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Brought back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
@@ -36,6 +38,14 @@ impl StreamServiceImpl { | |||
pub fn new(mgr: LocalStreamManager, env: StreamEnvironment) -> Self { | |||
StreamServiceImpl { mgr, env } | |||
} | |||
|
|||
async fn try_get_root_actor_failure(&self) -> Option<StreamError> { | |||
tokio::time::sleep(Duration::from_secs(3)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a previous epoch has failed and the root failure was located, do we still need to delay by 3 seconds for each subsequent epoch? Can we improve this to let the following epochs fail immediately, by moving the sleep
into the event loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to cache the root cause.
But blocking event loop by sleep
will also prevent new actor failure from being added to LocalBarrierWorker::failure_actors
, which make sleep
useless.
Thinking about other way 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realize that there is no guarantee that, the LocalBarrierWorker
instance being queried is the one active when actor fails 🥵 . For example:
- An actor fails.
try_get_root_actor_failure
starts to sleep.- A recovery succeeds, resets the
LocalBarrierWorker
, along with failure cause. - An actor fails again.
- After 3s,
try_get_root_actor_failure
queriesLocalBarrierWorker
but it gets wrong failure cause from step 4, not step 1.
Note that the issue still holds even without sleep
, because sending LocalBarrierEvent::AwaitEpochCompleted
and then LocalBarrierEvent::TryGetRootActorFailure
are not atomic. I thought about doing the sleep
and then try_get_root_actor_failure
here, but failed to come up with a good implementation.
To fix it, each LocalBarrierWorker
instance may need an unique identifier. But I doubt it's worth doing at the moment.
@wenym1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to cache the root cause. But blocking event loop by
sleep
will also prevent new actor failure from being added toLocalBarrierWorker::failure_actors
, which makesleep
useless. Thinking about other way 🤔 .
Another solution to this can be, when there is any actor failure, in the worker loop, we keep polling the channel to receive actor failure until a 3 seconds timeout is reached. A concern is that the current barrier manager only has a single channel, and the actor failure will be mixed with other channel event, which may make this logic complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change to report actor error in a separated channel, if it's not complicated. Let me have a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, the sleeping for different epochs is concurrent. We may leave it as it is if we find it's hard to improve that. 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #15157 we can keep fetching actor error within a timeout in worker loop when getting any actor error. This PR change to report actor error in a separate channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM!
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR attempts to provide more concise error about root cause of recovery, taking these considerations into account. The idea is
This PR waits for as many actors' error as possible, identifies the potential one that is likely to be the root cause, and reports it when next recovery starts. That's the time point when all actors have exited.try_find_root_cause
is rule-based. It sorts all errors and only takes the highest-ranked one. It assigns low rankings for specific error types:Meta node simply logs these errors received from each compute node. This can be further refined to identify and only log the wanted one.Note that this PR still cannot log failure caused by compute node crashing.
#12827
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.