Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass streaming error score to meta to identify the global-root error #17368

Closed
BugenZhao opened this issue Jun 20, 2024 · 0 comments · Fixed by #17685
Closed

pass streaming error score to meta to identify the global-root error #17368

BugenZhao opened this issue Jun 20, 2024 · 0 comments · Fixed by #17685
Labels
component/meta Meta related issue. component/streaming Stream processing related issue. type/enhancement Improvements to existing implementation.
Milestone

Comments

@BugenZhao
Copy link
Member

We have a scoring mechanism to identify the root error that fails the streaming pipeline locally on each compute node.

/// Tries to find the root cause of actor failures, based on hard-coded rules.
pub fn try_find_root_actor_failure<'a>(
actor_errors: impl IntoIterator<Item = &'a StreamError>,
) -> Option<StreamError> {
// Explicitly list all error kinds here to notice developers to update this function when
// there are changes in error kinds.
fn stream_executor_error_score(e: &StreamExecutorError) -> i32 {

It would be better if we can pass the score to the meta service, so that it can find the global-root cause of the failure among all compute nodes, instead of concatenating all error messages.

if let Some(command) = node.inflight_barriers.pop_front() {
let errors = self.collect_errors(node.worker.id, err).await;
let err = merge_node_rpc_errors("get error from control stream", errors);
self.context.report_collect_failure(&command, &err);
break Err(err);

To embed the score in gRPC error, one can make it a field of ServerError.

/// The error produced by the gRPC server and sent to the client on the wire.
#[derive(Debug, Serialize, Deserialize)]
struct ServerError {
error: serde_error::Error,
service_name: Option<ServiceName>,
}

@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. component/streaming Stream processing related issue. component/meta Meta related issue. labels Jun 20, 2024
@github-actions github-actions bot added this to the release-1.10 milestone Jun 20, 2024
@BugenZhao BugenZhao changed the title pass streaming error score to meta to only print the root error pass streaming error score to meta to identify the global-root error Jun 20, 2024
@BugenZhao BugenZhao modified the milestones: release-1.10, release-1.11 Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/meta Meta related issue. component/streaming Stream processing related issue. type/enhancement Improvements to existing implementation.
Projects
None yet
1 participant