-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: expose sink/recovery error to user #12996
Comments
We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span. risingwave/src/meta/src/barrier/mod.rs Lines 947 to 951 in f6e3444
However, I mentioned "going well" because it's indeed a race for different compute nodes reporting the error to the meta node. For example, if an actual error is propagated and leads to unexpected actor failures in upstream or downstream, it's possible that the meta node picks them as the cause of the failure. A mechanism for distinguishing them might be required. cc @yezizp2012 |
Strong +1 for action 1 & 2. Personally, I would like to do 1 & 2 at the same time.
Besides, instead of error table, I'll like to apply the idea of #12366 to sink/recovery error i.e. print all the logs in structured log with specific tag, then these can be queried from any logging collecting system like Loki. |
@BugenZhao Will the error message be displayed as a SQL error when user tries to run a query? I am thinking attaching the reason of recovery in the SQL result instead of just showing "The cluster is starting or recovering" |
Currently nope, but this is a good point. I guess we can leverage the notification/subscription from meta to frontend to broadcast some cluster information to users. I used to come up with this when implementing #11936. For example, we may want to |
Questions:
Just thinking of more ideas:
|
Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue. |
Yeah, in those cases we can just return the actor exited / exchange channel closed error. |
Yes, I tried to use the error kind to distinguish the streaming errors in this way by ignoring channel close error, however:
// The failure actors could exit before the barrier is issued, while their
// up-downstream actors could be stuck somehow. Return error directly to trigger the
// recovery.
for (actor_id, err) in &self.failure_actors {
if remaining_actors.contains(actor_id) {
bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
}
} |
This definitely needs refactor as it's not considered to be "uncategorized" now. However, if you need a quick fix, simply matching the |
I think we can incrementally handle it, the approach you took LGTM. We can do further refactoring later to handle the cases wrapped by We need to refactor these into error types / meta error variants then, so we can match on them. Do we have tests which reproduce these cases?
Sorry I didn't quite draw the link for:
and
|
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned. |
Updates:
|
Currently sink errors and recovery errors only exist in logs. We have seen for several times that errors thrown in external sink (unrelated to RW) have caused RW cluster to enter a recovery loop. The current debugging workflow is:
For example, we once encounter a case that user's jdbc sink is sinking a row that violates the foreign key constraint. This error cannot be caught by sink validation and RW itself cannot recovery from this error since it is data related. Therefore, it is better to let user be aware of this error directly instead of us digging the logs.
Some thoughts:
The text was updated successfully, but these errors were encountered: