You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In short, for a remote channel, end of sender doesn't result in end of receiver.
During the deterministic recovery test in this PR, I've encountered a counterintuitive issue:
Let's say there are actor 1 and actor 2 in different compute nodes, with actor 1 being upstream of actor 2, connected by remote channel.
Actor 1 exits. So I've expected actor 2 should exit too, because I think the sender of the remote channel should have dropped along with actor 1, and subsequently should the receiver in actor 2. But in fact actor 2 doesn't exit and its receiver remains waiting for new message.
Root cause
It turns out that, exit of actor 1 doesn't always result in end of actor 2's receiver. Because here, exit of actor 1 does cause the Either::Right stream to drop, but Either::Left stream is still available. The combined select_stream won't end until both streams are exhausted. So actor2's receiver is unaware of end of channel sender.
let select_stream = futures::stream::select(
add_permits_stream.map_ok(Either::Left),
#[try_stream]
async move {
while let Some(m) = receiver.recv_raw().await {
yield Either::Right(m);
}
},
);
Impact
The failed deterministic recovery test in #13441 has 3 compute nodes, 1 of which is killed and current barrier collection is stuck unexpectedly. Because that PR requires meta node to wait all compute nodes to respond when collecting barrier, compared to any one of compute nodes currently. The aforementioned issue causes a compute node to stuck during the barrier collection: the error from actor 1 will never propagates to actor 2, neither will any data.
I'm not sure if this issue should be treated as a bug. What do you think? @BugenZhao
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Because that PR requires meta node to wait all compute nodes to respond when collecting barrier
This reminds me of #10848 (comment). We believed the assumption that "the error of actor exiting should be naturally propagated to its downstream and upstream" sounds fragile at that time. 🫨
Describe the bug
In short, for a remote channel, end of sender doesn't result in end of receiver.
During the deterministic recovery test in this PR, I've encountered a counterintuitive issue:
Root cause
It turns out that, exit of actor 1 doesn't always result in end of actor 2's receiver. Because here, exit of actor 1 does cause the
Either::Right stream
to drop, butEither::Left stream
is still available. The combinedselect_stream
won't end until both streams are exhausted. So actor2's receiver is unaware of end of channel sender.Impact
The failed deterministic recovery test in #13441 has 3 compute nodes, 1 of which is killed and current barrier collection is stuck unexpectedly. Because that PR requires meta node to wait all compute nodes to respond when collecting barrier, compared to any one of compute nodes currently. The aforementioned issue causes a compute node to stuck during the barrier collection: the error from actor 1 will never propagates to actor 2, neither will any data.
I'm not sure if this issue should be treated as a bug. What do you think? @BugenZhao
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: