bug: remote channel doesn't stop as expected #14626

zwang28 · 2024-01-17T09:47:40Z

Describe the bug

In short, for a remote channel, end of sender doesn't result in end of receiver.

During the deterministic recovery test in this PR, I've encountered a counterintuitive issue:

Let's say there are actor 1 and actor 2 in different compute nodes, with actor 1 being upstream of actor 2, connected by remote channel.
Actor 1 exits. So I've expected actor 2 should exit too, because I think the sender of the remote channel should have dropped along with actor 1, and subsequently should the receiver in actor 2. But in fact actor 2 doesn't exit and its receiver remains waiting for new message.

Root cause

It turns out that, exit of actor 1 doesn't always result in end of actor 2's receiver. Because here, exit of actor 1 does cause the Either::Right stream to drop, but Either::Left stream is still available. The combined select_stream won't end until both streams are exhausted. So actor2's receiver is unaware of end of channel sender.

        let select_stream = futures::stream::select(
            add_permits_stream.map_ok(Either::Left),
            #[try_stream]
            async move {
                while let Some(m) = receiver.recv_raw().await {
                    yield Either::Right(m);
                }
            },
        );

Impact

The failed deterministic recovery test in #13441 has 3 compute nodes, 1 of which is killed and current barrier collection is stuck unexpectedly. Because that PR requires meta node to wait all compute nodes to respond when collecting barrier, compared to any one of compute nodes currently. The aforementioned issue causes a compute node to stuck during the barrier collection: the error from actor 1 will never propagates to actor 2, neither will any data.

I'm not sure if this issue should be treated as a bug. What do you think? @BugenZhao

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

BugenZhao · 2024-01-17T13:16:17Z

Great investigation!

Because that PR requires meta node to wait all compute nodes to respond when collecting barrier

This reminds me of #10848 (comment). We believed the assumption that "the error of actor exiting should be naturally propagated to its downstream and upstream" sounds fragile at that time. 🫨

zwang28 added the type/bug Something isn't working label Jan 17, 2024

github-actions bot added this to the release-1.7 milestone Jan 17, 2024

zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: remote channel doesn't stop as expected #14626

bug: remote channel doesn't stop as expected #14626

zwang28 commented Jan 17, 2024 •

edited

Loading

BugenZhao commented Jan 17, 2024

bug: remote channel doesn't stop as expected #14626

bug: remote channel doesn't stop as expected #14626

Comments

zwang28 commented Jan 17, 2024 • edited Loading

Describe the bug

Root cause

Impact

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

BugenZhao commented Jan 17, 2024

zwang28 commented Jan 17, 2024 •

edited

Loading