Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: remote channel doesn't stop as expected #14626

Closed
zwang28 opened this issue Jan 17, 2024 · 1 comment
Closed

bug: remote channel doesn't stop as expected #14626

zwang28 opened this issue Jan 17, 2024 · 1 comment
Labels
type/bug Something isn't working
Milestone

Comments

@zwang28
Copy link
Contributor

zwang28 commented Jan 17, 2024

Describe the bug

In short, for a remote channel, end of sender doesn't result in end of receiver.

During the deterministic recovery test in this PR, I've encountered a counterintuitive issue:

  1. Let's say there are actor 1 and actor 2 in different compute nodes, with actor 1 being upstream of actor 2, connected by remote channel.
  2. Actor 1 exits. So I've expected actor 2 should exit too, because I think the sender of the remote channel should have dropped along with actor 1, and subsequently should the receiver in actor 2. But in fact actor 2 doesn't exit and its receiver remains waiting for new message.

Root cause

It turns out that, exit of actor 1 doesn't always result in end of actor 2's receiver. Because here, exit of actor 1 does cause the Either::Right stream to drop, but Either::Left stream is still available. The combined select_stream won't end until both streams are exhausted. So actor2's receiver is unaware of end of channel sender.

        let select_stream = futures::stream::select(
            add_permits_stream.map_ok(Either::Left),
            #[try_stream]
            async move {
                while let Some(m) = receiver.recv_raw().await {
                    yield Either::Right(m);
                }
            },
        );

Impact

The failed deterministic recovery test in #13441 has 3 compute nodes, 1 of which is killed and current barrier collection is stuck unexpectedly. Because that PR requires meta node to wait all compute nodes to respond when collecting barrier, compared to any one of compute nodes currently. The aforementioned issue causes a compute node to stuck during the barrier collection: the error from actor 1 will never propagates to actor 2, neither will any data.

I'm not sure if this issue should be treated as a bug. What do you think? @BugenZhao

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@zwang28 zwang28 added the type/bug Something isn't working label Jan 17, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Jan 17, 2024
@BugenZhao
Copy link
Member

Great investigation!

Because that PR requires meta node to wait all compute nodes to respond when collecting barrier

This reminds me of #10848 (comment). We believed the assumption that "the error of actor exiting should be naturally propagated to its downstream and upstream" sounds fragile at that time. 🫨

@zwang28 zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants