feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

wenym1 · 2024-03-13T05:44:47Z

Reverts #15652 and re-merge #14887

sysbench that failed previously succeeds with the fix in this PR https://buildkite.com/risingwave-test/sysbench/builds/707

The previous failure is caused when meta node try to connect to a newly registered CN worker. It seems that there is latency in the dns information propagation in the k8s cluster. As a result, the meta node will fail to resolve the dns information of the newly registered CN worker that may just get booted and then the initial command will fail.

The fix in this PR is to retry when fail to create a new control stream with the newly registered CN worker.

Most LOC in this PR is from the previous merged PR. Fix can be reviewed in the commit 8e52b83

#1488…" This reverts commit 59ae1cf.

hzxa21 · 2024-03-13T09:44:17Z

src/meta/src/barrier/rpc.rs

+                    // It may happen that the dns information of newly registered worker node
+                    // has not been propagated to the meta node and cause error. Wait for a while and retry


Prior to the bidi stream refactoring, we didn't see this issue in our tests. Is it because previously the creation of the RPC connection is delayed until the first barrier injection given that it is a unary RPC while after the refactoring the creation of the RPC connection comes earlier when the worker node is added?

Yes, I think so. With this PR the meta node connects to CN earlier than before.

hzxa21 · 2024-03-13T09:50:45Z

src/meta/src/barrier/rpc.rs

            }
        }
+        error!(?node_host, "fail to create worker node after retry");


Should we retry indefinitely or mark the node as inactive or panic instead of just printing an error log if we cannot create the streaming RPC?

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery. The recovery process will continue until InjectBarrier RPC can succeed, which is equivalent to retry indefinitely.

If the stream cannot be created, it will not exist in the node map, and then injecting barrier will return with error because the worker node to be injected can not be found. Actually this is the cause of the previous failure to create table.

Not adding the worker node to the node map is something like marking the node as inactive.

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.

Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial add_worker attempt fails, there won't be future retry even when new actors are assigned to the worker node.

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.

Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial add_worker attempt fails, there won't be future retry even when new actors are assigned to the worker node.

If a CN does not have any actor on it, failure to connect to it should not trigger recovery. When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.

Since this is a blocking async method, endless retrying in the method may block the usual handling of events.

When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.

IIUC, if the retry attempts are exhausted in the initial add_worker call, the node map won't have an entry for the worker node. In this case, when there are new actors assigned to the worker node and the inject barrier fails, will recovery trigger add_worker again and try to re-establish the streaming RPC?

Oh my bad. recovery will trigger reset, which will try re-establish the streaming RPC. Then why did the tests fail prior to this PR? Does it mean that the failing tests expect no recovery triggered at all, even though things can be resolve after several rounds of recovery.

Then why did the tests fail prior to this PR?

In the longevity test, there is an initial SQL to create table. The create table SQL fails without retry, and then all subsequent create MV SQL failed.

…5653)

Revert "revert: "feat(meta): inject and collect barrier in bidi stream (

c07e585

#1488…" This reverts commit 59ae1cf.

github-actions bot added the type/feature label Mar 13, 2024

wenym1 added 2 commits March 13, 2024 14:01

Merge branch 'main' into revert-15652-revert-14887

4898baa

retry when fail to connect to worker node

8e52b83

wenym1 marked this pull request as ready for review March 13, 2024 08:59

wenym1 requested review from shanicky, hzxa21, TennyZhuang, yezizp2012, BugenZhao and lmatz March 13, 2024 08:59

hzxa21 reviewed Mar 13, 2024

View reviewed changes

hzxa21 approved these changes Mar 13, 2024

View reviewed changes

wenym1 added this pull request to the merge queue Mar 13, 2024

Merged via the queue into main with commit e46a307 Mar 13, 2024
38 checks passed

wenym1 deleted the revert-15652-revert-14887 branch March 13, 2024 12:54

lmatz mentioned this pull request Mar 13, 2024

nightly-20240312 pipeline test failed #15650

Closed

jetjinser pushed a commit that referenced this pull request Mar 14, 2024

feat: re-merge #14887 (inject and collect barrier in bidi stream) (#1…

5b159f0

…5653)

wenym1 mentioned this pull request Mar 18, 2024

fix(meta): add worker to cluster after able to resolve worker dns #15748

Merged

9 tasks

zwang28 mentioned this pull request Apr 12, 2024

bug: barrier failure no longer collected from all compute node #16278

Closed

BugenZhao mentioned this pull request Apr 12, 2024

fix(error): fix regression of error reporting for actor failures #16285

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

wenym1 commented Mar 13, 2024 •

edited

Loading

hzxa21 Mar 13, 2024

wenym1 Mar 13, 2024

hzxa21 Mar 13, 2024

wenym1 Mar 13, 2024

hzxa21 Mar 13, 2024

wenym1 Mar 13, 2024

hzxa21 Mar 13, 2024

hzxa21 Mar 13, 2024

wenym1 Mar 13, 2024

		// It may happen that the dns information of newly registered worker node
		// has not been propagated to the meta node and cause error. Wait for a while and retry

feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

Conversation

wenym1 commented Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenym1 commented Mar 13, 2024 •

edited

Loading