Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653

Merged
merged 3 commits into from
Mar 13, 2024

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Mar 13, 2024

Reverts #15652 and re-merge #14887

sysbench that failed previously succeeds with the fix in this PR https://buildkite.com/risingwave-test/sysbench/builds/707

The previous failure is caused when meta node try to connect to a newly registered CN worker. It seems that there is latency in the dns information propagation in the k8s cluster. As a result, the meta node will fail to resolve the dns information of the newly registered CN worker that may just get booted and then the initial command will fail.

The fix in this PR is to retry when fail to create a new control stream with the newly registered CN worker.

Most LOC in this PR is from the previous merged PR. Fix can be reviewed in the commit 8e52b83

@wenym1 wenym1 marked this pull request as ready for review March 13, 2024 08:59
Comment on lines +131 to +132
// It may happen that the dns information of newly registered worker node
// has not been propagated to the meta node and cause error. Wait for a while and retry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to the bidi stream refactoring, we didn't see this issue in our tests. Is it because previously the creation of the RPC connection is delayed until the first barrier injection given that it is a unary RPC while after the refactoring the creation of the RPC connection comes earlier when the worker node is added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. With this PR the meta node connects to CN earlier than before.

}
}
error!(?node_host, "fail to create worker node after retry");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we retry indefinitely or mark the node as inactive or panic instead of just printing an error log if we cannot create the streaming RPC?

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery. The recovery process will continue until InjectBarrier RPC can succeed, which is equivalent to retry indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stream cannot be created, it will not exist in the node map, and then injecting barrier will return with error because the worker node to be injected can not be found. Actually this is the cause of the previous failure to create table.

Not adding the worker node to the node map is something like marking the node as inactive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.

Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial add_worker attempt fails, there won't be future retry even when new actors are assigned to the worker node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.

Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial add_worker attempt fails, there won't be future retry even when new actors are assigned to the worker node.

If a CN does not have any actor on it, failure to connect to it should not trigger recovery. When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.

Since this is a blocking async method, endless retrying in the method may block the usual handling of events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.

IIUC, if the retry attempts are exhausted in the initial add_worker call, the node map won't have an entry for the worker node. In this case, when there are new actors assigned to the worker node and the inject barrier fails, will recovery trigger add_worker again and try to re-establish the streaming RPC?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad. recovery will trigger reset, which will try re-establish the streaming RPC. Then why did the tests fail prior to this PR? Does it mean that the failing tests expect no recovery triggered at all, even though things can be resolve after several rounds of recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why did the tests fail prior to this PR?

In the longevity test, there is an initial SQL to create table. The create table SQL fails without retry, and then all subsequent create MV SQL failed.

@wenym1 wenym1 added this pull request to the merge queue Mar 13, 2024
Merged via the queue into main with commit e46a307 Mar 13, 2024
38 checks passed
@wenym1 wenym1 deleted the revert-15652-revert-14887 branch March 13, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants