-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: re-merge #14887 (inject and collect barrier in bidi stream) #15653
Conversation
// It may happen that the dns information of newly registered worker node | ||
// has not been propagated to the meta node and cause error. Wait for a while and retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to the bidi stream refactoring, we didn't see this issue in our tests. Is it because previously the creation of the RPC connection is delayed until the first barrier injection given that it is a unary RPC while after the refactoring the creation of the RPC connection comes earlier when the worker node is added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. With this PR the meta node connects to CN earlier than before.
} | ||
} | ||
error!(?node_host, "fail to create worker node after retry"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we retry indefinitely or mark the node as inactive or panic instead of just printing an error log if we cannot create the streaming RPC?
With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery. The recovery process will continue until InjectBarrier RPC can succeed, which is equivalent to retry indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the stream cannot be created, it will not exist in the node map, and then injecting barrier will return with error because the worker node to be injected can not be found. Actually this is the cause of the previous failure to create table.
Not adding the worker node to the node map is something like marking the node as inactive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.
Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial add_worker
attempt fails, there won't be future retry even when new actors are assigned to the worker node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the previous behavior using unary RPC, if meta cannot connect to a worker node, I think InjectBarrier RPC will fail, leading to recovery.
Correction: InjectBarrier RPC will not be sent to worker node without any actors so the recovery and retry will only happen when new actors are created or actor rebalance happens on the unreachable worker node. But still with this PR, as long as the initial
add_worker
attempt fails, there won't be future retry even when new actors are assigned to the worker node.
If a CN does not have any actor on it, failure to connect to it should not trigger recovery. When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.
Since this is a blocking async method, endless retrying in the method may block the usual handling of events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When new actors are assigned to the worker node, there will be a inject barrier request to the CN, and then it will return an error when unable to find the CN in the node map, and then trigger the recovery, and then retry will be handled in the recovery loop.
IIUC, if the retry attempts are exhausted in the initial add_worker
call, the node map won't have an entry for the worker node. In this case, when there are new actors assigned to the worker node and the inject barrier fails, will recovery trigger add_worker
again and try to re-establish the streaming RPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my bad. recovery will trigger reset
, which will try re-establish the streaming RPC. Then why did the tests fail prior to this PR? Does it mean that the failing tests expect no recovery triggered at all, even though things can be resolve after several rounds of recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why did the tests fail prior to this PR?
In the longevity test, there is an initial SQL to create table. The create table SQL fails without retry, and then all subsequent create MV SQL failed.
Reverts #15652 and re-merge #14887
sysbench that failed previously succeeds with the fix in this PR https://buildkite.com/risingwave-test/sysbench/builds/707
The previous failure is caused when meta node try to connect to a newly registered CN worker. It seems that there is latency in the dns information propagation in the k8s cluster. As a result, the meta node will fail to resolve the dns information of the newly registered CN worker that may just get booted and then the initial command will fail.
The fix in this PR is to retry when fail to create a new control stream with the newly registered CN worker.
Most LOC in this PR is from the previous merged PR. Fix can be reviewed in the commit 8e52b83