-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: "group 0 concurrent modification" #1126
Comments
The third option could also use #1031 once it's implemented. |
I'm currently working on this issue. The way I see the LBP is this: // This LBP produces a predictable query plan - it order the nodes
// by position in the ring.
// This is to make sure that all DDL queries land on the same node,
// to prevent errors from concurrent DDL queries executed on different nodes.
#[derive(Debug)]
struct SchemaQueriesLBP {}
impl LoadBalancingPolicy for SchemaQueriesLBP {
fn pick<'a>(
&'a self,
_query: &'a scylla::load_balancing::RoutingInfo,
cluster: &'a scylla::transport::ClusterData,
) -> Option<(
scylla::transport::NodeRef<'a>,
Option<scylla::routing::Shard>,
)> {
// I'm not sure if Scylla can handle concurrent DDL queries to different shard,
// in other words if its local lock is per-node or per shard.
// Just to be safe, let's use explicit shard.
cluster.get_nodes_info().first().map(|node| (node, Some(0)))
}
fn fallback<'a>(
&'a self,
_query: &'a scylla::load_balancing::RoutingInfo,
cluster: &'a scylla::transport::ClusterData,
) -> scylla::load_balancing::FallbackPlan<'a> {
Box::new(cluster.get_nodes_info().iter().map(|node| (node, Some(0))))
}
fn name(&self) -> String {
todo!()
}
} WDYT @muzarski ? |
So from my understanding, if the query fails, the retry might be sent to the next node in the ring (based on the retry decision), correct? Is it ok this way? Maybe we need additional custom retry policy that always retries on the same node. I'm not really sure, though - it might be the case that the first node is overloaded, then it would make sense to try on another one. |
If we don't want to use next node, then it's enough to change fallback to be the same as pick. If it returns only one node then no retry policy will be able to use another one.
In both cases the version I posted above works correctly. |
Then I'm fine with the LBP you provided. It should significantly reduce the probability of this error's occurrence. If the error ever occures in the future, we can adjust the LBP to return the same node in the fallback as you mentioned (but for now, let's stick to the current version). |
This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126
This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126
This commit changes all DDL queries to use helper functions introduced in previous commit. Should fix scylladb#1126
New versions of Scylla use Raft for schema / topology operations.
Raft serializes those operations internally. Concurrent ddl operations are done similarly to optimistic locking:
they basically work like "if raft state is X, change schema that way (to state Y)".
If 2 such operations execute at the same time, then one of them will fail (because the state is no longer X).
Scylla server will restart such operations in that case, but the amount of restarts is limited to 10.
If there are many concurrent DDL queries, then its possible to hit this limit, and it happens very often in our tests,
which makes CI unstable.
There are several solutions:
Third solution should be best performance-wise because it doesn't introduce any additional locking and just lets Scylla handle this case.
The way to do this is to introduce a helper function in test utils that sends a DDL query.
This function would either:
The text was updated successfully, but these errors were encountered: