-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support for Tablets #937
Conversation
See the following report for details: cargo semver-checks output
|
[Based on discussion from Slack] We concluded that "tablet_allocator_shuffle" is great for manual testing, but for automatic testing there's not enough control to wait for it to take effect. Also, in practice it takes a long time (10s+) for tablet shuffle to actually happen. |
I think it's enough. |
I'd prefer the changes introduced by Wojciech's commits to be logically organized into commits. If the current organization of the code in the commits is bad (e.g. they weren't cleaned up fully and don't compile / pass tests / are illogical) then please put them into shape. If you want to preserve the credit then personally I think it is fine to add the original author as a co-author in the description.
Do you know why this happens? Won't such a query be sent to a random node/shard every time until it gets information about the partition's tablet?
Yes, the rust driver repo doesn't have a good way to do tests with schema changes. I agree it would be nice to have it, but it is out of scope here and not a trivial effort.
Sending to the wrong node/shard is not an error. Warning sounds OK to me.
I think it is possible, for example if a node is added and some tablets are quickly moved to the new node before the driver refreshes information about topology. In such case it doesn't sound bad to ignore the replica (or maybe even send to a wrong node if we choose the unknown replica) until we refresh the metadata. |
scylla/src/transport/iterator.rs
Outdated
|
||
let replicas: Option<smallvec::SmallVec<[_; 8]>> = | ||
if let (Some(keyspace), Some(token)) = | ||
(statement_info.keyspace.as_ref(), statement_info.token) | ||
{ | ||
Some( | ||
config | ||
.cluster_data | ||
.get_token_endpoints_iter(keyspace, token) | ||
.map(|(node, shard)| (node.clone(), shard)) | ||
.collect(), | ||
) | ||
} else { | ||
None | ||
}; | ||
let replicas: Option<smallvec::SmallVec<[_; 8]>> = if let (Some(kstable), Some(token)) = | ||
(statement_info.table.as_ref(), statement_info.token) | ||
{ | ||
Some( | ||
config | ||
.cluster_data | ||
.get_token_endpoints_iter(kstable, token) | ||
.map(|(node, shard)| (node.clone(), shard)) | ||
.collect(), | ||
) | ||
} else { | ||
None | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, that shows a drawback of using an automatic code formatter - it decided to format the code differently and it's hard to see what exactly changed (keyspace
-> table
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting in a separate commit avoids this, but then we can't say that each commit passes static checks :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution for that could be adding a #[rustfmt(skip)]
line temporarily, and dropping it in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for this one:
...not super happy, but I see why you decided to do it. I think that the execute and execute_iter deserve some refactoring and simplification. For now, I'm good with putting the sender into Conection. |
Reworked commit structure of Wojciechs commits and posted them as separate PR: #944 |
cargo semver-checks output
|
cargo semver-checks output
|
861724c
to
753d498
Compare
cargo semver-checks output
|
cargo semver-checks output
|
cargo semver-checks output
|
cargo semver-checks output
|
cargo semver-checks output
|
cargo semver-checks output
|
cargo semver-checks output
|
62fac8e
to
c5c45f6
Compare
SCT run didn't pass, but failure is most likely unrelated to this PR (according to @fruch it's caused by scylladb/scylla-cluster-tests#7320 ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! Well done!
just a general comment, I would recommend putting links to jenkins or Argus when you are referring specific SCT runs |
Sure, here's a run I'm talking about: https://jenkins.scylladb.com/job/scylla-staging/job/karol_baryla/job/longevity-100gb-4h-cql-stress-test-tablets/6/ |
This module contains structs that hold tablet info. Main operations include: - searching for a tablet for given table + token - adding new tablets - parsing tablet from custom payload feedback
tokio version is bumped to 1.34 because this code needs `recv_many` method on channel, which was introduced in 1.34.
This enables shard awareness for tablet tables
Some tests break when running on Scylla with tablet support, mostly for 2 reasons: - CDC is not yet supported for tablets - Token awareness works differently For affected tests disable tablet awareness for the time being.
One last small change: added a debug log in |
Fixed flaky integration test. Now the PR is finally ready to merge! |
This variable was unused, probably some leftover. It's interesting that we didn't get any warnings about it not being used (no reads, it was only written).
There are some situations where tablet information can become stale. This can have various consequences: - it would prevent old `Node` objects from being destroyed and make them continue to being used in LBP. - some of the replicas may be ignored indefinitely by the driver, increasing load on the rest. To prevent this we need to update tablet info during topology change with newly fetched data about cluster.
In the previous version of tablet PR tablets were put behind "unstable-tablets" feature. In order to test them the feature had to be enabled, but CI / Makefile didn't do this for cargo check or clippy. While the tablets are no longer hidden behind feature flag, I think checking with all features is still a valid addition, so I decided to leave this commit as-is.
The code is now used, so it's no longer necessary to prevent warnings.
Rewrote the whole comment mentioned in the review from scratch - so I marked the review comments regarding it as solved, as they are no longer relevant. |
This PR introduces support for tablets. The design as a bit different than in the other drivers.
For information about what Tablets are see:
https://www.scylladb.com/2023/07/10/why-scylladb-is-moving-to-a-new-replication-algorithm-tablets/
Driver context
To give a bit of context of how things are working in Rust Driver now:
Session
has acluster: Cluster
field which in turn hasdata: Arc<ArcSwap<ClusterData>>
.ClusterData
contains all the schema and topology information. It also containsReplicaLocator
, which uses token ring to return a list of replicas for a given token and replication strategy.Driver can access
ClusterData
cheaply, without the need for any locks. It is used for performing queries and is also exposed to the user: https://docs.rs/scylla/latest/scylla/transport/session/struct.Session.html#method.get_cluster_dataThis data is updated by
ClusterWorker
, running in a separate Tokio task. This worker handles server events,use_keyspace
requests, manages control connection. When necessary (and also periodically) it re-fetches topology and schemainformation, creates new
ClusterData
object and then swaps it atomically (for more information see https://docs.rs/arc-swap/latest/arc_swap/ ) with the old one - this way new queries will use newClusterData
without disrupting old ones.How does the driver interact with Tablets?
Information about Tablets are stored in
system.tablets
. Driver could fetch this information to have it always complete and up to date, but that could be overwhelming for the cluster.This is because there is just one token ring - token ownership is tied to a node, but in tablets this information is per-table (my intuition about tablets is to see them as per-table token ring that can by dynamically changed). Also there may be more tablets than token ring elements.
Now imagine a scenario where a lot of clients try to reconnect to the cluster, which has a lot of tables. The amount of transferred data would be too big.
For this reason driver fetches tablets information lazily. When the driver send a statement to a non-replica node, it will receive tablet information in the custom payload of the response.
This information contains data about the tablet that the statement belongs to: first and last tokens of the tablet + list of tablet replicas (in the form
list<(uuid, shard)>
).Driver can then update it's internal tablet information using this feedback, so that the following statements will be sent to a correct replica.
Implementation
Information about tablets is stored in
ReplicaLocator
(which is insideClusterData
), similarly to token ring data.When a connection receives tablet feedback it sends this feedback trough a channel. Receiving end of this channel
is owned by
ClusterWorker
. When it receives something on this channel it clonesClusterData
(because there is no other way to update it without introducing locks),updates tablet info inside and stores it in
ArcSwap
- just like when topology is updated. As an optimisation it tries to retrieve several tablets from the channel at once,so that when there is a lot of updates we don't perform clone + swap for each of them.
This way queries can be executed as before, without any locks and other costly operations, just accessing
ClusterData
which is mostly a simple data structure.Maintanance
Implementation chosen by Scylla makes things a bit more complicated for the driver. We only update tablet info when we send a query to a wrong node.
This means that the information can sometimes become stale
Node
object is recreated (this happens when node changes Ip / DC / Rack) tablet info would still contain old objects.To handle those scenarios driver fixes tablet information after fetching new topology. As far as I'm aware other driver don't handle those scenarios, we may need to fix it.
One scenario which I'm not sure how to handle (and if it's even problematic): what if user changes RF of a keyspace? How does it work with tablets?
If it just adds a replica to replica list, then we will not use this replica - we'll keep sending requests to other replicas and never receive new tablet info.
I'm not sure how we could detect it and if it's even a problem.
Alternative solution
One alternative solution would be to use
RwLock
- take a read lock inReplicaLocator
to get list of replicas, take a write lock when handling custom payload tablet feedback to update info.I decided against it because:
ClusterData
(pretty much all of it exceptNode
objects) is a simple data, without any locks, which makes it simpler to reason about it. I didn't want to change this.Downsides of implemented solution
ClusterData
to update tablet info. This won't increase max memory usage - because we already copy it when updating topology - but now we'll copy it more often. Implementing Move non-critical metadata out of Session #595 will largely mitigate this issue because we won't need to copy schema information, just topology.ClusterData
during which we'll keep sending data to wrong nodes. I don't think it's much of a problem - the same thing would happen with any implementation if you send a few queries to the same parttition at once. The worst thing that happens is that some more queries will be sent to wrong nodes and we'll receive few more feedbacks.LWT
One note about LWT: with token ring the LWT optimisation (sending LWT prepared statements to the same node from any clients, to avoid conflicts) was working always. Now it requires having tablet info for a given partition - so a first LWT to a given partition will not use the optimisation.
TODO:
Benchmark possible tablets implementations: vector vs mapToken
struct.Past questions / considerations
During implementation I had some dillemas questions, mostly solved / answered now, which are outlined below.
Tests with removed / decommisioned nodes
I'd like to test such scenarios, as suggested by @avelanarius , but I think we don't have a good way to do this now.
There is proxy, but afaik it can only simulate network errors.
It would be nice to use ccm (or something like this) in tests and have Rust API for it to manipulate cluster more easily.
Need to figure out a way to test it now.
Status: I'll run some SCT tests using cql-stress built using this branch of driver.
Unknown replica
Is it possible to get a replica uuid in custom payload that we don't know about? What should we do in this case?
Right now I'm just ignoring this replica, but I don't like this. This will cause load to be distributed less evenly
because we skip one (or more) of replicas.
Status: Implemented tablet maintanance to guard against this and some other problems.
What about
ReplicasOrdered
?It returns replicas in ring order, which makes little sense in Tablets tables.
I'm thinking of just editing a comment to say that the order is just consistent, but doesn't have to be ring order.
ReplicasOrdered
seems to only be used for LWT, so this should be fine.To do this, I need to make sure that the order is actually consistent - because it's not obvious.
Also, I'd like to have some test that uses the path where it's used - probably neeed to use LWT with Tablet-enabled table?
Status: Assumed that replicas are returned in consistent order. Added integration tests performing LWT queries.
Integration tests
I wrote an integration test for tablets (last commit) but it doesn't always work, and I'm not sure how to write a proper test.
The test first inserts some row. Each query is executed multiple times and the test checks that at least one execution
was sent to wrong node and recevied tablet feedback. This is not always the case and I'm not sure how to force it.
@avelanarius suggested that there is
"tablet_allocator_shuffle"
HTTP api, but the comment in Scylla tests says# Increases the chance of tablet migration concurrent with schema change
which doesn't sound helpful here.In Python driver implementation the test looks into tracing to verify that the query was sent to a correct node.
I'm not a big fan of this approach, as it doesn't verify that we can send things to the wrong nodes when we don't have the data.
Status: Managed to write correct integration tests using one-shot LBPs to make sure I send queries to all nodes/shards.
Wojciech's commits
What should I do with Wojciech's commits?
I'm considering squashing them (and introducing some changes to them so that resulting commit passes CI).
The changes are mostly mechanical and it would allow us to have atomic commits in this PR.
Status: Opened #944 with his changes cleaned up
Tablet feedback handled in Connection, not in Session
Tablet info is extracted from frame in Connection instead of Session + iterators.
I did look into moving that into session, but decided that it's too complicated and error-prone.
In the current implementation I can have certainty that we handle this for all requests.
Status: it will stay that way for now
FromCqlVal::from_cql
is tablets.rsIs it possible for this to fail? I don't see how, but I'm not very familiar with this are of code.
If it can't fail, I could avoid introducing new error type.
Status: I added
.unwrap()
, it seems that this call can't fail.Parsing fail
What to do when parsing Tablet data from custom payload fails?
Erroring whole request seems excessive. Right now I just print a warning. Is it enough?
Status: Printing warning is enough, so I'll stick to that.
Fixes: #867
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.