Introduce support for Tablets #937

Lorak-mmk · 2024-02-22T15:17:20Z

This PR introduces support for tablets. The design as a bit different than in the other drivers.

For information about what Tablets are see:

https://www.scylladb.com/2023/07/10/why-scylladb-is-moving-to-a-new-replication-algorithm-tablets/

Driver context

To give a bit of context of how things are working in Rust Driver now:
Session has a cluster: Cluster field which in turn has data: Arc<ArcSwap<ClusterData>>.
ClusterData contains all the schema and topology information. It also contains ReplicaLocator, which uses token ring to return a list of replicas for a given token and replication strategy.
Driver can access ClusterData cheaply, without the need for any locks. It is used for performing queries and is also exposed to the user: https://docs.rs/scylla/latest/scylla/transport/session/struct.Session.html#method.get_cluster_data
This data is updated by ClusterWorker, running in a separate Tokio task. This worker handles server events, use_keyspace requests, manages control connection. When necessary (and also periodically) it re-fetches topology and schema
information, creates new ClusterData object and then swaps it atomically (for more information see https://docs.rs/arc-swap/latest/arc_swap/ ) with the old one - this way new queries will use new ClusterData without disrupting old ones.

How does the driver interact with Tablets?

Information about Tablets are stored in system.tablets. Driver could fetch this information to have it always complete and up to date, but that could be overwhelming for the cluster.
This is because there is just one token ring - token ownership is tied to a node, but in tablets this information is per-table (my intuition about tablets is to see them as per-table token ring that can by dynamically changed). Also there may be more tablets than token ring elements.
Now imagine a scenario where a lot of clients try to reconnect to the cluster, which has a lot of tables. The amount of transferred data would be too big.
For this reason driver fetches tablets information lazily. When the driver send a statement to a non-replica node, it will receive tablet information in the custom payload of the response.
This information contains data about the tablet that the statement belongs to: first and last tokens of the tablet + list of tablet replicas (in the form list<(uuid, shard)> ).
Driver can then update it's internal tablet information using this feedback, so that the following statements will be sent to a correct replica.

Implementation

Information about tablets is stored in ReplicaLocator (which is inside ClusterData), similarly to token ring data.
When a connection receives tablet feedback it sends this feedback trough a channel. Receiving end of this channel
is owned by ClusterWorker. When it receives something on this channel it clones ClusterData (because there is no other way to update it without introducing locks),
updates tablet info inside and stores it in ArcSwap - just like when topology is updated. As an optimisation it tries to retrieve several tablets from the channel at once,
so that when there is a lot of updates we don't perform clone + swap for each of them.

This way queries can be executed as before, without any locks and other costly operations, just accessing ClusterData which is mostly a simple data structure.

Maintanance

Implementation chosen by Scylla makes things a bit more complicated for the driver. We only update tablet info when we send a query to a wrong node.
This means that the information can sometimes become stale

When a node is removed we would just keep sending queries to other replicas, never receiving new tablet.
When a keyspace / table is dropped we would unnecessarily keep info about it.
(may be Rust Driver only) When driver is not aware of one of the replicas yet (e.g. because the node is new and driver didn't yet fetch topology) and we skip this replica we'll increase load on other replicas indefinitely.
(may be Rust Driver only) When a Node object is recreated (this happens when node changes Ip / DC / Rack) tablet info would still contain old objects.

To handle those scenarios driver fixes tablet information after fetching new topology. As far as I'm aware other driver don't handle those scenarios, we may need to fix it.
One scenario which I'm not sure how to handle (and if it's even problematic): what if user changes RF of a keyspace? How does it work with tablets?
If it just adds a replica to replica list, then we will not use this replica - we'll keep sending requests to other replicas and never receive new tablet info.
I'm not sure how we could detect it and if it's even a problem.

Alternative solution

One alternative solution would be to use RwLock - take a read lock in ReplicaLocator to get list of replicas, take a write lock when handling custom payload tablet feedback to update info.
I decided against it because:

It's possible we would hold a write lock often, slowing down queries. I also don't like the idea of taking a lock in a hot path (executing a query) in general.
Most of ClusterData (pretty much all of it except Node objects) is a simple data, without any locks, which makes it simpler to reason about it. I didn't want to change this.
It makes implementing tablet info maintenance much harder. Doing maintenance would require taking a write lock for a lot of time, hindering performance.

Downsides of implemented solution

We need to copy ClusterData to update tablet info. This won't increase max memory usage - because we already copy it when updating topology - but now we'll copy it more often. Implementing Move non-critical metadata out of Session #595 will largely mitigate this issue because we won't need to copy schema information, just topology.
Received tablet info is not immediately used. There may be a brief period after we receive tablet information, but before it is present in ClusterData during which we'll keep sending data to wrong nodes. I don't think it's much of a problem - the same thing would happen with any implementation if you send a few queries to the same parttition at once. The worst thing that happens is that some more queries will be sent to wrong nodes and we'll receive few more feedbacks.

LWT

One note about LWT: with token ring the LWT optimisation (sending LWT prepared statements to the same node from any clients, to avoid conflicts) was working always. Now it requires having tablet info for a given partition - so a first LWT to a given partition will not use the optimisation.

TODO:

Past questions / considerations

During implementation I had some dillemas questions, mostly solved / answered now, which are outlined below.

Tests with removed / decommisioned nodes

I'd like to test such scenarios, as suggested by @avelanarius , but I think we don't have a good way to do this now.
There is proxy, but afaik it can only simulate network errors.
It would be nice to use ccm (or something like this) in tests and have Rust API for it to manipulate cluster more easily.
Need to figure out a way to test it now.

Status: I'll run some SCT tests using cql-stress built using this branch of driver.

Unknown replica

Is it possible to get a replica uuid in custom payload that we don't know about? What should we do in this case?
Right now I'm just ignoring this replica, but I don't like this. This will cause load to be distributed less evenly
because we skip one (or more) of replicas.

Status: Implemented tablet maintanance to guard against this and some other problems.

What about `ReplicasOrdered`?

It returns replicas in ring order, which makes little sense in Tablets tables.
I'm thinking of just editing a comment to say that the order is just consistent, but doesn't have to be ring order.
ReplicasOrdered seems to only be used for LWT, so this should be fine.
To do this, I need to make sure that the order is actually consistent - because it's not obvious.

Also, I'd like to have some test that uses the path where it's used - probably neeed to use LWT with Tablet-enabled table?

Status: Assumed that replicas are returned in consistent order. Added integration tests performing LWT queries.

Integration tests

I wrote an integration test for tablets (last commit) but it doesn't always work, and I'm not sure how to write a proper test.
The test first inserts some row. Each query is executed multiple times and the test checks that at least one execution
was sent to wrong node and recevied tablet feedback. This is not always the case and I'm not sure how to force it.

@avelanarius suggested that there is "tablet_allocator_shuffle" HTTP api, but the comment in Scylla tests says
# Increases the chance of tablet migration concurrent with schema change which doesn't sound helpful here.

In Python driver implementation the test looks into tracing to verify that the query was sent to a correct node.
I'm not a big fan of this approach, as it doesn't verify that we can send things to the wrong nodes when we don't have the data.

Status: Managed to write correct integration tests using one-shot LBPs to make sure I send queries to all nodes/shards.

Wojciech's commits

What should I do with Wojciech's commits?
I'm considering squashing them (and introducing some changes to them so that resulting commit passes CI).
The changes are mostly mechanical and it would allow us to have atomic commits in this PR.

Status: Opened #944 with his changes cleaned up

Tablet feedback handled in Connection, not in Session

Tablet info is extracted from frame in Connection instead of Session + iterators.
I did look into moving that into session, but decided that it's too complicated and error-prone.
In the current implementation I can have certainty that we handle this for all requests.

Status: it will stay that way for now

`FromCqlVal::from_cql` is tablets.rs

Is it possible for this to fail? I don't see how, but I'm not very familiar with this are of code.
If it can't fail, I could avoid introducing new error type.

Status: I added .unwrap(), it seems that this call can't fail.

Parsing fail

What to do when parsing Tablet data from custom payload fails?
Erroring whole request seems excessive. Right now I just print a warning. Is it enough?

Status: Printing warning is enough, so I'll stick to that.

Fixes: #867

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

github-actions · 2024-02-22T15:22:08Z

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: 426db65

See the following report for details:

cargo semver-checks output

./scripts/semver-checks.sh --baseline-rev 729fa507705dcbfb86ecb388685dcd5795bd3b26
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev 729fa507705dcbfb86ecb388685dcd5795bd3b26
     Cloning 729fa507705dcbfb86ecb388685dcd5795bd3b26
     Parsing scylla v0.12.0 (current)
      Parsed [  18.401s] (current)
     Parsing scylla v0.12.0 (baseline)
      Parsed [  17.095s] (baseline)
    Checking scylla v0.12.0 -> v0.12.0 (no change)
     Checked [   0.492s] 72 checks; 69 passed, 3 failed, 0 unnecessary

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field RoutingInfo.table in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/transport/load_balancing/mod.rs:32
  field RoutingInfo.table in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/transport/load_balancing/mod.rs:32

--- failure method_parameter_count_changed: pub method parameter count changed ---

Description:
A publicly-visible method now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/method_parameter_count_changed.ron

Failed in:
  scylla::transport::ClusterData::get_token_endpoints now takes 4 parameters instead of 3, in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/transport/cluster.rs:461
  scylla::transport::locator::ReplicaLocator::replicas_for_token now takes 5 parameters instead of 4, in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/transport/locator/mod.rs:89

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field keyspace of struct RoutingInfo, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-729fa507705dcbfb86ecb388685dcd5795bd3b26/4fd6b841b2e221bb326561261b99f0c87a83c718/scylla/src/transport/load_balancing/mod.rs:27
  field keyspace of struct RoutingInfo, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-729fa507705dcbfb86ecb388685dcd5795bd3b26/4fd6b841b2e221bb326561261b99f0c87a83c718/scylla/src/transport/load_balancing/mod.rs:27
     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  36.032s] scylla
     Parsing scylla-cql v0.1.0 (current)
      Parsed [   9.218s] (current)
     Parsing scylla-cql v0.1.0 (baseline)
      Parsed [   9.261s] (baseline)
    Checking scylla-cql v0.1.0 -> v0.1.0 (no change)
     Checked [   0.236s] 72 checks; 69 passed, 3 failed, 0 unnecessary

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ResponseBodyWithExtensions.custom_payload in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla-cql/src/frame/mod.rs:172

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field ks_name of struct TableSpec, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-729fa507705dcbfb86ecb388685dcd5795bd3b26/4fd6b841b2e221bb326561261b99f0c87a83c718/scylla-cql/src/frame/response/result.rs:40
  field table_name of struct TableSpec, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-729fa507705dcbfb86ecb388685dcd5795bd3b26/4fd6b841b2e221bb326561261b99f0c87a83c718/scylla-cql/src/frame/response/result.rs:41

--- failure struct_pub_field_now_doc_hidden: pub struct field is now #[doc(hidden)] ---

Description:
A pub field of a pub struct is now marked #[doc(hidden)] and is no longer part of the public API.
        ref: https://doc.rust-lang.org/rustdoc/write-documentation/the-doc-attribute.html#hidden
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.31.0/src/lints/struct_pub_field_now_doc_hidden.ron

Failed in:
  field TableSpec.ks_name in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla-cql/src/frame/response/result.rs:40
  field TableSpec.table_name in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla-cql/src/frame/response/result.rs:40
     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  18.752s] scylla-cql
make: *** [Makefile:61: semver-rev] Error 1

avelanarius · 2024-02-22T17:00:28Z

@avelanarius suggested that there is "tablet_allocator_shuffle" HTTP api

[Based on discussion from Slack] We concluded that "tablet_allocator_shuffle" is great for manual testing, but for automatic testing there's not enough control to wait for it to take effect. Also, in practice it takes a long time (10s+) for tablet shuffle to actually happen.

avelanarius · 2024-02-22T17:03:12Z

Parsing fail
What to do when parsing Tablet data from custom payload fails?
Erroring whole request seems excessive. Right now I just print a warning. Is it enough?

I think it's enough.

piodul · 2024-02-23T07:59:50Z

Wojciech's commits

What should I do with Wojciech's commits? I'm considering squashing them (and introducing some changes to them so that resulting commit passes CI). The changes are mostly mechanical and it would allow us to have atomic commits in this PR.

I'd prefer the changes introduced by Wojciech's commits to be logically organized into commits. If the current organization of the code in the commits is bad (e.g. they weren't cleaned up fully and don't compile / pass tests / are illogical) then please put them into shape. If you want to preserve the credit then personally I think it is fine to add the original author as a co-author in the description.

Integration tests

I wrote an integration test for tablets (last commit) but it doesn't always work, and I'm not sure how to write a proper test. The test first inserts some row. Each query is executed multiple times and the test checks that at least one execution was sent to wrong node and recevied tablet feedback. This is not always the case and I'm not sure how to force it.

Do you know why this happens? Won't such a query be sent to a random node/shard every time until it gets information about the partition's tablet?

Tests with removed / decommisioned nodes

I'd like to test such scenarios, as suggested by @avelanarius , but I think we don't have a good way to do this now. There is proxy, but afaik it can only simulate network errors. It would be nice to use ccm (or something like this) in tests and have Rust API for it to manipulate cluster more easily. Need to figure out a way to test it now.

Yes, the rust driver repo doesn't have a good way to do tests with schema changes. I agree it would be nice to have it, but it is out of scope here and not a trivial effort.

Parsing fail

What to do when parsing Tablet data from custom payload fails? Erroring whole request seems excessive. Right now I just print a warning. Is it enough?

Sending to the wrong node/shard is not an error. Warning sounds OK to me.

Unknown replica

Is it possible to get a replica uuid in custom payload that we don't know about? What should we do in this case? Right now I'm just ignoring this replica, but I don't like this. This will cause load to be distributed less evenly because we skip one (or more) of replicas.

I think it is possible, for example if a node is added and some tablets are quickly moved to the new node before the driver refreshes information about topology. In such case it doesn't sound bad to ignore the replica (or maybe even send to a wrong node if we choose the unknown replica) until we refresh the metadata.

scylla/src/transport/locator/tablets.rs

piodul · 2024-02-23T07:57:36Z

scylla/src/transport/iterator.rs


-            let replicas: Option<smallvec::SmallVec<[_; 8]>> =
-                if let (Some(keyspace), Some(token)) =
-                    (statement_info.keyspace.as_ref(), statement_info.token)
-                {
-                    Some(
-                        config
-                            .cluster_data
-                            .get_token_endpoints_iter(keyspace, token)
-                            .map(|(node, shard)| (node.clone(), shard))
-                            .collect(),
-                    )
-                } else {
-                    None
-                };
+            let replicas: Option<smallvec::SmallVec<[_; 8]>> = if let (Some(kstable), Some(token)) =
+                (statement_info.table.as_ref(), statement_info.token)
+            {
+                Some(
+                    config
+                        .cluster_data
+                        .get_token_endpoints_iter(kstable, token)
+                        .map(|(node, shard)| (node.clone(), shard))
+                        .collect(),
+                )
+            } else {
+                None
+            };


Eh, that shows a drawback of using an automatic code formatter - it decided to format the code differently and it's hard to see what exactly changed (keyspace -> table).

Formatting in a separate commit avoids this, but then we can't say that each commit passes static checks :/

One solution for that could be adding a #[rustfmt(skip)] line temporarily, and dropping it in the next commit.

Imo this is not an issue with this commit - this is an issue with tooling.
Diff in VSCode:

Diff in Sublime Merge:

I'd advise to look at diffs in tools other than GitHub - at least in cases like this

piodul · 2024-02-23T08:10:01Z

As for this one:

Tablet feedback handled in Connection, not in Session

Tablet info is extracted from frame in Connection instead of Session + iterators. I did look into moving that into session, but decided that it's too complicated and error-prone. In the current implementation I can have certainty that we handle this for all requests.

...not super happy, but I see why you decided to do it. I think that the execute and execute_iter deserve some refactoring and simplification. For now, I'm good with putting the sender into Conection.

Lorak-mmk · 2024-03-02T02:27:59Z

Reworked commit structure of Wojciechs commits and posted them as separate PR: #944

github-actions · 2024-03-02T02:31:37Z