-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserialization API rework #665
Conversation
There is a typo in |
00e71d8
to
b3d824f
Compare
v0.1: rebased and fixed clippy's complaints |
05a4ba8
to
86ec047
Compare
v1:
|
There is a small number of TODOs that should be fixed in near future, but I think that even without them the PR is good to go now (feel free to disagree). Moving out of draft. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far perused only 4 commits, but thanks to their volume it's enough for a firm burst of comments. I'll continue review in the next days.
impl<'frame> Iterator for UdtIterator<'frame> { | ||
type Item = Result< | ||
( | ||
&'frame (String, ColumnType), | ||
Option<Option<FrameSlice<'frame>>>, | ||
), | ||
ParseError, | ||
>; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
// TODO: Should we fail when there are too many fields? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this. Should we?
If we got more fields than we expected for a UDT, why should we let it silently pass? IMO it deserves at least a warning. Failure is appropriate as well, as I can't find any counterexample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a reason to allow less fields in the serialized form than there are in the type definition.
UDTs allow to perform the ALTER TYPE
operation, which allows to add a new field to the type.
From that point on all values of this UDT will have a value for the new field, but AFAIK we don't update the old values that were already stored in the table, that would require a costly full table scan. Instead when reading a value with less fields than defined in the type we just assume that the value for those fields is null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol spec only allows less fields to be sent, and the description implies that it's incorrect to send more:
A UDT value is composed of successive [bytes] values, one for each field of the UDT
value (in the order defined by the type). A UDT value will generally have one value
for each field of the type it represents, but it is allowed to have less values than
the type has fields.
The TODO is mostly about considering how to handle a case where the database sends a response that does not conform to the spec. If we don't trust the DB then this method will have to be written a bit differently.
@cvybhu @wprzytula review ping |
// TODO: We could provide ResultMetadata in a similar, lazily | ||
// deserialized form - now it can be a source of allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we plan to take care of this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not do it in this PR. It's quite complicated already. I'll create an issue after this is merged.
86ec047
to
c25d5ad
Compare
v1.1: rebased on newest main (no other changes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice, I got through some of it, will continue later.
Ok(MaybeEmpty::Value(v)) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some types can't be empty, but MaybeEmpty
doesn't forbid crating a MaybeEmpty
with them. For example it doesn't make sense to create a MaybeEmpty<String>
. In case of a an empty string the parsed value should be MaybeEmpty::Value("".to_string())
, but instead it will be MaybeEmpty::Empty
.
Perhaps we could introduce a trait: EmptyableCqlType
and have MaybeEmpty<T: EmptyableCqlType>
?
This could protect against such mistakes at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I remember that I considered it when implementing MaybeEmpty
. I think the reason for why I dropped was that it's easy to forget to add EmptyableCqlType
when adding a new impl DeserializeCql
. Besides, while MaybeEmpty<String>
doesn't make much sense from the CQL protocol point of view, it does work and it should be pretty clear to the user what it does.
However, now that I think of it, it isn't a huge problem and adding missing impl EmptyableCqlType
is easy and doesn't break the API. Moreover, unless users are well-versed with the CQL protocol spec (I wouldn't expect or require them to be), they won't know which types can be empty, so this might serve an informative purpose. In the future, if we rework the serialization API, then we'd rather use MaybeEmpty
there, too - and the EmptyableCqlType
will be even more important for the purpose of serialization.
To sum up - I'm convinced, I'll add a trait.
impl<'frame> Iterator for UdtIterator<'frame> { | ||
type Item = Result< | ||
( | ||
&'frame (String, ColumnType), | ||
Option<Option<FrameSlice<'frame>>>, | ||
), | ||
ParseError, | ||
>; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
// TODO: Should we fail when there are too many fields? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a reason to allow less fields in the serialized form than there are in the type definition.
UDTs allow to perform the ALTER TYPE
operation, which allows to add a new field to the type.
From that point on all values of this UDT will have a value for the new field, but AFAIK we don't update the old values that were already stored in the table, that would require a costly full table scan. Instead when reading a value with less fields than defined in the type we just assume that the value for those fields is null
.
"Value is out of representable range for NaiveDate".to_string(), | ||
) | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code for NaiveDate
contained comments as to what's happening and why the - (1i64 << 31)
doesn't panic. I think it would be good to keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it obvious in the new code? The old code operated on i64
s implicitly, i.e. it subtracted two chrono::Duration
values. In the new code, we are operating directly on i64
s.
impl QueryResult { | ||
/// Returns the number of received rows, or `None` if the response wasn't of Rows type. | ||
pub fn rows_num(&self) -> Option<usize> { | ||
Some(self.raw_rows.as_ref()?.rows_count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, for me the current version is much easier to read. The more functional style requires more brainpower to parse, taking something and mapping it using a member function is a bit unnatural. OTOH having Some() with some code inside is a standard thing that's easy to parse.
match &self.raw_rows { | ||
Some(rows) => Ok(Some(rows.rows_iter()?)), | ||
None => Ok(None), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think match
is more readable than chaining functional combinators. Everyone knows what match does, and the current version does everything neatly step-by-step.
Functional combinators require the reader to have them freshly in their head. With imperative code you can understand what's going on just by looking at the sequence of instructions, but with functional stuff the reader might have to google how and_then
differs from map
, etc... I think the code should be as accessible as possible, even for people who don't write Rust all the time.
/// If the UDT definition does not contain this field, it will be initialized | ||
/// with `Default::default()`. __This attribute has no effect in `enforce_order` | ||
/// mode.__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make it work with enforce_order
?
I imagine that people who use enforce_order
for the extra performance boost might need to use it too.
AFAIU the main use of default_when_missing
is gracefully handling the ADD FIELD
operation on UDTs.
When someone wants to add the field while the app is running they should:
- Add a new field to the rust struct, mark it as
default_when_missing
- Rollout the new updated version of db client
- Do the
ADD FIELD
operation in the database
In such situation it's crucial to be able to use default_when_missing
. We shouldn't penalize people who use straight order enforcing by making it impossible for them to handle such changes gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if this combination isn't supported I think trying to do both of them at once should generate a compilation error instead of silently ignoring default_when_missing
. Ignoring it might lead to some nasty surprises for the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can support enforce_order + default_when_missing
. If a struct field doesn't match the column name, we will initialize that field to Default::default()
and will try to initialize the next field with that column.
v: Option<FrameSlice<'frame>>, | ||
) -> Result<Self, ParseError>; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried that the user has to remember to call type_check
before doing deserialize
. As the code gets more complex, with all the iterators and whatnot it's hard to prove to myself that type_check
was indeed called.
I wonder if it would be possible to use the type system to make such mistakes impossible. Something like TypeChecked<T>()
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a problem for the users at all. We, as the developers, need to provide abstractions that always call type_check
before deserialize
and the users will be fine - and this PR provides them. Apart from implementing DeserializeCql/DeserializeRow
themselves, I don't think there will ever be a need for the users to call type_check
or deserialize
directly.
c25d5ad
to
b6a2495
Compare
v1.2: rebased, fixed one test that used to fail in serverless mode |
46e35a3
to
52941ee
Compare
Both Session and Legacy08Session will support methods that allow sending queries/prepared statements/batches and will share most of the implementation - it's just that return types will be slightly different. This commit moves the core of those methods to private methods `do_xyz` for every `xyz` method from the API. This will allow to implement the public methods for both API kinds with minimal boilerplate.
Adds Session as an alias over GenericSession<CurrentDeserializationApi>. No methods (apart from the common ones) are added to it yet.
This commit renames the SessionBuilder::build method to build_legacy, and then reintroduces the build method so that it returns the new Session (not Legacy08Session). All the examples, tests, documentation will gradually be migrated to use SessionBuilder::build again in following commits.
The query/execute/batch statements are generic over the type of the values that are meant to be serialized and sent to the cluster. They started by converting the values to SerializedValues and then continued without the need for generics. However, those functions used to be non-trivial and would have to be monomorphised for every type of the arguments passed to the method, increasing compilation time more than necessary. Now that most of the implementation was moved to do_query etc. methods, we can restrict the generic part to the public query/execute/batch methods which convert the input values to SerializedValues and then call the non-generic do_query etc. methods. This commit does just that - de-genericises do_query and friends, while leaving query and friends generic as they used to.
QueryResult can be converted to Legacy08QueryResult, but not the other way around. In order to support both APIs, internal methods (do_query, do_execute, etc.) need to be changed so that they return the new QueryResult.
Implements methods related to sending queries for the new Session.
Adjusts the methods of Connection, apart from query_iter, to use the new deserialization API. Connection is meant to be an internal API, so we don't introduce a Legacy08Connection for this.
In a similar fashion to Session, CachingSession was also made generic over the session kind.
Adjusts the CachingSession tests to use the new deserialization interface.
The Connection::query_iter method is changed to use the new deserialization framework. All the internal uses of it in topology.rs are adjusted.
…acing info Adjusts the Session::try_getting_tracing_info method to use the new deserialization framework.
This is a large commit which goes over all existing tests that haven't been migrated in previous commits and adjusts them to use the new deserialization framework. There were lots of changes to be made, but they are mostly independent from each other and very simple.
This commit goes over all unadjusted examples and changes them to use the new deserialization framework. Again, it contains a lot of changes, but they are quite simple.
The Cluster struct by itself only serves as a facade for the ClusterWorker, i.e. it has channels that allow sending requests to the worker, receives the ClusterData via the `data` field etc. Apart from the `_worker_handle` field, all other fields are cloneable. Two tasks working on two copies of the same Cluster object should behave the same as if they shared and operated on a single Cluster object (e.g. via Arc<Cluster>). This commit makes the Cluster object cloneable - the `_worker_handle` is shared via an Arc. This will be very useful in the next commit - we will do a similar thing for the GenericSession object.
In order to make migration from the old API easier and allow doing it gradually, some components of the client programs would probably like to use the old API while the new components will use the new API. However, in the current design, Session and Legacy08Session are two different types and it's not possible to "cast" one to another - even though they have nearly the same fields and implementations. The previous commit made Cluster cloneable, based on the observation that it's perfectly fine to clone Cluster's fields, construct a new one and treat it as a shared facade, handle to the same "semantic" cluster. The same applies to Session, actually - cloning a session would have similar effect (though we encourage users to keep Session in an Arc so that cloning is cheaper). Instead of making GenericSession cloneable, we introduce methods which, in reality, perform a clone but change the kind of session's API. This allows to have two session objects which share the same resources but have different APIs. This should be very useful when migrating large projects to the new API - components that need to use the new API can just "convert" the session to the new interface and use that.
Examples in the documentation are now adjusted to the new deserialization API, and the examples there finally compile.
Adds a document that should help adjust users to the new deserialization API.
b6e9507
to
46b6e4d
Compare
v2.1: rebased on main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the commit introducing new procedural macros, I've completed the review.
// UDF fields in correct same order | ||
let mut udt_contents = BytesMut::new(); | ||
append_cell(&mut udt_contents, "The quick brown fox".as_bytes()); | ||
append_cell(&mut udt_contents, &42i32.to_be_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean: UDT fields?
While it's possible to implement those manually, the driver provides procedural macros for automatic derivation in some cases: | ||
|
||
- `FromRow` - implements `FromRow` for a struct. | ||
- `FromUserType` - generated an implementation of `FromCqlVal` for the struct, trying to parse the CQL value as a UDT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/generated/generates
Note: the macros above have a default behavior that is different than what `FromRow` and `FromUserTypes` do. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is their default behaviour and what are the differences? Also, as the names here are similar both for macros and traits, it is crucial to distinguish them with some strong markers.
The new API introduce two analogous traits that, instead of consuming pre-parsed `Vec<Option<CqlValue>>`, are given raw, serialized data with full information about its type. This leads to better performance and allows for better type safety. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of information about the serialized data is given to the new API? Provide examples.
} | ||
``` | ||
|
||
The above traits have been implemented for the same set of types as `FromRow` and `FromCqlVal`, respectively. Notably, `DeserializeRow` is implemented for `Row`, and `DeserializeCql` is implemented for `CqlValue`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be helpful to remind that Row
and CqlValue
are representatives of the legacy API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new derive macros' code looks promising. I've also examined the generated code using cargo expand
and it looks valid, too. Only minor nits there.
// TODO: Allow collecting unrecognized fields into some special field | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of TODO is this one? An ad-kalendas-Graecas one, or something to be done in a follow-up?
I'll split this PR into multiple smaller ones. It's quite hard to keep going back to a single huge PR, rebasing it and addressing lots of comments at once, so it should make the process smoother. I'll address the existing comments in the upcoming, smaller PRs. |
This PR reworks the deserialization API. The new API solves some performance problems that the old one had and makes it possible to improve type safety. As this change will probably affect lots of existing code, the old API is still available but marked as "Legacy" - it's opt-in but only some small changes will be required to use it.
What are the problems with the current API?
Currently, when the driver receives response from the database, it immediately deserializes it to
Vec<Row>
, whereRow = Vec<CqlValue>
andCqlValue
is an enum that is able to represent any CQL type. Then, the user can choose to iterate over this representation, converting rows with theFromRow
trait. When a row is converted, it usually uses theFromCqlVal
trait to convert individual column values. This design has the following problems:#rows + 1
allocations just forVec
s only. Moreover, some column types will allocate too - e.g.text
(represented by aString
),blob
(represented withVec<u8>
), as well as collections and UDTs. Many of those allocations could be perfectly avoided by deserializing data on demand: non-allocating alternatives such as&[u8]
can be used to represent some types, collections can be consumed via iterators, UDTs can be deserialized directly to structs with the help of derive macros, and the need to allocate and deserialize everything up front can be avoided by iterating over the rows and deserializing them lazily.FromCqlVal
andFromRow
receive incomplete information about the rows and columns. For example,CqlValue::List
variant keeps its values in a Vec, so it can't validate the type if the vec is empty. A more issue withFromRow
is that it doesn't know about the names of the columns, so theFromRow
derive macro can't validate that the column types match, nor it can support deserializing structs where fields are in a different order than columns.Those issues couldn't really be fixed without the API redesign, hence that's why this PR have been created.
The new API
The new API is based on two new traits,
DeserializeCql
andDeserializeRow
:Instead of receiving pre-parsed data, both traits receive objects that borrow data from the serialized frame - hence the
'frame
lifetime parameter. This allows the types being deserialized to allocate only when they need to (e.g. in case ofString
). However, the new API also allows types that borrow data from the serialized frame - for example&[u8]
or&str
. There is also a third option - the type being deserialized can also share ownership of the serialized frame - for example, you can deserializeblob
to abytes::Bytes
, which will keep the original, serialized frame alive while it is alive. This is easier to use than slices orstr
s, but one needs to be careful as even a tiny blob will keep the memory for the whole serialized frame alive, which can lead to space leaks.Another notable changes is the
type_check
method. In case ofDeserializeRow
it receives more comprehensive information thanFromRow
, which include column names, and now maching column names to fields is possible. The derive forDeserializeRow
has that functionality. Moreover, because type checking is separate fromdeserialize
, it can be done only once when a response is received - if the page with response has 1000 rows then we only need to calltype_check
once per page butdeserialize
once per row.deserialize
can assume thattype_check
has been called and it can avoid some potentially expensive type checks.Migration from the legacy API
Apart from the use of different traits, the existing API is not that different. Some method names in
QueryResult
andRowIterator
(nowRawIterator
) have been changed and some fields have been hidden, but otherwise changes that need to be made are frequently mechanical - the commits that adjust the examples and tests are an example of that. Nonetheless, the old API have been existing for a long time, so requiring folks to switch everything at once wouldn't be right in my opinion. This PR also made sure to preserve the old API - while it's opt-in it's very easy to adjust the existing code to use it - just douse scylla::LegacySession as Session
and similarly for other types that have been marked as "legacy". There are also some facilities that make it possible to migrate the code piecewise - for example, you can create a new API Session from a LegacySession which, although has a different API, shares the same resources. There will also be a migration guide include (although it's still a TODO at this point, one of the reasons that it is a draft PR).TODOs
Fixes: #462
Pre-review checklist
Fixes:
annotations to PR description.