Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CqlTimeuuid type #894

Merged
merged 10 commits into from
Jan 26, 2024
Merged

Introduce CqlTimeuuid type #894

merged 10 commits into from
Jan 26, 2024

Conversation

muzarski
Copy link
Contributor

Fixes: #749

Changes

  • Introduced CqlTimeuuid type as a wrapper for version 1 uuids
  • Implemented uuid::Uuid delegate methods for CqlTimeuuid
  • Implemented strong ordering on CqlTimeuuid type so it has the same semantics as Scylla/Cassandra
  • CqlValue::Timeuuid now wraps CqlTimeuuid
  • Implemented serialization and deserialization logic for CqlTimeuuid
  • Got rid of timeuuid <-> uuid::Uuid mapping

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

scylla/src/transport/cql_types_test.rs Outdated Show resolved Hide resolved
docs/source/data-types/data-types.md Show resolved Hide resolved
Comment on lines 772 to 780
let timeuuid = match CqlTimeuuid::from_slice(buf) {
Ok(timeuuid) => timeuuid,
Err(e) => {
return Err(ParseError::BadIncomingData(format!(
"Timeuuid parsing failed: {e}"
)))
}
};
CqlValue::Timeuuid(timeuuid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column. If they are then perhaps we can do this validation here as well, otherwise I would abstain from it so that users don't run into a situation where they cannot read data they inserted earlier or via another driver.

Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column.

A simple test case shows that Scylla validates whether uuid inserted to timeuuid column is a valid timeuuid.

#[tokio::test]
async fn test_invalid_timeuuid() {
    let session: Session = create_new_session_builder().build().await.unwrap();
    let ks = unique_keyspace_name();

    session
        .query(
            format!(
                "CREATE KEYSPACE IF NOT EXISTS {} WITH REPLICATION = \
            {{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}}",
                ks
            ),
            &[],
        )
        .await
        .unwrap();
    session.use_keyspace(ks, false).await.unwrap();

    session
        .query(
            "CREATE TABLE tab (p int, t timeuuid, PRIMARY KEY (p, t))",
            (),
        )
        .await
        .unwrap();

    // Invalid timeuuid (version 2 uuid)
    let invalid_timeuuid = uuid::uuid!("00000000-0000-2000-8080-808080808080");

    session
        .query("INSERT INTO tab (p, t) VALUES (0, ?)", (invalid_timeuuid,))
        .await
        .unwrap();
}

The result:

thread 'transport::cql_types_test::test_invalid_timeuuid' panicked at scylla/src/transport/cql_types_test.rs:1731:10:
called `Result::unwrap()` on an `Err` value: DbError(Invalid, "Exception while binding column t: marshaling error: Unsupported UUID version (2)")

Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.

But I agree, I'll get rid of the validation part in CqlTimeuuid constructors i.e. replace

impl TryFrom<Uuid> for CqlTimeuuid {
    type Error = InvalidTimeuuidVersion;

    fn try_from(value: Uuid) -> Result<Self, Self::Error> {
        let version = value.get_version_num();
        if version != 1 {
            return Err(InvalidTimeuuidVersion(version));
        }

        Ok(Self(value))
    }
}

with

impl From<Uuid> for CqlTimeuuid { ... }

@muzarski muzarski force-pushed the timeuuid branch 3 times, most recently from b930435 to 5af99f6 Compare January 10, 2024 15:39
@muzarski muzarski requested a review from piodul January 10, 2024 15:49
@muzarski
Copy link
Contributor Author

v2:

  • rebased on top of master
  • fixed review comments

@avelanarius avelanarius added this to the 0.12.0 milestone Jan 15, 2024
scylla-cql/src/frame/response/result.rs Outdated Show resolved Hide resolved
scylla-cql/src/frame/value_tests.rs Outdated Show resolved Hide resolved
Comment on lines +1179 to +1190
let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![
CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(),
CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(),
CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(),
CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(),
CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(),
];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if we test sorting, it would be good to have unsorted data. You can use https://docs.rs/rand/latest/rand/seq/trait.SliceRandom.html#tymethod.shuffle to avoid having 2 literals in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.

I also think that we should test multiple orderings in one run with sort() - let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.

Copy link
Collaborator

@piodul piodul Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also use a random generator with known algorithm - std::thread_rng uses an unspecified algorithm which can probably change in the future. We already have tests with fixed pseudorandomness - for example, see partitioners_output_same_result_no_matter_how_input_is_partitioned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, 362880 is not a huge number. Perhaps checking all the permutations would not take much time (let's aim under 100ms)? We can iterate over permutations using Itertools::permutations().

scylla-cql/src/frame/value.rs Outdated Show resolved Hide resolved
@muzarski muzarski requested a review from Lorak-mmk January 19, 2024 05:59
@Lorak-mmk Lorak-mmk added the API-breaking This might introduce incompatible API changes label Jan 22, 2024
scylla-cql/src/frame/value.rs Outdated Show resolved Hide resolved
scylla-cql/src/frame/value.rs Show resolved Hide resolved
scylla-cql/src/frame/value.rs Show resolved Hide resolved
scylla-cql/src/frame/value.rs Show resolved Hide resolved
Comment on lines 123 to 127
// Compare two values of timeuuid type.
// Cassandra legacy requires:
// - using signed compare for least significant bits.
// - masking off UUID version during compare, to
// treat possible non-version-1 UUID the same way as UUID.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have this information in the docstring for CqlTimeuuid. The custom behaviors of CqlTimeuuid make it something more than a mere wrapper over Uuid, so I think it needs to be properly explained in the documentation.

Comment on lines +1179 to +1190
let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![
CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(),
CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(),
CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(),
CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(),
CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(),
CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(),
];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.

I also think that we should test multiple orderings in one run with sort() - let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.

Introduced a `CqlTimeuuid` type which is a wrapper for v1 uuids.
By default, uuids are compared in lexicographical order.
However, scylla (and cassandra) use different comparison semantics for
timeuuids (v1 uuids).

Timeuuids contain an encoded timestamp within 8 most-significant bytes.
Scylla compares timeuuids in a following manner:
- compare the timestamps retrieved from 8 msb
- if timestamps are equal, perform a signed comparison of 8 lsb
Before this commit, users were able to map native timeuuid type to
rust `uuid::Uuid` type and vice-versa.

After this change, users are only allowed to use `value::CqlTimeuuid` type
to work with scylla's native timeuuid type.

Changes:
- timeuuid test needed to be adjusted, so it maps columns to `CqlTimeuuid`
  instead of `uuid::Uuid`
- TracingEvent struct needs to be adjusted as `event_id` column of
  `system_traces.events` table is of `timeuuid` type
@muzarski
Copy link
Contributor Author

v3:

  • rebased on top of master
  • addressed review comments

@muzarski muzarski requested a review from piodul January 25, 2024 14:26
Copy link
Collaborator

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nitpicks, but they aren't blocking the merge - can be fixed later. LGTM.

Comment on lines +159 to +160
self.lsb_signed().hash(state);
self.msb().hash(state);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the lsb_signed() and msb() do some unnecessary work - the former performs a XOR, and the latter shuffles bytes around. It would be sufficient to take the bytes as they are, AND one of the bytes (like msb() does now) and compare. Same goes for the PartialEq implementation.

These are micro-optimizations with dubious performance impact, so it's fine to implement them in a followup (or never).

];

// Generate all permutations.
let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Itertools is an extention trait for iterators. While what you wrote works, you are supposed to use it on the iterators directly:

let perms = sorted_timeuuid_vals
    .iter()
    .permutations(sorted_timeuuid_vals.len())
    .collect::<Vec<_>>();

Comment on lines +1192 to +1196
// Generate all permutations.
let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len())
.collect::<Vec<_>>();
// Ensure that all of the permutations were generated.
assert_eq!(362880, perms.len());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the perms are going to take a few tens of megabytes of memory: an allocation of 362880 * 3 * 8 = ~8.3MB of contiguous memory for the direct storage of perms alone (but may be up to twice as much due to vec exponential reallocation) and 362880 * 9 * 16 = ~50MB for the timeuuids alone (spread over 362880 individual allocations).

While this is not very much for today's standards, you need to take into account that many test cases can run in parallel, so this memory usage adds up - and it's a good practice in general to be mindful of those allocations.

In this case, it could have been easily and completely avoided by not collecting the permutations iterator but rather consuming it in a loop.

Again, not a huge problem and not blocking the merge in my opinion. Can be fixed in a followup.

@piodul piodul merged commit d01b473 into scylladb:main Jan 26, 2024
9 checks passed
@Lorak-mmk Lorak-mmk mentioned this pull request Feb 1, 2024
@muzarski muzarski deleted the timeuuid branch October 29, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API-breaking This might introduce incompatible API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a TimeUuid struct
4 participants