Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speculative_execution test flakiness #967

Closed
Lorak-mmk opened this issue Mar 20, 2024 · 2 comments · Fixed by #969
Closed

speculative_execution test flakiness #967

Lorak-mmk opened this issue Mar 20, 2024 · 2 comments · Fixed by #969
Assignees

Comments

@Lorak-mmk
Copy link
Collaborator

speculative_execution_is_fired often fails in CI. We debugged it with @wprzytula . The problem is test assumes that each execution will be sent to different node which is no longer true in all cases after #944

Speculative execution uses load balancing Plan, which wraps usages of LBP::pick and LBP::fallback. If some target in fallback is the same as the one from pick, it is filtered out by Plan.
Before #944 Plan filtered out nodes, not targets (nodes + shards), which with guarantee that fallback doesn't return the same node twice guaranteed that nodes in Plan are unique.
Now those guarantees are no longer provided:

  • For unprepared queries shards returned from LBP are random, so it's possible for pick to return node X with shard 0 and for node X to appear in fallback with node 1. This won't be filtered out by Plan as those are different targets.
  • For prepared statements it's possible for the same node to appear in fallback multiple times. One case is with tablets - and that's intended. But it's possible even without tablets - first elements of fallback are replicas, and then other nodes in some order, with random shards. Those other nodes won't be filtered out because they are different targets (because of different nodes).

Solution after a long discussion with @wprzytula:

  • For unprepared queries ignore the problem - it will be possible for the first node in Plan to appear later with a different shard number. This is just one repetition, no more, so it shouldn't be a problem, as those are unprepared queries (so the user probably doesn't care that much about performance of such query).
  • Alternative solution for unprepared queries that we didn't discuss is to always return None from pick. Then fallback can take care of uniqness of nodes. WDYT @wprzytula ?
  • For prepared statements: don't eliminate repetitions in the first part of plan (that contains replicas). For the random part of the plan skip the nodes that appeared in the first part of the plan. This way there will be no repeating nodes in Plan except if 2 replicas happen to be on the same node.
@Lorak-mmk
Copy link
Collaborator Author

* Alternative solution for unprepared queries that we didn't discuss is to always return `None` from `pick`. Then `fallback` can take care of uniqness of nodes. WDYT @wprzytula ?

Now that I think of it I'm really starting to like the idea. It will require some allocations in the unprepared case, but will provide consistent behavior in exchange.

@wprzytula
Copy link
Collaborator

I'm not in favour of additional allocations on the happy path of unprepared queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants