Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Connection::query_all with Connection::query_iter #645

Merged
merged 13 commits into from
Mar 2, 2023

Conversation

piodul
Copy link
Collaborator

@piodul piodul commented Feb 17, 2023

For internal queries that can return a large number of rows (topology info, schema) we use currently use Connection::query_all to perform them. That method fetches results of the query in multiple pages and then returns it as a single combined QueryResult. This PR replaces it with a new Connection::query_iter method which has semantics more similar to the iterator-based query methods from the public interface (Session::query_iter, Session::execute_iter).

The motivations for this change are as follows:

  • Although the new methods will use slightly more memory if the query returns only one page (because of the need to spawn a new task/channels/etc.), it should consume less memory if multiple pages are fetched because rows are processed on-line as new pages appear and only 1-2 pages need to exist at a given time.
  • The logic that combines multiple pages into a single QueryResult will be hard to translate to use the upcoming iterator-based deserialization interface (Switch deserialization to an iterator-based interface to avoid allocations #462). Although the old interface will still be available as a deprecated fallback, I would like to adapt those methods as a form of dogfooding test for the new interface.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I added appropriate Fixes: annotations to PR description.

A common part of new_for_query and new_for_prepared_statement which
handles spawning a task and returns a non-empty RowIterator is extracted
to a common function. It will also help implement the query_iter method
for Connection.
We have two places in the iterator worker code that attempt to send an
empty page, and we will have one more in the upcoming commits, so let's
abstract this pattern away to a method.
@piodul
Copy link
Collaborator Author

piodul commented Feb 17, 2023

https://github.com/scylladb/scylla-rust-driver/actions/runs/4203056593/jobs/7291978661

running 2 tests
test retries_occur ... ok
test speculative_execution_is_fired ... FAILED

failures:

---- speculative_execution_is_fired stdout ----
Unique name: test_rust_1676631776_1
thread '<unnamed>' panicked at 'Worker failed: Node 172.42.0.2:9042 disconnected', scylla/tests/retries.rs:95:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'speculative_execution_is_fired' panicked at 'timeout: the function call took 3772 ms. Max time 30000 ms', scylla/tests/retries.rs:20:1


failures:
    speculative_execution_is_fired

test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 3.77s

Strange, sounds a little bit as if a node has crashed. It's hard to tell for sure because we don't print logs from the scylla docker containers at all.

I'll try to improve the CI so that it does print the logs from the cluster. In the meantime, I'll just re-run and hope that the error was an unimportant fluke.

Copy link
Collaborator

@wprzytula wprzytula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good overall, but some old naming is still left.

scylla/src/transport/iterator.rs Show resolved Hide resolved
scylla/src/transport/iterator.rs Show resolved Hide resolved
scylla/src/transport/topology.rs Outdated Show resolved Hide resolved
scylla/src/transport/connection.rs Outdated Show resolved Hide resolved
scylla/src/transport/connection.rs Outdated Show resolved Hide resolved
@piodul
Copy link
Collaborator Author

piodul commented Feb 20, 2023

v2:

  • Changed the buffer_unordered parameter from 64 -> 256 to allow more concurrency when translating addresses,
  • Updated some leftover comments/function names/string literals not to mention query_all but rather query_iter.

@piodul piodul requested a review from wprzytula February 20, 2023 15:38
@piodul
Copy link
Collaborator Author

piodul commented Feb 20, 2023

@cvybhu ping

Copy link
Contributor

@cvybhu cvybhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks ok, left a few nits.

Given that, I'm not a fan of this change. The streams are very fancy, but IMO they are much harder to read and reason about. Previously there were a few clearly defined steps where one thing happened at a time and I could easily follow it step after step. With streams it feels different - it's like first we prepare what to do and then do everything at once.

With normal code there are usually a few intermediate variables with clearly defined types, but here all the chained stream combinators make it impossible to be sure about the type of things at each step. The error handling is also much more complicated, all the errors must be propagated in the stream, and then all the try_something functions hopefully handle them, but I find it very hard to reason about. Then there are also the complications with .buffer_unordered(256)

IMO it would be better to keep using the query_all function. It could be implemented using query_iter, although I guess it doesn't make much sense.
I'm a big fan of KISS.

I might be complaining too much, maybe streams are just a Rust thing that I'm not used to yet, but my initial feeling is that they introduce new complexity that could easily be avoided.

In the PR description you mentioned that it will be hard to keep query_all after the deserialization refactor, but why is that? Will it be impossible to simply collect the rows to a Vec?

scylla/src/transport/iterator.rs Outdated Show resolved Hide resolved
scylla/src/transport/iterator.rs Outdated Show resolved Hide resolved
Comment on lines +718 to +720
/// A massively simplified version of the RowIteratorWorker. It does not have
/// any complicated logic related to retries, it just fetches pages from
/// a single connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use a RetryPolicy for retries. I think it wouldn't be too hard to incorporate it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it makes sense for queries that are limited to a single connection and only read data from a single node. For example: what should happen on RetryNextNode?

@piodul
Copy link
Collaborator Author

piodul commented Feb 27, 2023

Code looks ok, left a few nits.

Given that, I'm not a fan of this change. The streams are very fancy, but IMO they are much harder to read and reason about. Previously there were a few clearly defined steps where one thing happened at a time and I could easily follow it step after step. With streams it feels different - it's like first we prepare what to do and then do everything at once.

I assume that your comment is about the changes in query_peers from topology.rs.

I'm not sure if I agree. The biggest advantage of streams here is that they reduce the peak memory use. Instead of loading all results from a query into memory, we only need to keep 1-2 pages in memory at a given time. I'll admit that the benefit is not huge because we are constructing the metadata anyway, but it's not only about making the code look more "fancy".

With normal code there are usually a few intermediate variables with clearly defined types, but here all the chained stream combinators make it impossible to be sure about the type of things at each step. The error handling is also much more complicated, all the errors must be propagated in the stream, and then all the try_something functions hopefully handle them, but I find it very hard to reason about. Then there are also the complications with .buffer_unordered(256)

We are already using iterators extensively in the codebase, so I thought that streams would be familiar. They have a slightly different API due to their asynchronicity, but the concept is the same.

I'm not sure what do you mean with complications with .buffer_unordered(). Is it about the need to explicitly limit parallelism? Limiting parallelism is not a bad thing.

IMO it would be better to keep using the query_all function. It could be implemented using query_iter, although I guess it doesn't make much sense. I'm a big fan of KISS.

With query_iter, you can choose either to collect everything to a Vec and then process, or do the processing on the fly, as pages are fetched.

I might be complaining too much, maybe streams are just a Rust thing that I'm not used to yet, but my initial feeling is that they introduce new complexity that could easily be avoided.

In the PR description you mentioned that it will be hard to keep query_all after the deserialization refactor, but why is that? Will it be impossible to simply collect the rows to a Vec?

I guess it will be possible to change query_all to just return a Vec of parsed rows, not a full QueryResponse. I just felt that solving the problem in this way will prevent us from utilizing the new interface to its full potential and will deny us from testing its ergonomics on our internal code.

In general, the new deserialization interface will not return pre-parsed rows in the style of Vec<Vec<Option<CqlValue>>>>, but will return an iterator that lazily deserializes the contents of a response. If we wanted the query_all to keep returning QueryResponse, then we would have to concatenate the contents of the pages into a large blob, which is even worse with regards to memory use.

Adds the RowIterator::new_for_connection_query_iter method which will
be the main workhorse of the Connection::query_iter method. It uses
a new iterator worker which is a much simpler version of the proper,
multi-connection iterator worker.
The new Connection::query_iter method will serve as a replacement for
the query_all method. Contrary to query_all, query_iter will not
materialize all query results as a single QueryResult but will allow to
process the results row by row.
In order to use Connection::query_iter it is neccessary to pass self
wrapped in an Arc - the iterator worker is spawned as a separate tokio
task, so it needs to co-own the Connection.
Thanks to scylladb#628, it is now possible to provide an explicit path to a
crate which provides necessary items for the FromRow macro. It is now
possible to use this macro from the `scylla` crate itself, as we just
need to point it to the `scylla-cql` crate instead.

The NodeInfoRow struct is introduced to represent rows fetched from
system.local and system.peers. It looks nicer than using a 5-element
tuple and will reduce some noise when performing refactors in the
commits that follow.
An enum expresses intent a little more clearly than a bool.
Conversion from NodeInfoRow to Peer is done across two lambdas applied
to the untranslated_rows iterator. Let's simplify it a bit and extract
this logic to a new separate function create_peer_from_row.

It is recommended to hide whitespace changes when reviewing the diff of
this commit.
Previously, query_peers used query_all to fetch topology information and
then processed it using iterators. Now, this method is changed to use
query_iter and rows are processed on the fly using asynchronous streams.

In order to preserve the behavior of system.peers and system.local being
queried in parallel, stream::select is used which merges results of both
streams as they become available. Previously, we just put results of one
query after results of the other one. Because of this, the order of
Peers in the final result might be slightly different on each fetch -
however, this order was never specified and could actually change e.g.
when the control connection changes, so I think this should be OK.
There are other functions in topology.rs that use query_all, but they
are less complicated and changing them to use query_iter is relatively
straightforward. Therefore, we do it in a single commit.
This method is a sibling of query_all that was never actually used. In
the future we will consider switching to prepared statements for
internal queries, but 1) this will be done via not-yet-existing
execute_iter and 2) we might do it by converting query_iter to use
prepared statements. Therefore, there is no need to keep the dead code.

This commit gets rid of execute_all and its all occurrences in the
connection_query_all_execute_all_test (now renamed to:
connection_query_all_test).
The query_iter method is meant to serve similar purpose to what
query_all does currently - fetch data on control connection for internal
queries - so it makes sense to adapt the (only) existing test for the
new method.

The last assertion (4) had to be removed because the new method does not
check that the page size is set by the caller.
Now that we got rid of all uses of query_all in the code, we can finally
get rid of it and of all the other code that depended on it - most
notably, QueryResult::merge_with_next_page_res for which it won't be
possible to translate it to the upcoming iterator-based deserialization
interface.
@piodul
Copy link
Collaborator Author

piodul commented Feb 27, 2023

v3:

  • renamed new_for_query_iter -> new_for_connection_query_iter
  • got rid of an unnecessary .clone() in a lambda defined in RowIterator::new_for_connection_query_iter

@piodul piodul requested a review from cvybhu February 28, 2023 10:51
@piodul
Copy link
Collaborator Author

piodul commented Mar 2, 2023

@cvybhu review ping

@cvybhu cvybhu merged commit 739c4f8 into scylladb:main Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants