From 349bc79686af8e6bc6b268733ffa448a2f2066ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Wojciech=20Przytu=C5=82a?= Date: Thu, 29 Aug 2024 13:07:04 +0200 Subject: [PATCH] docs: exhaustive overview of statements & best practices In order to avoid API misuse, much knowledge is now shared in a structured way of tables, and best practices are described to aid users. --- docs/source/queries/paged.md | 53 +++++++++++++---- docs/source/queries/queries.md | 100 +++++++++++++++++++++++++-------- 2 files changed, 120 insertions(+), 33 deletions(-) diff --git a/docs/source/queries/paged.md b/docs/source/queries/paged.md index c38bcb4dfe..1c41f35ff4 100644 --- a/docs/source/queries/paged.md +++ b/docs/source/queries/paged.md @@ -2,9 +2,31 @@ Sometimes query results might be so big that one prefers not to fetch them all at once, e.g. to reduce latency and/or memory footprint. Paged queries allow to receive the whole result page by page, with a configurable page size. +In fact, most SELECTs queries should be done with paging, to avoid big load on cluster and large memory footprint. -`Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md) -or a [prepared query](prepared.md) and return an `async` iterator over result `Rows`. +> ***Warning***\ +> Issuing unpaged SELECTs (`Session::query_unpaged` or `Session::execute_unpaged`) +> may have dramatic performance consequences! **BEWARE!**\ +> If the result set is big (or, e.g., there are a lot of tombstones), those atrocities can happen: +> - cluster may experience high load, +> - queries may time out, +> - the driver may devour a lot of RAM, +> - latency will likely spike. +> +> Stay safe. Page your SELECTs. + +## `RowIterator` + +The automated way to achieve that is `RowIterator`. It always fetches and enables access to one page, +while prefetching the next one. This limits latency and is a convenient abstraction. + +> ***Note***\ +> `RowIterator` is quite heavy machinery, introducing considerable overhead. Therefore, +> don't use it for statements that do not benefit from paging. In particular, avoid using it +> for non-SELECTs. + +On API level, `Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md) +or a [prepared query](prepared.md), respectively, and return an `async` iterator over result `Rows`. > ***Warning***\ > In case of unprepared variant (`Session::query_iter`) if the values are not empty @@ -22,7 +44,6 @@ Use `query_iter` to perform a [simple query](simple.md) with paging: # use scylla::Session; # use std::error::Error; # async fn check_only_compiles(session: &Session) -> Result<(), Box> { -use scylla::IntoTypedRows; use futures::stream::StreamExt; let mut rows_stream = session @@ -45,7 +66,6 @@ Use `execute_iter` to perform a [prepared query](prepared.md) with paging: # use scylla::Session; # use std::error::Error; # async fn check_only_compiles(session: &Session) -> Result<(), Box> { -use scylla::IntoTypedRows; use scylla::prepared_statement::PreparedStatement; use futures::stream::StreamExt; @@ -106,10 +126,10 @@ let _ = session.execute_iter(prepared, &[]).await?; // ... # } ``` -### Passing the paging state manually -It's possible to fetch a single page from the table, extract the paging state -from the result and manually pass it to the next query. That way, the next -query will start fetching the results from where the previous one left off. +## Manual paging +It's possible to fetch a single page from the table, and manually pass paging state +to the next query. That way, the next query will start fetching the results +from where the previous one left off. On a `Query`: ```rust @@ -197,5 +217,18 @@ loop { ``` ### Performance -Performance is the same as in non-paged variants.\ -For the best performance use [prepared queries](prepared.md). \ No newline at end of file +For the best performance use [prepared queries](prepared.md). +See [query types overview](queries.md). + +## Best practices + +| Query result fetching | Unpaged | Paged manually | Paged automatically | +|-------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------| +| Exposed Session API | `{query,execute}_unpaged` | `{query,execute}_single_page` | `{query,execute}_iter` | +| Working | get all results in a single CQL frame, into a single Rust struct | get one page of results in a single CQL frame, into a single Rust struct | upon high-level iteration, fetch consecutive CQL frames and transparently iterate over their rows | +| Cluster load | potentially **HIGH** for large results, beware! | normal | normal | +| Driver overhead | low - simple frame fetch | low - simple frame fetch | considerable - `RowIteratorWorker` is a separate tokio task | +| Feature limitations | none | none | speculative execution not supported | +| Driver memory footprint | potentially **BIG** - all results have to be stored at once! | small - only one page stored at a time | small - at most constant number of pages stored at a time | +| Latency | potentially **BIG** - all results have to be generated at once! | considerable on page boundary - new page needs to be fetched | small - next page is always pre-fetched in background | +| Suitable operations | - in general: operations with empty result set (non-SELECTs)
- as possible optimisation: SELECTs with LIMIT clause | - for advanced users who prefer more control over paging, with less overhead of `RowIteratorWorker` | - in general: all SELECTs | \ No newline at end of file diff --git a/docs/source/queries/queries.md b/docs/source/queries/queries.md index 2bf2436cb6..e444abe085 100644 --- a/docs/source/queries/queries.md +++ b/docs/source/queries/queries.md @@ -1,26 +1,80 @@ -# Making queries - -This driver supports all query types available in Scylla: -* [Simple queries](simple.md) - * Easy to use - * Poor performance - * Primitive load balancing -* [Prepared queries](prepared.md) - * Need to be prepared before use - * Fast - * Properly load balanced -* [Batch statements](batch.md) - * Run multiple queries at once - * Can be prepared for better performance and load balancing -* [Paged queries](paged.md) - * Allows to read result in multiple pages when it might be so big that one - prefers not to fetch it all at once - * Can be prepared for better performance and load balancing - -Additionally there is special functionality to enable `USE KEYSPACE` queries: -[USE keyspace](usekeyspace.md) - -Queries are fully asynchronous - you can run as many of them in parallel as you wish. +# Making queries - best practices + +Driver supports all kinds of statements supported by ScyllaDB. The following tables aim to bridge between DB concepts and driver's API. +They include recommendations on which API to use in what cases. + +## Kinds of CQL statements (from the CQL protocol point of view): + +| Kind of CQL statement | Single | Batch | +|-----------------------|---------------------|------------------------------------------| +| Prepared | `PreparedStatement` | `Batch` filled with `PreparedStatement`s | +| Unprepared | `Query` | `Batch` filled with `Query`s | + +This is **NOT** strictly related to content of the CQL query string. + +> ***Interesting note***\ +> In fact, any kind of CQL statement could contain any CQL query string. +> Yet, some of such combinations don't make sense and will be rejected by the DB. +> For example, SELECTs in a Batch are nonsense. + +### [Unprepared](simple.md) vs [Prepared](prepared.md) + +> ***GOOD TO KNOW***\ +> Each time a statement is executed by sending a query string to the DB, it needs to be parsed. Driver does not parse CQL, therefore it sees query strings as opaque.\ +> There is an option to *prepare* a statement, i.e. parse it once by the DB and associate it with an ID. After preparation, it's enough that driver sends the ID +> and the DB already knows what operation to perform - no more expensive parsing necessary! Moreover, upon preparation driver receives valuable data for load balancing, +> enabling advanced load balancing (so better performance!) of all further executions of that prepared statement.\ +> ***Key take-over:*** always prepare statements that you are going to execute multiple times. + +| Statement comparison | Unprepared | Prepared | +|----------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------| +| Exposed Session API | `query_*` | `execute_*` | +| Usability | execute CQL statement string directly | need to be separately prepared before use, in-background repreparations if statement falls off the server cache | +| Performance | poor (statement parsed each time) | good (statement parsed only upon preparation) | +| Load balancing | primitive (random choice of a node/shard) | advanced (proper node/shard, optimisations for LWT statements) | +| Suitable operations | one-shot operations | repeated operations | + +### Single vs [Batch](batch.md) + +| Statement comparison | Single | Batch | +|----------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Exposed Session API | `query_*`, `execute_*` | `batch` | +| Usability | simple setup | need to aggregate statements and binding values to each is more cumbersome | +| Performance | good (DB is optimised for handling single statements) | good for small batches, may be worse for larger (also: higher risk of request timeout due to big portion of work) | +| Load balancing | advanced if prepared, else primitive | advanced if prepared **and ALL** statements in the batch target the same partition, else primitive | +| Suitable operations | most of operations | - a list of operations that needs to be executed atomically (batch LightWeight Transaction)
- a batch of operations targetting the same partition (as an advanced optimisation) | + +## CQL statements - operations (based on what the CQL string contains): + +| CQL data manipulation statement | Recommended statement kind | Recommended Session operation | +|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------| +| SELECT | `PreparedStatement` if repeated, `Query` if once | `{query,execute}_iter` (or `{query,execute}_single_page` in a manual loop for performance / more control) | +| INSERT, UPDATE | `PreparedStatement` if repeated, `Query` if once, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction) | `{query,execute}_unpaged` (paging is irrelevant, because the result set of such operation is empty) | +| CREATE/DROP {KEYSPACE, TABLE, TYPE, INDEX,...} | `Query`, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction) | `query_unpaged` (paging is irrelevant, because the result set of such operation is empty) | + +### [Paged](paged.md) vs Unpaged query + +> ***GOOD TO KNOW***\ +> SELECT statements return a [result set](result.md), possibly a large one. Therefore, paging is available to fetch it in chunks, relieving load on cluster and lowering latency.\ +> ***Key take-overs:***\ +> For SELECTs you had better **avoid unpaged queries**.\ +> For non-SELECTs, unpaged API is preferred. + +| Query result fetching | Unpaged | Paged | +|-----------------------|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Exposed Session API | `{query,execute}_unpaged` | `{query,execute}_single_page`, `{query,execute}_iter` | +| Usability | get all results in a single CQL frame, so into a [single Rust struct](result.md) | need to fetch multiple CQL frames and iterate over them - using driver's abstractions (`{query,execute}_iter`) or manually (`{query,execute}_single_page` in a loop) | +| Performance | - for large results, puts **high load on the cluster**
- for small results, the same as paged | - for large results, relieves the cluster
- for small results, the same as unpaged | +| Memory footprint | potentially big - all results have to be stored at once | small - at most constant number of pages are stored by the driver at the same time | +| Latency | potentially big - all results have to be generated at once | small - at most one chunk of data must be generated at once, so latency of each chunk is small | +| Suitable operations | - in general: operations with empty result set (non-SELECTs)
- as possible optimisation: SELECTs with LIMIT clause | - in general: all SELECTs | + +For more detailed comparison and more best practices, see [doc page about paging](paged.md). + +### Queries are fully asynchronous - you can run as many of them in parallel as you wish. + +## `USE KEYSPACE`: +There is a special functionality to enable [USE keyspace](usekeyspace.md). ```{eval-rst} .. toctree::