feat: concurrent batch scan #12825

st1page · 2023-10-12T13:45:35Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#12783

It still will not be concurrent when

the pk prefix determines a specified distribution key and the scan range is just in some vnode.
The optimizer generates the plan depending on the scan's order property

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

fuyufjh · 2023-10-13T04:10:33Z

src/storage/src/table/batch_table/storage_table.rs

-                (start, end)
-            }))
-        } else {
+        // let raw_key_ranges = if !ordered


👀 Please either remove these code or make an option for enabling/disabling it.

src/storage/src/table/batch_table/storage_table.rs

fuyufjh

LGTM but needs to be proved to work by benchmark.

codecov · 2023-10-13T09:44:00Z

Codecov Report

Merging #12825 (38f0304) into main (527a276) will increase coverage by 0.01%.
The diff coverage is 40.00%.

@@            Coverage Diff             @@
##             main   #12825      +/-   ##
==========================================
+ Coverage   69.17%   69.19%   +0.01%     
==========================================
  Files        1489     1489              
  Lines      246069   246061       -8     
==========================================
+ Hits       170230   170257      +27     
+ Misses      75839    75804      -35

Flag	Coverage Δ
rust	`69.19% <40.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/storage/src/table/batch_table/storage_table.rs	`85.87% <40.00%> (-0.72%)`	⬇️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

liurenjie1024

Generally LGTM, I think we should make the limit larger

liurenjie1024 · 2023-10-13T11:19:46Z

src/storage/src/table/batch_table/storage_table.rs

-            _ if !ordered => futures::stream::iter(iterators).flatten(),
+            _ if !ordered => {
+                futures::stream::iter(iterators.into_iter().map(Box::pin).collect_vec())
+                    .flatten_unordered(10)


I think 10 maybe two small, 1024 maybe better?

BugenZhao

I'm not sure whether this is a good practice that the upper-level application are responsible for iteration performance improvement by manually schedule it concurrently. Does any other system do this?

If so, we may have to also handle the ordered case. Or it's hard to tell whether sorting with a separate executor can be faster.

BugenZhao · 2023-10-13T11:44:47Z

src/storage/src/table/batch_table/storage_table.rs

@@ -493,7 +473,10 @@ impl<S: StateStore, SD: ValueRowSerde> StorageTableInner<S, SD> {
            0 => unreachable!(),
            1 => iterators.into_iter().next().unwrap(),
            // Concat all iterators if not to preserve order.
-            _ if !ordered => futures::stream::iter(iterators).flatten(),
+            _ if !ordered => {
+                futures::stream::iter(iterators.into_iter().map(Box::pin).collect_vec())


I'm wondering if we can pin it after buffer.

BugenZhao · 2023-10-13T11:46:34Z

src/storage/src/table/batch_table/storage_table.rs

-        let raw_key_ranges = if !ordered
-            && matches!(encoded_key_range.start_bound(), Unbounded)
-            && matches!(encoded_key_range.end_bound(), Unbounded)
-        {
-            // If the range is unbounded and order is not required, we can create a single iterator
-            // for each continuous vnode range.


I guess removing this can significantly hurt the performance for a full scan on a small table (like Top-10 or simple aggregation).

🤔 I will add a session varible to control it but I am not sure about which default value should be

or can we get some table's cardinarity statics here or in optimizer? c.c. @hzxa21 @wcy-fdu

We have maintained the physical key count in meta. Currently CN doesn't have a way to get the key count stats but we can add it if needed.

st1page · 2023-10-16T04:18:09Z

I'm not sure whether this is a good practice that the upper-level application are responsible for iteration performance improvement by manually schedule it concurrently. Does any other system do this?

offline discussion with @liurenjie1024. In OLAP system, they will do the concurent scan more aggressive by using file as the divided unit.

src/storage/src/table/batch_table/storage_table.rs

lmatz · 2023-10-17T02:11:15Z

The image we use is:
concurrent-batch-scan

CREATE TABLE s (
  dummy int, -- to control the number of rows generated in total, 1M rows in this case
  i int,
  c varchar,
  t1 timestamp,
  t2 timestamp
) with (
  connector = 'datagen',
  fields.dummy.kind = 'sequence',
  fields.dummy.start = '1',
  fields.dummy.end = '1000000',
  fields.i.kind = 'random',
  fields.i.min = 1,
  fields.i.max = 10,
  fields.i.seed = 1,
  fields.c.kind = 'random',
  fields.c.length = 1,
  fields.c.seed = 1,
  datagen.rows.per.second = '100000',
  datagen.split.num = '4'
);

-- When data is generated in memory, everything in cache

dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 949.184 ms
dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 941.612 ms
dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 1080.106 ms (00:01.080)

-- We restarted the cluster on purpose to clear the cache

dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 2964.502 ms (00:02.965)

If I do the same with image v1.2.0, i.e. the one that your cluster was running when we were having the call.

-- When the cache is empty.

dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 6516.271 ms (00:06.516)

-- When everything is in cache:
dev=> select i, c, max(t1), max(t2) from s group by i, c;
Time: 961.800 ms

The time is measured by turning on \timing in psql. It includes the time spent on network communication between my laptop and the RW cluster at US-east-1.
We measure the network latency by

dev=> select 1;
 ?column? 
----------
        1
(1 row)

Time: 255.276 ms

So the optimization improves the latency from 6.5s-0.25s = 6.2s to 2.9s-0.25s = 2.7s. about 56% faster

lmatz · 2023-10-17T02:12:34Z

In terms of the concern about performance regression on small table

we use the image with the change of concurrency to 1024:

dev=> select i, c, max(t1), max(t2) from s group by i, c;                                                                  
Time: 1824.229 ms (00:01.824)

dev=> select * from s2;
(20 rows)

Time: 244.970 ms

s2 is a table of 20 rows.

BugenZhao · 2023-10-17T03:00:30Z

performance regression on small table

I guess QPS might be a better indicator as the network latency is wavy and we're actually utilizing more CPU but no more I/O considering single-flight optimization in block cache.

BugenZhao · 2023-10-17T03:01:26Z

In OLAP system, they will do the concurent scan more aggressive by using file as the divided unit.

Then does more aggressive block pre-fetch help in this case?

lmatz · 2023-10-17T04:16:02Z

I guess QPS might be a better indicator as the network latency is wavy

Sure, how was it tested last time when it was changed to the current status? can I reuse the testbed to compare

Or we wait for the sysbench

st1page · 2023-10-17T04:57:49Z

In OLAP system, they will do the concurent scan more aggressive by using file as the divided unit.

Then does more aggressive block pre-fetch help in this case?

Yes, we can pre-fetch the blocks concurrently. But it will wait to return until all pre-fetch IO is finished.
I think the question here is how deep we can push down the information that "order does not matter“, Only when the state store provides a disordered range scan interface, it can read the multiple blocks concurrently and return when any of them is finished.

st1page · 2023-10-17T05:05:44Z

I guess removing this can significantly hurt the performance for a full scan on a small table (like Top-10 or simple aggregation).

Looks like a very special case that the query must

be a serving query requiring high QPS
need to scan the whole table without any range predicate on the pk column.
no order required on the scan

BugenZhao · 2023-10-17T06:04:12Z

But it will wait to return until all pre-fetch IO is finished.

I'm unsure about the implementation. However, in my mind, isn't pre-fetching a background task? Why do we have to make it synchronous? 👀

BugenZhao · 2023-10-17T06:07:40Z

Sure, how was it tested last time when it was changed to the current status?

I apologize, but if I remember correctly, I only conducted basic tests on my local machine because those changes must be an improvement as there were no concurrent iterations.

lmatz · 2023-10-17T06:18:43Z

It's fine, then we use sysbench or manually construct a test case, I will add this setting as a weekly test

BugenZhao · 2023-10-17T06:34:06Z

Looks like a very special case that the query must

You're probably correct. Also the table must be a distributed one.

st1page · 2023-10-17T06:35:13Z

But it will wait to return until all pre-fetch IO is finished.

I'm unsure about the implementation. However, in my mind, isn't pre-fetching a background task? Why do we have to make it synchronous? 👀

So we have two methods here

divide the scan range into pieces and make the Scan concurrent in the executor
make the Scan's performance totally dependent on the block cache, and optimize the cache's prefetch logic with concurent IO.

I am not sure which one is better... @hzxa21 any idea?

hzxa21 · 2023-10-17T10:08:07Z

But it will wait to return until all pre-fetch IO is finished.

I'm unsure about the implementation. However, in my mind, isn't pre-fetching a background task? Why do we have to make it synchronous? 👀

So we have two methods here

divide the scan range into pieces and make the Scan concurrent in the executor

make the Scan's performance totally dependent on the block cache, and optimize the cache's prefetch logic with concurent IO.

I am not sure which one is better... @hzxa21 any idea?

Correct me if I am wrong, I think these are two non-conflicting optimizations. Although prefetch is asynchrouous, it is better to be driven by the caller's poll. For example, it is unrealistic to prefetch all blocks required by a per-vnode iter if iter.next is rarely called. Also, prefetch for full table scan query can pollute block cache so the amount of data to be prefetched should be bounded as well. That being said, I agree we can provide more information to storage to do more aggressive prefetch instead of prefetching only one block ahead but I think 1 is still needed.

liurenjie1024 · 2023-10-17T10:08:26Z

But it will wait to return until all pre-fetch IO is finished.

I'm unsure about the implementation. However, in my mind, isn't pre-fetching a background task? Why do we have to make it synchronous? 👀

So we have two methods here

divide the scan range into pieces and make the Scan concurrent in the executor

make the Scan's performance totally dependent on the block cache, and optimize the cache's prefetch logic with concurent IO.

I am not sure which one is better... @hzxa21 any idea?

I don't think the scan executor should think about block cache or io. It should delegate the request to storage layer, which is supposed to have an io manager to schedule io request.

hzxa21 · 2023-10-17T10:16:27Z

But it will wait to return until all pre-fetch IO is finished.

I'm unsure about the implementation. However, in my mind, isn't pre-fetching a background task? Why do we have to make it synchronous? 👀

So we have two methods here

divide the scan range into pieces and make the Scan concurrent in the executor

make the Scan's performance totally dependent on the block cache, and optimize the cache's prefetch logic with concurent IO.

I am not sure which one is better... @hzxa21 any idea?

I don't think the scan executor should think about block cache or io. It should delegate the request to storage layer, which is supposed to have an io manager to schedule io request.

Scan executor shouldn't worry about that. IMO, under the current implementation, storage table should be responsible for concurrently poll the iterator.

lmatz

More sysbench results have been posted in the Slack channel, the numbers LOKTM

feat: concurrent batch scan

2ae1646

github-actions bot added the type/feature label Oct 12, 2023

clippy

977c915

st1page requested a review from BugenZhao October 13, 2023 02:37

fuyufjh reviewed Oct 13, 2023

View reviewed changes

xxchan reviewed Oct 13, 2023

View reviewed changes

src/storage/src/table/batch_table/storage_table.rs Show resolved Hide resolved

st1page added 2 commits October 13, 2023 17:01

fix

d405585

fix test

7ccfcc7

st1page requested review from fuyufjh and liurenjie1024 October 13, 2023 09:26

fuyufjh reviewed Oct 13, 2023

View reviewed changes

liurenjie1024 reviewed Oct 13, 2023

View reviewed changes

BugenZhao reviewed Oct 13, 2023

View reviewed changes

Merge branch 'main' into sts/concurrent_scan

111c2d5

st1page commented Oct 16, 2023

View reviewed changes

src/storage/src/table/batch_table/storage_table.rs Outdated Show resolved Hide resolved

Update src/storage/src/table/batch_table/storage_table.rs

0278adc

Merge branch 'main' into sts/concurrent_scan

38f0304

lmatz approved these changes Oct 20, 2023

View reviewed changes

st1page added this pull request to the merge queue Oct 20, 2023

Merged via the queue into main with commit 37d0a79 Oct 20, 2023

st1page deleted the sts/concurrent_scan branch October 20, 2023 11:45

xiangjinwu mentioned this pull request Oct 22, 2023

revert(ci): e2e timeout 15 -> 17 #12985

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: concurrent batch scan #12825

feat: concurrent batch scan #12825

st1page commented Oct 12, 2023 •

edited

Loading

fuyufjh Oct 13, 2023

fuyufjh left a comment

codecov bot commented Oct 13, 2023 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 Oct 13, 2023

BugenZhao left a comment

BugenZhao Oct 13, 2023

BugenZhao Oct 13, 2023

st1page Oct 16, 2023

st1page Oct 16, 2023

hzxa21 Oct 17, 2023

st1page commented Oct 16, 2023

lmatz commented Oct 17, 2023

lmatz commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

lmatz commented Oct 17, 2023

st1page commented Oct 17, 2023

st1page commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

lmatz commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

st1page commented Oct 17, 2023 •

edited

Loading

hzxa21 commented Oct 17, 2023 •

edited

Loading

liurenjie1024 commented Oct 17, 2023

hzxa21 commented Oct 17, 2023

lmatz left a comment

feat: concurrent batch scan #12825

feat: concurrent batch scan #12825

Conversation

st1page commented Oct 12, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

fuyufjh Oct 13, 2023

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 13, 2023 • edited Loading

Codecov Report

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 Oct 13, 2023

Choose a reason for hiding this comment

BugenZhao left a comment

Choose a reason for hiding this comment

BugenZhao Oct 13, 2023

Choose a reason for hiding this comment

BugenZhao Oct 13, 2023

Choose a reason for hiding this comment

st1page Oct 16, 2023

Choose a reason for hiding this comment

st1page Oct 16, 2023

Choose a reason for hiding this comment

hzxa21 Oct 17, 2023

Choose a reason for hiding this comment

st1page commented Oct 16, 2023

lmatz commented Oct 17, 2023

lmatz commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

lmatz commented Oct 17, 2023

st1page commented Oct 17, 2023

st1page commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

lmatz commented Oct 17, 2023

BugenZhao commented Oct 17, 2023

st1page commented Oct 17, 2023 • edited Loading

hzxa21 commented Oct 17, 2023 • edited Loading

liurenjie1024 commented Oct 17, 2023

hzxa21 commented Oct 17, 2023

lmatz left a comment

Choose a reason for hiding this comment

st1page commented Oct 12, 2023 •

edited

Loading

codecov bot commented Oct 13, 2023 •

edited

Loading

st1page commented Oct 17, 2023 •

edited

Loading

hzxa21 commented Oct 17, 2023 •

edited

Loading