feat(storage): enable preload for log store and dynamic filter #13558

Little-Wallace · 2023-11-21T08:59:46Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

enable preload for those executor:

cmd scan
log store reader
streaming dynamic filter
streaming source reader
batch query

This PR will refactor prefetch and change all read operation for streaming executor prefetch to read. Because in s3 sdk, streaming_read share the same interface with read. For remote object storage server, it could only see some read request lives a long time and has still not end. We can not make sure that object storage server will not kill these read requests. So it is dangerous for us to use a long time read in case which can not finish as soon as possible.

We remove streaming read in this PR. Because we find it will block query future in some case. We may enable it when we investigate the root cause. So it means that the batch-query will use the same prefetch method as streaming executor.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Signed-off-by: Little-Wallace <[email protected]>

codecov · 2023-11-21T14:21:28Z

Codecov Report

Attention: 77 lines in your changes are missing coverage. Please review.

Comparison is base (98b0ebd) 68.07% compared to head (400e7db) 68.12%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
src/storage/src/hummock/sstable_store.rs	67.42%	43 Missing ⚠️
...ge/src/hummock/sstable/forward_sstable_iterator.rs	83.05%	10 Missing ⚠️
src/storage/src/hummock/block_stream.rs	91.66%	8 Missing ⚠️
src/storage/src/hummock/utils.rs	81.25%	6 Missing ⚠️
src/storage/src/store.rs	73.68%	5 Missing ⚠️
src/stream/src/executor/source/fetch_executor.rs	0.00%	2 Missing ⚠️
src/storage/hummock_test/src/bin/replay/main.rs	0.00%	1 Missing ⚠️
src/storage/src/hummock/validator.rs	0.00%	1 Missing ⚠️
src/storage/src/store_impl.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13558      +/-   ##
==========================================
+ Coverage   68.07%   68.12%   +0.04%     
==========================================
  Files        1532     1533       +1     
  Lines      264568   264675     +107     
==========================================
+ Hits       180114   180310     +196     
+ Misses      84454    84365      -89

Flag	Coverage Δ
rust	`68.12% <80.94%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Little-Wallace <[email protected]>

This reverts commit c5f66e7.

Signed-off-by: Little-Wallace <[email protected]>

chenzl25 · 2023-11-23T07:07:21Z

+1 for the idea that backfilling uses read rather than streaming_read to prefetch because the backfilling workload differs from the normal batch query, it will cancel on each barrier. Using streaming_read for backfilling could cause high
instantaneous connection number.

Signed-off-by: Little-Wallace <[email protected]>

hzxa21 · 2023-11-24T04:54:55Z

src/storage/src/store.rs

+    pub prefetch: bool,
+    pub for_large_query: bool,


Need some documentation on what the behavior is after setting these flags

streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.

src/storage/hummock_trace/src/opts.rs

hzxa21 · 2023-11-24T05:20:54Z

src/storage/src/hummock/sstable_store.rs

+        sst: &Sstable,
+        block_index: usize,
+        max_prefetch_block_count: usize,
+        policy: CachePolicy,


How the provided cache policy is used is confusing. The behavior is:

If no prefetch is triggered, the requested block is inserted into block cache with the provided policy

If prefetch is triggered, the requested block as well as the prefetched blocks are inserted into block cache with CachePolicy::Fill(CachePriority::Low)

How about making it more consistent?

More thoughts:
we have two mechanism to constrain prefetch memory usage now:

For prefetch with streaming read, pending_streaming_loading is used to track the usage and it is limited by large_query_memory_usage

For prefetch with normal read, block cache is used to track the usage and it is limited by block_cache_capacity (and the low priority ratio).

The motivation of introducing low priority block cache is to control the behavior of cache refill, not block prefetch, and we are mixing it up here. IMO, logically it is better to have a clear sepration:

Low priority block cache: enable block reuse across queries without polluting the whole block cache. Blocks in here can be evicted.

Prefetch buffer: buffer prefetched blocks within a query. Blocks in here cannot be eviected.

Ideas:

rename pending_streaming_loading to prefetch_buffer_usage and large_query_memory_usage to prefetch_buffer_capacity

Prefetched blocks will be put into low priority block cache and the block holder will be dropped immediately. This is completely independent of the prefetch policy.

Prefetched blocks will be put into a buf (holding Bytes, not the BlockHolder) and account for prefetch_buffer_usage. For normal read prefetch, the buf is a Vec<Block>. For streaming read prefetch, the buf is the implicit recv buf.

When there is no quota in prefetch_buffer_capacity, prefetch will be disable and fall back to per block normal read. This is completely independent to the block cache policy.

What do you think?

Signed-off-by: Little-Wallace <[email protected]>

commit a3ea7c0 Author: Eric Fu <[email protected]> Date: Sat Nov 25 11:00:48 2023 +0800 refactor: memory management (#13636) commit 2348a2b Author: Bugen Zhao <[email protected]> Date: Fri Nov 24 18:26:48 2023 +0800 fix(streaming): use correct label for `stream_fragment_exchange_bytes` metrics (#13644) Signed-off-by: Bugen Zhao <[email protected]> commit 3ccb249 Author: Runji Wang <[email protected]> Date: Fri Nov 24 17:39:12 2023 +0800 fix: estimate jsonb's value encoding size (#13643) Signed-off-by: Runji Wang <[email protected]> commit 7b21e04 Author: Dylan <[email protected]> Date: Fri Nov 24 16:54:38 2023 +0800 feat(optimizer): improve inline session timezone in exprs (#13640) Signed-off-by: Little-Wallace <[email protected]>

Signed-off-by: Little-Wallace <[email protected]>

hzxa21 · 2023-12-08T03:39:40Z

src/config/example.toml

@@ -160,7 +160,7 @@ recent_filter_rotate_interval_ms = 10000
 object_store_streaming_read_timeout_ms = 600000
 object_store_streaming_upload_timeout_ms = 600000
 object_store_upload_timeout_ms = 3600000
-object_store_read_timeout_ms = 3600000
+object_store_read_timeout_ms = 300000



Can we aviod changing this number in this PR? We can have a separate PR to change the default timeout.

hzxa21 · 2023-12-08T03:40:41Z

src/common/src/config.rs

@@ -1518,7 +1518,7 @@ pub fn extract_storage_memory_config(s: &RwConfig) -> StorageMemoryConfig {
        .storage
        .high_priority_ratio_in_percent
        .unwrap_or(default::storage::high_priority_ratio_in_percent());
-    let large_query_memory_usage_mb = s
+    let prefetch_buffer_capacity_mb = s
        .storage
        .shared_buffer_capacity_mb
        .unwrap_or((100 - high_priority_ratio_in_percent) * block_cache_capacity_mb / 100);


Hmm.... Does this indicate that we are still relying on cache to do prefetch?

No. We just to set a value like block-cache.
As the performance benchmark, it can not be too small because there may be several iterator using prefetch.
For example: if there are 256 vnode reading hummock and each vnode creates 8 iterator using prefetch, they will cost 256 * 8 * 16 * 64KB = 2GB memory.
So it can not be a small number.

hzxa21 · 2023-12-08T03:40:50Z

src/compute/src/memory/config.rs

+
+    let prefetch_buffer_capacity_mb = storage_config
+        .prefetch_buffer_capacity_mb
+        .unwrap_or(meta_cache_capacity_mb + block_cache_capacity_mb);


src/object_store/src/object/mod.rs

src/storage/src/hummock/sstable_store.rs

src/storage/src/hummock/block_stream.rs

src/storage/src/hummock/sstable_store.rs

hzxa21 · 2023-12-08T04:18:20Z

src/storage/src/hummock/sstable_store.rs

+            let holder = if let CachePolicy::Fill(priority) = policy {
+                let cache_priority = if idx == block_index {
+                    priority
+                } else {
+                    CachePriority::Low
+                };
+                self.block_cache
+                    .insert(object_id, idx as u64, Box::new(block), cache_priority)
+            } else {
+                BlockHolder::from_owned_block(Box::new(block))
+            };
+
+            blocks.push_back(holder);
+            offset = end;
+        }
+        Ok(Box::new(PrefetchBlockStream::new(blocks, block_index)))


IIUC, the approch we are using is:

If polcy is CachePolicy::Fill, we rely on block cache to hold the prefetch buffer, meaning that we are pinning the prefetch blocks in block cache until the read finishes.

Otherwise, we use BlockHolder::from_owned_block, meaning that the memory usage of prefetch blocks are untracked.

As mentioned in #13558 (comment), this can cause unexpected memory usage and cache eviction, which is sub-optimal. How about:

Store Block instead of BlockHolder in PrefetchBlockStream

Increment prefetch_buffer_usage when creating PrefetchBlockStream as well.

When prefetch_buffer_usage > prfetch_buffer_capacity, disable prefetch and use normal get instead of PrefetchBlockStream and LongConnectionBlockStream.

Btw, the default prefetch_buffer_capactiy doesn't need be be very large with this approch because IMO a small buffer can already achieve good enough performance when the prefetch concurrency is not high, which is mostly the case for streaming CN. My opinion:

For streaming CN with large memory, I think 200MB prefetch buffer is an okay default number.

For streaming CN with small memory, prefetch can benefit little so we can have a even smaller number

For serving CN, prefetch concurrency can be much higher and prefetch can also be more aggressive so the prefetch buffer can be the remaining memory minus block cache capacity, which also makes sense and provides good reason for user to deploy serving CN if they do care about batch query performance.

More brainstorms for future works (just immature ideas):

We can do adaptive prefetch instead of prefetch a fixed number of blocks each time

Backfill behaves more like a batch query than a streaming query so it may need more prefetch buffer. We can probably use ad-hoc instance (like a serving CN with large memory) to do the scan to achieve better prefetch.

It is strange if we do not pin these block read by prefetch.
Because when we only read one block, we will pin it, and when we read several blocks, we do not pin them anymore.
If we do not hope to pin these blocks, we shal not pin any block in iterator.

src/storage/src/hummock/sstable_store.rs

src/storage/src/monitor/monitored_storage_metrics.rs

Signed-off-by: Little-Wallace <[email protected]>

hzxa21

Rest LGTM

hzxa21 · 2023-12-11T05:15:10Z

src/object_store/src/object/mod.rs

@@ -273,7 +273,8 @@ fn try_update_failure_metric<T>(
    result: &ObjectResult<T>,
    operation_type: &'static str,
 ) {
-    if result.is_err() {
+    if let Err(e) = &result {
+        tracing::error!("read failed because of: {:?}", e);


Not just read. Should be tracing::error!("{} failed because of: {:?}", operation_type, e);

src/storage/src/hummock/block_stream.rs

hzxa21 · 2023-12-11T06:02:31Z

src/storage/src/store.rs

+    pub prefetch: bool,
+    pub for_large_query: bool,


streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.

src/storage/src/hummock/sstable_store.rs

hzxa21 · 2023-12-11T06:17:47Z

src/storage/src/hummock/sstable/forward_sstable_iterator.rs

+            .prefetch_blocks(self.sst.value(), idx, self.preload_end_block_idx,
+                        self.options.cache_policy,
+                        &mut self.stats,
+            ).await


Not related to this PR but I just realize that the error returned by prefetch_blocks is ignored. We should check and log it.

hzxa21

Rest LGTM

hzxa21

Streaming read is no longer used for streaming/batch query. Please modify the PR description accordingly.

Signed-off-by: Little-Wallace <[email protected]>

WanYixian · 2023-12-13T09:22:31Z

@Little-Wallace @chenzl25 @hzxa21
Hello everyone, thanks for your work on this PR. I'll be updating the related documentation, and could you please let me know the parts that require updates on the doc side? Thanks in advance!

KeXiangWang · 2023-12-21T22:17:44Z

@Little-Wallace @huangjw806 Can we cherry-pick this PR into 1.5.2?

enable preload for log store

7d0a80e

Signed-off-by: Little-Wallace <[email protected]>

github-actions bot added the type/feature label Nov 21, 2023

speed up tombstone reclaim

c5f66e7

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace added 7 commits November 23, 2023 13:19

only prefetch small data for streaming executor

4abeaa3

Signed-off-by: Little-Wallace <[email protected]>

Revert "speed up tombstone reclaim"

6e42e38

This reverts commit c5f66e7.

fix check

c6a1c82

Signed-off-by: Little-Wallace <[email protected]>

only prefetch small data for streaming read

cd8a949

Signed-off-by: Little-Wallace <[email protected]>

Merge branch 'main' into wallace/more-preload

3b532f5

fill more block

5ef5e7b

Signed-off-by: Little-Wallace <[email protected]>

fix block index error

8b1d10f

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace added 2 commits November 23, 2023 19:59

support close prefetch

e4dc5b9

Signed-off-by: Little-Wallace <[email protected]>

limit select memory

cd5f2cb

Signed-off-by: Little-Wallace <[email protected]>

hzxa21 reviewed Nov 24, 2023

View reviewed changes

chenzl25 mentioned this pull request Nov 24, 2023

feat(batch): support batch limit push down #13638

Merged

9 tasks

Little-Wallace added 8 commits November 24, 2023 17:09

rename variable

cb6bd7d

Signed-off-by: Little-Wallace <[email protected]>

rename new method

372c9e9

Signed-off-by: Little-Wallace <[email protected]>

Merge branch 'main' into wallace/more-preload

d45a926

refactor streaming

9c6ab96

Signed-off-by: Little-Wallace <[email protected]>

fix format

ba49e30

Signed-off-by: Little-Wallace <[email protected]>

Merge branch 'main' into wallace/more-preload

09499ff

fix format

01685da

Signed-off-by: Little-Wallace <[email protected]>

chenzl25 mentioned this pull request Nov 28, 2023

chore: Adjust default large query memory #13686

Merged

9 tasks

use one trait for both prefetch strategy

7970480

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace force-pushed the wallace/more-preload branch from ee9b0d9 to 7970480 Compare December 1, 2023 06:49

Merge branch 'main' into wallace/more-preload

05970f6

Little-Wallace force-pushed the wallace/more-preload branch 2 times, most recently from 03ea2ad to 149f67b Compare December 4, 2023 11:25

hzxa21 mentioned this pull request Dec 6, 2023

feat: Add a config to disbale storage iter prefetch #12642

Closed

8 tasks

Little-Wallace force-pushed the wallace/more-preload branch from aacdd38 to 1e53bca Compare December 7, 2023 05:34

Little-Wallace requested a review from a team as a code owner December 7, 2023 05:34

spawn for data stream

035a879

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace force-pushed the wallace/more-preload branch from 1e53bca to 035a879 Compare December 7, 2023 05:37

Little-Wallace added 3 commits December 7, 2023 13:40

Merge branch 'main' into wallace/more-preload

29e516d

fix conflict

bdd467d

Signed-off-by: Little-Wallace <[email protected]>

use prefetch small scan

e12ac31

Signed-off-by: Little-Wallace <[email protected]>

hzxa21 reviewed Dec 8, 2023

View reviewed changes

Little-Wallace added 3 commits December 8, 2023 14:39

revert streaming read

29bad52

Signed-off-by: Little-Wallace <[email protected]>

Merge branch 'main' into wallace/more-preload

1597cdd

address comment

ed431d2

Signed-off-by: Little-Wallace <[email protected]>

hzxa21 approved these changes Dec 11, 2023

View reviewed changes

hzxa21 reviewed Dec 11, 2023

View reviewed changes

hzxa21 approved these changes Dec 11, 2023

View reviewed changes

Little-Wallace added 2 commits December 11, 2023 14:59

address comment

6b782a3

Signed-off-by: Little-Wallace <[email protected]>

fix comment

cd29165

Signed-off-by: Little-Wallace <[email protected]>

Little-Wallace added this pull request to the merge queue Dec 11, 2023

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Dec 11, 2023

Merge branch 'main' into wallace/more-preload

400e7db

Little-Wallace added this pull request to the merge queue Dec 11, 2023

Merged via the queue into main with commit 2fffc13 Dec 11, 2023
6 of 7 checks passed

Little-Wallace deleted the wallace/more-preload branch December 11, 2023 12:11

Little-Wallace mentioned this pull request Dec 11, 2023

perf(topn): reduce unnecessary table scan in streaming (Group)TopN executors #13832

Merged

9 tasks

lmatz mentioned this pull request Dec 12, 2023

14 nexmark benchmark skus performance degradation #13931

Closed

xxchan mentioned this pull request Jan 11, 2024

risingwave 1.6.0 risingwavelabs/homebrew-risingwave#22

Closed

hzxa21 mentioned this pull request Jan 25, 2024

Longevity test release-1.6 CN OOM #14712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): enable preload for log store and dynamic filter #13558

feat(storage): enable preload for log store and dynamic filter #13558

Little-Wallace commented Nov 21, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

chenzl25 commented Nov 23, 2023

hzxa21 Nov 24, 2023

Little-Wallace Nov 25, 2023

hzxa21 Dec 11, 2023

hzxa21 Nov 24, 2023

hzxa21 Nov 24, 2023 •

edited

Loading

hzxa21 Dec 8, 2023

hzxa21 Dec 8, 2023

Little-Wallace Dec 11, 2023

hzxa21 Dec 8, 2023

hzxa21 Dec 8, 2023

hzxa21 Dec 8, 2023

Little-Wallace Dec 8, 2023

hzxa21 left a comment

hzxa21 Dec 11, 2023

hzxa21 Dec 11, 2023

hzxa21 Dec 11, 2023

Little-Wallace Dec 11, 2023

hzxa21 left a comment

hzxa21 left a comment

WanYixian commented Dec 13, 2023

KeXiangWang commented Dec 21, 2023

feat(storage): enable preload for log store and dynamic filter #13558

feat(storage): enable preload for log store and dynamic filter #13558

Conversation

Little-Wallace commented Nov 21, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

codecov bot commented Nov 21, 2023 • edited Loading

Codecov Report

chenzl25 commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

WanYixian commented Dec 13, 2023

KeXiangWang commented Dec 21, 2023

Little-Wallace commented Nov 21, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

hzxa21 Nov 24, 2023 •

edited

Loading