Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): enable preload for log store and dynamic filter #13558

Merged
merged 31 commits into from
Dec 11, 2023

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Nov 21, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

enable preload for those executor:

  • cmd scan
  • log store reader
  • streaming dynamic filter
  • streaming source reader
  • batch query

This PR will refactor prefetch and change all read operation for streaming executor prefetch to read. Because in s3 sdk, streaming_read share the same interface with read. For remote object storage server, it could only see some read request lives a long time and has still not end. We can not make sure that object storage server will not kill these read requests. So it is dangerous for us to use a long time read in case which can not finish as soon as possible.

We remove streaming read in this PR. Because we find it will block query future in some case. We may enable it when we investigate the root cause. So it means that the batch-query will use the same prefetch method as streaming executor.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 77 lines in your changes are missing coverage. Please review.

Comparison is base (98b0ebd) 68.07% compared to head (400e7db) 68.12%.
Report is 1 commits behind head on main.

Files Patch % Lines
src/storage/src/hummock/sstable_store.rs 67.42% 43 Missing ⚠️
...ge/src/hummock/sstable/forward_sstable_iterator.rs 83.05% 10 Missing ⚠️
src/storage/src/hummock/block_stream.rs 91.66% 8 Missing ⚠️
src/storage/src/hummock/utils.rs 81.25% 6 Missing ⚠️
src/storage/src/store.rs 73.68% 5 Missing ⚠️
src/stream/src/executor/source/fetch_executor.rs 0.00% 2 Missing ⚠️
src/storage/hummock_test/src/bin/replay/main.rs 0.00% 1 Missing ⚠️
src/storage/src/hummock/validator.rs 0.00% 1 Missing ⚠️
src/storage/src/store_impl.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13558      +/-   ##
==========================================
+ Coverage   68.07%   68.12%   +0.04%     
==========================================
  Files        1532     1533       +1     
  Lines      264568   264675     +107     
==========================================
+ Hits       180114   180310     +196     
+ Misses      84454    84365      -89     
Flag Coverage Δ
rust 68.12% <80.94%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chenzl25
Copy link
Contributor

+1 for the idea that backfilling uses read rather than streaming_read to prefetch because the backfilling workload differs from the normal batch query, it will cancel on each barrier. Using streaming_read for backfilling could cause high
instantaneous connection number.

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Comment on lines +276 to +277
pub prefetch: bool,
pub for_large_query: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some documentation on what the behavior is after setting these flags

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.

src/storage/hummock_trace/src/opts.rs Outdated Show resolved Hide resolved
sst: &Sstable,
block_index: usize,
max_prefetch_block_count: usize,
policy: CachePolicy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the provided cache policy is used is confusing. The behavior is:

  1. If no prefetch is triggered, the requested block is inserted into block cache with the provided policy
  2. If prefetch is triggered, the requested block as well as the prefetched blocks are inserted into block cache with CachePolicy::Fill(CachePriority::Low)

How about making it more consistent?

Copy link
Collaborator

@hzxa21 hzxa21 Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts:
we have two mechanism to constrain prefetch memory usage now:

  • For prefetch with streaming read, pending_streaming_loading is used to track the usage and it is limited by large_query_memory_usage
  • For prefetch with normal read, block cache is used to track the usage and it is limited by block_cache_capacity (and the low priority ratio).

The motivation of introducing low priority block cache is to control the behavior of cache refill, not block prefetch, and we are mixing it up here. IMO, logically it is better to have a clear sepration:

  • Low priority block cache: enable block reuse across queries without polluting the whole block cache. Blocks in here can be evicted.
  • Prefetch buffer: buffer prefetched blocks within a query. Blocks in here cannot be eviected.

Ideas:

  1. rename pending_streaming_loading to prefetch_buffer_usage and large_query_memory_usage to prefetch_buffer_capacity
  2. Prefetched blocks will be put into low priority block cache and the block holder will be dropped immediately. This is completely independent of the prefetch policy.
  3. Prefetched blocks will be put into a buf (holding Bytes, not the BlockHolder) and account for prefetch_buffer_usage. For normal read prefetch, the buf is a Vec<Block>. For streaming read prefetch, the buf is the implicit recv buf.
  4. When there is no quota in prefetch_buffer_capacity, prefetch will be disable and fall back to per block normal read. This is completely independent to the block cache policy.

What do you think?

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
commit a3ea7c0
Author: Eric Fu <[email protected]>
Date:   Sat Nov 25 11:00:48 2023 +0800

    refactor: memory management (#13636)

commit 2348a2b
Author: Bugen Zhao <[email protected]>
Date:   Fri Nov 24 18:26:48 2023 +0800

    fix(streaming): use correct label for `stream_fragment_exchange_bytes` metrics (#13644)

    Signed-off-by: Bugen Zhao <[email protected]>

commit 3ccb249
Author: Runji Wang <[email protected]>
Date:   Fri Nov 24 17:39:12 2023 +0800

    fix: estimate jsonb's value encoding size (#13643)

    Signed-off-by: Runji Wang <[email protected]>

commit 7b21e04
Author: Dylan <[email protected]>
Date:   Fri Nov 24 16:54:38 2023 +0800

    feat(optimizer): improve inline session timezone in exprs (#13640)

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
@Little-Wallace Little-Wallace force-pushed the wallace/more-preload branch 2 times, most recently from 03ea2ad to 149f67b Compare December 4, 2023 11:25
Signed-off-by: Little-Wallace <[email protected]>
@@ -160,7 +160,7 @@ recent_filter_rotate_interval_ms = 10000
object_store_streaming_read_timeout_ms = 600000
object_store_streaming_upload_timeout_ms = 600000
object_store_upload_timeout_ms = 3600000
object_store_read_timeout_ms = 3600000
object_store_read_timeout_ms = 300000

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we aviod changing this number in this PR? We can have a separate PR to change the default timeout.

@@ -1518,7 +1518,7 @@ pub fn extract_storage_memory_config(s: &RwConfig) -> StorageMemoryConfig {
.storage
.high_priority_ratio_in_percent
.unwrap_or(default::storage::high_priority_ratio_in_percent());
let large_query_memory_usage_mb = s
let prefetch_buffer_capacity_mb = s
.storage
.shared_buffer_capacity_mb
.unwrap_or((100 - high_priority_ratio_in_percent) * block_cache_capacity_mb / 100);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... Does this indicate that we are still relying on cache to do prefetch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We just to set a value like block-cache.
As the performance benchmark, it can not be too small because there may be several iterator using prefetch.
For example: if there are 256 vnode reading hummock and each vnode creates 8 iterator using prefetch, they will cost 256 * 8 * 16 * 64KB = 2GB memory.
So it can not be a small number.


let prefetch_buffer_capacity_mb = storage_config
.prefetch_buffer_capacity_mb
.unwrap_or(meta_cache_capacity_mb + block_cache_capacity_mb);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

src/object_store/src/object/mod.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/sstable_store.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/block_stream.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/sstable_store.rs Outdated Show resolved Hide resolved
Comment on lines 415 to 430
let holder = if let CachePolicy::Fill(priority) = policy {
let cache_priority = if idx == block_index {
priority
} else {
CachePriority::Low
};
self.block_cache
.insert(object_id, idx as u64, Box::new(block), cache_priority)
} else {
BlockHolder::from_owned_block(Box::new(block))
};

blocks.push_back(holder);
offset = end;
}
Ok(Box::new(PrefetchBlockStream::new(blocks, block_index)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the approch we are using is:

  • If polcy is CachePolicy::Fill, we rely on block cache to hold the prefetch buffer, meaning that we are pinning the prefetch blocks in block cache until the read finishes.
  • Otherwise, we use BlockHolder::from_owned_block, meaning that the memory usage of prefetch blocks are untracked.

As mentioned in #13558 (comment), this can cause unexpected memory usage and cache eviction, which is sub-optimal. How about:

  • Store Block instead of BlockHolder in PrefetchBlockStream
  • Increment prefetch_buffer_usage when creating PrefetchBlockStream as well.
  • When prefetch_buffer_usage > prfetch_buffer_capacity, disable prefetch and use normal get instead of PrefetchBlockStream and LongConnectionBlockStream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the default prefetch_buffer_capactiy doesn't need be be very large with this approch because IMO a small buffer can already achieve good enough performance when the prefetch concurrency is not high, which is mostly the case for streaming CN. My opinion:

  • For streaming CN with large memory, I think 200MB prefetch buffer is an okay default number.
  • For streaming CN with small memory, prefetch can benefit little so we can have a even smaller number
  • For serving CN, prefetch concurrency can be much higher and prefetch can also be more aggressive so the prefetch buffer can be the remaining memory minus block cache capacity, which also makes sense and provides good reason for user to deploy serving CN if they do care about batch query performance.

More brainstorms for future works (just immature ideas):

  • We can do adaptive prefetch instead of prefetch a fixed number of blocks each time
  • Backfill behaves more like a batch query than a streaming query so it may need more prefetch buffer. We can probably use ad-hoc instance (like a serving CN with large memory) to do the scan to achieve better prefetch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange if we do not pin these block read by prefetch.
Because when we only read one block, we will pin it, and when we read several blocks, we do not pin them anymore.
If we do not hope to pin these blocks, we shal not pin any block in iterator.

src/storage/src/hummock/sstable_store.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@@ -273,7 +273,8 @@ fn try_update_failure_metric<T>(
result: &ObjectResult<T>,
operation_type: &'static str,
) {
if result.is_err() {
if let Err(e) = &result {
tracing::error!("read failed because of: {:?}", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just read. Should be tracing::error!("{} failed because of: {:?}", operation_type, e);

src/storage/src/hummock/block_stream.rs Outdated Show resolved Hide resolved
Comment on lines +276 to +277
pub prefetch: bool,
pub for_large_query: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.

src/storage/src/hummock/sstable_store.rs Outdated Show resolved Hide resolved
.prefetch_blocks(self.sst.value(), idx, self.preload_end_block_idx,
self.options.cache_policy,
&mut self.stats,
).await
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but I just realize that the error returned by prefetch_blocks is ignored. We should check and log it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming read is no longer used for streaming/batch query. Please modify the PR description accordingly.

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
@Little-Wallace Little-Wallace added this pull request to the merge queue Dec 11, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Dec 11, 2023
@Little-Wallace Little-Wallace added this pull request to the merge queue Dec 11, 2023
Merged via the queue into main with commit 2fffc13 Dec 11, 2023
6 of 7 checks passed
@Little-Wallace Little-Wallace deleted the wallace/more-preload branch December 11, 2023 12:11
@WanYixian
Copy link
Contributor

@Little-Wallace @chenzl25 @hzxa21
Hello everyone, thanks for your work on this PR. I'll be updating the related documentation, and could you please let me know the parts that require updates on the doc side? Thanks in advance!

@KeXiangWang
Copy link
Contributor

@Little-Wallace @huangjw806 Can we cherry-pick this PR into 1.5.2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants