-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): enable preload for log store and dynamic filter #13558
Conversation
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13558 +/- ##
==========================================
+ Coverage 68.07% 68.12% +0.04%
==========================================
Files 1532 1533 +1
Lines 264568 264675 +107
==========================================
+ Hits 180114 180310 +196
+ Misses 84454 84365 -89
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Little-Wallace <[email protected]>
This reverts commit c5f66e7.
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
+1 for the idea that backfilling uses |
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
pub prefetch: bool, | ||
pub for_large_query: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some documentation on what the behavior is after setting these flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.
sst: &Sstable, | ||
block_index: usize, | ||
max_prefetch_block_count: usize, | ||
policy: CachePolicy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the provided cache policy is used is confusing. The behavior is:
- If no prefetch is triggered, the requested block is inserted into block cache with the provided
policy
- If prefetch is triggered, the requested block as well as the prefetched blocks are inserted into block cache with
CachePolicy::Fill(CachePriority::Low)
How about making it more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More thoughts:
we have two mechanism to constrain prefetch memory usage now:
- For prefetch with streaming read,
pending_streaming_loading
is used to track the usage and it is limited bylarge_query_memory_usage
- For prefetch with normal read, block cache is used to track the usage and it is limited by
block_cache_capacity
(and the low priority ratio).
The motivation of introducing low priority block cache is to control the behavior of cache refill, not block prefetch, and we are mixing it up here. IMO, logically it is better to have a clear sepration:
- Low priority block cache: enable block reuse across queries without polluting the whole block cache. Blocks in here can be evicted.
- Prefetch buffer: buffer prefetched blocks within a query. Blocks in here cannot be eviected.
Ideas:
- rename
pending_streaming_loading
toprefetch_buffer_usage
andlarge_query_memory_usage
toprefetch_buffer_capacity
- Prefetched blocks will be put into low priority block cache and the block holder will be dropped immediately. This is completely independent of the prefetch policy.
- Prefetched blocks will be put into a buf (holding Bytes, not the BlockHolder) and account for
prefetch_buffer_usage
. For normal read prefetch, the buf is aVec<Block>
. For streaming read prefetch, the buf is the implicit recv buf. - When there is no quota in
prefetch_buffer_capacity
, prefetch will be disable and fall back to per block normal read. This is completely independent to the block cache policy.
What do you think?
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
commit a3ea7c0 Author: Eric Fu <[email protected]> Date: Sat Nov 25 11:00:48 2023 +0800 refactor: memory management (#13636) commit 2348a2b Author: Bugen Zhao <[email protected]> Date: Fri Nov 24 18:26:48 2023 +0800 fix(streaming): use correct label for `stream_fragment_exchange_bytes` metrics (#13644) Signed-off-by: Bugen Zhao <[email protected]> commit 3ccb249 Author: Runji Wang <[email protected]> Date: Fri Nov 24 17:39:12 2023 +0800 fix: estimate jsonb's value encoding size (#13643) Signed-off-by: Runji Wang <[email protected]> commit 7b21e04 Author: Dylan <[email protected]> Date: Fri Nov 24 16:54:38 2023 +0800 feat(optimizer): improve inline session timezone in exprs (#13640) Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
ee9b0d9
to
7970480
Compare
03ea2ad
to
149f67b
Compare
aacdd38
to
1e53bca
Compare
Signed-off-by: Little-Wallace <[email protected]>
1e53bca
to
035a879
Compare
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
src/config/example.toml
Outdated
@@ -160,7 +160,7 @@ recent_filter_rotate_interval_ms = 10000 | |||
object_store_streaming_read_timeout_ms = 600000 | |||
object_store_streaming_upload_timeout_ms = 600000 | |||
object_store_upload_timeout_ms = 3600000 | |||
object_store_read_timeout_ms = 3600000 | |||
object_store_read_timeout_ms = 300000 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we aviod changing this number in this PR? We can have a separate PR to change the default timeout.
@@ -1518,7 +1518,7 @@ pub fn extract_storage_memory_config(s: &RwConfig) -> StorageMemoryConfig { | |||
.storage | |||
.high_priority_ratio_in_percent | |||
.unwrap_or(default::storage::high_priority_ratio_in_percent()); | |||
let large_query_memory_usage_mb = s | |||
let prefetch_buffer_capacity_mb = s | |||
.storage | |||
.shared_buffer_capacity_mb | |||
.unwrap_or((100 - high_priority_ratio_in_percent) * block_cache_capacity_mb / 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.... Does this indicate that we are still relying on cache to do prefetch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We just to set a value like block-cache.
As the performance benchmark, it can not be too small because there may be several iterator using prefetch.
For example: if there are 256 vnode reading hummock and each vnode creates 8 iterator using prefetch, they will cost 256 * 8 * 16 * 64KB = 2GB memory.
So it can not be a small number.
src/compute/src/memory/config.rs
Outdated
|
||
let prefetch_buffer_capacity_mb = storage_config | ||
.prefetch_buffer_capacity_mb | ||
.unwrap_or(meta_cache_capacity_mb + block_cache_capacity_mb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
let holder = if let CachePolicy::Fill(priority) = policy { | ||
let cache_priority = if idx == block_index { | ||
priority | ||
} else { | ||
CachePriority::Low | ||
}; | ||
self.block_cache | ||
.insert(object_id, idx as u64, Box::new(block), cache_priority) | ||
} else { | ||
BlockHolder::from_owned_block(Box::new(block)) | ||
}; | ||
|
||
blocks.push_back(holder); | ||
offset = end; | ||
} | ||
Ok(Box::new(PrefetchBlockStream::new(blocks, block_index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the approch we are using is:
- If polcy is
CachePolicy::Fill
, we rely on block cache to hold the prefetch buffer, meaning that we are pinning the prefetch blocks in block cache until the read finishes. - Otherwise, we use
BlockHolder::from_owned_block
, meaning that the memory usage of prefetch blocks are untracked.
As mentioned in #13558 (comment), this can cause unexpected memory usage and cache eviction, which is sub-optimal. How about:
- Store
Block
instead ofBlockHolder
inPrefetchBlockStream
- Increment
prefetch_buffer_usage
when creatingPrefetchBlockStream
as well. - When
prefetch_buffer_usage > prfetch_buffer_capacity
, disable prefetch and use normal get instead ofPrefetchBlockStream
andLongConnectionBlockStream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the default prefetch_buffer_capactiy
doesn't need be be very large with this approch because IMO a small buffer can already achieve good enough performance when the prefetch concurrency is not high, which is mostly the case for streaming CN. My opinion:
- For streaming CN with large memory, I think 200MB prefetch buffer is an okay default number.
- For streaming CN with small memory, prefetch can benefit little so we can have a even smaller number
- For serving CN, prefetch concurrency can be much higher and prefetch can also be more aggressive so the prefetch buffer can be the remaining memory minus block cache capacity, which also makes sense and provides good reason for user to deploy serving CN if they do care about batch query performance.
More brainstorms for future works (just immature ideas):
- We can do adaptive prefetch instead of prefetch a fixed number of blocks each time
- Backfill behaves more like a batch query than a streaming query so it may need more prefetch buffer. We can probably use ad-hoc instance (like a serving CN with large memory) to do the scan to achieve better prefetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange if we do not pin these block read by prefetch.
Because when we only read one block, we will pin it, and when we read several blocks, we do not pin them anymore.
If we do not hope to pin these blocks, we shal not pin any block in iterator.
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
src/object_store/src/object/mod.rs
Outdated
@@ -273,7 +273,8 @@ fn try_update_failure_metric<T>( | |||
result: &ObjectResult<T>, | |||
operation_type: &'static str, | |||
) { | |||
if result.is_err() { | |||
if let Err(e) = &result { | |||
tracing::error!("read failed because of: {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just read. Should be tracing::error!("{} failed because of: {:?}", operation_type, e);
pub prefetch: bool, | ||
pub for_large_query: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streaming read is removed from this PR. Can we either remove this option until streaming read is added back or at least modify the doc to mention that this option is not used.
.prefetch_blocks(self.sst.value(), idx, self.preload_end_block_idx, | ||
self.options.cache_policy, | ||
&mut self.stats, | ||
).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR but I just realize that the error returned by prefetch_blocks
is ignored. We should check and log it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streaming read is no longer used for streaming/batch query. Please modify the PR description accordingly.
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
@Little-Wallace @chenzl25 @hzxa21 |
@Little-Wallace @huangjw806 Can we cherry-pick this PR into 1.5.2? |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
enable preload for those executor:
This PR will refactor prefetch and change all read operation for streaming executor prefetch to
read
. Because in s3 sdk,streaming_read
share the same interface withread
. For remote object storage server, it could only see some read request lives a long time and has still not end. We can not make sure that object storage server will not kill these read requests. So it is dangerous for us to use a long time read in case which can not finish as soon as possible.We remove streaming read in this PR. Because we find it will block query future in some case. We may enable it when we investigate the root cause. So it means that the batch-query will use the same prefetch method as streaming executor.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.