-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(stream): concurrently fetch row from storage and refill cache #19629
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
120103d
to
b547dc8
Compare
Benchmarking this in RWC does not show decrease in mem utilization. There's probably something wrong with the implementation. Writing a memory benchmark first. |
Added benchmark: #19712. Continuing investigation. |
b547dc8
to
fb55b51
Compare
Works well. See PR description. Memory utilization plateaus after the threshold set. |
efda035
to
5d72e75
Compare
8766e99
to
3f8c1a6
Compare
3f8c1a6
to
bf1ac94
Compare
f5188c8
to
bd82fe3
Compare
We don't have any issues supporting this for Currently we:
So the upfront IO cost is: Then the total latency for cache missed will be After this PR: Currently I workaround this by reading all the degrees into memory first. The size of the degrees shouldn't be too bad, since we just need to store the degrees. Then the total latency for cache missed will be However, given that cache miss is expected to seldom occur, I think this is a reasonable trade-off to make. Will benchmark this approach against a nexmark query with Another approach is to concurrently do point get, and write to the degree state table, since the point get reference to the degree table is short-lived. However, point get does not have prefetch interface yet. The second issue is tolerating inconsistency. Previously we buffered all matched rows into memory. If there's a mismatch in the number of rows in the degree table rows and the matched table rows, we will compare their pk. I think it's possible to support this with the point get approach. Alternatively, if tolerate inconsistency is set to true, we can always follow the old approach of refilling cache first. |
Will add a micro-bench for |
af75e8e
to
08ac933
Compare
3a20550
to
0f5e5ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the Cargo.lock changes.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Currently we read all rows matching a key, encode them, and store them in an in-memory data structure. We add this to the cache. Then we iterate each row from the in-memory data structure, decode them. This means we have to incur IO latency of reading all rows first. We also have to incur the memory cost of buffering all matching records.
Instead of doing this, we can iterate the rows, concurrently refill cache for them. Then we don't have to wait for IO to finish for ALL rows. We also avoid OOM, since while refilling cache, if we notice the number of rows exceeds some threshold, we can stop refilling.
Further, if we tolerate inconsistency, we will revert to the old strategy of reading all records into cache first, so we will preserve that logic.
This is done for all join types in the hash join executor.
Handling degree table matches and updates
Before this PR we:
However, in this PR we concurrently refill cache and handle matches. This means that 1&2 happen concurrently, which means we hold an immutable and mutable reference to the underlying match side degree table. This will be rejected by the borrow checker.
To solve this, we read from the match side degree table, keeping all the degrees in-memory. Only after that is complete, we start concurrently doing cache refill of the match side, and also updating degrees of match side. In this way the lifetimes of the mutable and immutable reference to match side degree table won't overlap.
In the future we can optimize it further, holding a separate cache for degrees, from their join state, since the degree state is much smaller.
Nexmark Benchmark results
Build: https://buildkite.com/risingwave-test/nexmark-benchmark/builds/4994
Dashboards:
Micro Benchmark results
Cache Miss
With a 30,000 record limit for each key we measure the performance + memory consumption when a single record is matched against N records which are not in-cache. Here's the comparison (using in-memory statestore, inner join, sample size = 100):
You can observe how after 30,000, we don't do cache refill anymore, and so the memory usage becomes capped, and we can avoid OOM.
Do also note that because this is an in-memory statestore, we can see significant runtime improvements. I expect that with block cache enabled, we can also see similar improvements, although may not be as high.
Cache Hit
We join ingest 64 records on the update (probe) side, to match all amplified records on the build side. Sample size is N, using in-memory state store. For memory utilization, we measure mem table usage, it's unavoidable because we have to use a barrier to synchronize the sequence of ingestion, so we can make the sequence of join deterministic, rather than interleaved from lhs and rhs of join. But it's fine because we don't expect better memory utilization, since everything is in cache.
Cache hit results show some regression for runtime (~5-6%):
However:
is_append_only
code path if join is non-append-only.Further improvements
We can also check the cache size when inserting the update records into cache. Currently we only do so for the match side, since that has N matches for a single update record. We differ this optimization to later to avoid further complicating this PR.
We may also need to call try_flush more frequently. To ensure records in join state get flushed.
We also need to remove merge chunks triggered OOM.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.