Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too aggressive and early cache eviction #15305

Closed
lmatz opened this issue Feb 27, 2024 · 5 comments · Fixed by #16087
Closed

too aggressive and early cache eviction #15305

lmatz opened this issue Feb 27, 2024 · 5 comments · Fixed by #16087
Assignees
Milestone

Comments

@lmatz
Copy link
Contributor

lmatz commented Feb 27, 2024

The eviction at the beginning of both tests is too aggressive.
The eviction starts even when there is still quite much memory.

first posted here TPC-H q20: #14797 (comment)

will collect a few more examples when encountering

@github-actions github-actions bot added this to the release-1.7 milestone Feb 27, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Feb 28, 2024

The metrics mentioned in #14797 (comment) is somehow expected under our current implementation of LRU memory manager.

The root cause is because the feedback loop between LRU watermark and actual usage is much slower than 1 second (the default running interval of the memory policy). The memory manager assumes its feedback at previous run has been reflected in the current memory usage, but it's not, actually.

To get rid of this assumption, we have discussed on several ideas. For example, we may let each streaming executor report its LRU cache memory usage by epoch, so that the memory manager can find out the best LRU watermark accordingly.

@lmatz
Copy link
Contributor Author

lmatz commented Feb 29, 2024

TPC-H q4: #14811 (comment)
has a very similar observation

Edit:
q4 is not affected by eviction though as it does not need a very big cache and its cache miss ops/rate is very low

But q20 does get affected by this as shown at #14797 (comment)

@lmatz
Copy link
Contributor Author

lmatz commented Mar 18, 2024

TPC-H q17: #14799 (comment) also has a very similar observation

@lmatz
Copy link
Contributor Author

lmatz commented Mar 25, 2024

Repost observation and experiments done by @MrCroxx for better visibility:

output.xlsx

There are three observations:

  1. uneven data among different epochs when that epoch gets evicted.

SCR-20240325-ipl

  1. a single eviction of multiple epochs and the size of data evicted is huge
    SCR-20240325-iq5

  2. a single eviction of a single epoch and the size of data evicted is huge
    SCR-20240325-ir3

For phenomena 1 and 2, there was a speculation that each time an entry is accessed, the entry's epoch is increased, causing older epoch data to gradually decrease, leading to evicting data with fewer older epochs first. After evicting several epochs and finding that memory did not decrease, a more aggressive exponential evicting of epochs began, ultimately resulting in the eviction of data from epochs with a larger amount of newer data.

This speculation may explain the issue when the barrier frequency is high. However, for phenomenon 3 observed in the 10-second barrier experiment mentioned above, it seems more like there are too few epochs to adequately partition the eviction scope.

In the future, I will try other strategies such as incorporating epochs and quotas, using quotas only without epochs, or exploring alternative strategies.

@MrCroxx MrCroxx self-assigned this Mar 25, 2024
@MrCroxx
Copy link
Contributor

MrCroxx commented Mar 26, 2024

The evicted bytes by epoch (summed by all LRUs, 1s barrier):

https://1drv.ms/x/s!AiJJmrmsw_N2mxGoFZbGAYSRGstJ?e=B76XvA

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants