-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM for sysbench select random limits (Hummock read duration seems abnormal) #13506
Comments
The sysbench workload is a customed one, see: https://github.com/risingwavelabs/sysbench/blob/master/src/lua/select_random_limits.lua#L26-L31 aka select the same with limit 10 repetitively |
Lots of memory hold by Hummock iteration 1700467603-2023-11-20-08-06-42.auto.heap.collapsed.zip Seems leaks somewhere. This is critical and let’s prioritize it. Might be related #9732 |
Since the error is reproducible every time, I suppose it must be some wrong commit between Edit: test earlier commits and Edit: |
We notice that the latest change in our also no related commits in kube-bench: https://github.com/risingwavelabs/kube-bench/commits/main in the last two weeks. neither in risingwave-test: https://github.com/risingwavelabs/risingwave-test/commits/main |
The sysbench CN memory configuration is as such: In principle, CN should not OOM until 15Gi. But it exceeds 11Gi and then OOM. This is weird. cc: @huangjw806 is trying another new configuration to see if this behavior would be gone or not. But, we remark that this is a separate issue. It does not explain why the memory usage reaches 11Gi. Previously, it used only about 6GB at most: |
How am I supposed to see the flamegraph? Is this a svg? |
Drop the |
I agree with @lmatz This more like an environment change rather code change. Same image failed with different schedules. |
It seems that v1.4.0 does not have this issue. |
Actually, I think env and kernel are still both likely to have an issue.
Since the workload is not changed, this behavior is hard to be explained by env but only the kernel. However, the |
The latency of succeeded ones are 150+ms: While it speed up to 5ms for failed ones: Any changes in storage to speed up this? cc @hzxa21 |
Let's wait to see this: https://buildkite.com/risingwave-test/sysbench/builds/572#018bfbef-a682-42d5-bc57-36fad5bce88f |
It seems it is related to this PR #13132 . It uses a streaming read to prefetch. |
Let's see if this one without config can succeeded. https://buildkite.com/risingwave-test/sysbench/builds/575 |
It failed. |
https://buildkite.com/risingwave-test/sysbench/builds/576 In this run I've adjusted |
Succeeded. |
Describe the bug
https://buildkite.com/risingwave-test/sysbench/builds/554#018be963-e823-4393-aad6-38255d87bcdd/1121
image is
20231119
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1700429709000&to=1700430303000&var-namespace=sysbench-daily-test-20231119
Dashboard:
The read duration seems abnormal.
The previous tests are all good, e.g. the latest successful one is image
20231116
https://buildkite.com/risingwave-test/sysbench/builds/553#018bd9f0-9c26-4472-a129-e0cb237ecf39
Dashboard:
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?from=1700169909000&orgId=1&to=1700171403000&var-datasource=P2453400D1763B4D9&var-namespace=sysbench-daily-test-20231116&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All
According to https://github.com/risingwavelabs/rw-commits-history#nightly-20231119, it does not seem to be affected by a PR between
20231116
and20231119
.Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
IIUC, we have implemented the mechanism at the executor level to kill a batch query that risks going OOM.
Therefore, OOM is unexpected, although the root cause of OOM is not necessarily this mechanism.
I saw #13132 merged in
20231115
, wonder if it may have some impact?The text was updated successfully, but these errors were encountered: