-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: source throughput & barrier latency of longevity-test is worse than before #14828
Comments
Wait, this diff includes feat: embed trace collector & jaeger ui to dev dashboard #14220... The performance impact caused by #14220 was expected to be eliminated by chore: remove some periodical logs #14427 ❌ nightly-20240109 https://buildkite.com/risingwave-test/longevity-test/builds/897 It’s slightly better than 20240108 but still worse than before (6.2MB/s) I am still suspecting distributed tracing. To verify it, I ran a longevity test with Sadly, with tracing disabled, the performance is still bad:
Summary at this moment:
|
When comparing the grafana metrics between However, the compactor cpu usage and task number was different: I didn't see compactor related changes in a01a30d...414c6ec. Wonder whether there are changes in the test pipeline. |
Just FYI, this was expected: #14424 (comment) |
@hzxa21 Yes, today I just ran a longevity test with So far it runs for about 2 hours, and both the source throughput and barrier latency looks good. Grafana
|
The previous conclusion is incorrect. The performance impact is expected for versions between
However, the true cause is shadowed, and is very likely to be between |
Background
The recent failures of longevity test all nexmark (8 sets of nexmark queries) with 10k throughput), including
In short, in the past we can pass the test, but recently, it sometimes OOMed. This indicates that RW consume a bit more memory with same workload.
Observations
1. Barrier latency became higher
As we know, the barrier latency is an important cause of OOM.
Before:
After:
2. Source throughput became lower
The source throughput went down from 6.21 MB/s to 4.x ~ 5.x MB/s
nightly-20231225
6.18 MB/s (18277 rows/s) BuildKite Grafananightly-20240122
15109 rows/s (5.10 MB/s) (-18% even when q6-group-top1 was EXCLUDED) BuildKite GrafanaCode History
✅ nightly-20240107 https://buildkite.com/risingwave-test/longevity-test/builds/891
Source throughput: 6.21 MB/s 18400 rows/s
Commit: a01a30d
❌ nightly-20240108 https://buildkite.com/risingwave-test/longevity-test/builds/895
Source throughput: 3.57 MB/s 10566 rows/s
Commit: 414c6ec
Comparing: a01a30d...414c6ec(see next comment)The text was updated successfully, but these errors were encountered: