Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: source throughput & barrier latency of longevity-test is worse than before #14828

Closed
fuyufjh opened this issue Jan 27, 2024 · 6 comments · Fixed by #14855
Closed

perf: source throughput & barrier latency of longevity-test is worse than before #14828

fuyufjh opened this issue Jan 27, 2024 · 6 comments · Fixed by #14855
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Jan 27, 2024

Summarized from Sack thread: https://risingwave-labs.slack.com/archives/C034TRPKN1F/p1706079713814969

Background

The recent failures of longevity test all nexmark (8 sets of nexmark queries) with 10k throughput), including

In short, in the past we can pass the test, but recently, it sometimes OOMed. This indicates that RW consume a bit more memory with same workload.

Observations

1. Barrier latency became higher

As we know, the barrier latency is an important cause of OOM.

Before: before

After: after

2. Source throughput became lower

The source throughput went down from 6.21 MB/s to 4.x ~ 5.x MB/s

  • nightly-20231225 6.18 MB/s (18277 rows/s) BuildKite Grafana
  • nightly-20240122 15109 rows/s (5.10 MB/s) (-18% even when q6-group-top1 was EXCLUDED) BuildKite Grafana

Code History

nightly-20240107 https://buildkite.com/risingwave-test/longevity-test/builds/891
Source throughput: 6.21 MB/s 18400 rows/s
Commit: a01a30d

nightly-20240108 https://buildkite.com/risingwave-test/longevity-test/builds/895
Source throughput: 3.57 MB/s 10566 rows/s
Commit: 414c6ec

Comparing: a01a30d...414c6ec (see next comment)

@github-actions github-actions bot added this to the release-1.7 milestone Jan 27, 2024
@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 27, 2024

Wait, this diff includes feat: embed trace collector & jaeger ui to dev dashboard #14220...

The performance impact caused by #14220 was expected to be eliminated by chore: remove some periodical logs #14427
but it turns out not.

nightly-20240109 https://buildkite.com/risingwave-test/longevity-test/builds/897
Source throughput: 4.66 MB/s 13791 rows/s
Commit: d21d42f (including #14427)

It’s slightly better than 20240108 but still worse than before (6.2MB/s)


I am still suspecting distributed tracing. To verify it, I ran a longevity test with nightly-20240125-disable-embedded-tracing (Commit: 95b4cd3).

Sadly, with tracing disabled, the performance is still bad:

  • source throughput: 4.75 MB/s (expected: ~6.2MB/s)
  • barrier latency was high and laggy (see screenshot below)
    BuildKite Grafana

image


Summary at this moment:

The problem is introduced between a01a30d...d21d42f, and it is not caused by the distributed tracing (assuming it was truly disabled in the last test.) (See #14828 (comment))

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 29, 2024

When comparing the grafana metrics between reglngvty-20240107-150238 and reglngvty-20240108-150225, I didn't see significant differences in barrier latency. Somehow the barrier latency in 0108 was better than 0107
reglngvty-20240107-150238:
image
reglngvty-20240108-150225:
image

However, the compactor cpu usage and task number was different:
reglngvty-20240107-150238:
image
image
reglngvty-20240108-150225:
image
image

I didn't see compactor related changes in a01a30d...414c6ec. Wonder whether there are changes in the test pipeline.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 29, 2024

May be related: CN OOM very quickly in 0108's test but not in 0107's test. CN OOM can lead to less L0 SSTs generated and thus reduce compaction load.
reglngvty-20240107-150238:
image

reglngvty-20240108-150225:
image

@BugenZhao
Copy link
Member

CN OOM very quickly in 0108's test but not in 0107's test.

Just FYI, this was expected: #14424 (comment)

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 29, 2024

When comparing the grafana metrics between reglngvty-20240107-150238 and reglngvty-20240108-150225, I didn't see significant differences in barrier latency. Somehow the barrier latency in 0108 was better than 0107

@hzxa21 Yes, today I just ran a longevity test with nightly-20240108-revert-tracing

So far it runs for about 2 hours, and both the source throughput and barrier latency looks good. Grafana

image

This means that the root cause might be introduced in nightly-20240109. 414c6ec...d21d42f (See next comment)

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 29, 2024

The previous conclusion is incorrect.

The performance impact is expected for versions between nightly-20240108 (inclusive) and nightly-20240115 (exclusive), because of the B and E below.

nightly-20240108
(A): Accidentally add verbose logs, could affect performance: feat(cdc): support transaction for shared cdc source 10f7b77
(B): Tracing introduced. Together with the commit above, will OOM: feat: embed trace collector & jaeger ui to dev dashboard 4f6295a

nightly-20240109
(C): Temporarily fix OOM: chore: remove some periodical logs babfdc9
(D): Thoroughly fix OOM, with potential perf overhead of creating more spans. Does no good as long as there’s no parsing failure since (C) already fixes verbose logs: fix(connector): avoid long-lived tracing span fd768da

nightly-20240115:
(E): Disabled tracing: feat: enable or disable tracing with system params 240416f

However, the true cause is shadowed, and is very likely to be between nightly-20240108 and nightly-20240115. I am not doing a binary search to find out it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants