perf: source throughput & barrier latency of longevity-test is worse than before #14828

fuyufjh · 2024-01-27T07:16:46Z

Summarized from Sack thread: https://risingwave-labs.slack.com/archives/C034TRPKN1F/p1706079713814969

Background

The recent failures of longevity test all nexmark (8 sets of nexmark queries) with 10k throughput), including

In short, in the past we can pass the test, but recently, it sometimes OOMed. This indicates that RW consume a bit more memory with same workload.

Observations

1. Barrier latency became higher

As we know, the barrier latency is an important cause of OOM.

Before:

After:

2. Source throughput became lower

The source throughput went down from 6.21 MB/s to 4.x ~ 5.x MB/s

nightly-20231225 6.18 MB/s (18277 rows/s) BuildKite Grafana
nightly-20240122 15109 rows/s (5.10 MB/s) (-18% even when q6-group-top1 was EXCLUDED) BuildKite Grafana

Code History

✅ nightly-20240107 https://buildkite.com/risingwave-test/longevity-test/builds/891
Source throughput: 6.21 MB/s 18400 rows/s
Commit: a01a30d

❌ nightly-20240108 https://buildkite.com/risingwave-test/longevity-test/builds/895
Source throughput: 3.57 MB/s 10566 rows/s
Commit: 414c6ec

~~Comparing: a01a30d...414c6ec~~ (see next comment)

The text was updated successfully, but these errors were encountered:

fuyufjh · 2024-01-27T07:22:58Z

Wait, this diff includes feat: embed trace collector & jaeger ui to dev dashboard #14220...

The performance impact caused by #14220 was expected to be eliminated by chore: remove some periodical logs #14427
but it turns out not.

❌ nightly-20240109 https://buildkite.com/risingwave-test/longevity-test/builds/897
Source throughput: 4.66 MB/s 13791 rows/s
Commit: d21d42f (including #14427)

It’s slightly better than 20240108 but still worse than before (6.2MB/s)

I am still suspecting distributed tracing. To verify it, I ran a longevity test with nightly-20240125-disable-embedded-tracing (Commit: 95b4cd3).

Sadly, with tracing disabled, the performance is still bad:

source throughput: 4.75 MB/s (expected: ~6.2MB/s)
barrier latency was high and laggy (see screenshot below)
BuildKite Grafana

Summary at this moment:

~~The problem is introduced between a01a30d...d21d42f, and it is not caused by the distributed tracing (assuming it was truly disabled in the last test.)~~ (See #14828 (comment))

hzxa21 · 2024-01-29T04:07:03Z

When comparing the grafana metrics between reglngvty-20240107-150238 and reglngvty-20240108-150225, I didn't see significant differences in barrier latency. Somehow the barrier latency in 0108 was better than 0107
reglngvty-20240107-150238:

reglngvty-20240108-150225:

However, the compactor cpu usage and task number was different:
reglngvty-20240107-150238:

reglngvty-20240108-150225:

I didn't see compactor related changes in a01a30d...414c6ec. Wonder whether there are changes in the test pipeline.

hzxa21 · 2024-01-29T04:33:33Z

May be related: CN OOM very quickly in 0108's test but not in 0107's test. CN OOM can lead to less L0 SSTs generated and thus reduce compaction load.
reglngvty-20240107-150238:

reglngvty-20240108-150225:

BugenZhao · 2024-01-29T05:24:43Z

CN OOM very quickly in 0108's test but not in 0107's test.

Just FYI, this was expected: #14424 (comment)

fuyufjh · 2024-01-29T05:52:49Z

When comparing the grafana metrics between reglngvty-20240107-150238 and reglngvty-20240108-150225, I didn't see significant differences in barrier latency. Somehow the barrier latency in 0108 was better than 0107

@hzxa21 Yes, today I just ran a longevity test with nightly-20240108-revert-tracing

So far it runs for about 2 hours, and both the source throughput and barrier latency looks good. Grafana

~~This means that the root cause might be introduced in nightly-20240109. 414c6ec...d21d42f~~ (See next comment)

fuyufjh · 2024-01-29T15:21:43Z

The previous conclusion is incorrect.

The performance impact is expected for versions between nightly-20240108 (inclusive) and nightly-20240115 (exclusive), because of the B and E below.

nightly-20240108
(A): Accidentally add verbose logs, could affect performance: feat(cdc): support transaction for shared cdc source 10f7b77
(B): Tracing introduced. Together with the commit above, will OOM: feat: embed trace collector & jaeger ui to dev dashboard 4f6295a

nightly-20240109
(C): Temporarily fix OOM: chore: remove some periodical logs babfdc9
(D): Thoroughly fix OOM, with potential perf overhead of creating more spans. Does no good as long as there’s no parsing failure since (C) already fixes verbose logs: fix(connector): avoid long-lived tracing span fd768da

nightly-20240115:
(E): Disabled tracing: feat: enable or disable tracing with system params 240416f

However, the true cause is shadowed, and is very likely to be between nightly-20240108 and nightly-20240115. I am not doing a binary search to find out it.

github-actions bot added this to the release-1.7 milestone Jan 27, 2024

fuyufjh added the type/perf label Jan 27, 2024

fuyufjh mentioned this issue Jan 30, 2024

revert: feat(stream): make Project expr evaluation concurrent (#14558) #14855

Merged

9 tasks

fuyufjh closed this as completed in #14855 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: source throughput & barrier latency of longevity-test is worse than before #14828

perf: source throughput & barrier latency of longevity-test is worse than before #14828

fuyufjh commented Jan 27, 2024 •

edited

Loading

fuyufjh commented Jan 27, 2024 •

edited

Loading

hzxa21 commented Jan 29, 2024

hzxa21 commented Jan 29, 2024

BugenZhao commented Jan 29, 2024

fuyufjh commented Jan 29, 2024 •

edited

Loading

fuyufjh commented Jan 29, 2024 •

edited

Loading

perf: source throughput & barrier latency of longevity-test is worse than before #14828

perf: source throughput & barrier latency of longevity-test is worse than before #14828

Comments

fuyufjh commented Jan 27, 2024 • edited Loading

Background

Observations

1. Barrier latency became higher

2. Source throughput became lower

Code History

fuyufjh commented Jan 27, 2024 • edited Loading

hzxa21 commented Jan 29, 2024

hzxa21 commented Jan 29, 2024

BugenZhao commented Jan 29, 2024

fuyufjh commented Jan 29, 2024 • edited Loading

fuyufjh commented Jan 29, 2024 • edited Loading

fuyufjh commented Jan 27, 2024 •

edited

Loading

fuyufjh commented Jan 27, 2024 •

edited

Loading

fuyufjh commented Jan 29, 2024 •

edited

Loading

fuyufjh commented Jan 29, 2024 •

edited

Loading