nightly-20241106 sysbench perf degradation #19281

cyliu0 · 2024-11-07T01:51:52Z

Describe the bug

+---------------------------------------------------------------+--------------+------------+----------------------------------------------------+---------------------+-----------------------------+-------------------------------+
| BENCHMARK NAME                                                | EXECUTION ID | STATUS     | KEY METRICS                                        | FLUCTUATION OF BEST | FLUCTUATION OF LAST 10 DAYS | FLUCTUATION OF LAST EXECUTION |
+---------------------------------------------------------------+--------------+------------+----------------------------------------------------+---------------------+-----------------------------+-------------------------------+
| sysbench-select-random-points-medium-1cn                      |        42742 | Negative   | sysbench-qps                                       | -92.41%             | -59.19%                     | -64.07%                       |
| sysbench-select-random-ranges-medium-1cn                      |        42744 | Negative   | sysbench-qps                                       | -91.41%             | -58.47%                     | -63.27%                       |
| nexmark-q105-blackhole-medium-1cn                             |        42752 | Negative   | avg-source-output-rows-per-second                  | -46.23%             | -23.84%                     | -29.06%                       |

Buildkite Job
Grafana
Metabase Sysbench

The nexmark q105 also drops. But it's not stable recently. Metabase Nexmark Q105

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20241106

Additional context

The only pull request for nightly-20241106 is #19080 according to https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20241106

@Li0k PTAL

The text was updated successfully, but these errors were encountered:

Li0k · 2024-11-08T06:37:14Z

https://buildkite.com/risingwave-test/sysbench/builds/934#01930a58-97d3-44b1-888f-2d95513f874e

https://risingwave-labs.slack.com/files/U0436BKS2CF/F07V7TVD946/untitled?origin_team=T030LTU38S2&origin_channel=C034TRPKN1F

The system will hang when we set max_prefetch_block_number = 0, 100% reproduced.

Li0k · 2024-11-11T05:33:12Z

I did a real investigation through a lot of tests and I found that the performance regression is related to cache miss, and I concluded that the performance data of this test is unstable for the following reasons.

The test machine has 11g of memory, the memory allocated to the block cache is 1g, and the low priority cache is only 300m.
Random points batch query uses prefect fetch (no end bound) by default, iter operation will prefetch 16 blocks.(but random point select test does not benefit from it).
After changing the compaction group distribution, the compaction frequency within each cg may decrease, and the L0 file distribution is different from the previous one.

All of the above behaviours are unstable and therefore the performance of this test is unstable. A fairer way would be to fully compact the lsm before the read test and adopt a more reasonable prefetch strategy for different olap query modes, but the current test is favourable for us to find out the problems on the compaction/read path.

@hzxa21 @cyliu0 what do you think ?

cyliu0 · 2024-11-11T06:18:13Z

I think we can do both in the future if you can provide something like a warm-up machanism in the OLTP test. We can have two test pipelines with warm-up enabled and disabled.

cc @lmatz

hzxa21 · 2024-11-11T07:48:10Z

All of the above behaviours are unstable and therefore the performance of this test is unstable. A fairer way would be to fully compact the lsm before the read test and adopt a more reasonable prefetch strategy for different olap query modes, but the current test is favourable for us to find out the problems on the compaction/read path.

+1. IIUC, the current test is expected to test regression on random lookup assuming cache is warmed up but the assumption has changed with the recent compaction strategy changes. I think we can fix the current test to make the assumption hold all the time by introducing cache warm up.

However, given that the current test unexpectedly helps us find out some corner cases on the new compaction strategy, I think it is still valuable. For testing the new compaction strategy, do you think we should keep this test or write a new test? cc @Li0k

cyliu0 · 2024-11-11T08:56:47Z

For testing the new compaction strategy, do you think we should keep this test or write a new test?

I think we can enhance the current pipeline with optional warm-up to achieve both in the current test. But you need to provide the warm-up mechanism first. I remember @Li0k said he can provide a SQL syntax for this.

cyliu0 added type/bug Something isn't working type/perf labels Nov 7, 2024

github-actions bot added this to the release-2.2 milestone Nov 7, 2024

cyliu0 assigned Li0k Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nightly-20241106 sysbench perf degradation #19281

nightly-20241106 sysbench perf degradation #19281

cyliu0 commented Nov 7, 2024

Li0k commented Nov 8, 2024

Li0k commented Nov 11, 2024 •

edited

Loading

cyliu0 commented Nov 11, 2024

hzxa21 commented Nov 11, 2024

cyliu0 commented Nov 11, 2024

nightly-20241106 sysbench perf degradation #19281

nightly-20241106 sysbench perf degradation #19281

Comments

cyliu0 commented Nov 7, 2024

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

Li0k commented Nov 8, 2024

Li0k commented Nov 11, 2024 • edited Loading

cyliu0 commented Nov 11, 2024

hzxa21 commented Nov 11, 2024

cyliu0 commented Nov 11, 2024

Li0k commented Nov 11, 2024 •

edited

Loading