Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightly-20241106 sysbench perf degradation #19281

Open
cyliu0 opened this issue Nov 7, 2024 · 5 comments
Open

nightly-20241106 sysbench perf degradation #19281

cyliu0 opened this issue Nov 7, 2024 · 5 comments
Assignees
Labels
type/bug Something isn't working type/perf
Milestone

Comments

@cyliu0
Copy link
Collaborator

cyliu0 commented Nov 7, 2024

Describe the bug

+---------------------------------------------------------------+--------------+------------+----------------------------------------------------+---------------------+-----------------------------+-------------------------------+
| BENCHMARK NAME                                                | EXECUTION ID | STATUS     | KEY METRICS                                        | FLUCTUATION OF BEST | FLUCTUATION OF LAST 10 DAYS | FLUCTUATION OF LAST EXECUTION |
+---------------------------------------------------------------+--------------+------------+----------------------------------------------------+---------------------+-----------------------------+-------------------------------+
| sysbench-select-random-points-medium-1cn                      |        42742 | Negative   | sysbench-qps                                       | -92.41%             | -59.19%                     | -64.07%                       |
| sysbench-select-random-ranges-medium-1cn                      |        42744 | Negative   | sysbench-qps                                       | -91.41%             | -58.47%                     | -63.27%                       |
| nexmark-q105-blackhole-medium-1cn                             |        42752 | Negative   | avg-source-output-rows-per-second                  | -46.23%             | -23.84%                     | -29.06%                       |

Buildkite Job
Grafana
Metabase Sysbench
image

The nexmark q105 also drops. But it's not stable recently. Metabase Nexmark Q105
image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20241106

Additional context

The only pull request for nightly-20241106 is #19080 according to https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20241106

@Li0k PTAL

@cyliu0 cyliu0 added type/bug Something isn't working type/perf labels Nov 7, 2024
@github-actions github-actions bot added this to the release-2.2 milestone Nov 7, 2024
@Li0k
Copy link
Contributor

Li0k commented Nov 8, 2024

@Li0k
Copy link
Contributor

Li0k commented Nov 11, 2024

I did a real investigation through a lot of tests and I found that the performance regression is related to cache miss, and I concluded that the performance data of this test is unstable for the following reasons.

  1. The test machine has 11g of memory, the memory allocated to the block cache is 1g, and the low priority cache is only 300m.
  2. Random points batch query uses prefect fetch (no end bound) by default, iter operation will prefetch 16 blocks.(but random point select test does not benefit from it).
  3. After changing the compaction group distribution, the compaction frequency within each cg may decrease, and the L0 file distribution is different from the previous one.

All of the above behaviours are unstable and therefore the performance of this test is unstable. A fairer way would be to fully compact the lsm before the read test and adopt a more reasonable prefetch strategy for different olap query modes, but the current test is favourable for us to find out the problems on the compaction/read path.

@hzxa21 @cyliu0 what do you think ?

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Nov 11, 2024

I think we can do both in the future if you can provide something like a warm-up machanism in the OLTP test. We can have two test pipelines with warm-up enabled and disabled.

cc @lmatz

@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 11, 2024

All of the above behaviours are unstable and therefore the performance of this test is unstable. A fairer way would be to fully compact the lsm before the read test and adopt a more reasonable prefetch strategy for different olap query modes, but the current test is favourable for us to find out the problems on the compaction/read path.

+1. IIUC, the current test is expected to test regression on random lookup assuming cache is warmed up but the assumption has changed with the recent compaction strategy changes. I think we can fix the current test to make the assumption hold all the time by introducing cache warm up.

However, given that the current test unexpectedly helps us find out some corner cases on the new compaction strategy, I think it is still valuable. For testing the new compaction strategy, do you think we should keep this test or write a new test? cc @Li0k

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Nov 11, 2024

For testing the new compaction strategy, do you think we should keep this test or write a new test?

I think we can enhance the current pipeline with optional warm-up to achieve both in the current test. But you need to provide the warm-up mechanism first. I remember @Li0k said he can provide a SQL syntax for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working type/perf
Projects
None yet
Development

No branches or pull requests

3 participants