Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nexmark q19 and q20 blackhole performance degradation on affinity testbed between 1130-1203 #13773

Closed
cyliu0 opened this issue Dec 4, 2023 · 10 comments
Assignees
Milestone

Comments

@cyliu0
Copy link
Collaborator

cyliu0 commented Dec 4, 2023

Describe the bug

The daily performance test shows we have performance degradation on nexmark q19 and q20 with affinity testbed. https://buildkite.com/risingwave-test/nexmark-benchmark/builds/2583#018c306a-745d-4244-9b94-8a9234f74a42

http://metabase.risingwave-cloud.xyz/question/7039-nexmark-avg-source-throughput?start_date=2023-11-01&workload=q20-blackhole
image

+-----------------------------------------------------------------+--------------+------------+---------------------+-----------------------------+
| BENCHMARK NAME                                                  | EXECUTION ID | STATUS     | FLUCTUATION OF BEST | FLUCTUATION OF LAST 10 DAYS |
+-----------------------------------------------------------------+--------------+------------+---------------------+-----------------------------+
| nexmark-q19-blackhole-medium-1cn-affinity                       |        15919 | Negative   | -18.25%             | -13.15%                     |
| nexmark-q20-blackhole-medium-1cn-affinity                       |        15927 | Negative   | -29.25%             | -22.28%                     |

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20231203

Additional context

No response

@cyliu0 cyliu0 added the type/bug Something isn't working label Dec 4, 2023
@github-actions github-actions bot added this to the release-1.5 milestone Dec 4, 2023
@cyliu0 cyliu0 changed the title nexmark q19 and q20 blackhole performance degradation nexmark q19 and q20 blackhole performance degradation on affinity testbed Dec 4, 2023
@xxchan xxchan changed the title nexmark q19 and q20 blackhole performance degradation on affinity testbed nexmark q19 and q20 blackhole performance degradation on affinity testbed between 1130-1203 Dec 4, 2023
@xxchan
Copy link
Member

xxchan commented Dec 4, 2023

commits history:

## nightly-20231203
- `0b6184d8fce4275e9a81a2d66c171c20d2b7a529` [doc(docker): add comments for minio disk full issue (#13768)](https://github.com/risingwavelabs/risingwave/pull/13768)

## nightly-20231202
- `ab011eb0e58b8637e9d53f70fd7221a49fcd9e75` [feat(jni): pass stream chunk directly without serde (#13430)](https://github.com/risingwavelabs/risingwave/pull/13430)
- `b149c67b46a7b42f253f7f09c1c69d9184fa7797` [refactor(error): simplify all boxed error wrapper definition (#13725)](https://github.com/risingwavelabs/risingwave/pull/13725)
- `7677abcb6fa87ea3f667ae2e111650296fdd8e97` [fix(cdc-backfill): ensure snapshot read starts after source (#13663)](https://github.com/risingwavelabs/risingwave/pull/13663)

## nightly-20231201
- `fce66c0b06fe27c30a89d9809ebab102b2d68e0e` [test: add vector demo to integration tests (#13753)](https://github.com/risingwavelabs/risingwave/pull/13753)
- `61d364cbea2d9e3687ce358c9f630c72bedd4494` [feat: check-in `cargo dylint` with `format_error` lint (#13750)](https://github.com/risingwavelabs/risingwave/pull/13750)
- `26bd9ef630fd360717be23535b39d07949be9570` [feat(storage): optimize data alignment for default compaction group (#13075)](https://github.com/risingwavelabs/risingwave/pull/13075)
- `0bd10c4f67746075b83913011e86a1eb8fb475e7` [fix(udf): fix decimal values (#11839)](https://github.com/risingwavelabs/risingwave/pull/11839)
- `da79ff5ba64362cb2ee07f3114040fa00e8e62c4` [chore: Adjust default large query memory (#13686)](https://github.com/risingwavelabs/risingwave/pull/13686)
- `67c2d70523fd4645c158428a11825e5e4926de7d` [feat(expr): distributed make timestamptz (#13647)](https://github.com/risingwavelabs/risingwave/pull/13647)

## nightly-20231130
- ce0121fc7cf30251b5bc58065d1ed5f09f4512ab ...

@xxchan
Copy link
Member

xxchan commented Dec 4, 2023

compactor CPU increased, and compute CPU decreased
image
image

Is it possible that #13075 caused this? cc @Li0k

@fuyufjh
Copy link
Member

fuyufjh commented Dec 4, 2023

@Li0k
Copy link
Contributor

Li0k commented Dec 4, 2023

Yes, I guess it has negative effects, do I need to revert it immediately?

@huangjw806
Copy link
Contributor

Yes, I guess it has negative effects, do I need to revert it immediately?

If you can't fix it quickly, just revert it, but fortunately it's not in this release.

@fuyufjh fuyufjh modified the milestones: release-1.5, release-1.6 Dec 4, 2023
@Li0k
Copy link
Contributor

Li0k commented Dec 4, 2023

I found a couple of preliminary conclusions

  1. feat(storage): optimize data alignment for default compaction group #13075 wrong sst switch condition, which causes sst to become large
image
  1. fix(compaction): fast compact may not cut sst if not meet point key #13690 stricter fast compact condition, which requires more cpu for compacting
  • 1130
image image
  • 1203
image image

I'm doing some test and expected to be fixed today

test: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1701678218000&to=1701680021000&var-namespace=li0k-nexmark-q20-debug-fix-align

image image

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Dec 5, 2023

Is the PR in nightly-20231204? According to the test result, the perf is not totally recovered.
image

@Li0k
Copy link
Contributor

Li0k commented Dec 5, 2023

Is the PR in nightly-20231204? According to the test result, the perf is not totally recovered. image

Unfortunately, Should I need to run the nexmark manually?

@Li0k
Copy link
Contributor

Li0k commented Dec 5, 2023

Compare affinity

CPU

before

image

after

image

throughput

before

image

after

image

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Dec 6, 2023

According to the test result, it's fixed by #13790. But in the daily perf test, we found some new perf degradations #13821
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants