-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.7.0-rc/nightly-20240201 source throughput down to 0 with non-shared PG CDC sources #14943
Comments
Update: revert #14899 also reproduce the problem, investigating other PRs in the list. |
I found that the stream query in the passed job generated much less data compared with failed jobs. And there is join amplification in the failed jobs: I suspect the workload has changed and I rerun the pipeline with nightly-20240131 again, then it also experience barrier piled up as those failed jobs. So I think the pipeline failure is not caused by the code change. cc @lmatz if you have other information. |
thanks for the findings, let us check if there is any changes on the pipeline side |
Recently, I've added the ch-benchmark q3 back to the pipeline. The q3 has been removed since #12777 So I think it's still a problem? |
Reran the queries except q3 with v1.7.0-rc-1 passed https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/199 |
Hit again for v1.7.0-rc-1 with following queries CH_BENCHMARK_QUERY="q1,q2,q4,q5,q6,q9,q10,q11,q12,q13,q14,q15,q16,q17,q18,q19,q20,q22" https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/201 |
Hit again with nightly-20240207 |
Why the compaction duration is so high? |
The symptom is same as the conclusion in #14943 (comment):
which causes barrier piled up and backpressures the the source. |
Remove it from blockers for now.
Join amplification is expected as it is determined by the nature of the query and data |
Ping, Any updates? |
@cyliu0 could you run one more time but with more resources? I think the point of this test is here is just to make sure that |
Hitting this while running with bigger memory on nightly-20240408
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=ch-benchmark-pg-cdc-pipeline&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1712630508369&to=1712632212485
It's caused by ch-benchmark q3 in this case. @StrikeW Shall we keep this issue for future enhancements? Or close it now? |
Optimize the query performance should be tracked by other issue. let's close this one. |
The issue still exists with nightly-20240507. Which issue covered this right now? @StrikeW |
There is no new issue for the performance problem. The source is backpressured and could you confirm that the CN is configured with 16GB memory? It seems the bottleneck is in the state store due to the amount of L0 files and lead to higher sync duration. |
I think this should be a performance issue instead of functionality bug, so I suggest you can create a new issue for it and post to perf working group. |
Describe the bug
Run ch-benchmark with non-shared PG CDC sources with v1.7.0-rc/nightly-20240201
v1.7.0-rc: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/187
nightly-20240201: https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/188
The buildkite pipeline jobs failed the data consistency check before they completed the data sync. Because the data consistency check would start after the source throughput down to 0 for 60 seconds.
Grafana
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
v1.7.0-rc
nightly-20240201
Additional context
nightly-20240201
752b5e0205cea397c51645cd39f12f3f24a5f06a
refactor(meta): handle all stream client request in stream rpc manager (#14844)725c6b05c62ac99001ab549c63fa8c4603752046
test(connector): add debezium-mysql compatible test (#14891)4c70f3dfd1f69174476bd3d08d48a708dd954f13
test(connector): add debezium-pg compatible test (#14862)9749e4c46f57e5c9738abfafe06fd6f8c0a1b8e0
fix: initialize backup system parameter for in memory mode (#14921)2437c082cdf26eecc9bd0229afc79335e7369a28
test: add ruby client test (#14859)ded73afefa465108d3a2baf559ec8380e4b4eda5
fix(batch): fix sequential exchange (#14924)6b3a3193b673dc24508eccaa0f8054db6fb716ab
feat(telemetry): addstreaming_job_count
to meta telemetry report (#14878)f621acb7af95db48780f0f7e6438c1a8315cdb24
fix: usingalter source add column
on a table with conn will panic (#14922)822190b1f3ca253b7eca31a2a9c47eda1d8f8827
chore: update icelake (#14920)b43a7b9e0d960f6a255c6559d2b0239cf49203d1
feat: merge config for auto scale & enable auto scale (#14873)096570d106ee275244e2de3ff7df4a8705db6832
refactor(connector): avoid anyhow inAccessError
and avoid usingRwError
if possible (#14874)5d43604d3e703fa0ed6594deb116ef2e6c660469
fix(storage): assert no read version in storage reset (#14876)fff9b79ba71487423291a1d48cb45994e15e2fa0
chore(deps): bump rquickjs (#14897)7834019899a0c4d86e0daa5101a42a91d003fe7d
feat(log-store): add a vnode col in log store pk (#14599)26297e583a44d59db8d2ed720e8420073b8dcb24
chore: fix clippy (#14916)c71efbb9346b83f18426f4535648b8e6a9ef6b58
chore: Replace AUTO with ADAPTIVE for parallelism mode and table behavior (#14414)309097204ccf67dbef9e2800077b463f5579a909
refactor(over window): simplifyfind_affected_ranges
and correct some corner cases (#14580)344cf990870e9c73755f12021f9baaaf734a98d3
feat: support sql func encrypt/decrypt (#14717)d22423077539b98e557814dcf1fb4f10d3b74e3a
chore(dashboard): bump versions: Node.js to v20, Next.js to 14, typescript to 5.3.3 (#14913)20fc87999a65c831e24006fff893604589e49547
feat(grafana): improve actor info & source throughput panels (#14870)32762399011b9ce332285e355899d0ede3b2c55f
feat: support scaling in sql backend (#14757)6acb999a32f22881bc22c74f6162256f43232eb6
chore: update cherry pick version to release-1.7 (#14912)73907050ca15dd1c98f6a342f793f6d96ada29cc
refactor: replace ctor by linkme (#14814)82d12773b924153bfc4c747e91791af939edbf02
feat(expr): add support for make_date/time/timestamp (#14827)9c874fe91c6f1b6c8301a94667e8c185337a2dc5
chore: enable transactional cdc for mysql and postgres by default (#14899)b65afa37f21c39d9fff512f030d02a0dde343f9a
feat(cdc): support disable cdc backfill and only consumes from latest changelog (#14718)d812316f1982edeab2527c07af418e08b3d8e8d0
refactor: minor refactor on fragment graph building (#14560)c2228ec862766df8364a914c796b5b660ebcff8f
fix(tests): remove delete range runner test (#14889)f21fb9b7915e7ed95b858ffd3d3d87ae20075c65
feat(stream): read exactly 1 row if no snapshot read per barrier for arrangement backfill (#14842)8cb91531cb1e8a9a49d835aacd4fad441874ba2f
feat(sql-backend): disable etcd meta store initialization when sql backend is enabled (#14902)7d0d43f6586c036642b3e5589ce6b24a878af077
feat: support precompute partition for iceberg (#14710)The text was updated successfully, but these errors were encountered: