Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chaos mesh daily test: ch-benchmark-pg-cdc data verification failed #15312

Closed
xuefengze opened this issue Feb 28, 2024 · 14 comments
Closed

chaos mesh daily test: ch-benchmark-pg-cdc data verification failed #15312

xuefengze opened this issue Feb 28, 2024 · 14 comments
Assignees
Labels
found-by-chaos-mesh type/bug Something isn't working
Milestone

Comments

@xuefengze
Copy link
Contributor

xuefengze commented Feb 28, 2024

Chaos-mesh test failed(ch-benchmark-pg-cdc). The experiment made the meta unavailable for 20 seconds.

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly ch-benchmark-pg-cdc
Namespace            longcmkf-20240227-190312
TestBed              medium-arm-all-affinity
RW Version           nightly-20240227    
Test Start time      2024-02-27 19:07:23 
Test End time        2024-02-27 19:50:13 
Test Queries         q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q17,q18,q19,q20,q21,q22
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-190312&from=1709060843000&to=1709063413000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-190312&from=1709060843000&to=1709063413000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240227-190312/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/629

Data verification fails:

revenue1 table/mv data checksum inconsistent
ch_benchmark_q4 table/mv data checksum inconsistent
order_line table/mv data checksum inconsistent
================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly ch-benchmark-pg-cdc
Namespace            longcmkf-20240227-195152
TestBed              medium-arm-all-affinity
RW Version           nightly-20240227    
Test Start time      2024-02-27 19:54:13 
Test End time        2024-02-27 21:53:46 
Test Queries         q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q17,q18,q19,q20,q21,q22
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-195152&from=1709063653000&to=1709070826000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240227-195152&from=1709063653000&to=1709070826000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240227-195152/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/629

Data verification fails:

revenue1 table/mv data checksum inconsistent
ch_benchmark_q10 table/mv data checksum inconsistent
order_line table/mv data checksum inconsistent
stock table/mv data checksum inconsistent
@github-actions github-actions bot added this to the release-1.7 milestone Feb 28, 2024
@xuefengze xuefengze added type/bug Something isn't working found-by-chaos-mesh labels Feb 28, 2024
@xuefengze
Copy link
Contributor Author

link #15245 #15190 #15279

@StrikeW
Copy link
Contributor

StrikeW commented Apr 1, 2024

There is a null ptr issue in the log, which has been fixed in release-1.8, I suggest we rerun the pipeline with new image.
image

And I notice the pipeline is using the dedicated source, it is problematic in chaos-mesh test as explained in #15141 (comment). Please refactor the pipeline to ensure the historical data is empty or use share source instead. cc @cyliu0

@xuefengze
Copy link
Contributor Author

xuefengze commented Apr 1, 2024

BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q1,q2,q3,q4"
RW_VERSION="nightly-20240331"
BENCHMARK_TYPE="tpc"
TPC_TARGET="postgres"
duration="10m"
experiments="[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"

test failed with share source.
https://buildkite.com/risingwave-test/chaos-mesh/builds/716

@StrikeW
Copy link
Contributor

StrikeW commented Apr 1, 2024

I found same error logs as in the #15141.
We are working a fix patch to try to fix the issue, tracked in #15464

@StrikeW
Copy link
Contributor

StrikeW commented Apr 7, 2024

Hi @xuefengze, please rerun the test with RW_VERSION=cdc-commit-offset, which is the image of #16058

@xuefengze
Copy link
Contributor Author

https://buildkite.com/risingwave-test/chaos-mesh/builds/730#018eb902-b9e8-44f1-83a4-058cc8232d99
compute-0 got panicked when running q3
image

@StrikeW
Copy link
Contributor

StrikeW commented Apr 10, 2024

I run the chaos test with the image built by #16058, the Cannot seek to the last known offset WARN logs are gone, which means the persistent source offset are always valid with the PR. But q4 still encounter source table inconsistent with upstream. Still investigating.

@StrikeW StrikeW changed the title ch-benchmark-pg-cdc data verification failed in daily chaos mesh test chaos mesh daily test: ch-benchmark-pg-cdc data verification failed Apr 10, 2024
@hzxa21
Copy link
Collaborator

hzxa21 commented Apr 12, 2024

This time there are more rows in the RW tables:

<html>
<body>
<!--StartFragment-->
{
--
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "history",
  | "table-checksum": 8546249372229983623,
  | "table-rows": 46500
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "history",
  | "table-checksum": 1784133604083324500,
  | "table-rows": 46541
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "new_order",
  | "table-checksum": -758467578028615202,
  | "table-rows": 17
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "new_order",
  | "table-checksum": -5196837531638814068,
  | "table-rows": 23
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "orders",
  | "table-checksum": 5892817120355204331,
  | "table-rows": 140849
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "orders",
  | "table-checksum": 9047804123778707166,
  | "table-rows": 141184
  | }
  | ]
  | },
  | {
  | "consistent": false,
  | "table-checksums": [
  | {
  | "url": "postgres://postgres:postgres@localhost:5432/postgres",
  | "table-name": "order_line",
  | "table-checksum": 7067257756083074964,
  | "table-rows": 1408500
  | },
  | {
  | "url": "postgres://root:@127.0.0.1:4567/dev",
  | "table-name": "order_line",
  | "table-checksum": 8647488982357384687,
  | "table-rows": 1412341
  | }
  | ]
  | },

<!--EndFragment-->
</body>
</html>

@StrikeW
Copy link
Contributor

StrikeW commented Apr 12, 2024

I also run against the nightly image (750), the symptom is same: RW rows is more than upstream pg.

@stdrc
Copy link
Member

stdrc commented May 7, 2024

During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.

I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏

@StrikeW
Copy link
Contributor

StrikeW commented May 7, 2024

During a previous refactor to source parser, I found a possible cause of cdc offset rewind, which according to @StrikeW is possibly a reason of this issue. The bug case is that, if there's always a transaction beginning in the middle of a source message batch, and ending in the next batch, we'll always yield stream chunks immediately when the transaction is committed. Then heartbeat messages to update offset can possibly be delayed forever. After a series of such transactions, the earliest heartbeat message may get emitted, but the offset may already be invalid then.

I created a PR #16608 to fix this bug. Let's hope it can fix this issue🙏

I am building an image for the branch, let's see whether the issue can be fixed.

@StrikeW
Copy link
Contributor

StrikeW commented May 8, 2024

I run the chaos mesh test against the PR, but it fails in q3.
https://buildkite.com/risingwave-test/chaos-mesh/builds/816#018f5610-37ba-450c-8c59-067fa74e648e

But the nightly image can pass the test.

BENCH_TESTBED=medium-arm-all-affinity
CH_BENCHMARK_QUERY=q1,q2,q3,q4
RW_VERSION=nightly-20240507
BENCHMARK_TYPE=tpc
TPC_TARGET=postgres
duration=10m
experiments=[{'kind': 'PodChaos', 'actions': ['pod-failure'], 'duration': '20s', 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]

I think we should take a look before merge the PR.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 8, 2024

@StrikeW Is the issue fixed?

@StrikeW
Copy link
Contributor

StrikeW commented Oct 9, 2024

@StrikeW Is the issue fixed?

chaos mesh pipeline didn't fail for a long time, I think we can close this issue and open new issues if the pipeline fails.

@StrikeW StrikeW closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-chaos-mesh type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants