PG cdc `q3` checksums inconsistent #15057

xuefengze · 2024-02-08T05:52:40Z

Describe the bug

pg-cdc q3 checksums failed
https://buildkite.com/risingwave-test/chaos-mesh/builds/552

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240207

Additional context

No response

The text was updated successfully, but these errors were encountered:

xuefengze · 2024-02-08T09:20:03Z

The source tables are synced completed. Something wrong with the query q3.

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-password "" --downstream-database-name dev -t customer,new_order,orders,order_line 
{
    "consistent": true,
    "table-compare-results": [
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                }
            ]
        }
    ]
}

q3:

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-pa
ssword "" --downstream-database-name dev -t ch_benchmark_q3 
{
    "consistent": false,
    "table-compare-results": [
        {
            "consistent": false,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": 7479820059950340793,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": -9219433622567761793,
                    "table-rows": 212
                }
            ]
        }
    ]
}

fuyufjh · 2024-02-19T06:41:39Z

create materialized view ch_benchmark_q3
as
select   ol_o_id, ol_w_id, ol_d_id,
         sum(ol_amount) as revenue, o_entry_d
from 	 customer, new_order, orders, order_line
where 	 c_state like 'a%'
  and c_id = o_c_id
  and c_w_id = o_w_id
  and c_d_id = o_d_id
  and no_w_id = o_w_id
  and no_d_id = o_d_id
  and no_o_id = o_id
  and ol_w_id = o_w_id
  and ol_d_id = o_d_id
  and ol_o_id = o_id
  and o_entry_d > '2007-01-02 00:00:00.000000'
group by ol_o_id, ol_w_id, ol_d_id, o_entry_d;

The new_orders table contain 20 rows, as a result, the final result should have 20 rows. Expected results:

dev=>  SELECT ol_o_id, ol_w_id, ol_d_id, sum(ol_amount) AS revenue, o_entry_d FROM customer, new_order, orders, order_line WHERE c_id = o_c_id AND c_w_id = o_w_id AND c_d_id = o_d_id AND no_w_id = o_w_id AND no_d_id = o_d_id AND no_o_id = o_id AND ol_w_id = o_w_id AND ol_d_id = o_d_id AND ol_o_id = o_id AND o_entry_d > '2007-01-02 00:00:00.000000' GROUP BY ol_o_id, ol_w_id, ol_d_id, o_entry_d order by ol_o_id;
 ol_o_id | ol_w_id | ol_d_id | revenue |      o_entry_d      
---------+---------+---------+---------+---------------------
   15435 |       1 |       6 | 4293.00 | 2024-02-08 04:41:01
   15436 |       1 |       6 |  801.59 | 2024-02-08 04:41:01
   15449 |       1 |       3 | 1756.33 | 2024-02-08 04:41:01
   15450 |       1 |       3 | 3373.57 | 2024-02-08 04:41:01
   15451 |       1 |       3 | 4297.36 | 2024-02-08 04:41:01
   15487 |       1 |       5 | 1059.55 | 2024-02-08 04:41:01
   15514 |       1 |       7 | 2996.72 | 2024-02-08 04:41:01
   15515 |       1 |       7 | 1995.63 | 2024-02-08 04:41:01
   15541 |       1 |       1 | 3272.81 | 2024-02-08 04:41:01
   15542 |       1 |       1 | 1603.53 | 2024-02-08 04:41:01
   15543 |       1 |       1 | 2534.95 | 2024-02-08 04:41:01
   15544 |       1 |       1 | 1684.74 | 2024-02-08 04:41:01
   15593 |       1 |      10 | 3780.30 | 2024-02-08 04:41:01
   15594 |       1 |      10 | 1929.09 | 2024-02-08 04:41:01
   15595 |       1 |      10 | 2076.20 | 2024-02-08 04:41:01
   15613 |       1 |       2 | 1920.34 | 2024-02-08 04:41:01
   15614 |       1 |       2 | 1527.93 | 2024-02-08 04:41:01
   15654 |       1 |       9 | 2032.44 | 2024-02-08 04:41:01
   15655 |       1 |       9 | 1679.22 | 2024-02-08 04:41:01
   15666 |       1 |       4 | 1894.98 | 2024-02-08 04:41:01
(20 rows)

However,

ch_benchmark_q3 has 212 rows. All redundant rows are between 04:36:02 and 04:36:04
__internal_ch_benchmark_q3_41_hashaggstate_1046 has 2k+ rows. All reduantant rows are either at 04:35:52 or between 04:36:02 and 04:36:04

actual_results.txt

It seems that some data was leaked in these 2 epochs. 🤔

fuyufjh · 2024-02-19T06:53:47Z

Questions:

Did we & what kinds of nodes did we killed during the chaos mesh test?
Beside this and Q16 (PG CDC data checksums inconsistent #14998), did this happen in any other queries?

@xuefengze @cyliu0

xuefengze · 2024-02-19T07:05:52Z

Did we & what kinds of nodes did we killed during the chaos mesh test?

chaos-mesh test didn't kill any nodes.

fuyufjh · 2024-02-19T08:18:07Z

Yeah. Also confirmed from Grafana: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=longcmkf-20240208-043103&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1707366471489&to=1707368038581

No node crashed or recovered. Compaction is also normal.

fuyufjh · 2024-02-19T08:32:40Z

I am suspecting mem-table spill, because 1) the redundant data seems to come from only 2 epoches. (see #15057 (comment)) 2) The problem only happen in certain state tables, instead of all tables (for example, the base tables are all good)

Mem-table spill is enabled in the test:

@xuefengze Can you please help to run the test again without mem-table spill?

[storage]
mem_table_spill_threshold = 0

xuefengze · 2024-02-19T09:22:30Z

failed again.https://buildkite.com/risingwave-test/chaos-mesh/builds/580

rw config:

[system]
data_directory = "${BENCHMARK_RISINGWAVE_STORAGE_S3_DATA_DIRECTORY}"

[server]
telemetry_enabled = false
[streaming.developer]
stream_enable_executor_row_count = true

[storage]
enable_fast_compaction = true
imm_merge_threshold = 4
mem_table_spill_threshold = 0

fuyufjh · 2024-02-19T10:11:56Z

failed again.https://buildkite.com/risingwave-test/chaos-mesh/builds/580

rw config:

[system]
data_directory = "${BENCHMARK_RISINGWAVE_STORAGE_S3_DATA_DIRECTORY}"

[server]
telemetry_enabled = false
[streaming.developer]
stream_enable_executor_row_count = true

[storage]
enable_fast_compaction = true
imm_merge_threshold = 4
mem_table_spill_threshold = 0

Emmm, this time the base table new_order were different:

In PG: 1 rows
In RW: 7776 rows

However, materialized view ch_benchmark_q3 is consistent

dev=>  SELECT ol_o_id, ol_w_id, ol_d_id, sum(ol_amount) AS revenue, o_entry_d FROM customer, new_order, orders, order_line WHERE c_id = o_c_id AND c_w_id = o_w_id AND c_d_id = o_d_id AND no_w_id = o_w_id AND no_d_id = o_d_id AND no_o_id = o_id AND ol_w_id = o_w_id AND ol_d_id = o_d_id AND ol_o_id = o_id AND o_entry_d > '2007-01-02 00:00:00.000000' GROUP BY ol_o_id, ol_w_id, ol_d_id, o_entry_d;
...
(7776 rows)

dev=> select * from ch_benchmark_q3;
...
(7776 rows)

In short, it looks like a completely different problem. Note that the CN node restarts 2 times during the test (but it's not OOM, still investigating the reason...)

xuefengze · 2024-02-20T02:04:43Z

Note that the CN node restarts 2 times during the test

pod-failure will cause pods restart.

xuefengze · 2024-02-20T02:50:05Z

https://buildkite.com/risingwave-test/chaos-mesh/builds/583
test without mem-table spill and chaos-mesh just inject stress faults(which occupied meta node 20% cpu).
compute node is OOMKilled.

fuyufjh · 2024-02-20T02:52:40Z

https://buildkite.com/risingwave-test/chaos-mesh/builds/583 test without mem-table spill and chaos-mesh just inject stress faults(which occupied meta node 20% cpu). compute node is OOMKilled.

Yeah, OOM is expected in this case, but the data inconsistency is a problem.

It seems to be a different problem. Let's discuss there: #15141

hzxa21 · 2024-02-23T10:13:55Z

I am able to successfully repro with:

BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q3"
TEST_TYPE="ch-benchmark-pg-cdc"
cooldown_after_experiment="3m"
duration="1h"
experiment_recovery_time="1m"
experiments="[{'kind': 'StressChaos', 'actions': ['cpu'], 'duration': '10m', 'cpu': {'workers': 1, 'load': 10}, 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"
RW_VERSION="nightly-20240207"
ENABLE_KEEP_CLUSTER="true"

Buildkite
Grafana

Findings:

Row count of all source tables (customer, new_order, orders, order_line) related to ch_benchmark_q3 matches with PG.
ch_benchmark_q3 in RW has more rows than PG.
To further dig out the root cause of the mismatch, I created a new MV called ch_benchmark_q3_2 with the same SQL when there is no source throughput. ch_benchmark_q3_2 has correct row count so I cross check the internal state table row count one by one:
- Only 1 join state table's row count mismatches between ch_benchmark_q3 and ch_benchmark_q3_2: __internal_ch_benchmark_q3_99_hashjoinleft_1121
- __internal_ch_benchmark_q3_99_hashjoinleft_1121 is the one and the only one table that has triggered memtable spill.
I pick one of the row that should not be present in __internal_ch_benchmark_q3_99_hashjoinleft_1121 and use sst-dump to grep the occurrences of its key in all relevant SSTs. I only found one occurrence in a L6 SST. There is no other PUT/DELETE for the key in sst-dump.

See more details in debug_notes.txt. The cluster is not cleaned up. For those who are interested, you can psql into the cluster to get more information.

Now I am re-running the same test with the following config. Hopefully we can get the full mutation history for a row in hummock:

RW_CONFIG="{'meta':{'enable_hummock_data_archive': true, 'full_gc_interval_sec': 6048000, 'min_sst_retention_time_sec': 6048000}}"

hzxa21 · 2024-02-23T13:38:07Z

Hopefully we can get the full mutation history for a row in hummock

I can successfully repro with the full row mutation history in hummock. I picked a row that should be deleted in the join state table and dig into its mutation history using risectl (Thanks to the tooling provided by @zwang28).

 UserKey=0x00000417000300800004090080000001008000000a0080000001008000000a0080000cae008000000a00800000010080000cae008000000a0080000001
{ 1047, TableKey { 000300800004090080000001008000000a0080000001008000000a0080000cae008000000a00800000010080000cae008000000a0080000001 }
column: customer_c_id Some("1033")
column: customer_c_d_id Some("10")
column: customer_c_w_id Some("1")
column: new_order_no_o_id Some("3246")
column: new_order_no_d_id Some("10")
column: new_order_no_w_id Some("1")

TL;DR: there should be a bug in L5->L6 compaction that deletes a newer tombstone (with larger spill offset), which makes a older PUT (with smaller spill offset) visible.

Let's use K to denote the UserKey of this row. Here is a sequence of events that happened:

K is PUT into Hummock in SST 304 with epoch: 5993065021702144 offset = 21 (2024-02-23 09:55:15.004 UTC)
K is DELETEed from Hummock (by inserting a tombstone) in SST 369 with epoch: 5993065021702144 offset = 59 (2024-02-23 09:55:15.004 UTC). Note that it is the same epoch with larger spilled offset meaning that both SST 304 and 369 are spilled SSTs in the same checkpoint.
A compaction task compacts SST 304 and 369, ... into SST 375, ... Both PUT and DELETE of K are present in SST 375.
A compaction task compacts SST 375, ... into SST 566, .... Only PUT is present in SST 566 and DELETE is absent. Note that this is a L5->L6 compaction.

See more details in sst_dump_debug_notes.txt (You can search for "//" to better understand the notes).

I will keep the environment for a while. @Li0k @zwang28 @Little-Wallace Let's try dig out the bug together next week.

hzxa21 · 2024-02-23T14:04:42Z

Bug found:

risingwave/src/storage/src/hummock/compactor/compactor_runner.rs

Line 821 in 26340f1

    
           if (epoch <= task_config.watermark && task_config.gc_delete_keys && value.is_delete())

When there are DELETE and PUT with the same pure epoch, the above check will discard DELETE but keep the PUT. A simple fix is to replace epoch <= task_config.watermark with epoch < task_config.watermark

hzxa21 · 2024-02-24T05:06:30Z

Trigger a run for the test with #15232. Let's wait and see whether the issue is gone.

hzxa21 · 2024-02-24T06:43:44Z

Trigger a run for the test with #15232. Let's wait and see whether the issue is gone.

The test passed.

xuefengze added the type/bug Something isn't working label Feb 8, 2024

github-actions bot added this to the release-1.7 milestone Feb 8, 2024

lmatz added help wanted Issues that need help from contributors block-release labels Feb 8, 2024

fuyufjh self-assigned this Feb 21, 2024

lmatz added the found-by-chaos-mesh label Feb 22, 2024

fuyufjh mentioned this issue Feb 22, 2024

ch-benchmark q9 data verification fails in chaos mesh test #15190

Closed

lmatz removed the block-release-v1.7 label Feb 22, 2024

hzxa21 mentioned this issue Feb 23, 2024

fix(store): fix incorrect tombstone delete with memtable spill #15231

Closed

9 tasks

Little-Wallace mentioned this issue Feb 23, 2024

fix(storage): fix leaving put key after delete keys are dropped cause inconsistent #15232

Merged

9 tasks

Little-Wallace closed this as completed in #15232 Feb 23, 2024

lmatz added the block-release-v1.7 label Feb 28, 2024

fuyufjh mentioned this issue Mar 6, 2024

PG CDC data checksums inconsistent #14998

Closed

fuyufjh mentioned this issue Mar 18, 2024

ch-benchmark mysql cdc streaming q1 q2 result mismatch #15721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PG cdc `q3` checksums inconsistent #15057

PG cdc `q3` checksums inconsistent #15057

xuefengze commented Feb 8, 2024 •

edited

Loading

xuefengze commented Feb 8, 2024

fuyufjh commented Feb 19, 2024

fuyufjh commented Feb 19, 2024

xuefengze commented Feb 19, 2024

fuyufjh commented Feb 19, 2024

fuyufjh commented Feb 19, 2024 •

edited

Loading

xuefengze commented Feb 19, 2024

fuyufjh commented Feb 19, 2024 •

edited

Loading

xuefengze commented Feb 20, 2024

xuefengze commented Feb 20, 2024

fuyufjh commented Feb 20, 2024 •

edited

Loading

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 24, 2024

hzxa21 commented Feb 24, 2024

PG cdc q3 checksums inconsistent #15057

PG cdc q3 checksums inconsistent #15057

Comments

xuefengze commented Feb 8, 2024 • edited Loading

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

xuefengze commented Feb 8, 2024

fuyufjh commented Feb 19, 2024

fuyufjh commented Feb 19, 2024

xuefengze commented Feb 19, 2024

fuyufjh commented Feb 19, 2024

fuyufjh commented Feb 19, 2024 • edited Loading

xuefengze commented Feb 19, 2024

fuyufjh commented Feb 19, 2024 • edited Loading

xuefengze commented Feb 20, 2024

xuefengze commented Feb 20, 2024

fuyufjh commented Feb 20, 2024 • edited Loading

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 23, 2024

hzxa21 commented Feb 24, 2024

hzxa21 commented Feb 24, 2024

PG cdc `q3` checksums inconsistent #15057

PG cdc `q3` checksums inconsistent #15057

xuefengze commented Feb 8, 2024 •

edited

Loading

fuyufjh commented Feb 19, 2024 •

edited

Loading

fuyufjh commented Feb 19, 2024 •

edited

Loading

fuyufjh commented Feb 20, 2024 •

edited

Loading