Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PG cdc q3 checksums inconsistent #15057

Closed
xuefengze opened this issue Feb 8, 2024 · 16 comments · Fixed by #15232
Closed

PG cdc q3 checksums inconsistent #15057

xuefengze opened this issue Feb 8, 2024 · 16 comments · Fixed by #15232
Assignees
Labels
found-by-chaos-mesh help wanted Issues that need help from contributors type/bug Something isn't working
Milestone

Comments

@xuefengze
Copy link
Contributor

xuefengze commented Feb 8, 2024

Describe the bug

pg-cdc q3 checksums failed
https://buildkite.com/risingwave-test/chaos-mesh/builds/552
image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240207

Additional context

No response

@xuefengze xuefengze added the type/bug Something isn't working label Feb 8, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 8, 2024
@lmatz lmatz added help wanted Issues that need help from contributors block-release labels Feb 8, 2024
@xuefengze
Copy link
Contributor Author

The source tables are synced completed. Something wrong with the query q3.

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-password "" --downstream-database-name dev -t customer,new_order,orders,order_line 
{
    "consistent": true,
    "table-compare-results": [
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "customer",
                    "table-checksum": 3564470369424743317,
                    "table-rows": 30000
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "new_order",
                    "table-checksum": -7778568713383235460,
                    "table-rows": 20
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "orders",
                    "table-checksum": -6341453060762824403,
                    "table-rows": 155614
                }
            ]
        },
        {
            "consistent": true,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "order_line",
                    "table-checksum": 5946274962969133011,
                    "table-rows": 1557351
                }
            ]
        }
    ]
}

q3:

bin/qa consistency compare --upstream-driver postgres --upstream-port 54321 --upstream-user postgres --upstream-password postgres --upstream-database-name postgres --downstream-port 45678 --downstream-user root --downstream-pa
ssword "" --downstream-database-name dev -t ch_benchmark_q3 
{
    "consistent": false,
    "table-compare-results": [
        {
            "consistent": false,
            "table-checksums": [
                {
                    "url": "postgres://postgres:postgres@localhost:54321/postgres",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": 7479820059950340793,
                    "table-rows": 20
                },
                {
                    "url": "postgres://root:@localhost:45678/dev",
                    "table-name": "ch_benchmark_q3",
                    "table-checksum": -9219433622567761793,
                    "table-rows": 212
                }
            ]
        }
    ]
}

@fuyufjh
Copy link
Member

fuyufjh commented Feb 19, 2024

create materialized view ch_benchmark_q3
as
select   ol_o_id, ol_w_id, ol_d_id,
         sum(ol_amount) as revenue, o_entry_d
from 	 customer, new_order, orders, order_line
where 	 c_state like 'a%'
  and c_id = o_c_id
  and c_w_id = o_w_id
  and c_d_id = o_d_id
  and no_w_id = o_w_id
  and no_d_id = o_d_id
  and no_o_id = o_id
  and ol_w_id = o_w_id
  and ol_d_id = o_d_id
  and ol_o_id = o_id
  and o_entry_d > '2007-01-02 00:00:00.000000'
group by ol_o_id, ol_w_id, ol_d_id, o_entry_d;

The new_orders table contain 20 rows, as a result, the final result should have 20 rows. Expected results:

dev=>  SELECT ol_o_id, ol_w_id, ol_d_id, sum(ol_amount) AS revenue, o_entry_d FROM customer, new_order, orders, order_line WHERE c_id = o_c_id AND c_w_id = o_w_id AND c_d_id = o_d_id AND no_w_id = o_w_id AND no_d_id = o_d_id AND no_o_id = o_id AND ol_w_id = o_w_id AND ol_d_id = o_d_id AND ol_o_id = o_id AND o_entry_d > '2007-01-02 00:00:00.000000' GROUP BY ol_o_id, ol_w_id, ol_d_id, o_entry_d order by ol_o_id;
 ol_o_id | ol_w_id | ol_d_id | revenue |      o_entry_d      
---------+---------+---------+---------+---------------------
   15435 |       1 |       6 | 4293.00 | 2024-02-08 04:41:01
   15436 |       1 |       6 |  801.59 | 2024-02-08 04:41:01
   15449 |       1 |       3 | 1756.33 | 2024-02-08 04:41:01
   15450 |       1 |       3 | 3373.57 | 2024-02-08 04:41:01
   15451 |       1 |       3 | 4297.36 | 2024-02-08 04:41:01
   15487 |       1 |       5 | 1059.55 | 2024-02-08 04:41:01
   15514 |       1 |       7 | 2996.72 | 2024-02-08 04:41:01
   15515 |       1 |       7 | 1995.63 | 2024-02-08 04:41:01
   15541 |       1 |       1 | 3272.81 | 2024-02-08 04:41:01
   15542 |       1 |       1 | 1603.53 | 2024-02-08 04:41:01
   15543 |       1 |       1 | 2534.95 | 2024-02-08 04:41:01
   15544 |       1 |       1 | 1684.74 | 2024-02-08 04:41:01
   15593 |       1 |      10 | 3780.30 | 2024-02-08 04:41:01
   15594 |       1 |      10 | 1929.09 | 2024-02-08 04:41:01
   15595 |       1 |      10 | 2076.20 | 2024-02-08 04:41:01
   15613 |       1 |       2 | 1920.34 | 2024-02-08 04:41:01
   15614 |       1 |       2 | 1527.93 | 2024-02-08 04:41:01
   15654 |       1 |       9 | 2032.44 | 2024-02-08 04:41:01
   15655 |       1 |       9 | 1679.22 | 2024-02-08 04:41:01
   15666 |       1 |       4 | 1894.98 | 2024-02-08 04:41:01
(20 rows)

However,

  • ch_benchmark_q3 has 212 rows. All redundant rows are between 04:36:02 and 04:36:04
  • __internal_ch_benchmark_q3_41_hashaggstate_1046 has 2k+ rows. All reduantant rows are either at 04:35:52 or between 04:36:02 and 04:36:04

actual_results.txt

It seems that some data was leaked in these 2 epochs. 🤔

@fuyufjh
Copy link
Member

fuyufjh commented Feb 19, 2024

Questions:

  1. Did we & what kinds of nodes did we killed during the chaos mesh test?
  2. Beside this and Q16 (PG CDC data checksums inconsistent #14998), did this happen in any other queries?

@xuefengze @cyliu0

@xuefengze
Copy link
Contributor Author

  1. Did we & what kinds of nodes did we killed during the chaos mesh test?

chaos-mesh test didn't kill any nodes.

@fuyufjh
Copy link
Member

fuyufjh commented Feb 19, 2024

I am suspecting mem-table spill, because 1) the redundant data seems to come from only 2 epoches. (see #15057 (comment)) 2) The problem only happen in certain state tables, instead of all tables (for example, the base tables are all good)

Mem-table spill is enabled in the test:

image

@xuefengze Can you please help to run the test again without mem-table spill?

[storage]
mem_table_spill_threshold = 0

@xuefengze
Copy link
Contributor Author

failed again.https://buildkite.com/risingwave-test/chaos-mesh/builds/580

rw config:

[system]
data_directory = "${BENCHMARK_RISINGWAVE_STORAGE_S3_DATA_DIRECTORY}"

[server]
telemetry_enabled = false
[streaming.developer]
stream_enable_executor_row_count = true

[storage]
enable_fast_compaction = true
imm_merge_threshold = 4
mem_table_spill_threshold = 0

@fuyufjh
Copy link
Member

fuyufjh commented Feb 19, 2024

failed again.https://buildkite.com/risingwave-test/chaos-mesh/builds/580

rw config:

[system]
data_directory = "${BENCHMARK_RISINGWAVE_STORAGE_S3_DATA_DIRECTORY}"

[server]
telemetry_enabled = false
[streaming.developer]
stream_enable_executor_row_count = true

[storage]
enable_fast_compaction = true
imm_merge_threshold = 4
mem_table_spill_threshold = 0

Emmm, this time the base table new_order were different:

In PG: 1 rows
In RW: 7776 rows

However, materialized view ch_benchmark_q3 is consistent

dev=>  SELECT ol_o_id, ol_w_id, ol_d_id, sum(ol_amount) AS revenue, o_entry_d FROM customer, new_order, orders, order_line WHERE c_id = o_c_id AND c_w_id = o_w_id AND c_d_id = o_d_id AND no_w_id = o_w_id AND no_d_id = o_d_id AND no_o_id = o_id AND ol_w_id = o_w_id AND ol_d_id = o_d_id AND ol_o_id = o_id AND o_entry_d > '2007-01-02 00:00:00.000000' GROUP BY ol_o_id, ol_w_id, ol_d_id, o_entry_d;
...
(7776 rows)

dev=> select * from ch_benchmark_q3;
...
(7776 rows)

In short, it looks like a completely different problem. Note that the CN node restarts 2 times during the test (but it's not OOM, still investigating the reason...)

@xuefengze
Copy link
Contributor Author

Note that the CN node restarts 2 times during the test

pod-failure will cause pods restart.
image

@xuefengze
Copy link
Contributor Author

https://buildkite.com/risingwave-test/chaos-mesh/builds/583
test without mem-table spill and chaos-mesh just inject stress faults(which occupied meta node 20% cpu).
compute node is OOMKilled.

@fuyufjh
Copy link
Member

fuyufjh commented Feb 20, 2024

https://buildkite.com/risingwave-test/chaos-mesh/builds/583 test without mem-table spill and chaos-mesh just inject stress faults(which occupied meta node 20% cpu). compute node is OOMKilled.

Yeah, OOM is expected in this case, but the data inconsistency is a problem.

It seems to be a different problem. Let's discuss there: #15141

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 23, 2024

I am able to successfully repro with:

BENCH_TESTBED="medium-arm-all-affinity"
CH_BENCHMARK_QUERY="q3"
TEST_TYPE="ch-benchmark-pg-cdc"
cooldown_after_experiment="3m"
duration="1h"
experiment_recovery_time="1m"
experiments="[{'kind': 'StressChaos', 'actions': ['cpu'], 'duration': '10m', 'cpu': {'workers': 1, 'load': 10}, 'cases': [{'mode': ['one'], 'label': {'key': 'risingwave/component', 'value': 'meta'}}]}]"
RW_VERSION="nightly-20240207"
ENABLE_KEEP_CLUSTER="true"

Buildkite
Grafana

Findings:

  1. Row count of all source tables (customer, new_order, orders, order_line) related to ch_benchmark_q3 matches with PG.
  2. ch_benchmark_q3 in RW has more rows than PG.
  3. To further dig out the root cause of the mismatch, I created a new MV called ch_benchmark_q3_2 with the same SQL when there is no source throughput. ch_benchmark_q3_2 has correct row count so I cross check the internal state table row count one by one:
    • Only 1 join state table's row count mismatches between ch_benchmark_q3 and ch_benchmark_q3_2: __internal_ch_benchmark_q3_99_hashjoinleft_1121
    • __internal_ch_benchmark_q3_99_hashjoinleft_1121 is the one and the only one table that has triggered memtable spill.
  4. I pick one of the row that should not be present in __internal_ch_benchmark_q3_99_hashjoinleft_1121 and use sst-dump to grep the occurrences of its key in all relevant SSTs. I only found one occurrence in a L6 SST. There is no other PUT/DELETE for the key in sst-dump.

See more details in debug_notes.txt. The cluster is not cleaned up. For those who are interested, you can psql into the cluster to get more information.

Now I am re-running the same test with the following config. Hopefully we can get the full mutation history for a row in hummock:

RW_CONFIG="{'meta':{'enable_hummock_data_archive': true, 'full_gc_interval_sec': 6048000, 'min_sst_retention_time_sec': 6048000}}"

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 23, 2024

Hopefully we can get the full mutation history for a row in hummock

I can successfully repro with the full row mutation history in hummock. I picked a row that should be deleted in the join state table and dig into its mutation history using risectl (Thanks to the tooling provided by @zwang28).

 UserKey=0x00000417000300800004090080000001008000000a0080000001008000000a0080000cae008000000a00800000010080000cae008000000a0080000001
{ 1047, TableKey { 000300800004090080000001008000000a0080000001008000000a0080000cae008000000a00800000010080000cae008000000a0080000001 }
column: customer_c_id Some("1033")
column: customer_c_d_id Some("10")
column: customer_c_w_id Some("1")
column: new_order_no_o_id Some("3246")
column: new_order_no_d_id Some("10")
column: new_order_no_w_id Some("1")

TL;DR: there should be a bug in L5->L6 compaction that deletes a newer tombstone (with larger spill offset), which makes a older PUT (with smaller spill offset) visible.

Let's use K to denote the UserKey of this row. Here is a sequence of events that happened:

  1. K is PUT into Hummock in SST 304 with epoch: 5993065021702144 offset = 21 (2024-02-23 09:55:15.004 UTC)
  2. K is DELETEed from Hummock (by inserting a tombstone) in SST 369 with epoch: 5993065021702144 offset = 59 (2024-02-23 09:55:15.004 UTC). Note that it is the same epoch with larger spilled offset meaning that both SST 304 and 369 are spilled SSTs in the same checkpoint.
  3. A compaction task compacts SST 304 and 369, ... into SST 375, ... Both PUT and DELETE of K are present in SST 375.
  4. A compaction task compacts SST 375, ... into SST 566, .... Only PUT is present in SST 566 and DELETE is absent. Note that this is a L5->L6 compaction.

See more details in sst_dump_debug_notes.txt (You can search for "//" to better understand the notes).

I will keep the environment for a while. @Li0k @zwang28 @Little-Wallace Let's try dig out the bug together next week.

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 23, 2024

Bug found:

if (epoch <= task_config.watermark && task_config.gc_delete_keys && value.is_delete())

When there are DELETE and PUT with the same pure epoch, the above check will discard DELETE but keep the PUT. A simple fix is to replace epoch <= task_config.watermark with epoch < task_config.watermark

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 24, 2024

Trigger a run for the test with #15232. Let's wait and see whether the issue is gone.

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 24, 2024

Trigger a run for the test with #15232. Let's wait and see whether the issue is gone.

The test passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-chaos-mesh help wanted Issues that need help from contributors type/bug Something isn't working
Projects
None yet
4 participants