-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PG cdc q3
checksums inconsistent
#15057
Comments
The source tables are synced completed. Something wrong with the query
|
create materialized view ch_benchmark_q3
as
select ol_o_id, ol_w_id, ol_d_id,
sum(ol_amount) as revenue, o_entry_d
from customer, new_order, orders, order_line
where c_state like 'a%'
and c_id = o_c_id
and c_w_id = o_w_id
and c_d_id = o_d_id
and no_w_id = o_w_id
and no_d_id = o_d_id
and no_o_id = o_id
and ol_w_id = o_w_id
and ol_d_id = o_d_id
and ol_o_id = o_id
and o_entry_d > '2007-01-02 00:00:00.000000'
group by ol_o_id, ol_w_id, ol_d_id, o_entry_d; The
However,
It seems that some data was leaked in these 2 epochs. 🤔 |
Questions:
|
chaos-mesh test didn't kill any nodes. |
Yeah. Also confirmed from Grafana: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=longcmkf-20240208-043103&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1707366471489&to=1707368038581 No node crashed or recovered. Compaction is also normal. |
I am suspecting mem-table spill, because 1) the redundant data seems to come from only 2 epoches. (see #15057 (comment)) 2) The problem only happen in certain state tables, instead of all tables (for example, the base tables are all good) Mem-table spill is enabled in the test: @xuefengze Can you please help to run the test again without mem-table spill?
|
failed again.https://buildkite.com/risingwave-test/chaos-mesh/builds/580 rw config:
|
Emmm, this time the base table In PG: 1 rows However, materialized view
In short, it looks like a completely different problem. Note that the CN node restarts 2 times during the test (but it's not OOM, still investigating the reason...) |
https://buildkite.com/risingwave-test/chaos-mesh/builds/583 |
Yeah, OOM is expected in this case, but the data inconsistency is a problem. It seems to be a different problem. Let's discuss there: #15141 |
I am able to successfully repro with:
Findings:
See more details in debug_notes.txt. The cluster is not cleaned up. For those who are interested, you can psql into the cluster to get more information. Now I am re-running the same test with the following config. Hopefully we can get the full mutation history for a row in hummock:
|
I can successfully repro with the full row mutation history in hummock. I picked a row that should be deleted in the join state table and dig into its mutation history using risectl (Thanks to the tooling provided by @zwang28).
TL;DR: there should be a bug in L5->L6 compaction that deletes a newer tombstone (with larger spill offset), which makes a older PUT (with smaller spill offset) visible. Let's use
See more details in sst_dump_debug_notes.txt (You can search for "//" to better understand the notes). I will keep the environment for a while. @Li0k @zwang28 @Little-Wallace Let's try dig out the bug together next week. |
Bug found:
When there are DELETE and PUT with the same pure epoch, the above check will discard DELETE but keep the PUT. A simple fix is to replace |
Describe the bug
pg-cdc
q3
checksums failedhttps://buildkite.com/risingwave-test/chaos-mesh/builds/552
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
nightly-20240207
Additional context
No response
The text was updated successfully, but these errors were encountered: