Support a method to reload a CDC table without recreating it #12753

StrikeW · 2023-10-11T02:59:57Z

Is your feature request related to a problem? Please describe.

When we try to recover a CDC table job, if the upstream binlog/wal file is missing, currently this job cannot be recovered. Right now, we can only drop the table and recreate it to do the full replication. However, the process of rebuilding a CDC table can be tedious, because there maybe many downstream stream jobs depends on the CDC table.

One of our customer request us to support a method to reload the CDC table without recreating it.

Describe the solution you'd like

Some ideas:

Trigger a full replication internally when we detect that upstream binlog file is missing in the recovery process, but how to detect the binlog file is missing? Need some investigatation
Introduce a new SQL command, for example the REFRESH command to allow user to trigger a full replication for a specific CDC table. Mainly for operational scenario.

reference:
DB2: https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-refresh-table
PG: https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html

Describe alternatives you've considered

No response

Additional context

related: #12313, #14060

The text was updated successfully, but these errors were encountered:

BugenZhao · 2023-10-11T03:59:53Z

This is somehow similar to Alter MV. IIUC, the main motivation is to preserve all downstream streaming jobs. Then the problem is, how to decide the changes that should be yielded to these downstream if there's already data loss?

Delete all records in the RW table and re-synchronize with the upstream. Will result in a significant amount of downtime and unnecessary changes to the sinks.
Directly re-synchronize with the upstream and rely on the on-conflict handling to overwrite data. Will there be dirty data visible by users?
Somehow find the "diff" to indicate the lost data. This introduces a brand-new job pattern and may require a complete design.

liurenjie1024 · 2023-10-12T11:41:52Z

Trigger a full replication internally when we detect that upstream binlog file is missing in the recovery process, but how to detect the binlog file is missing? Need some investigatation

By full replication you mean do full scan of source table?

hzxa21 · 2023-10-12T12:14:41Z

When we try to recover a CDC table job, if the upstream binlog/wal file is missing, currently this job cannot be recovered.

Why is the upstream binlog/wal file missing at the first place? Is this an expected situaion.

I guess this is due to binlog/WAL retention? Correct me if I am wrong, the job cannot be recovered only if the missing upstream binlog/WAL are "unconsumed" by RW. If this is the case, that means there is something unexpected happening in RW cluster causing lags in consuming the binlog/WAL files. That sounds rare to me and dropping & creating the cdc table sounds acceptable to me.

hzxa21 · 2023-10-13T06:57:32Z

I tried to repro binlog missing via the following steps:

Use docker compose under integration test to create MySQL/RW pods and create sample tables and MVs.
Run the following command in MySQL:

mysql> show binary logs;
+---------------+-----------+-----------+
| Log_name      | File_size | Encrypted |
+---------------+-----------+-----------+
| binlog.000003 |       157 | No        |
+---------------+-----------+-----------+

mysql> purge binary logs before now();

# Reset to a number larger than the binlog idx saw above
mysql> reset master to 4

Check RW cluster and run batch query

Here are my findings:

RW cluster is still up after the binlog is cleaned.
Newly changes to the MySQL table won't reflected in RW table.
I saw the following errors in compute node:

2023-10-13 06:44:56,867 ERROR [blc-mysql:3306] mysql.MySqlStreamingChangeEventSource:1094 - Error during binlog processing. Last offset stored = {transaction_id=null, ts_sec=1697179409, file=binlog.000004, pos=1882, server_id=1, event=1}, binlog reader near position = binlog.000004/1882
2023-10-13 06:44:56,871 ERROR [blc-mysql:3306] pipeline.ErrorHandler:35 - Producer failure
io.debezium.DebeziumException: could not find next log; the first event 'binlog.000004' at 1042, the last event read from './binlog.000004' at 1882, the last byte read from './binlog.000004' at 1882. Error code: 1236; SQLSTATE: HY000.
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1194) ~[debezium-connector-mysql-1.9.7.Final.jar:1.9.7.Final]
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1239) [debezium-connector-mysql-1.9.7.Final.jar:1.9.7.Final]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1079) [mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:631) [mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:932) [mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: com.github.shyiko.mysql.binlog.network.ServerException: could not find next log; the first event 'binlog.000004' at 1042, the last event read from './binlog.000004' at 1882, the last byte read from './binlog.000004' at 1882.
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1043) ~[mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        ... 3 more
2023-10-13 06:44:56,872 INFO  [blc-mysql:3306] mysql.MySqlStreamingChangeEventSource:1210 - Stopped reading binlog after 0 events, last recorded offset: {transaction_id=null, ts_sec=1697179409, file=binlog.000004, pos=1882, server_id=1, event=1}
2023-10-13 06:44:57,112 INFO  [rw-dbz-engine-runner-1008] embedded.EmbeddedEngine:847 - Stopping the task and engine
2023-10-13 06:44:57,113 INFO  [rw-dbz-engine-runner-1008] common.BaseSourceTask:243 - Stopping down connector
2023-10-13T06:44:57.161128057Z  INFO risingwave_storage::hummock::event_handler::uploader: epoch 5239229393076224 is sealed
2023-10-13T06:44:57.16114017Z  INFO risingwave_storage::hummock::event_handler::uploader: epoch 5239229393076224 to seal has no data
2023-10-13 06:44:57,216 INFO  [debezium-mysqlconnector-RW_CDC_1008-change-event-source-coordinator] pipeline.ChangeEventSourceCoordinator:175 - Finished streaming
2023-10-13 06:44:57,222 INFO  [pool-5-thread-1] jdbc.JdbcConnection:962 - Connection gracefully closed
2023-10-13 06:44:57,225 ERROR [rw-dbz-engine-runner-1008] core.DbzCdcEngineRunner:83 - engine#1008 terminated with error. message: Error while trying to run connector class 'io.debezium.connector.mysql.MySqlConnector'
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
        at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:50) ~[debezium-core-1.9.7.Final.jar:1.9.7.Final]
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1239) ~[debezium-connector-mysql-1.9.7.Final.jar:1.9.7.Final]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1079) ~[mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:631) ~[mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:932) ~[mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: io.debezium.DebeziumException: could not find next log; the first event 'binlog.000004' at 1042, the last event read from './binlog.000004' at 1882, the last byte read from './binlog.000004' at 1882. Error code: 1236; SQLSTATE: HY000.
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1194) ~[debezium-connector-mysql-1.9.7.Final.jar:1.9.7.Final]
        ... 5 more
Caused by: com.github.shyiko.mysql.binlog.network.ServerException: could not find next log; the first event 'binlog.000004' at 1042, the last event read from './binlog.000004' at 1882, the last byte read from './binlog.000004' at 1882.
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1043) ~[mysql-binlog-connector-java-0.27.2.jar:0.27.2]
        ... 3 more

I am using this git commit (git-5a7e0883e10e6cba54a70888717b3d2152bc9a61) since this is the locally cached one and my box takes a long time to pull the latest image for some reason.

StrikeW · 2023-10-16T01:40:17Z

Hi @hzxa21 , thanks for the effort. But I doubt that it may not a correct way to reproduce the binlog missing case caused by retention.

The io.debezium.DebeziumException: could not find next log; error is caused by binlog file missing on the MySQL, but the missing is due to the reset master command which forcefully removes the current active binlog file.
The active binlog file would not be successfully removed by the PURGE BINARY LOGS (and also retention policy IMO).

mysql> show master status;
+---------------+----------+--------------+------------------+-------------------+
| File          | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------+----------+--------------+------------------+-------------------+
| binlog.000009 |    13947 |              |                  |                   |
+---------------+----------+--------------+------------------+-------------------+
1 row in set (0.01 sec)

mysql> PURGE BINARY LOGS BEFORE now();
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+------------------------------------------------------------------------+
| Level   | Code | Message                                                                |
+---------+------+------------------------------------------------------------------------+
| Warning | 1868 | file ./binlog.000009 was not purged because it is the active log file. |
+---------+------+------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> reset master to 10;
Query OK, 0 rows affected (0.02 sec)

The io.debezium.DebeziumException: could not find next log exception occurs in the connector after the reset master command.

In summary, if binlog file is accidentally lost instead of removed by the retention policy, then the following behavior is expected. For this case we need to drop and rebuild the cdc table.

2. Newly changes to the MySQL table won't reflected in RW table.
3. I saw the following errors in compute node:

StrikeW · 2023-10-16T03:22:01Z

This is somehow similar to Alter MV. IIUC, the main motivation is to preserve all downstream streaming jobs. Then the problem is, how to decide the changes that should be yielded to these downstream if there's already data loss?

Right now we should focus on the case that binlog file is deleted by the MySQL retention policy. In this case, the persisted source offset may have been purged by MySQL. So that the connector can't be recovered correctly upon recovery, then cdc table would not sync with its upstream.

I found that Debezium can emit heartbeat event to downstream. With the help with heartbeat events, we can update the source offset to keep it up to date with the upstream even the upstream doesn't have any update for a long time. Then offset can be reset successfully during the recovery. I will implement the support of heartbeat event first to solve the problem for our customer.

Heartbeat messages are useful for monitoring whether the connector is receiving change events from the database. Heartbeat messages might help decrease the number of change events that need to be re-sent when a connector restarts.

BugenZhao · 2023-10-17T07:05:14Z

This is somehow similar to Alter MV. IIUC, the main motivation is to preserve all downstream streaming jobs. Then the problem is, how to decide the changes that should be yielded to these downstream if there's already data loss?

Did you accidentally reference this? 👀

StrikeW · 2023-10-18T01:11:45Z

This is somehow similar to Alter MV. IIUC, the main motivation is to preserve all downstream streaming jobs. Then the problem is, how to decide the changes that should be yielded to these downstream if there's already data loss?

Did you accidentally reference this? 👀

No. Since I think the problem you mentioned is more general and complex, for example when the compute node crashed for a long time. But right now we can narrow the scope and assumes that compute node won't be down for more than the expiration time of the binlog, so that the connector can reset its offset upon recovery.

StrikeW · 2024-01-04T03:14:50Z

We have another round of discussion of this issue. Let me migrate it to this thread.
The background is when our CDC connector runs into trouble, the CDC table would miss some upstream data.

@fuyufjh: How do you plan to implement “resync”? Especially, will you delete and re-insert all rows from the streaming perspective?
Yes - then it’s not better than drop cascade and recreate all objects
No - then it will lead to inconsistant data and may cause panic

If user confirm that their upstream table is append-only, I and @hzxa21 have a discussion about this case. We can provide a command to allow user trigger a full re-sync with their upstream table, and introduce a new type of ConflictBehavior for the MV operator that will compare the old row (if exists) and new row for a primary key. So that we can recognize no-op Upsert events to avoid emitting redundant UD and UI events to the downstream. The idea is similar to #14339.

Since this new type of conflict handle will introduce overhead, it should be switched off after the re-sync has been done.

github-actions · 2024-06-12T09:00:15Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

StrikeW added the type/feature label Oct 11, 2023

github-actions bot added this to the release-1.4 milestone Oct 11, 2023

StrikeW mentioned this issue Oct 16, 2023

fix(mysql-cdc): enable heartbeat message to keep the offset up to date with upstream #12868

Merged

8 tasks

StrikeW modified the milestones: release-1.4, release-1.5 Nov 6, 2023

StrikeW added the needs-discussion label Nov 8, 2023

kwannoel mentioned this issue Dec 19, 2023

Support resync of upstream and cdc table #14070

Closed

github-actions bot added the no-issue-activity label Jun 12, 2024

StrikeW closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a method to reload a CDC table without recreating it #12753

Support a method to reload a CDC table without recreating it #12753

StrikeW commented Oct 11, 2023 •

edited

Loading

BugenZhao commented Oct 11, 2023 •

edited

Loading

liurenjie1024 commented Oct 12, 2023

hzxa21 commented Oct 12, 2023

hzxa21 commented Oct 13, 2023 •

edited

Loading

StrikeW commented Oct 16, 2023 •

edited

Loading

StrikeW commented Oct 16, 2023 •

edited

Loading

BugenZhao commented Oct 17, 2023

StrikeW commented Oct 18, 2023

StrikeW commented Jan 4, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024

Support a method to reload a CDC table without recreating it #12753

Support a method to reload a CDC table without recreating it #12753

Comments

StrikeW commented Oct 11, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

BugenZhao commented Oct 11, 2023 • edited Loading

liurenjie1024 commented Oct 12, 2023

hzxa21 commented Oct 12, 2023

hzxa21 commented Oct 13, 2023 • edited Loading

StrikeW commented Oct 16, 2023 • edited Loading

StrikeW commented Oct 16, 2023 • edited Loading

BugenZhao commented Oct 17, 2023

StrikeW commented Oct 18, 2023

StrikeW commented Jan 4, 2024 • edited Loading

github-actions bot commented Jun 12, 2024

StrikeW commented Oct 11, 2023 •

edited

Loading

BugenZhao commented Oct 11, 2023 •

edited

Loading

hzxa21 commented Oct 13, 2023 •

edited

Loading

StrikeW commented Oct 16, 2023 •

edited

Loading

StrikeW commented Oct 16, 2023 •

edited

Loading

StrikeW commented Jan 4, 2024 •

edited

Loading