Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cdc): support transaction for shared cdc source #14375

Merged
merged 7 commits into from
Jan 8, 2024

Conversation

StrikeW
Copy link
Contributor

@StrikeW StrikeW commented Jan 5, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

close #14348

  • Add transaction metadata parsing logic to PlainParser for the CDC source job, which fixed to plain json format.
  • Add more solid tests for emitting transactional chunk
  • Fix transaction id parsing logic for specific database

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

User can specify transactional = true parameter to the WITH options to preserve upstream transaction semantic when create a CDC source.

@StrikeW StrikeW added the user-facing-changes Contains changes that are visible to users label Jan 5, 2024
Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add transaction metadata parsing logic to PlainParser for the CDC source job, which fixed to plain json format.

I am a little worried that we are mixing too much logic into plain parser. shall we make a new format dedicated to CDC source?

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 5, 2024

Add transaction metadata parsing logic to PlainParser for the CDC source job, which fixed to plain json format.

I am a little worried that we are mixing too much logic into plain parser. shall we make a new format dedicated to CDC source?

Hmm, the modification to plain parser is small, only add a new branch to parse transaction metadata if it needs to.
https://github.com/risingwavelabs/risingwave/pull/14375/files#diff-8c7e5305d8e1796a4ccc76a9522ccb881ea31abeeb6d8010427679c5d9fa801fR91-R95

Copy link
Contributor

Add transaction metadata parsing logic to PlainParser for the CDC source job, which fixed to plain json format.

I am a little worried that we are mixing too much logic into plain parser. shall we make a new format dedicated to CDC source?

Hmm, the modification to plain parser is small, only add a new branch to parse transaction metadata if it needs to.
https://github.com/risingwavelabs/risingwave/pull/14375/files#diff-8c7e5305d8e1796a4ccc76a9522ccb881ea31abeeb6d8010427679c5d9fa801fR91-R95

Yes, I am ok with the current impl, just wonder if you have such plan.

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 5, 2024

Add transaction metadata parsing logic to PlainParser for the CDC source job, which fixed to plain json format.

I am a little worried that we are mixing too much logic into plain parser. shall we make a new format dedicated to CDC source?

Hmm, the modification to plain parser is small, only add a new branch to parse transaction metadata if it needs to.
https://github.com/risingwavelabs/risingwave/pull/14375/files#diff-8c7e5305d8e1796a4ccc76a9522ccb881ea31abeeb6d8010427679c5d9fa801fR91-R95

Yes, I am ok with the current impl, just wonder if you have such plan.

Right now the modification is small, so I think it is ok to just extend the plain parser. If there are more features need to add to the plain parser (e.g. handle schema change events), we can define a new parser and format for it at that time.

Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikeW StrikeW added this pull request to the merge queue Jan 8, 2024
Merged via the queue into main with commit 10f7b77 Jan 8, 2024
29 of 30 checks passed
@StrikeW StrikeW deleted the siyuan/cdc-transaction branch January 8, 2024 03:48
Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask what's PlainParser for? Why aren't we using JsonParser here?

&'a mut self,
key: Option<Vec<u8>>,
payload: Option<Vec<u8>>,
writer: SourceStreamChunkRowWriter<'a>,
) -> Result<()> {
) -> Result<ParseResult> {
tracing::info!("parse_one_with_txn");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be too verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, will remove it in a following bugfix PR.

Comment on lines +54 to +58
ConnectorProperties::PostgresCdc(_) => {
let (tx_id, _) = id.split_once(':').unwrap();
return Ok(TransactionControl::Begin { id: tx_id.into() });
}
ConnectorProperties::MysqlCdc(_) => return Ok(TransactionControl::Begin { id }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we depend on the value of the ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. IIUC, the LSN for BEGIN and COMMIT differs from each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lsn in commit will greater than begin.

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 9, 2024

May I ask what's PlainParser for? Why aren't we using JsonParser here?

It is for parsing message for source in FORMAT PLAIN format, and actually we don't have JsonParser, json is an encoding not format in our terminology.

let payload_builder = match props.encoding_config {
EncodingProperties::Json(_)
| EncodingProperties::Protobuf(_)
| EncodingProperties::Avro(_)
| EncodingProperties::Bytes(_) => {
AccessBuilderImpl::new_default(props.encoding_config, EncodingType::Value).await?

@fuyufjh
Copy link
Member

fuyufjh commented Jan 10, 2024

User can specify transactional = true parameter to the WITH options to preserve upstream transaction semantic when create a CDC source.

Is it possible to enable this feature by default? Will this involve additional cost?

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 10, 2024

User can specify transactional = true parameter to the WITH options to preserve upstream transaction semantic when create a CDC source.

Is it possible to enable this feature by default? Will this involve additional cost?

The overhead should be small. But I think we should conduct a performance test to make this decision. @lmatz Does our ch-benchmark/sysbench use BEGIN and COMMIT on upstream database to generate workload? If it is we can create a transactional source and eval its performance.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 18, 2024

User can specify transactional = true parameter to the WITH options to preserve upstream transaction semantic when create a CDC source.

Is it possible to enable this feature by default? Will this involve additional cost?

The overhead should be small. But I think we should conduct a performance test to make this decision. @lmatz Does our ch-benchmark/sysbench use BEGIN and COMMIT on upstream database to generate workload? If it is we can create a transactional source and eval its performance.

Any follow-ups? @StrikeW

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 18, 2024

User can specify transactional = true parameter to the WITH options to preserve upstream transaction semantic when create a CDC source.

Is it possible to enable this feature by default? Will this involve additional cost?

The overhead should be small. But I think we should conduct a performance test to make this decision. @lmatz Does our ch-benchmark/sysbench use BEGIN and COMMIT on upstream database to generate workload? If it is we can create a transactional source and eval its performance.

Any follow-ups? @StrikeW

@cyliu0 is working on the test, the ETA is next week. https://github.com/risingwavelabs/risingwave-test/issues/577

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 26, 2024

Update the evaluation result (nightly-20240125)

The overhead is about 5% for mysql and 4% for pg roughly and it is increased inside the debezium connector.
image

I think the overhead is not trivial, it is better to remain the transactional to false by default. @fuyufjh

@fuyufjh
Copy link
Member

fuyufjh commented Jan 26, 2024

I think the overhead is not trivial, it is better to remain the transactional to false by default. @fuyufjh

Thanks for the testing work!

I prefer to enable this by default so that we can claim the CDC connector is transactional. Otherwise, telling people "Yes-but" (Yes it supports transaction but needs to be enabled manually) sounds bad to me.

@StrikeW
Copy link
Contributor Author

StrikeW commented Jan 26, 2024

I got it. Let's do the change in v1.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support transaction for shared cdc source
4 participants