Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(source): fix panic for ALTER SOURCE with schema registry #17293

Merged
merged 13 commits into from
Jun 19, 2024
Merged

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Jun 18, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

fix #16486

Note that it doesn't panic only when columns are added at the end, but actually only few cases of ALTER SOURCE won't panic.

The root cause is quite clear: when the schema registry is refreshed, the newly resolved columns still use column ids from 1. So the updated SourceCatalog contains duplicated column IDs.

Why protobuf test works previously? Or when it won't panic?

Test: https://github.com/risingwavelabs/risingwave/blob/dec1c4f0d8e9400b98888e923f941e9b54d40c3e/e2e_test/schema_registry/alter_sr.slt

It just happen to work, and since the test proto file has a struct field, whose field occupies a column ID. And it works because of 2 mistakes combined together.

e.g.,

message Foo {
  int32 a = 1;
  Bar bar = 2;
}

message Bar {
  int32 baz = 1;
}

When CREATE SOURCE

  1. In bind_columns_from_source: we get [a:#1, bar:#3 {bar.baz:#2}] according to the schema

  2. Then in bind_create_source: we use col_id_gen to "compact" the ids, and also added additional cols. We get [a:#1, bar:#2 {bar.baz:#2}, _rw_kafka_timestamp:#3, _row_id:#0] (note bar.baz is unchanged, although it doesn't matter, but is strange :)

    for c in &mut columns {
    c.column_desc.column_id = col_id_gen.generate(c.name())
    }

Then when adding field b=3 to Foo, and ALTER SOURCE, we will get:

  1. In refresh_sr_and_get_columns_diff, we only call bind_columns_from_source, and get [a:#1, bar:#3 {bar.baz:#2}, b:#4].

    Note that we don't use col_id_gen here!

  2. We calculated added_columns, and got b:#4. And simply extend it to the original columns.

    Note that we don't compare hidden columns here (_rw_kafka_timestamp)!

So there are a lot of edge cases to make it fail or crash:

  1. If we INCLUDE timestamp to make _rw_kafka_timestamp not hidden, ALTER SOURCE src_user REFRESH SCHEMA will fail with: this altering statement will drop columns, which is not supported yet: (_rw_kafka_timestamp: timestamp with time zone)
  2. If in the protobuf test, we don't have struct fields, or the newly added column is at the beginning (according to the protobuf field number), it will panic, like avro.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member Author

xxchan commented Jun 18, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @xxchan and the rest of your teammates on Graphite Graphite

@xxchan xxchan changed the title add tests fix(source): fix panic for alter source with sr Jun 18, 2024
@github-actions github-actions bot added type/fix Bug fix and removed Invalid PR Title labels Jun 18, 2024
@xxchan xxchan changed the title fix(source): fix panic for alter source with sr fix(source): fix panic for alter source with schema registry Jun 18, 2024
@xxchan xxchan marked this pull request as ready for review June 18, 2024 05:32
@xxchan
Copy link
Member Author

xxchan commented Jun 18, 2024

avoid manipulating columns manually, which bypasses ColumnIdGenerator and can be problematic if we support ALTER TABLE with connector

#9828

e2e_test/schema_registry/pb.py Show resolved Hide resolved
Comment on lines +495 to +515
if cfg!(debug_assertions) {
// validate column ids
// Note: this just documents how it works currently. It doesn't mean whether it's reasonable.
if let Some(ref columns) = columns {
let mut i = 1;
fn check_col(col: &ColumnDesc, i: &mut usize, columns: &Vec<ColumnCatalog>) {
for nested_col in &col.field_descs {
// What's the usage of struct fields' column IDs?
check_col(nested_col, i, columns);
}
assert!(
col.column_id.get_id() == *i as i32,
"unexpected column id\ncol: {col:?}\ni: {i}\ncolumns: {columns:#?}"
);
*i += 1;
}
for col in columns {
check_col(&col.column_desc, &mut i, columns);
}
}
}
Copy link
Member Author

@xxchan xxchan Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should use ColumnId::placeholder() to assign IDs, because later we will re-assign column ids with col_id_gen, or use col_id_gen here directly, but it would be very intrusive.

@@ -894,6 +895,7 @@ pub(super) async fn handle_create_table_plan(
with_version_column: Option<String>,
include_column_options: IncludeOption,
) -> Result<(PlanRef, Option<PbSource>, PbTable, TableJobType)> {
let col_id_gen = ColumnIdGenerator::new_initial();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move arg to eliminate unnecessary param

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I believe we reused this function for ALTER TABLE prior to some refactoring and that's why we took an argument of column id generator. 🤡

@xxchan xxchan force-pushed the xxchan/sr branch 2 times, most recently from 295121d to 74040b6 Compare June 19, 2024 01:14
Comment on lines -165 to +177
let added_columns = columns_minus(&columns_from_resolve_source, &original_source.columns);
let mut added_columns = columns_minus(&columns_from_resolve_source, &original_source.columns);
// The newly resolved columns' column IDs also starts from 1. They cannot be used directly.
let mut next_col_id = max_column_id(&original_source.columns).next();
for col in &mut added_columns {
col.column_desc.column_id = next_col_id;
next_col_id = next_col_id.next();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real fix of the issue. Other changes are just refactoring/debugging

@xxchan xxchan changed the title fix(source): fix panic for alter source with schema registry fix(source): fix panic for ALTER SOURCE with schema registry Jun 19, 2024
Comment on lines 108 to +109
let mut bound_column = bind_sql_columns(&[column_def])?.remove(0);
bound_column.column_desc.column_id = columns
.iter()
.fold(ColumnId::new(i32::MIN), |a, b| a.max(b.column_id()))
.next();
bound_column.column_desc.column_id = max_column_id(columns).next();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see ALTER SOURCE ADD COLUMN use this solution. Actually bind_sql_columns previously takes col_id_gen as a param, but it's changed and use ColumnId::placeholder() after the refactor :( #10307

Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the change

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

e2e_test/schema_registry/alter_sr.slt Outdated Show resolved Hide resolved
src/frontend/src/handler/alter_source_column.rs Outdated Show resolved Hide resolved
src/frontend/src/handler/alter_source_with_sr.rs Outdated Show resolved Hide resolved
@@ -894,6 +895,7 @@ pub(super) async fn handle_create_table_plan(
with_version_column: Option<String>,
include_column_options: IncludeOption,
) -> Result<(PlanRef, Option<PbSource>, PbTable, TableJobType)> {
let col_id_gen = ColumnIdGenerator::new_initial();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I believe we reused this function for ALTER TABLE prior to some refactoring and that's why we took an argument of column id generator. 🤡

@@ -70,6 +70,9 @@ impl LogicalSource {
ctx: OptimizerContextRef,
as_of: Option<AsOf>,
) -> Result<Self> {
// XXX: should we reorder the columns?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order does not matter much. The columns field is essentially a map indexed by the column id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, it's just for what users will see in SELECT *.

But I'm not sure if we rely on the position of hidden column like _row_id somewhere.. For projected_row_id we do so...

.iter()
.fold(ColumnId::new(i32::MIN), |a, b| a.max(b.column_id()))
.next();
bound_column.column_desc.column_id = max_column_id(columns).next();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: we actually can directly go through the path of planning a completely new source catalog without keeping the consistency for column ids between the old and the new one. The current approach is just to be more compatible with ALTER TABLE, in case of future extension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same, but without confidence.

@tabVersion
Copy link
Contributor

#17336 change a position to unify the additional columns related logic

@xxchan xxchan added the need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged label Jun 19, 2024
@xxchan
Copy link
Member Author

xxchan commented Jun 19, 2024

I finally found the reason of CI failure, and why I can't reproduce it locally: When using rpk topic produce --schema-id=topic, it looks for schema registry of the redpanda, but the schema is in confluent schema registry... 🤡

Unfortunately rpk topic produce returns 0 when it meet error..

unable to build value serializer using TopicNameStrategy for topic "avro_alter_source_test": unable to get schema with name "avro_alter_source_test-value" using TopicName strategy: unable to GET "http://127.0.0.1:8081/subjects/avro_alter_source_test-value/versions/latest": Get "http://127.0.0.1:8081/subjects/avro_alter_source_test-value/versions/latest": dial tcp 127.0.0.1:8081: connect: connection refused

@xxchan xxchan mentioned this pull request Jun 19, 2024
9 tasks
@xxchan xxchan enabled auto-merge June 19, 2024 10:06
Signed-off-by: xxchan <[email protected]>
Signed-off-by: xxchan <[email protected]>
@xxchan xxchan added this pull request to the merge queue Jun 19, 2024
Merged via the queue into main with commit 2a413ff Jun 19, 2024
30 of 31 checks passed
@xxchan xxchan deleted the xxchan/sr branch June 19, 2024 13:02
github-actions bot pushed a commit that referenced this pull request Jun 19, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(source): panic on ALTER SOURCE when columns added in the middle
3 participants