feat: only ingest key-ed value in additional header column #14628

tabVersion · 2024-01-17T13:10:04Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

syntax INCLUDE HEADER 'header_col' AS column_name and only header can specify the inner field part.
if alias not specified, the header with inner field will be named _rw_kafka_header_{inner field}

if an inner field name is specified, the column type becomes bytea instead of Array[Struct<Varchar, Bytea>]

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

support syntax specially for header

create table/source s (...) include header '<header col name>' [varchar/bytea] [as <alias>]

now we can specify desired keys in header and make it a column

here is an example: given a header [(key1, value1), (key2, value2)]

and the clause is include header 'key1', the column content is value1 in bytes
if the clause is include header 'key1' varchar the column content is value1 in varchar

about the default naming:

in prev impl, the default name is _rw_{connector name}_{additional column type}
In this pr we introduced inner field and type hint, so it can be _rw_{connector name}_{additional column type}_{<inner field name>}_{type hint}
to be specific, for include header 'header1' bytea, the name is _rw_kafka_header_header1_bytea and for include header 'header2' varchar the name is _rw_kafka_header_header2_varchar

…ages - Refactored the `do_action` method in `mod.rs` for improved source column description processing - Added support for additional column types in the `wrapped_f` closure in `mod.rs` - Updated error handling for failed access to non-primary key columns in the `wrapped_f` closure in `mod.rs` - Added rollback functionality to the `do_action` method in `mod.rs` - Modified the function `extract_headers_from_meta` in `util.rs` to accept an additional parameter - Updated the implementation of `extract_headers_from_meta` to call `kafka_meta.extract_headers(inner_field)` in `util.rs` - Added an optional `inner_field` parameter to the `extract_headers` function in `message.rs` - Updated the implementation of `extract_headers` to handle the changes in `message.rs` Signed-off-by: tabVersion <[email protected]>

xxchan · 2024-01-19T15:00:02Z

proto/plan_common.proto

-  AdditionalColumnType additional_column_type = 9;
+
+  // deprecated, use AdditionalColumn instead
+  // AdditionalColumnType additional_column_type = 9;
+  reserved 9;


additional_column_type is included in 1.6, but not documented.

#14215 (comment)

Why do we need to deprecate this field here?

Oh, for source created in 1.6, it will have AdditionalColumnType::NORMAL. So we cannot change the type for field 9

Have we told any poc user to try include before? We might need to tell them to rebuild the sources later. Or maybe we just document this breaking change in the release note.

Yes, the code is in v1.6.0 but the feature but not considered as released. It is ok to ignore the non-normal columns.
And I don't want to make breaking changes to normal columns here so I choose to use a new field.

I want to make things flexible when handling additional columns. Just like this change, the prev enum is not sufficient with handling an extra inner field arg. I don't know what comes next, so I make all columns a message instead of an enum.

The breaking change looks acceptable to me, although it seems not hard to make it backward compatible.

xxchan · 2024-01-19T15:10:51Z

proto/plan_common.proto

+
+  // deprecated, use AdditionalColumn instead
+  // AdditionalColumnType additional_column_type = 9;
+  reserved 9;

  ColumnDescVersion version = 10;


Do we need to add a new ColumnDescVersion? TBH I'm not sure about why it's added, and it doesn't seem to be needed here. Ask just in case.

the field is introduced in #13707 to deal with DEFAULT_KEY_COLUMN_NAME change in future.
discussions are available here #13707 (comment)

fuyufjh · 2024-01-22T03:28:23Z

Materialize parse the value to UTF-8 string by default, unless specifying the data type BYTES

Ref. https://materialize.com/docs/sql/create-source/kafka/#syntax

I think we should follow the design, because in most cases the value will just be used as varchar instead of bytea, and converting bytea to varchar is verbose.

My idea:

INCLUDE HEADER key                  -- decode as UTF-8 string
INCLUDE HEADER key AS name          -- decode as UTF-8 string
INCLUDE HEADER key AS name VARCHAR  -- decode as UTF-8 string
INCLUDE HEADER key AS name BYTEA    -- output raw bytes

proto/plan_common.proto

fuyufjh · 2024-01-22T03:46:36Z

proto/plan_common.proto

-  AdditionalColumnType additional_column_type = 9;
+
+  // deprecated, use AdditionalColumn instead
+  // AdditionalColumnType additional_column_type = 9;
+  reserved 9;


The breaking change looks acceptable to me, although it seems not hard to make it backward compatible.

tabVersion · 2024-01-22T07:59:39Z

Materialize parse the value to UTF-8 string by default, unless specifying the data type BYTES
Ref. [materialize.com/docs/sql/create-source/kafka/#syntax](https://materialize.com/docs/sql/create-source/kafka/#syntax)
I think we should follow the design, because in most cases the value will just be used as varchar instead of bytea, and converting bytea to varchar is verbose.

My idea:
INCLUDE HEADER key                  -- decode as UTF-8 string
INCLUDE HEADER key AS name          -- decode as UTF-8 string
INCLUDE HEADER key AS name VARCHAR  -- decode as UTF-8 string
INCLUDE HEADER key AS name BYTEA    -- output raw bytes

The solution seems a little verbose to me, the original purpose for this new syntax is to handle the problem that users can have trouble finding the key they want in Array[Struct<varchar, bytea>]. And the solution works well solving it.
Users have no obstacle converting varchar from bytes in sql and the efficiency is good. I'd keep this external because it serves as a shortcut in parser, not sharing the logic that fills NULL when parsing failure. We won't want to be responsible for some unexpected behavior caused by users forgetting to specify varchar in the clause.

Signed-off-by: tabVersion <[email protected]>

fuyufjh · 2024-01-22T15:28:08Z

The solution seems a little verbose to me, the original purpose for this new syntax is to handle the problem that users can have trouble finding the key they want in Array[Struct<varchar, bytea>]. And the solution works well solving it. Users have no obstacle converting varchar from bytes in sql and the efficiency is good. I'd keep this external because it serves as a shortcut in parser, not sharing the logic that fills NULL when parsing failure. We won't want to be responsible for some unexpected behavior caused by users forgetting to specify varchar in the clause.

As a connector, I would hope to complete every parsing work inside of it. I consider parsing a header value to string as part of this.

I can forecast that users will ask you 2 questions frequently without such an option.

How to convert bytea to varchar with PG functions? -- Please use encode()
How to create a table with a header value in varchar type? -- Please use generated columns.

Both are natural requirements but the solution is obscure, so you can't blame the users actually.

src/frontend/src/handler/create_source.rs

src/connector/src/source/kafka/source/reader.rs

Signed-off-by: tabVersion <[email protected]>

tabVersion · 2024-01-23T11:40:47Z

The solution seems a little verbose to me, the original purpose for this new syntax is to handle the problem that users can have trouble finding the key they want in Array[Struct<varchar, bytea>]. And the solution works well solving it. Users have no obstacle converting varchar from bytes in sql and the efficiency is good. I'd keep this external because it serves as a shortcut in parser, not sharing the logic that fills NULL when parsing failure. We won't want to be responsible for some unexpected behavior caused by users forgetting to specify varchar in the clause.

As a connector, I would hope to complete every parsing work inside of it. I consider parsing a header value to string as part of this.

I can forecast that users will ask you 2 questions frequently without such an option.

How to convert bytea to varchar with PG functions? -- Please use encode()

How to create a table with a header value in varchar type? -- Please use generated columns.

Both are natural requirements but the solution is obscure, so you can't blame the users actually.

Already implemented. Please review.

proto/plan_common.proto

src/connector/src/parser/additional_columns.rs

src/connector/src/source/kafka/source/message.rs

proto/plan_common.proto

fuyufjh

LGTM!

frontend

4aae804

github-actions bot added the type/feature label Jan 17, 2024

tabVersion added 7 commits January 18, 2024 13:14

stash

c62e98f

fix prost

c5d4197

fix compile

7cfbb15

fix

205b89a

more test

0d07a66

Merge branch 'main' into tab/header-col

d7c94ef

tabVersion marked this pull request as ready for review January 19, 2024 14:12

tabVersion requested review from fuyufjh, xxchan and Rossil2012 January 19, 2024 14:13

tabVersion added 2 commits January 19, 2024 22:20

format

be58ada

format

735dbba

xxchan reviewed Jan 19, 2024

View reviewed changes

tabVersion added the user-facing-changes Contains changes that are visible to users label Jan 19, 2024

tabVersion requested a review from st1page January 19, 2024 15:40

Merge branch 'main' into tab/header-col

054cf0e

fuyufjh reviewed Jan 22, 2024

View reviewed changes

fix

626be23

tabVersion added 6 commits January 22, 2024 16:19

rename additional_column_type to additional_columns

650f831

fix

01f2116

fix

2c0dbd0

rerun

9540879

Signed-off-by: tabVersion <[email protected]>

separate header inner and headers

8b715b4

Merge branch 'main' into tab/header-col

4c3b96a

Rossil2012 reviewed Jan 23, 2024

View reviewed changes

src/frontend/src/handler/create_source.rs Outdated Show resolved Hide resolved

src/connector/src/source/kafka/source/reader.rs Outdated Show resolved Hide resolved

tabVersion added 5 commits January 23, 2024 16:11

fix comments

b7310ff

add header col type hint

8cba0b9

handle col name

f00632a

add test case in e2e

c296e8d

rerun

f9f4df5

Signed-off-by: tabVersion <[email protected]>

tabVersion added 3 commits January 23, 2024 19:41

Merge branch 'main' into tab/header-col

dc2f42e

handle non exist header key

37570d1

fix

8528acb

fuyufjh reviewed Jan 24, 2024

View reviewed changes

proto/plan_common.proto Show resolved Hide resolved

src/connector/src/parser/additional_columns.rs Outdated Show resolved Hide resolved

src/connector/src/source/kafka/source/message.rs Outdated Show resolved Hide resolved

proto/plan_common.proto Outdated Show resolved Hide resolved

tabVersion added 4 commits January 25, 2024 00:33

refactor

bea93b9

remove additional_column_normal

372988d

resolve comments

6653119

fix misc

d0053d5

fuyufjh approved these changes Jan 25, 2024

View reviewed changes

fix ut

9c1d2f8

tabVersion enabled auto-merge January 25, 2024 03:07

tabVersion added this pull request to the merge queue Jan 25, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 25, 2024

tabVersion added this pull request to the merge queue Jan 25, 2024

Merged via the queue into main with commit df87c2d Jan 25, 2024
31 of 32 checks passed

tabVersion deleted the tab/header-col branch January 25, 2024 05:04

cyliu0 mentioned this pull request Jan 31, 2024

nightly-20240126 deleting data from backfill table stuck #14886

Closed

tabVersion mentioned this pull request Feb 23, 2024

fix: handle upsert json in prev versions #15226

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: only ingest key-ed value in additional header column #14628

feat: only ingest key-ed value in additional header column #14628

tabVersion commented Jan 17, 2024 •

edited

Loading

xxchan Jan 19, 2024

xxchan Jan 19, 2024

xxchan Jan 19, 2024

tabVersion Jan 19, 2024

fuyufjh Jan 22, 2024

xxchan Jan 19, 2024

tabVersion Jan 19, 2024

fuyufjh commented Jan 22, 2024

fuyufjh Jan 22, 2024

tabVersion commented Jan 22, 2024

fuyufjh commented Jan 22, 2024 •

edited

Loading

tabVersion commented Jan 23, 2024

fuyufjh left a comment

feat: only ingest key-ed value in additional header column #14628

feat: only ingest key-ed value in additional header column #14628

Conversation

tabVersion commented Jan 17, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh commented Jan 22, 2024

Choose a reason for hiding this comment

tabVersion commented Jan 22, 2024

fuyufjh commented Jan 22, 2024 • edited Loading

tabVersion commented Jan 23, 2024

fuyufjh left a comment

Choose a reason for hiding this comment

tabVersion commented Jan 17, 2024 •

edited

Loading

fuyufjh commented Jan 22, 2024 •

edited

Loading