feat: allow configure other additional columns for connectors #14215

tabVersion · 2023-12-26T09:33:23Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

following #13707 and the final part of risingwavelabs/rfcs#79

the syntax will be like

create table t (..schema.. )
  include key as some_key
  include partition
  include offset
with (...) format ... encode ...

accept columns for each connector -> https://github.com/risingwavelabs/rfcs/blob/tab/include-key-as/rfcs/0079-include-key-as.md

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

described above.

a special note to batch queries on source

in prev impl, we always insert a timestamptz column to the catalog for all source with kafka.
but when introducing include timestamp the semantic is the same. So in this pr, we no longer insert the column if already include timestamp.

a minor change for batch query

for source like

create source s ( ... ) with ( ... ) format ... encode ... 
create source s ( ... ) include timestamp with ( ... ) format ... encode ...

the query works select * from s where _rw_kafka_timestamp > '1977-01-01 00:00:00'

but if there is an alias specified

create source s ( ... ) include timestamp as some_ts with ( ... ) format ... encode ...

the query should be select * from s where some_ts > '1977-01-01 00:00:00'

tabVersion · 2024-01-04T09:08:51Z

on second thought, we are going to use both partition column and offset column to record the consumption process, and deprecating StreamChunkWithState
so there is no need to spend effort on the tests, the later refactor will cover most of the logic

…a/risingwave into tab/addi-columns

st1page

generally LGTM

e2e_test/source/basic/inlcude_key_as.slt

src/connector/src/source/kafka/source/reader.rs

e2e_test/source/basic/inlcude_key_as.slt

kwannoel · 2024-01-05T02:06:58Z

e2e_test/source/basic/inlcude_key_as.slt

+       AND timestamp_col IS NOT NULL
+       AND header_col IS NOT NULL
+----
+101


How do we know what count to expect here 🤔

Where can I find the input data

risingwave/scripts/source/prepare_ci_kafka.sh

Line 79 in 600300a

for i in {0..100}; do echo "key$i:{\"a\": $i}" | ${KCAT_BIN} -P -b message_queue:29092 -t ${ADDI_COLUMN_TOPIC} -K : -H "header1=v1" -H "header2=v2"; done

it will generate message like

key payload header

key1 {"a": 1} [(header1, v1), (header2, v2)]

May mention this in comment to avoid confusion

kwannoel · 2024-01-05T02:08:42Z

src/connector/src/parser/additional_columns.rs

+                        name,
+                        id,
+                        DataType::List(get_kafka_header_item_datatype().into()),
+                        AdditionalColumnType::Header,


Should this be DataType::Struct instead?

No, Kafka header is a list of kvs, with schema (varchar, bytes). The list can be empty, ie. no header and multiple kvs,

here is an example

risingwave/scripts/source/prepare_ci_kafka.sh

Line 79 in 600300a

for i in {0..100}; do echo "key$i:{\"a\": $i}" | ${KCAT_BIN} -P -b message_queue:29092 -t ${ADDI_COLUMN_TOPIC} -K : -H "header1=v1" -H "header2=v2"; done

kcat will generate 101 messages containing header with two kvs (header1, v1) and (header2, v2).

e2e_test/source/basic/inlcude_key_as.slt

xxchan

Is legacy code for _rw_kafka_timestamp considered?

tabVersion · 2024-01-10T06:36:18Z

Is legacy code for _rw_kafka_timestamp considered?

good question, I almost forgot it.

tabVersion · 2024-01-10T06:40:13Z

I remember it's the case, just to confirm: #13707 is will be included in v1.6, but not documented, right? So this PR includes some changes to that, but it's ok. BTW, might mention this in Release Note section for doc team to understand it better.

Yes, #13707 is in v1.6.0 and will not be in the doc, will notify the doc team.

- Add `_rw_kafka_timestamp` column to messages from Kafka source - Handle addition of columns and bind primary key columns - Set connector to backfill mode and enable CDC sharing mode - Check and add timestamp column before generating column IDs - Throw error if source does not support PRIMARY KEY constraint - Bind source watermark based on columns - Resolve privatelink connection for Kafka source - Create PbSource object with provided properties - Import `KAFKA_TIMESTAMP_COLUMN_NAME` and handle legacy column in `trad_source.rs` Signed-off-by: tabVersion <[email protected]>

fuyufjh · 2024-01-10T07:21:22Z

src/connector/src/parser/additional_columns.rs

-        //     }),
-        // ),
+        (
+            "header", // type: struct<key varchar, value bytea>[]


I think JSONB is better for storing Kafka header because it can get (->) a value easily.

A related topic was discussed at #13387 (Summary at #13387 (comment))

The problem is jsonb type do not support bytes inside.

Good point. 🤣 Thinking...

The downside of struct<key varchar, value bytea>[] is obvious: RW/PG doesn't provide any function to get a value by key. I don't know how the users can do that...

We can support array_filter() https://github.com/risingwavelabs/rfcs/pull/69/files#diff-857a6f40f71644499fee9c269c260a570942420de9a0225b059508d02c1fe98bR127-R138 or array_find for it. if there is not too many fields in the array.
Or support Map datatype...

maybe lambda can do the work 😈

fuyufjh

LGTM for the rest

tabVersion · 2024-01-10T13:20:50Z

Btw seems like the additional column offset can't be used for rate limit implementation of source then?

Will the offset column will only be present when the user includes it? Or will it always be parsed, but marked as hidden?

Previous thread: #13800 (comment)

Oh, I plan to do the refactor in #14384. After this, we will always derive offset and partition columns for source and table with connector regardless of whether users explicitly include them. The clause just changes the visibility of the two columns.

- Added new source `s10` with columns `v1` and `v2` - Included a timestamp column `some_ts` in the `s10` source - Configured `s10` source as a Kafka connector with topic, bootstrap server, and startup mode properties - Implemented a query to filter rows from `s10` based on a specific timestamp - Dropped tables `s8` and `s9` - Removed source `s9` - Removed source `s10` Signed-off-by: tabVersion <[email protected]>

Signed-off-by: tabVersion <[email protected]>

xxchan · 2024-01-10T15:35:27Z

but if there is an alias specified

create source s ( ... ) include timestamp as some_ts with ( ... ) format ... encode ...
the query should be select * from s where some_ts > '1977-01-01 00:00:00'

I think expr_to_kafka_timestamp_range still uses hard coded KAFKA_TIMESTAMP_COLUMN_NAME, so that query doesn't work as expected...

tabVersion · 2024-01-17T05:23:17Z

but if there is an alias specified
create source s ( ... ) include timestamp as some_ts with ( ... ) format ... encode ...
the query should be select * from s where some_ts > '1977-01-01 00:00:00'

I think expr_to_kafka_timestamp_range still uses hard coded KAFKA_TIMESTAMP_COLUMN_NAME, so that query doesn't work as expected...

Yes, but the example above works. Let me find out why...

xxchan · 2024-01-17T05:36:27Z

I think it may because the predicate becomes a FILTER above SOURCE, but not pushed down into the source.

tabVersion · 2024-01-17T08:46:49Z

I think it may because the predicate becomes a FILTER above SOURCE, but not pushed down into the source.

alright, can you help remove the hard code col to prevent future panic?

xxchan · 2024-01-17T09:33:55Z

can

…

On Wed, 17 Jan 2024 at 16:47, Bohan Zhang ***@***.***> wrote: I think it may because the predicate becomes a FILTER above SOURCE, but not pushed down into the source. alright, can you help remove the hard code col to prevent future panic? — Reply to this email directly, view it on GitHub <#14215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJBQZNM6HTUIVUNIZ7ETXC3YO6FYJAVCNFSM6AAAAABBDDCBXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVGM2DOOBZGU> . You are receiving this because your review was requested.Message ID: ***@***.***>

tabVersion added 4 commits December 20, 2023 22:13

stash

79807e6

change table display

54d60b8

stash

3a89125

fix

9b4dd37

github-actions bot added the type/feature label Dec 26, 2023

tabVersion added 6 commits December 27, 2023 11:36

Merge remote-tracking branch 'origin' into tab/addi-columns

a98bcf4

support header

3f96e4d

fix

7107acf

stash

587fb35

kafka additional col test

64552ed

add s3 test

b1572b4

tabVersion marked this pull request as ready for review January 4, 2024 08:58

tabVersion requested a review from a team as a code owner January 4, 2024 08:58

tabVersion added 2 commits January 4, 2024 17:02

minor fix

512b8d3

Merge branch 'main' into tab/addi-columns

dbaa602

tabVersion requested review from fuyufjh, Rossil2012, st1page and kwannoel January 4, 2024 09:10

tabVersion added ci/run-s3-source-tests ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. labels Jan 4, 2024

tabVersion added 2 commits January 4, 2024 17:58

fix kcat command

dcd79dd

Merge branch 'tab/addi-columns' of https://github.com/singularity-dat…

77b5b65

…a/risingwave into tab/addi-columns

st1page reviewed Jan 4, 2024

View reviewed changes

e2e_test/source/basic/inlcude_key_as.slt Show resolved Hide resolved

src/connector/src/source/kafka/source/reader.rs Show resolved Hide resolved

tabVersion added 2 commits January 4, 2024 21:42

fix e2e

134f571

fix e2e

600300a

kwannoel reviewed Jan 5, 2024

View reviewed changes

e2e_test/source/basic/inlcude_key_as.slt Show resolved Hide resolved

kwannoel reviewed Jan 5, 2024

View reviewed changes

xxchan reviewed Jan 8, 2024

View reviewed changes

e2e_test/source/basic/inlcude_key_as.slt Show resolved Hide resolved

xxchan reviewed Jan 8, 2024

View reviewed changes

tabVersion added 2 commits January 10, 2024 14:00

Merge remote-tracking branch 'origin' into tab/addi-columns

671f365

resolve comments

b60ba68

fuyufjh reviewed Jan 10, 2024

View reviewed changes

fuyufjh approved these changes Jan 10, 2024

View reviewed changes

fuyufjh mentioned this pull request Jan 10, 2024

connector: support protobuf map type in source #13387

Closed

kwannoel approved these changes Jan 10, 2024

View reviewed changes

tabVersion enabled auto-merge January 10, 2024 13:30

tabVersion added 2 commits January 10, 2024 21:48

rerun

5c62b5a

Signed-off-by: tabVersion <[email protected]>

fix

c6ad345

tabVersion added this pull request to the merge queue Jan 10, 2024

Merged via the queue into main with commit b03a641 Jan 10, 2024
27 of 28 checks passed

tabVersion deleted the tab/addi-columns branch January 10, 2024 15:20

kwannoel mentioned this pull request Jan 16, 2024

Do not block barrier when rate limiting with smaller than chunk_size #13799

Closed

7 tasks

This was referenced Jan 17, 2024

batch: fix predicate pushdown for kafka timestamp #14627

Open

feat: only ingest key-ed value in additional header column #14628

Merged

xiangjinwu mentioned this pull request Oct 8, 2024

bug(source): multiple INCLUDE additional columns Display inconsistent with parser #18800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow configure other additional columns for connectors #14215

feat: allow configure other additional columns for connectors #14215

tabVersion commented Dec 26, 2023 •

edited

Loading

tabVersion commented Jan 4, 2024

st1page left a comment

kwannoel Jan 5, 2024

tabVersion Jan 5, 2024

xxchan Jan 8, 2024

kwannoel Jan 5, 2024

tabVersion Jan 5, 2024

xxchan left a comment

tabVersion commented Jan 10, 2024

tabVersion commented Jan 10, 2024 •

edited

Loading

fuyufjh Jan 10, 2024 •

edited

Loading

tabVersion Jan 10, 2024

fuyufjh Jan 10, 2024 •

edited

Loading

st1page Jan 10, 2024

tabVersion Jan 10, 2024

fuyufjh left a comment

tabVersion commented Jan 10, 2024

xxchan commented Jan 10, 2024

tabVersion commented Jan 17, 2024

xxchan commented Jan 17, 2024

tabVersion commented Jan 17, 2024

xxchan commented Jan 17, 2024 via email

feat: allow configure other additional columns for connectors #14215

feat: allow configure other additional columns for connectors #14215

Conversation

tabVersion commented Dec 26, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

tabVersion commented Jan 4, 2024

st1page left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxchan left a comment

Choose a reason for hiding this comment

tabVersion commented Jan 10, 2024

tabVersion commented Jan 10, 2024 • edited Loading

fuyufjh Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

tabVersion commented Jan 10, 2024

xxchan commented Jan 10, 2024

tabVersion commented Jan 17, 2024

xxchan commented Jan 17, 2024

tabVersion commented Jan 17, 2024

xxchan commented Jan 17, 2024 via email

tabVersion commented Dec 26, 2023 •

edited

Loading

tabVersion commented Jan 10, 2024 •

edited

Loading

fuyufjh Jan 10, 2024 •

edited

Loading

fuyufjh Jan 10, 2024 •

edited

Loading