RFC: use `include ... as ...` to ingest more message parts #79

tabVersion · 2023-11-22T08:27:55Z

No description provided.

st1page · 2023-11-22T08:51:28Z

rfcs/0079-include-key-as.md

+* If `as` name is not specified, a connector-component naming template will be applied
+  * For connector kafka and component key, the derived message key column name is `_rw_kafka_key`.
+* The default type for message key column is `bytea`. The priority of the type definition is: 
+  `key encode` > infer from `format ... encode ...` > default type  


infer from format ... encode ...

What dose it mean?

if there is a schema registry and we can infer the key schema from it.

st1page · 2023-11-22T08:54:53Z

rfcs/0079-include-key-as.md

+* For all connectors with `format upsert`, RisingWave derives the column as primary key to perform upsert semantic.
+  * An exception: for `format upsert encode json`, RisingWave allows the PK to be part of the message payload instead of
+  the message key. This behavior is more like `format plain encode json` with PK constraint.
+  * **To Be Discussed**: whether we can afford to perform a breaking change for the above exception.


Even more, We need a method to define pk from the part of the whole json message key. Not sure if it can be the default behavior when user defining the primary key constraint.

An exception: for format upsert encode json, RisingWave allows the PK to be part of the message payload instead of the message key. This behavior is more like format plain encode json with PK constraint.

IIRC, according to our previous discussion, this will be format insert + normal PK definition i.e. primary key (foo, bar).

The exception looks very inconsistent and I'd like to get rid of it.

In some cases, user want to determine the primary key of the table for some use cases such as

better batching/ serving performance

streaming temporal join

And in those cases, user can make sure the primary keys column is exactly the message key in Kafka.

If format upsert is specific to MQ with delete tombsonte (null value), I think the PK must be the MQ message key and cannot be a field in the value.

I think either of the following behavior is acceptable:

All fields in message key are used as PK, and empty message body will be considered as delete tomb

All/Some fields in message key/body are used as PK, and empty message body will not be considered as delete tomb.

1 corresponds to upsert semantics, while 2 corresponds to plain(aka.insert) semantics. Whether to consider empty message as delete tomb is the key difference between upsert and plain.

Example:

/* message key: {"a":1,"b":2} message body: {"a":1,"b":2,"c":12,"d":34} */ -- case 1 -- create table t ( a int, b int, c int, d int) format upsert key encode json include key as pk; /* NOTE: The output table will contain a `pk jsonb` column */ -- case 2 -- create table t ( a int, b int, c int, d int, primary key (a,b)) format plain [key encode json include key as pk] -- optional!

fuyufjh

Generally LGTM

fuyufjh · 2023-11-23T03:46:40Z

rfcs/0079-include-key-as.md

+* **Important**: `include key` is required for `format upsert` and RisingWave will use the key column as one and 
+  only primary key to perform upsert semantic. It does not allow to specify multiple columns as primary key
+  even if they are part of the key.


So,

if I specify key encode avro and include key as x, x will be struct

if I specify key encode json and include key as x, x will be JSONB

Right?

fuyufjh · 2023-11-23T03:52:04Z

rfcs/0079-include-key-as.md

+* For all connectors with `format upsert`, RisingWave derives the column as primary key to perform upsert semantic.
+  * An exception: for `format upsert encode json`, RisingWave allows the PK to be part of the message payload instead of
+  the message key. This behavior is more like `format plain encode json` with PK constraint.
+  * **To Be Discussed**: whether we can afford to perform a breaking change for the above exception.


An exception: for format upsert encode json, RisingWave allows the PK to be part of the message payload instead of the message key. This behavior is more like format plain encode json with PK constraint.

IIRC, according to our previous discussion, this will be format insert + normal PK definition i.e. primary key (foo, bar).

The exception looks very inconsistent and I'd like to get rid of it.

xiangjinwu · 2023-11-23T08:50:24Z

rfcs/0079-include-key-as.md

+
+| Allowed Components | Default Type                             | Note                                                                             |
+|--------------------|------------------------------------------|----------------------------------------------------------------------------------|
+| key                | `bytea`                                  | Allow overwritten by `encode` and `key encode`. Refer to `Record::partition_key` |


nit: partition key in kinesis is always a unicode string

Yes, I make it bytea here to make it consistent with other connectors. There are planned reactors relying on the unified type in source state table.

xiangjinwu · 2023-11-23T09:10:29Z

rfcs/0079-include-key-as.md

+|--------------------|--------------|-------------------------------------------------------------------------------------------|
+| key                | `bytea`      | Allow overwritten by `encode` and `key encode`. Refer to `MessageMetadata::partition_key` |
+
+More components are available at [here](https://docs.rs/pulsar/latest/pulsar/message/proto/struct.MessageMetadata.html).


How do we decide which metadata fields to expose? Do we intend to expose similar (not sure if same) concepts from different connectors using the same word? For example

kafka key, pulsar partition_key, kinesis partition_key

There is also pulsar ordering_key or kinesis ExplicitHashKey

kafka offset, pulsar sequence_id, kinesis sequence_number

i64 vs u64 vs string

kafka timestamp, pulsar publish_time, kinesis approximate_arrival_timestamp

There is also pulsar event_time

kafka headers, pulsar properties

kafka headers is struct<varchar, bytea>[] but pulsar properties is struct<varchar, varchar>[]

Do we intend to expose similar (not sure if same) concepts from different connectors using the same word?

I think it's not necessary to unify them. If there is ambiguity, I tend to let them have different names.

Update: now we do want to unify partition and offset for them.. @tabVersion Can you add the rationale for that in the RFC? 😇

Yes, we are going to unify the partition and offset column type for all connectors.
The basic reason is that we want to impl a source exec level throttling and requires a chunk to be cut anywhere with maintaining the offset info. Besides, the source backfill feature also rely on the behavior to tell when to end the backfill stage.

xiangjinwu · 2023-11-23T09:12:59Z

rfcs/0079-include-key-as.md

+| timestamp          | `timestamp with time zone` (i64 in millis) | Refer to `CreateTime` rather than `LogAppendTime` |
+| partition          | `i64`                                      | The message is from which partition               |
+| offset             | `i64`                                      | The offset in the partition                       |
+| header             | `struct<varchar, bytea>[]`                 | KV pairs along with message                       |


plural: headers

xiangjinwu · 2023-11-27T06:09:01Z

Would also like to mention an alternative syntax:

create source/table (
    ..,
    <key-column> bytea FROM SOURCE key,
    <timestamp-column> timestamptz FROM SOURCE timestamp,
    <headers-column> struct<key varchar, value bytea>[] FROM SOURCE headers,
    primary key ( <key-column> )
)
with ( ... )

Pros and cons of this syntax, compared to include ... as ...:

All columns are listed in one place, similar to DEFAULT and GENERATED columns.
Users are required to spell out the data type.
(welcome to add more)

neverchanje · 2023-11-27T07:34:22Z

Regarding the upgrade, please automatically add
include key as '_rw_key'
to old tables so that they won't need recreation.

st1page · 2023-11-27T09:48:26Z

Would also like to mention an alternative syntax:
create source/table (
    ..,
    <key-column> bytea FROM SOURCE key,
    <timestamp-column> timestamptz FROM SOURCE timestamp,
    <headers-column> struct<key varchar, value bytea>[] FROM SOURCE headers,
    primary key ( <key-column> )
)
with ( ... )
Pros and cons of this syntax, compared to include ... as ...:

All columns are listed in one place, similar to DEFAULT and GENERATED columns.

Users are required to spell out the data type.

(welcome to add more)

Currently, we do not allow users to define schema in column clauses when using schema registry or schema file
Another method is using the "star" grammer
risingwavelabs/risingwave#12209

rfcs/0079-include-key-as.md

fuyufjh · 2023-12-04T04:17:27Z

FYI. Feel free to join the slack channel #wg-include-key-as

update offset and partition column type to make it consistent with existing source impl

rfcs/0079-include-key-as.md

Co-authored-by: xxchan <[email protected]>

tabVersion added 2 commits November 22, 2023 16:26

new rfc

ee522c0

rename

669bb2a

st1page reviewed Nov 22, 2023

View reviewed changes

fuyufjh reviewed Nov 23, 2023

View reviewed changes

xiangjinwu reviewed Nov 23, 2023

View reviewed changes

tabVersion mentioned this pull request Nov 29, 2023

feat: introduce include clause to add additional connector columns risingwavelabs/risingwave#13707

Merged

9 tasks

BugenZhao reviewed Dec 4, 2023

View reviewed changes

rfcs/0079-include-key-as.md Show resolved Hide resolved

tabVersion mentioned this pull request Dec 4, 2023

fix: revert #13278 & #13390 for include syntax risingwavelabs/risingwave#13785

Merged

9 tasks

tabVersion mentioned this pull request Dec 26, 2023

feat: allow configure other additional columns for connectors risingwavelabs/risingwave#14215

Merged

9 tasks

Update 0079-include-key-as.md

a4b5c65

update offset and partition column type to make it consistent with existing source impl

xxchan reviewed Dec 26, 2023

View reviewed changes

rfcs/0079-include-key-as.md Outdated Show resolved Hide resolved

Update rfcs/0079-include-key-as.md

89ec15e

Co-authored-by: xxchan <[email protected]>

st1page approved these changes Jan 12, 2024

View reviewed changes

tabVersion merged commit caa060b into main Jan 12, 2024

tabVersion deleted the tab/include-key-as branch January 12, 2024 07:40

xiangjinwu mentioned this pull request Apr 18, 2024

CDC connector with additional columns risingwavelabs/risingwave#16359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: use `include ... as ...` to ingest more message parts #79

RFC: use `include ... as ...` to ingest more message parts #79

tabVersion commented Nov 22, 2023

st1page Nov 22, 2023

tabVersion Nov 22, 2023

st1page Nov 22, 2023

fuyufjh Nov 23, 2023

st1page Nov 27, 2023 •

edited

Loading

hzxa21 Nov 27, 2023

fuyufjh Nov 27, 2023 •

edited

Loading

fuyufjh left a comment

fuyufjh Nov 23, 2023

fuyufjh Nov 23, 2023

xiangjinwu Nov 23, 2023

tabVersion Dec 27, 2023 •

edited

Loading

xiangjinwu Nov 23, 2023

fuyufjh Nov 27, 2023

xxchan Dec 26, 2023

tabVersion Dec 27, 2023

xiangjinwu Nov 23, 2023

xiangjinwu commented Nov 27, 2023 •

edited

Loading

neverchanje commented Nov 27, 2023

st1page commented Nov 27, 2023

fuyufjh commented Dec 4, 2023

RFC: use include ... as ... to ingest more message parts #79

RFC: use include ... as ... to ingest more message parts #79

Conversation

tabVersion commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

st1page Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tabVersion Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangjinwu commented Nov 27, 2023 • edited Loading

neverchanje commented Nov 27, 2023

st1page commented Nov 27, 2023

fuyufjh commented Dec 4, 2023

RFC: use `include ... as ...` to ingest more message parts #79

RFC: use `include ... as ...` to ingest more message parts #79

st1page Nov 27, 2023 •

edited

Loading

fuyufjh Nov 27, 2023 •

edited

Loading

tabVersion Dec 27, 2023 •

edited

Loading

xiangjinwu commented Nov 27, 2023 •

edited

Loading