-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(source): support Avro Union type #17485
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
8e633f5
to
92a4d7a
Compare
e2e7e9e
to
5f3947b
Compare
7b24228
to
9f3b8fc
Compare
5f3947b
to
011b18c
Compare
This comment has been minimized.
This comment has been minimized.
9f3b8fc
to
3222903
Compare
eb2fb5a
to
66f4ae4
Compare
3222903
to
2351718
Compare
66f4ae4
to
75f46d2
Compare
2351718
to
de77653
Compare
ca811fa
to
1f0e216
Compare
// Actually this should be an invalid schema according to the spec. https://issues.apache.org/jira/browse/AVRO-2380 | ||
// But some library like Python and Rust both allow it. See `risingwave_connector_codec::decoder::avro::tests::test_avro_lib_union` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about being conservative and reject unions containing logical types in the first version, given their behavior is inconsistent across different language implementations?
(As mention in the linked issue, logical types are allowed to be ignored, and it shall not make a difference for the sake of interoperability. The support within python & rust shall be considered as an unintentional bug.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about being conservative and reject unions containing logical types in the first version
Understand and I also used to be conservative, want to delay the decision. Especially since the codebase of our source contains too many mysterious stuff, being extra cautious can prevent us from more footguns..
But @fuyufjh seems to support more the "just do it" way, like the "map to jsonb" problem #16963:
- There seem to be no obvious problems of the decision. It might not be always the "best" way in all situations, but also not too bad (at most 1 or 2 operations away?).
- These usages are common, so "I would be unwilling to say we support it (if these cases not supported)". (Well this might be arguable, since why few users have requested union/map type before? 🤣)
- Very few additional workload is required.
I think these also make sense.
Currently I tend to "just do it", if
- there's no serious drawbacks we can come up with
- the decision is reversible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the decision is reversible
No, it would become burden for backward compatibility, both for union-containing-logical-type and for map-containing-more-types. If we believe it is okay to break them in the future, meaning few people are relying on them, why implement it at the first place; if we foresee they would be used in this way, then we would need to support both flavors forever. @fuyufjh
To elaborate on the incompatibility here:
Given ["double", {"type":"int", "logicalType": "date"}]
, which is unarguably a valid schema, we have the option to transform it into struct<"double" double, "date" date>
vs struct<"double" double, "int" date>
. They are not compatible and cannot co-exist. A decision has to be made and inreversible.
The former definitely looks more intuitive and as a user I would prefer it. But it is against the design behind avro logical type - an optional annotation. Its official json encoding outputs {"int": 19912}
rather than {"date":"2024-07-08"}
. (Whereas protobuf spec does treat well-known types specially for their json encoding rather than treating them as normal nested message). Again I am also leaning towards the former option but would like to defer the decision until we have more inputs on how it is being used in related systems.
Talking about practical usage, when the upstream is Java, they would not be able to produce the invalid ["int", {"type":"int", "logicalType": "date"}]
according the linked issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are the talking about the Avro Source, I think we can be less restrictive, because the data's producer is not under the control of us (or even users).
Saying, if the user is using a strict Java Avro writer, then ["double", {"type":"int", "logicalType": "date"}]
won't exist. No problem.
Only when the user is using a non-strict Avro writer, we will be facing this problem. In that case, I think accepting the data is a bit better than rejecting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct<"double" double, "date" date>
vsstruct<"double" double, "int" date>
. They are not compatible and cannot co-exist. A decision has to be made and irreversible.
By "reversible" I mean introducing config option for compatibility (or just breaking change). Compared with large architecture decisions, this are "relatively reversible" to me. lol
If we believe it is okay to break them in the future, meaning few people are relying on them, why implement it at the first place.
This is indeed a good question and perhaps I don't have a good answer to it. But I feel a lot of what we do is guess what users want?
Again I am also leaning towards the former option but would like to defer the decision until we have more inputs on how it is being used in related systems.
What I doubt now is that whether we can have useful inputs from users? Similar to the "map to jsonb" problem, I guess if we ask users "what expected result do you want", the answer is probably "whatever"... Perhaps users can hardly understand the subtleties, and need "it just works".
So for such kind of problems, we still finally have to choose for users. That's why I'd make the decision earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think there are 2 problems mixed together:
- The invalid schema:
["int", {"type":"int", "logicalType": "date"}]
. I guess @fuyufjh is talking about allowing it, because we want to be tolerate. This point makes some sense (like allowing out-of-range JSON numbers?), but currently Rust Avro lib will reject it. - Which field name to use for logical types (logical or actual). IMO this is kind of arbitrary decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct<"double" double, "int" date>
BTW, I just realized that "int" date
is way too strange. 🤣
In our system, the logical type is just treated completely different with the actual type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we ask users "what expected result do you want", the answer is probably "whatever"... Perhaps users can hardly understand the subtleties, and need "it just works".
This is not always the case. Such answer is still informative and we would know whether our proposed solution fits their concrete needs, rather than guess what users want. It is true in this case there is still risk that decision is biased toward single use case, but better than for speculated zero use case.
Just thought about a new issue: Logical types may be based on named types (fixed or record), for example [{"type":"fixed","name":"Decimal128","size":16,"logicalType":"decimal","precision":38,"scale":2}, {"type":"fixed","name":"Decimal256","size":32,"logicalType":"decimal","precision":50,"scale":2}]
. This is valid in Java (different name) but we may encounter a conflict due to use of decimal
as field name. Given Avro spec has been vague (or against the idea) here, making our extension shall be careful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I just realized that
"int" date
is way too strange. 🤣
Yes so I have been saying I also prefer logical type name, but need more input to justify. As my latest example shows, at least we shall do the following instead:
- When physical type is named, use physical
name
; else use logical type
^ I am hesitating because there may be more loopholes that need patches like this, and changing from "decimal" decimal
to "Decimal128" decimal
is definitely a breaking change - of course we could introduce a new option. It just feels unnecessary complex to maintain these backward compatible options.
IMO this is kind of arbitrary decision.
It is unfortunately not arbitrary. Two things are coupled with each other here: (1) unions cannot have duplicate members; (2) structs cannot have duplicate field names. Ideally we want to separate the 2 concerns but I feel avro was designed this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reconsidered why I'm thinking "make reversible decision early" or "bias for action" is good. Basically there are 2 main benefits:
- To better serve potential users. Since users have limited attention, they may not bother telling us their requirements, but will just silently go away when they meet problems. So we'd better make everything smooth if we can already foresee the requirement.
- (This is actually more important to me), unblock us to do later work.
For the specific problem of source (more specifically, FORMAT
s), it's actually the "leaf decision" (no follow-ups to be done...). Therefore, the point 2 doesn't hold.
The point 1 still holds to some extent to me. But since we are to-B software (so serious users, especially customers will not walk away silently just for this minor reason), and
in the first place, few users requested these features like Avro Union and Map, this might be less likely a concern (but an imaginary problem).
So now I would agree with you that making the decision early doesn't have outstanding benefits but potential risks, and splitting the seemingly controversial part into a separate PR is indeed more responsible. cc @neverchanje @fuyufjh
Signed-off-by: xxchan <[email protected]>
c915565
to
6a8c351
Compare
@xiangjinwu You can review the changes on graphite (v17..v19), or just on GitHub, select new commits starting from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
# # Demonstrate how to access union variants (struct fields) below: | ||
# # Note that we need to use quotes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demonstrate with a simpler example that is working now?
Signed-off-by: xxchan <[email protected]>
Signed-off-by: xxchan <[email protected]>
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
fix #16273
Note: real LoC change of this PR is very short. Mostly tests.
TODO
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
Support Avro Union type in
SOURCE
s.There's a example in
e2e_test/source_inline/kafka/avro/union.slt