Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(source): support Avro Union type #17485

Merged
merged 12 commits into from
Jul 9, 2024
Merged

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Jun 27, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

fix #16273

Note: real LoC change of this PR is very short. Mostly tests.

TODO

  • add some more tests
    • record with/without namespace
  • add an e2e test to demonstrate the usage & for documentation

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Support Avro Union type in SOURCEs.

There's a example in e2e_test/source_inline/kafka/avro/union.slt

Copy link
Member Author

xxchan commented Jun 27, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @xxchan and the rest of your teammates on Graphite Graphite

@xxchan xxchan changed the title refine current union code feat(source): support Avro Union type Jun 27, 2024
@xxchan xxchan marked this pull request as ready for review June 27, 2024 09:59
@xxchan xxchan requested review from xiangjinwu and BugenZhao June 27, 2024 10:18
@xxchan xxchan force-pushed the 06-20-test_add_codec_integration_tests branch from 8e633f5 to 92a4d7a Compare July 2, 2024 02:34
@xxchan xxchan requested a review from a team as a code owner July 2, 2024 02:34
@xxchan xxchan force-pushed the 06-25-feat_support_avro_union_type branch 2 times, most recently from e2e7e9e to 5f3947b Compare July 2, 2024 02:43
@xxchan xxchan force-pushed the 06-20-test_add_codec_integration_tests branch from 7b24228 to 9f3b8fc Compare July 2, 2024 02:47
@xxchan xxchan force-pushed the 06-25-feat_support_avro_union_type branch from 5f3947b to 011b18c Compare July 2, 2024 02:47
@BugenZhao

This comment has been minimized.

@xxchan xxchan force-pushed the 06-20-test_add_codec_integration_tests branch from 9f3b8fc to 3222903 Compare July 2, 2024 15:24
@xxchan xxchan force-pushed the 06-25-feat_support_avro_union_type branch from eb2fb5a to 66f4ae4 Compare July 2, 2024 15:25
@xxchan xxchan force-pushed the 06-20-test_add_codec_integration_tests branch from 3222903 to 2351718 Compare July 4, 2024 07:00
@xxchan xxchan force-pushed the 06-25-feat_support_avro_union_type branch from 66f4ae4 to 75f46d2 Compare July 4, 2024 07:00
@xxchan xxchan force-pushed the 06-20-test_add_codec_integration_tests branch from 2351718 to de77653 Compare July 5, 2024 11:39
@xxchan xxchan force-pushed the 06-25-feat_support_avro_union_type branch 2 times, most recently from ca811fa to 1f0e216 Compare July 5, 2024 12:55
@xxchan xxchan changed the base branch from 06-20-test_add_codec_integration_tests to main July 5, 2024 12:55
@graphite-app graphite-app bot requested a review from a team July 5, 2024 13:13
Comment on lines 350 to 351
// Actually this should be an invalid schema according to the spec. https://issues.apache.org/jira/browse/AVRO-2380
// But some library like Python and Rust both allow it. See `risingwave_connector_codec::decoder::avro::tests::test_avro_lib_union`
Copy link
Contributor

@xiangjinwu xiangjinwu Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about being conservative and reject unions containing logical types in the first version, given their behavior is inconsistent across different language implementations?

(As mention in the linked issue, logical types are allowed to be ignored, and it shall not make a difference for the sake of interoperability. The support within python & rust shall be considered as an unintentional bug.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about being conservative and reject unions containing logical types in the first version

Understand and I also used to be conservative, want to delay the decision. Especially since the codebase of our source contains too many mysterious stuff, being extra cautious can prevent us from more footguns..

But @fuyufjh seems to support more the "just do it" way, like the "map to jsonb" problem #16963:

  • There seem to be no obvious problems of the decision. It might not be always the "best" way in all situations, but also not too bad (at most 1 or 2 operations away?).
  • These usages are common, so "I would be unwilling to say we support it (if these cases not supported)". (Well this might be arguable, since why few users have requested union/map type before? 🤣)
  • Very few additional workload is required.

I think these also make sense.

Currently I tend to "just do it", if

  • there's no serious drawbacks we can come up with
  • the decision is reversible

Copy link
Contributor

@xiangjinwu xiangjinwu Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the decision is reversible

No, it would become burden for backward compatibility, both for union-containing-logical-type and for map-containing-more-types. If we believe it is okay to break them in the future, meaning few people are relying on them, why implement it at the first place; if we foresee they would be used in this way, then we would need to support both flavors forever. @fuyufjh

To elaborate on the incompatibility here:

Given ["double", {"type":"int", "logicalType": "date"}], which is unarguably a valid schema, we have the option to transform it into struct<"double" double, "date" date> vs struct<"double" double, "int" date>. They are not compatible and cannot co-exist. A decision has to be made and inreversible.

The former definitely looks more intuitive and as a user I would prefer it. But it is against the design behind avro logical type - an optional annotation. Its official json encoding outputs {"int": 19912} rather than {"date":"2024-07-08"}. (Whereas protobuf spec does treat well-known types specially for their json encoding rather than treating them as normal nested message). Again I am also leaning towards the former option but would like to defer the decision until we have more inputs on how it is being used in related systems.

Talking about practical usage, when the upstream is Java, they would not be able to produce the invalid ["int", {"type":"int", "logicalType": "date"}] according the linked issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are the talking about the Avro Source, I think we can be less restrictive, because the data's producer is not under the control of us (or even users).

Saying, if the user is using a strict Java Avro writer, then ["double", {"type":"int", "logicalType": "date"}] won't exist. No problem.

Only when the user is using a non-strict Avro writer, we will be facing this problem. In that case, I think accepting the data is a bit better than rejecting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct<"double" double, "date" date> vs struct<"double" double, "int" date>. They are not compatible and cannot co-exist. A decision has to be made and irreversible.

By "reversible" I mean introducing config option for compatibility (or just breaking change). Compared with large architecture decisions, this are "relatively reversible" to me. lol

If we believe it is okay to break them in the future, meaning few people are relying on them, why implement it at the first place.

This is indeed a good question and perhaps I don't have a good answer to it. But I feel a lot of what we do is guess what users want?

Again I am also leaning towards the former option but would like to defer the decision until we have more inputs on how it is being used in related systems.

What I doubt now is that whether we can have useful inputs from users? Similar to the "map to jsonb" problem, I guess if we ask users "what expected result do you want", the answer is probably "whatever"... Perhaps users can hardly understand the subtleties, and need "it just works".

So for such kind of problems, we still finally have to choose for users. That's why I'd make the decision earlier.

Copy link
Member Author

@xxchan xxchan Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think there are 2 problems mixed together:

  1. The invalid schema: ["int", {"type":"int", "logicalType": "date"}]. I guess @fuyufjh is talking about allowing it, because we want to be tolerate. This point makes some sense (like allowing out-of-range JSON numbers?), but currently Rust Avro lib will reject it.
  2. Which field name to use for logical types (logical or actual). IMO this is kind of arbitrary decision.

Copy link
Member Author

@xxchan xxchan Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct<"double" double, "int" date>

BTW, I just realized that "int" date is way too strange. 🤣

In our system, the logical type is just treated completely different with the actual type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we ask users "what expected result do you want", the answer is probably "whatever"... Perhaps users can hardly understand the subtleties, and need "it just works".

This is not always the case. Such answer is still informative and we would know whether our proposed solution fits their concrete needs, rather than guess what users want. It is true in this case there is still risk that decision is biased toward single use case, but better than for speculated zero use case.

Just thought about a new issue: Logical types may be based on named types (fixed or record), for example [{"type":"fixed","name":"Decimal128","size":16,"logicalType":"decimal","precision":38,"scale":2}, {"type":"fixed","name":"Decimal256","size":32,"logicalType":"decimal","precision":50,"scale":2}]. This is valid in Java (different name) but we may encounter a conflict due to use of decimal as field name. Given Avro spec has been vague (or against the idea) here, making our extension shall be careful.

Copy link
Contributor

@xiangjinwu xiangjinwu Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I just realized that "int" date is way too strange. 🤣

Yes so I have been saying I also prefer logical type name, but need more input to justify. As my latest example shows, at least we shall do the following instead:

  • When physical type is named, use physical name; else use logical type

^ I am hesitating because there may be more loopholes that need patches like this, and changing from "decimal" decimal to "Decimal128" decimal is definitely a breaking change - of course we could introduce a new option. It just feels unnecessary complex to maintain these backward compatible options.

IMO this is kind of arbitrary decision.

It is unfortunately not arbitrary. Two things are coupled with each other here: (1) unions cannot have duplicate members; (2) structs cannot have duplicate field names. Ideally we want to separate the 2 concerns but I feel avro was designed this way.

Copy link
Member Author

@xxchan xxchan Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reconsidered why I'm thinking "make reversible decision early" or "bias for action" is good. Basically there are 2 main benefits:

  1. To better serve potential users. Since users have limited attention, they may not bother telling us their requirements, but will just silently go away when they meet problems. So we'd better make everything smooth if we can already foresee the requirement.
  2. (This is actually more important to me), unblock us to do later work.

For the specific problem of source (more specifically, FORMATs), it's actually the "leaf decision" (no follow-ups to be done...). Therefore, the point 2 doesn't hold.

The point 1 still holds to some extent to me. But since we are to-B software (so serious users, especially customers will not walk away silently just for this minor reason), and
in the first place, few users requested these features like Avro Union and Map, this might be less likely a concern (but an imaginary problem).

So now I would agree with you that making the decision early doesn't have outstanding benefits but potential risks, and splitting the seemingly controversial part into a separate PR is indeed more responsible. cc @neverchanje @fuyufjh

src/connector/codec/src/decoder/avro/mod.rs Outdated Show resolved Hide resolved
src/connector/codec/src/decoder/avro/mod.rs Outdated Show resolved Hide resolved
src/connector/codec/src/decoder/avro/mod.rs Outdated Show resolved Hide resolved
@xxchan
Copy link
Member Author

xxchan commented Jul 9, 2024

@xiangjinwu You can review the changes on graphite (v17..v19), or just on GitHub, select new commits starting from 9d08add

@xxchan xxchan requested a review from xiangjinwu July 9, 2024 09:11
Copy link
Contributor

@xiangjinwu xiangjinwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Comment on lines +77 to +78
# # Demonstrate how to access union variants (struct fields) below:
# # Note that we need to use quotes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstrate with a simpler example that is working now?

xxchan added 3 commits July 9, 2024 18:34
Signed-off-by: xxchan <[email protected]>
Signed-off-by: xxchan <[email protected]>
@xxchan xxchan enabled auto-merge July 9, 2024 11:12
@xxchan xxchan added this pull request to the merge queue Jul 9, 2024
Merged via the queue into main with commit b3e1fad Jul 9, 2024
32 of 33 checks passed
@xxchan xxchan deleted the 06-25-feat_support_avro_union_type branch July 9, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Avro] Support Union types in Avro Schemas with Schema registry
4 participants