Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sink): jsonb column to avro record/map/union type #16941

Open
xiangjinwu opened this issue May 27, 2024 · 8 comments
Open

feat(sink): jsonb column to avro record/map/union type #16941

xiangjinwu opened this issue May 27, 2024 · 8 comments
Assignees
Labels
component/connector needs-design Don't start your coding work before a detailed design proposed needs-discussion
Milestone

Comments

@xiangjinwu
Copy link
Contributor

I think I am facing a similar issue. I am trying to sink jsonb values from a materialized view into an avro schema in kafka, with map and record types. I am getting the following errors:

Encode error: encode foo error: cannot encode jsonb column as {"name":"foo","type":"record","fields":[{"name":"id","type":"string"},{"name":"name","type":"string"}]} field

Or

Encode error: encode foo error: cannot encode jsonb column as {"type":"map","values":"string"} field

Originally posted by @maingoh in #11699 (comment)

@github-actions github-actions bot added this to the release-1.10 milestone May 27, 2024
@xiangjinwu xiangjinwu added needs-discussion needs-design Don't start your coding work before a detailed design proposed labels May 27, 2024
@xiangjinwu xiangjinwu self-assigned this May 27, 2024
@xiangjinwu
Copy link
Contributor Author

@maingoh In your specific case, could you try converting the loosely-typed jsonb to a struct before sinking?

jsonb_populate_record(null::struct<id varchar, name varchar>, foo)

(Note the function jsonb_populate_record is only available since 1.9.0.)

As for a general case, note that true and 23 are also valid jsonb values but would not be able to be sinked as an avro record or map. In the workaround above we are explicitly saying null shall be used as a default in these unexpected cases.

Does this address your issue?

@maingoh
Copy link

maingoh commented May 27, 2024

Thank you, I managed to sink a simple struct as an avro record but I don't find a way to generate a type that is convertible to an avro map. Is a jsonb object directly convertible to a map ? I actually have a joined table with a name, and some other columns. I would like to build a map<name, record<column_2, column_3, ...>>. I tried building a jsonb object of struct but it does not seem to work.

I would also need a way to union different types as avro allow it. For example we have a jsonb column which can be different types. I feel it does not exist (yet ?), what would be the best way to sink such values ?

As for a general case, note that true and 23 are also valid jsonb values but would not be able to be sinked as an avro record or map. In the workaround above we are explicitly saying null shall be used as a default in these unexpected cases.

Good to know! I actually didn't need to use jsonb_populate_record yet as I was building my record manually.

@xiangjinwu
Copy link
Contributor Author

a type that is convertible to an avro map

You are right it is not available in RisingWave yet. Although jsonb feels close, there can be ambiguous cases, like {"foo": "AA=="} into avro map<bytes>. Instead, we plan to support a native map data type in RisingWave #13387 (comment)

I would also need a way to union different types as avro allow it. For example we have a jsonb column which can be different types. I feel it does not exist (yet ?)

There is no native union data type either and it is not part of the plan. The plan is to allow sinking specific types, i.e. both int columns and bool columns can be sinked as avro ["null", "int", "boolean"]. If the upstream is a jsonb column, it would need to be split into strongly typed streams beforehand:

create sink variant_int as select foo::int from mixed where jsonb_typeof(foo) = 'number' ...
create sink variant_bool as select foo::bool from mixed where jsonb_typeof(foo) = 'boolean' ...

To clarify:

  • The ability to sink specific types in avro union (as shown above) does not exist today.
  • Alternatives are open to discuss. Given jsonb to avro struct/union is a very generic topic we would like to avoid implementing it a way that only works in one specific use case. We need to clearly define the data type mapping rules. The one defined in avro spec is an option, but very likely not what people want: it requires {"int": 42} or {"boolean": true} for avro union ["null", "int", "boolean"]. Such complexity and unclarity is why we prefer not to touch jsonb <-> avro and support strongly typed columns first.

@xiangjinwu xiangjinwu changed the title feat(sink): jsonb column to avro record/map type feat(sink): jsonb column to avro record/map/union type May 28, 2024
@maingoh
Copy link

maingoh commented May 28, 2024

Although jsonb feels close, there can be ambiguous cases, like {"foo": "AA=="} into avro map

Is it really possible to store bytes into a jsonb field ? At least JSON support only strings (usually bytes are b64 encoded strings). So for me it sounds quite natural to convert it natively to map<string, string>. And if the user want to cast it to bytes, he can ask for it explicitely eventually.

If the upstream is a jsonb column, it would need to be split into strongly typed streams beforehand

I agree that the union type is not very needed as a risingwave type. However doing the conditional check on types is something that RW could support natively for most types, it is not very user friendly to do so in SQL. If the jsonb column is compatible with the avro format, why not try to serialize it without needing some strong casts ? And having a way to default to null the uncompatible ones like in the example with jsonb_populate_record(null::struct<id varchar, name varchar>, foo). In my case I am sure that the avro will match the jsonb column, but one field can be from different types.

There could be a parameter on the sink side to treat jsonb columns as avro ones while sinking. This way it would still be stored as unstructured JSONB on RW (allowing a single json field to have different type) but would be converted in the sink only using this mapping:

JSONB AVRO
array array
object record/map
integer int or long ?
boolean boolean
number float
string string
null null

All other very specific (dates, bytes, smallint, float) would need an explicit conversion in the query itself. But since most JSON column only have the types above, it would save a lot of boilerplate to not have to handle this on the user side.

In a similar way, on the source side, RW would find the best native type if possible otherwise fallback to JSONB in case it is not possible.

It could be an optional parameter allow_jsonb_fallback_conversion=True. Or maybe a more explicit way of selecting which columns can be converted or not to/from jsonb. Just an idea :)

@maingoh
Copy link

maingoh commented May 29, 2024

Here is another mapping that would fit my use case https://materialize.com/docs/sql/create-sink/kafka/#avro:

  • They have a native map type
  • Any nullable type is sinked as an avro union type of [my_type, "null"]
  • All other complicated cases (union of many types) are stored as a json string with a "connect.name": "io.debezium.data.Json" to be able to automatically decode it as json on consumer side

@xiangjinwu
Copy link
Contributor Author

  • We will be working on a native map type soon.
  • Sinking nullable as avro-union-with-null is already supported today.
  • Sinking jsonb as avro-string is trivial and the support of connect.name is also work in progress.

My major concern above is about adding a native union type, or sinking arbitrary jsonb as avro union. There was one core difference I did not mention and may have led to some confusions:

RisingWave does not automatically generate and register the avro schema to schema registry. Instead, it accepts an existing one read from schema registry, so it need to do the following validation before seeing any concrete records:

  • Can a RisingWave int column fit into avro int field? Yes.
  • Can a RisingWave int column fit into avro boolean field? No. create sink fails.
  • Can a RisingWave jsonb column fit into avro ["int", "boolean"] field? It depends...

If we were to generate the avro schema, we could generate a ["null", {"type": "string", "connect.name": "io.debezium.data.Json"}] so that all possible jsonb values can fit in it. The schema auto-register deature is also in plan.

Does this address your concern?

@maingoh
Copy link

maingoh commented Jun 4, 2024

  1. We will be working on a native map type soon.
  2. Sinking nullable as avro-union-with-null is already supported today.
  3. Sinking jsonb as avro-string is trivial and the support of connect.name is also work in progress.

If 1. and 3. land quickly I feel handling union is less of a priority. In any case the native union type feels too much and can be handled using jsonb. I have not seen any database supporting it (I didn't spend much time though).
However I feel being able to source and sink array/maps/union as jsonb might be faster to implement than having a fully integrated native RW map, and will handle the union case.

Can a RisingWave jsonb column fit into avro ["int", "boolean"] field? It depends...

In this case I would say:

  • RW does not allow it by default, sink fail
  • Having some extra config in the sink that:
    • Allow the creation of the source/sink if it cannot know in advance
    • Could say which jsonb field can be converted to avro. Example: allow_jsonb_conversion = {'foo': {'default': null}, 'bar': {'raise': true}} (could be a more SQL like syntax).
    • Try to convert the jsonb value (foo and bar) to the corresponding avro schema. If fails raise an error for bar or default to null for foo.

@maingoh
Copy link

maingoh commented Jun 20, 2024

@xiangjinwu any approximate idea when 1. and 3. would be available ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/connector needs-design Don't start your coding work before a detailed design proposed needs-discussion
Projects
None yet
Development

No branches or pull requests

2 participants