diff --git a/docs/design-documents/20240430-schema-support.md b/docs/design-documents/20240430-schema-support.md index 825d6be5f..483ae0483 100644 --- a/docs/design-documents/20240430-schema-support.md +++ b/docs/design-documents/20240430-schema-support.md @@ -1,4 +1,19 @@ -# Schema support +

Schema support

+ + + * [The problem](#the-problem) + * [Requirements](#requirements) + * [Schema format](#schema-format) + * [Support for schemas originating from other streaming tools](#support-for-schemas-originating-from-other-streaming-tools) + * [Schema evolution](#schema-evolution) + * [Implementation](#implementation) + * [Schema operations](#schema-operations) + * [Option 1: Stream of commands and responses](#option-1-stream-of-commands-and-responses) + * [Option 2: Exposing Conduit as a service](#option-2-exposing-conduit-as-a-service) + * [Chosen option](#chosen-option) + * [Questions](#questions) + * [Other considerations](#other-considerations) + ## The problem @@ -19,8 +34,8 @@ the above. 1. Records **should not** carry the full schema. - Reason: If a record would carry the whole schema, that would increase the - record size a lot. + Reason: If a record would carry the whole schema, that might increase a + record's size significantly. 2. Sources and destinations need to be able to work with multiple schemas. Reason: Multiple collections support. @@ -40,6 +55,8 @@ the above. cost of repeatedly fetching the same schema many times (especially over gRPC), schemas should be cached by the SDK. +## Non-goals + ## Schema format A destination connector should work with one schema format only, regardless of @@ -56,13 +73,14 @@ A schema consists of following: * default value The following types are supported: -* Primitive: +* basic: * boolean * integers: 8, 16, 32, 64-bit * float: single precision (32-bit) and double precision (64-bit) IEEE 754 floating-point number * bytes * string -* Complex: + * timestamp +* complex: * array * map * struct @@ -70,6 +88,22 @@ The following types are supported: Every field in a schema can be marked as optional (nullable). +## Support for schemas originating from other streaming tools + +Conduit is sometimes used alongside other streaming tools. For example, Kafka +Connect may be used to read data from a source and write it into a topic. +Conduit then reads messages from that topic and writes it into a destination. We +also have the Kafka Connect wrapper which makes it possible to use Kafka Connect +connectors with Conduit. Here we have two possibilities: + +1. The schema is part of the record (e.g. Debezium records) +2. The schema can be found in a schema registry (e.g. an Avro schema registry) + +## Schema evolution + +TBD (this section is about checking if a new version of a schema is compatible +with the previous one) + ## Implementation Schema support is part of the OpenCDC standard. A schema is represented by the @@ -131,9 +165,92 @@ message Field { } ``` -A reference to a schema is saved in a new metadata field, `opencdc.schemaID`. +### Schema storage + +The schemas are stored in Conduit's database. Currently, there's no need to use +an external service. + +### Schema operations + +Source connectors need a way to register a schema, and destination connectors +need a way to fetch a schema. Regardless of which schema registry is used (an +internal one in Conduit or an external service), the access should be abstracted +away and should go through Conduit. With that, we have the following +implementation options: + +#### Option 1: Stream of commands and responses + +This pattern is used in WASM processors. A server (in this case: Conduit) +listens to commands (in this case: via a bidirectional gRPC stream). A client ( +in this case: a connector) sends a command to either register a schema or fetch +a schema. Conduit receives the command and replies. An example can be seen below: + +```protobuf +rpc CommandStream(stream Command) returns (stream Response); +``` + +For different types of commands and response to be supported, `Command` +and `Response` need to have a `oneof` field where all the possible commands +and respective responses are listed: + +```protobuf +message Command { + oneof cmd { + SaveSchemaCommand saveSchemaCmd = 1; + // etc. + } +} + +message Response { + oneof resp { + SaveSchemaResponse saveSchemaResp = 1; + // etc. + } +} +``` +**Advantages**: + +1. No additional connection setup. When Conduit starts a connector process, it + establishes a connection. The same connection is used for all communication ( + e.g. configuring a connector, opening, reading/writing records, etc.) +2. Connector actions (which are planned for a future milestone) might use the + same command-and-reply stream. + +**Disadvantages**: + +1. Separate flow needed to establish a connector to a remote Conduit instance ( + see [requirement](#requirements) #3). +2. A single method for all the operations makes both, the server and client + implementation, more complex. In Conduit, a single gRPC method needs to check + for the command type and then reply with a response. Then the client (i.e. + the connector) needs to check the response type. In case multiple commands + are sent, we need ordering guarantees. + +#### Option 2: Exposing Conduit as a service + +Conduit exposes a service to work with schemas. Connectors access the service +and call methods on the service. + +For this work, a connector (i.e. clients of the schema service) needs Conduit's +IP address and the gRPC port. The IP address can be fetched +using [peer](https://pkg.go.dev/google.golang.org/grpc/peer#Peer). Conduit can +send its gRPC port to the connector via the `Configure` method. + +**Advantages**: + +1. Works with a remote Conduit instance (see [requirement](#requirements) #3). +2. Easy to understand: the gRPC methods, together with requests and responses, + can easily be understood from a proto file. +3. An HTTP API for the schema registry can easily be exposed (if needed). + +**Disadvatanges**: + +1. Changes needed to communicate Conduit's gRPC port to the connector. + +#### Chosen option -### Schema-related operations in connectors +**Option 2** is the chosen method since it offers more clarity and the support for +remote Conduit instances. ## Questions diff --git a/docs/design-documents/schema.proto b/docs/design-documents/schema.proto deleted file mode 100644 index 2fe505339..000000000 --- a/docs/design-documents/schema.proto +++ /dev/null @@ -1,53 +0,0 @@ -syntax = "proto3"; - -message Schema { - string id = 1; - repeated Field fields = 2; -} - -message FieldType { - oneof type { - PrimitiveFieldType primitiveType = 1; - ArrayType arrayType = 2; - MapType mapType = 3; - StructType structType = 4; - UnionType unionType = 5; - } -} - -enum PrimitiveFieldType { - BOOLEAN = 0; - INT8 = 1; - // other primitive types -} - -message ArrayType { - FieldType elementType = 1; -} - -message MapType { - FieldType keyType = 1; - FieldType valueType = 2; -} - -message StructType { - repeated Field fields = 1; -} - -message UnionType { - repeated FieldType types = 1; -} - -message Field { - string name = 1; - oneof type { - PrimitiveFieldType primitiveType = 2; - ArrayType arrayType = 3; - MapType mapType = 4; - StructType structType = 5; - UnionType unionType = 6; - } - bool optional = 7; - // todo: find appropriate type - any defaultValue = 8; -}