Skip to content

AVRO Schema Peculiarities

Ryan Slominski edited this page May 26, 2021 · 43 revisions

Naming Limitations

AVRO named items (including enum items) cannot start with a number (specification, issue ticket). This is why some of the locations in the registered-alarms-value.asvc like Spectrometers currently have S prefix (S1D - S5D). Ops doesn't like this. Workarounds include:

  1. Have ops just "get over it" and accept things can't start with numbers
    • This is a common limitation - Java Enums can't start with a number either, but it is easy to add an alias
  2. Not using AVRO enums when enumerating things (use a string and lose in-place enforcement of valid items
  3. Provide a lookup-map object in the global config topic that provides an operator friendly name (String values in a map don't have naming limits).
  4. Use AVRO Enum aliases field (needs to be tested to check if same limits apply).

Java Specific Logical Types

When you use the AVRO built-in "compiler" to generate Java classes from an AVRO *.avsc file it creates classes tied to a SCHEMA$ class variable, but that embedded schema may be different than the schema supplied. In particular, Java specific logical types may be added (in order to provide metadata to other Java ecosystem tools). This causes issues when you're attempting to establish language agnostic shared schemas (issue ticket). One specific problem is that the schema registry doesn't treat the original and modified schemas as identical so two schemas are ultimately registered and then schema compatibility comes into play. Example Schemas:

Original:

{
     "name": "username",
     "type": "string"
}

Modified:

{
     "name": "username",
     "type": {
        "type": "string",
        "avro.java.string": "String"
      }
}

Workarounds include:

  1. Don't use the built-in compiler - hand craft Java classes
  2. Modify the generated classes (SCHEMA$ class variables) to remove Java specific logical types.
  3. Set the "one true schema" to have Java specific types (even in Python jaws-libp)
    • AVRO specification says unknown logical types should fallback to native base type (needs to be tested)
  4. Let Schema Compatibility do its thing (needs to be tested)

More information is needed about why Java compiler has this behavior and the ramifications of a schema without Java specific logical types (how critical is that metadata and can it be provided in alternative ways). By default AVRO treats a plain "string" type to mean AVRO's internal Utf8 CharSequence. However, AVRO always uses the Utf8 CharSequence internally (binary wire format doesn't change), I think we're simply talking about the Java class accessor methods and whether they provide a plain Java.lang.String or the internal Utf8 CharSequence. If true, the classes could simply have conversion methods exposed (with optional conversion caching as this whole debacle is about performance - of course Java developers want to simply use java.lang.String).

Schema References

It is useful to define an AVRO entity such as an enum in one place in one file and then reference it from many other places. For example Registered Classes and Registered Alarms each reference Locations, Priorities, and Categories, which are enums. This avoids duplicate specification, and is the alternative to nesting identical type specifications. However, schema references are ill-defined and support is incomplete across the entire AVRO ecosystem. There is overlapping and competing concerns with registries and entity naming and organization. Confluent Schema Registry wants your schemas organized at the granularity of a "subject" - either a key or value of a topic. The naming of subjects is recommended to be topic-{key|value}. This contrasts with the strategy of one type per file with file name matching the fully qualified type name.

Schema references are being used experimentally at this time. See: Confluent Python API does not support schema references.

Union Serialization

For AVRO Unions we avoid placing this structure at the root. Instead we use a single field msg, which is a union of records. The msg field appears unnecessary, and as an alternative the entire value could have been a union. However, a nested union is less problematic than a union at the AVRO root (confluent blog). If a union was used at the root then (1) the schema must be pre-registered with the registry instead of being created on-the-fly by clients, (2) the AVRO serializer must have additional configuration:

auto.register.schemas=false
use.latest.version=true

This may appear especially odd with messages that have no fields. For example the value is:

{"msg":{}}

instead of:

{}

Union Deserialization

Unions must have their type specified in the record in the general case (in some cases the type can be inferred). When AVRO records are formatted as JSON this still applies. Unfortunately the Confluent Python API does not support union deserialization completely. We can easily import records using the Confluent kafka-avro-console-consumer (wraps Java API), but our Python scripts cannot easily export the records in the same AVRO JSON format. The Python API does correctly serialize union types.

Clone this wiki locally