-
Notifications
You must be signed in to change notification settings - Fork 0
AVRO Schema Peculiarities
AVRO named items (including enum items) cannot start with a number (specification, issue ticket). This is why some of the locations in the AlarmLocation like Spectrometers currently have S prefix (S1D - S5D). Ops doesn't like this. Workarounds include:
- Have ops just "get over it" and accept things can't start with numbers
- This is a common limitation - Java Enums can't start with a number either, but it is easy to add an alias
- Not using AVRO enums when enumerating things (use a string and lose in-place enforcement of valid items
- Provide a lookup-map object in the global config topic that provides an operator friendly name (String values in a map don't have naming limits).
- Use AVRO Enum aliases field (needs to be tested to check if same limits apply).
When you use the AVRO built-in "compiler" to generate Java classes from an AVRO *.avsc file it creates classes tied to a SCHEMA$ class variable, but that embedded schema may be different than the schema supplied. In particular, Java specific logical types may be added (in order to provide metadata to other Java ecosystem tools). This causes issues when you're attempting to establish language agnostic shared schemas (issue ticket). One specific problem is that the schema registry doesn't treat the original and modified schemas as identical so two schemas are ultimately registered and then schema compatibility comes into play. Example Schemas:
Original:
{
"name": "username",
"type": "string"
}
Modified:
{
"name": "username",
"type": {
"type": "string",
"avro.java.string": "String"
}
}
Workarounds include:
- Don't use the built-in compiler - hand craft Java classes
- Modify the generated classes (SCHEMA$ class variables) to remove Java specific logical types.
- Set the "one true schema" to have Java specific types (even in Python jaws-libp)
- AVRO specification says unknown logical types should fallback to native base type (needs to be tested)
- Let Schema Compatibility do its thing (needs to be tested)
More information is needed about why Java compiler has this behavior and the ramifications of a schema without Java specific logical types (how critical is that metadata and can it be provided in alternative ways). By default AVRO treats a plain "string" type to mean AVRO's internal Utf8 CharSequence. However, AVRO always uses the Utf8 CharSequence internally (binary wire format doesn't change), I think we're simply talking about the Java class accessor methods and whether they provide a plain Java.lang.String or the internal Utf8 CharSequence. If true, the classes could simply have conversion methods exposed (with optional conversion caching as this whole debacle is about performance - of course Java developers want to simply use java.lang.String).
AVRO does not include schemas inside messages, but they are required to be available at runtime and optionally at buildtime. AVRO allows dynamic schema discovery and parsing of messages with newly discovered schemas via the GenericRecord interface. Alternatively, the AVRO SpecificRecord interface can be used when schemas are known at buildtime. We use SpecificRecords because our applications know what to expect at build time. For runtime schema lookup we use the Confluent Schema Registry. For buildtime schemas we bundle them inside language API libraries: jaws-libj (Java) and jaws-libp (Python). This means the schemas are distributed in multiple places - a little awkward (should one or more of these be an automated cache?).
It is useful to define an AVRO entity such as an enum in one place in one file and then reference it from many other places. For example AlarmClass and AlarmRegistration each reference Locations, Priorities, and Categories, which are enums. This avoids duplicate specification, and is the alternative to nesting identical type specifications. However, schema references are ill-defined and support is incomplete across the entire Kafka AVRO ecosystem.
Schema references are being used experimentally at this time. See: Confluent Python API does not support schema references.
AVRO Union types allow specification of a field that can be one of many choices. This includes indicating that a field may be null (union consisting of null and non-null type). For us, we have a set of different possible alarm producers tied to the registered alarm producer field as a complex type and each producer has different fields so they're modeled as a union. Similar scenario for overrides - we have multiple different overrides with different fields. Again for active alarms.
For AVRO Unions we avoid placing this structure at the root. Instead we use a single field msg, which is a union of records. The msg field appears unnecessary, and as an alternative the entire value could have been a union. However, a nested union is less problematic than a union at the AVRO root (confluent blog). If a union was used at the root then (1) the schema must be pre-registered with the registry instead of being created on-the-fly by clients, (2) the AVRO serializer must have additional configuration:
auto.register.schemas=false
use.latest.version=true
This may appear especially odd with messages that have no fields. For example the value is:
{"msg":{}}
instead of:
{}
When attempting to serialize a union the Python API has two strategies:
- Use a tuple to indicate the type and value
- Just use the value and let the API try to guess what branch of the union you mean
The later is dangerous and messy and has bitten us already - make sure to use the tuple explicit specification.
Unions are powerful and coupled with Kafka key=value message format you can define a composite key such that the type in value varies based on the key. There is currently no built-in constraints enforcing keys match union values though. We've encountered scenarios where we weren't careful and had alarm override keys indicting Disabled while the value was for something else like Shelved for example - applications must be careful.