Merge pull request #18 from xloem/patch-2

Draft for Including Avro Encoding in ANS-104
ArweaveTeam · Oct 1, 2022 · eb8c8bb · eb8c8bb
2 parents 9f2b1b8 + 4b5c6b0
commit eb8c8bb
Showing 1 changed file with 87 additions and 8 deletions.
diff --git a/ans/ANS-104.md b/ans/ANS-104.md
@@ -64,29 +64,108 @@ A DataItem is a binary encoded object that has similar properties to a transacti
 |anchor    |A value to prevent replay attacks               | Binary            |32 (+ presence byte)|:heavy_check_mark: |
 |number of tags      |Number of tags                         | Binary      |8|:x: |
 |number of tag bytes      |Number of bytes used for tags                         | Binary      |8|:x: |
-|tags      |An array of tag objects                         | Binary      |Variable|:x: |
+|tags      |An avro array of tag objects                    | Binary      |Variable|:x: |
 |data      |The data contents                               | Binary            |Variable|      :x: |            
 
 All optional fields will have a leading byte which describes whether the field is present (`1` for present, `0` for *not* present). Any other value for this byte makes the DataItem invalid.
 
-A tag object is a binary object representing an object `{ name: string, value: string }`.
+A tag object is an Apache Avro encoded stream representing an object `{ name: string, value: string }`. Prefixing the tags objects with their bytes length means decoders may skip them if they wish.
 
 The `anchor` and `target` fields in DataItem are optional. The `anchor` is an arbitrary value to allow bundling gateways
 to provide protection from replay attacks against them or their users.
 
 ##### 1.3.1 Tag format
 
+Parsing the tags is optional, as they are prefixed by their bytes length.
+
+To conform with deployed bundles, the tag format is [Apache Avro](https://avro.apache.org/docs/current/spec.html) with the following schema:
+``` 
+{
+  "type": "array",
+  "items": {
+    "type": "record",
+    "name": "Tag",
+    "fields": [
+      { "name": "name", "type": "bytes" },
+      { "name": "value", "type": "bytes" }
+    ]
+  }  
+}
+```
+
+Usually the name and value fields are UTF-8 encoded strings, in which case `"string"` may be specified as the field type rather than `"bytes"`, and avro will automatically decode them.
+
+To encode field and list sizes, avro uses a `long` datatype that is first zig-zag encoded, and then variable-length integer encoded, using existing encoding specifications. When encoding arrays, avro provides for a streaming approach that separates the content into blocks.
+
+##### 1.3.1.1 ZigZag coding
+
+[ZigZag](https://code.google.com/apis/protocolbuffers/docs/encoding.html#types) is an integer format where the sign bit is in the 1s place, such that small negative numbers have no high bits set. In surrounding code, normal integers are almost always stored in a twos-complement manner instead, which can be converted as below.
+
+Converting to ZigZag:
+```
+zigzag = twos_complement << 1;
+if (zigzag < 0) zigzag = ~zigzag;
+```
+
+Converting from ZigZag:
+```
+if (zigzag & 1) zigzag = ~zigzag;
+twos_complement = zigzag >> 1;
+```
+
+##### 1.3.1.2 Variable-length integer coding
+
+[Variable-length integer](https://lucene.apache.org/java/3_5_0/fileformats.html#VInt) is a 7-bit little-endian integer format, where the 8th bit of each byte indicates whether another byte (of 7 bits greater significance) follows in the stream.
+
+Converting to VInt:
+```
+// writes 'zigzag' to 'vint' buffer
+offset = 0;
+do {
+  vint_byte = zigzag & 0x7f;
+  zigzag >>= 7;
+  if (zigzag)
+    vint_byte |= 0x80;
+  vint.writeUInt8(vint_byte, offset);
+  offset += 1;
+} while(zigzag);
+```
+
+Converting from VInt:
+```
+// constructs 'zigzag' from 'vint' buffer
+zigzag = 0;
+offset = 0;
+do {
+  vint_byte = vint.readUInt8(offset);
+  zigzag |= (vint_byte & 0x7f) << (offset*7);
+  vint_byte &= 0x80;
+  offset += 1;
+} while(vint_byte);
+```
+
+##### 1.3.1.3 Avro tag array format
+
+[Avro arrays](https://avro.apache.org/docs/current/spec.html#array_encoding) may arrive split into more than one sequence of items. Each sequence is prefixed by its length, which may be negative, in which case a byte length is inserted between the length and the sequence content. This is used in schemas of larger data to provide for seeking. The end of the array is indicated by a sequence of length zero.
+
+The complete tags format is a single avro array, consisting solely of blocks of the below format. The sequence is terminated by a block with a count of 0. The size field is only present if the count is negative, in which case its absolute value should be used.
+
 |Field     |Description               | Encoding        |Length  | Optional |
 |---       |---                       |---              |---     |---
-|name      |Name of the tag           | Binary          |Variable| :x:      |
-|value     |Value of the tag          | Binary          |Variable| :x:      |
+|count     |Number of items in block  | ZigZag VInt     |Variable| :x:      |
+|size      |Number of bytes if count<0| ZigZag VInt     |Variable| :heavy_check_mark: |
+|block     |Concatenated tag items    | Binary          |size| :x:      |
 
-The number of bytes used for tags is needed to know the start point of the data payload
+##### 1.3.1.4 Avro tag item format
 
-#### 1.4 DataItem field delimiter
+Each item of the avro array is a pair of avro strings or bytes objects, a name and a value, each prefixed by their length.
 
-The fields on each DataItem will either be fixed-sized or used run-length encoding in order to describe the fields'
-length. This allows the parser to know the bytes relevant to each field
+|Field     |Description               | Encoding        |Length  | Optional |
+|---       |---                       |---              |---     |---
+|name_size |Number of bytes in name   | ZigZag VInt     |Variable| :x:      |
+|name      |Name of the tag           | Binary          |name_size| :x:      |
+|value_size|Number of bytes in value  | ZigZag VInt     |Variable| :x:      |
+|value     |Value of the tag          | Binary          |value_size| :x:      |
 
 ### 2. DataItem signature and id