diff --git a/blog/2024-10-08-cad3/index.md b/blog/2024-10-08-cad3/index.md new file mode 100644 index 0000000..85f764e --- /dev/null +++ b/blog/2024-10-08-cad3/index.md @@ -0,0 +1,41 @@ +--- +slug: cad3-revolution +title: The CAD3 Revolution +authors: [mikera] +tags: [convex, cad3, lattice] +--- + +A quick note on [CAD3](https://docs.convex.world/docs/cad/encoding) because I think it is important for everyone to understand how important this is - it's also probably the last significant piece we NEED to get right before Protonet goes live. + + +What is CAD3? It's the format with which we encode lattice data, e.g. the number `13` becomes the 2-byte sequence `0x110d`. If you've used Convex Desktop, you may recognise these from the message encoding utility in the "Hacker Tools". + +### Why this is critical + +The CAD3 encodings are important to everything we are doing: +- These encodings describe all the data in Convex and other lattice applications: the global state, DLFS drives, lattice structures for merging, transactions, CVM smart contract code etc. +- These sequences of bytes (encodings) are what we put through a SHA3-256 cryptographic hash to build Merkle DAGs and verify integrity of data +- These are also the raw bytes that get transmitted between peers and binary clients +- These are also the bytes that get stored to disk in Etch +- These are also performance critical - a lot of the performance in Convex depends on how fast we can encode, transmit and store data +- These are also security critical - attackers might attempt to construct malicious encodings to circumvent security or mount a DoS attack + +Hopefully this makes it clear: these encodings are pretty essential to Convex and lattice technology as a whole! They are also very hard to change after we go live: changing encodings would mean everyone needs to re-encode all their data in the new format! Hence why we're super focused on getting this right before Protonet launch. + +### The Good News + +The good news: we are now very close to having a near-perfect encoding format for decentralised data. Has the potential to be a game changer much more broadly, as it solves a lot of the problems with existing encoding formats when used for decentralised data. Some juicy features: +- A **unique canonical encoding** for every value, such that it can be hashed to a stable ID +- An **efficient binary format** for both storage and transmission +- A **self describing** format - no additional schema is required to read an encoding +- Provision of **immutable persistent data structures** for the lattice data values used in Convex +- Automatic generation of a verifiable **Merkle DAG** via references to other value IDs +- Support for **rich data types** used in the CVM and lattice data (Maps, Sets, Vectors, Blobs etc.) +- Data structure of **arbitrary size** may be represented. The lattice is huge. +- Support for **partial data**: we often need to transmit deltas of large data structures, so need a way to build these deltas and reconstruct the complete structure when they are received (assuming existing data can fill the gaps) +- Ability to read encode / decode n bytes of data in O(n) time and space to ensure **DoS resistance** +- Fixed upper bound on the encoding size of any value (excluding referenced children) so that reading and writing can occur in fixed sized buffers - this allows **streaming capabilities** including zero-copy operations. + +### What next? + +Full CAD3 specifications are outlined in [CAD003](/docs/cad/encoding). For anyone wanting to work on the CAD3 format or it's implementation in Convex please get involved! diff --git a/blog/tags.yml b/blog/tags.yml index ac02c2a..48e7e35 100644 --- a/blog/tags.yml +++ b/blog/tags.yml @@ -22,3 +22,13 @@ lisp: label: Lisp permalink: /lisp description: Convex Reader + +cad3: + label: CAD3 + permalink: /cad3 + description: CAD3 encoding and data formats + +lattice: + label: Lattice + permalink: /lattice + description: Lattice Technology diff --git a/docs/cad/003_encoding/README.md b/docs/cad/003_encoding/README.md index 45b402f..deac315 100644 --- a/docs/cad/003_encoding/README.md +++ b/docs/cad/003_encoding/README.md @@ -2,32 +2,34 @@ ## Overview -Convex uses the standard **CAD3 Encoding** format that represents any valid Convex data value as a **sequence of bytes**. The CAD3 encoding is an important capability for Convex because: +Convex uses the standard **CAD3 Encoding** format that represents data values as a **sequence of bytes**. The CAD3 encoding is an important capability for Convex because: - It allows values to be efficiently **transmitted** over the network between peers and clients - It provides a standard format for **durable data storage** of values - It defines a cryptographic **value ID** to identify any value. This is a decentralised pointer, which also serves as the root of a Merkle DAG that represents the complete encoding of a value. +- CAD3 values are fundamental for enabling **lattice technology** for internet-scale decentralised data structures -The encoding model breaks values into a Merkle DAG of one or more **Cells** that are individually encoded. Cells are immutable, and may therefore be safely shared by different values, or used multiple times in the the same DAG. This technique of "structural sharing" is extremely important for the performance and memory efficiency of Convex. +The encoding model breaks values into a Merkle DAG of one or more **cells** that can be individually encoded. Cells are immutable, and may therefore be safely shared by different DAGs, or used multiple times in the the same DAG. This technique of "structural sharing" is extremely important for the performance and memory efficiency of Convex. ## Special Requirements Convex and related lattice infrastructure have some very specific requirements for the encoding format which necessitate the design of the encoding scheme design here: -- Every distinct value must have one and only one **unique canonical encoding**, so that it can be hashed to a stable ID -- It must be possible to read encode / decode `n` bytes of data in `O(n)` time and space to ensure **DoS resistance** -- There must be a fixed upper bound on the encoding size of any value (excluding referenced children) so that reading and writing can occur in fixed sized buffers - this allows **streaming capabilities** -- It must be an **efficient binary format** for both storage and transmission -- It must be **self describing** - no additional schema is required to read an encoding -- It must support **persistent data structures** for the immutable lattice data values used in Convex -- We must be able to use the encodings to build a verifiable **Merkle tree** via hashes that reference other values -- It supports the rich **data types** used in teh CVM (Maps, Sets, Vectors, Blobs etc.) -- Any data structure of **arbitrary size** may be represented. The lattice is huge. -- Support for **partial data**: we often need to transmit deltas of large data structures, so need a way to build these deltas and reconstruct the complete value when they are received (assuming existing data can fill the gaps) +- A **unique canonical encoding** for every value, such that it can be hashed to a stable **value ID** +- An **efficient binary format** for both storage and transmission +- A **self describing** format - no additional schema is required to read an encoding +- Provision of **immutable persistent data structures** for the lattice data values used in Convex +- Automatic generation of a verifiable **Merkle DAG** via references to other value IDs +- Support for rich **data types** used in the CVM and lattice data (Maps, Sets, Vectors, Blobs etc.) +- Data structure of **arbitrary size** may be represented. The lattice is huge. +- Support for **partial data**: we often need to transmit deltas of large data structures, so need a way to build these deltas and reconstruct the complete structure when they are received (assuming existing data can fill the gaps) +- Ability to read encode / decode `n` bytes of data in `O(n)` time and space to ensure **DoS resistance** +- **maximum size limit** on encodings for any value (excluding referenced children) so that reading and writing can occur in fixed sized buffers - this allows **streaming capabilities** including zero-copy operations. No existing standard was identified that meets these requirements, e.g. - XML and JSON are inefficient text based formats, and lack unique representations of the same data - Google's protocol buffers require external schemas (and does not usually guarantee a unique canonical encoding) +- ASN.1 is over-complex, generally relies on schemas and doesn't encode into bounded units suitable for building merkle trees or streaming data through fixed size buffers ## Examples @@ -38,23 +40,60 @@ The Integer `19` is encoded as: - the hex value 0x13 ``` -0x1113 +0x1113 = 19 ``` -### Vector with external reference +### A 3-element Vector -A Vector (length 2) containing The Integer 19 and another non-embedded value is encoded as: -- 0x80 tag for a Vector -- 0x02 count of Vector elements -- 0x1113 embedded encoding of the Integer 19 -- 0x20 tag for a non-embedded external value reference -- The value ID = hash of the referenced value's encoding +A Vector containing the Integer 101, the String "Hello" and an empty Set is encoded as: +- `0x80` tag for a Vector +- `0x02` count of Vector elements +- `0x1164` embedded encoding of the Integer 101 +- `0x30` tag for a String +- `0x05` length of String +- `0x48656c6c6f` 5 bytes UTF-8 encoding of "Hello" +- `0x83` tag for a Set +- `0x00` number of elements in the set (i.e. empty) ``` -0x800211132028daa385e6b97d3628e1deecb412c7d4e98135e204d0661c92ba885ff23d2b94 +80031165300548656c6c6f8300 = [101 "Hello" #{}] ``` -On its own, the encoding above is a valid encoding for a single cell, but the encoding of the referenced value would need to be obtained in order to examine the second element - which could include petabytes of data. This is an example of a "partial" value. +Note that the encoded representation (13 bytes) is shorter than the printed representation (17 bytes), even though it incorporates rich data types and self-describing structural information! This is a key design goal in general for CAD3 + +### A 4gb Blob + +A Blob of length 4gb, specified with: +- `0x31` tag for a Blob +- `0x9080808000` VLQ encoded length of 2^32 +- `0x20`+child value ID (repeated 16 times, each child is a 256mb Blob) + +``` +0x31908080800020af61c2faf10511466f73fe890524dccc056bddc79df37c7fbb + 1823d5c8dae191202144a7641028ccd2259792d4c9626feb7f3cfb631eb7473d + 3b95f1312fc4bf05202426c963ce5e0032fff92a028deec91a7466dd6d970cf4 + 78e510033854f6499120b2165e855dddd0daf62ba138ba0c1553a347b7f9f635 + a589a2f9ab50be67c65120c7e6b0c74f27af4771ef06304fb02988bcd3bbe8f7 + c2af84d3262d36f9ab75a620e8000a3edfa7bd1321c5d40e36a52c3c93d2be03 + d976df15fd2323796c43435f201de13753be217f7fe3b89effaf7f2f5326bff4 + 94b50c1d86d96eeeb537bfcd5e205b2e93772a254a5196662707c68e851d16e3 + a9386df7b40183daf82389d761032095ed2e62b005d363d33ccd4794ecc9f9f3 + bae35979151ee1e340555e6d265a0820cf0902e3f9ca79469ed03e25085ad14b + dea6a03fe41299ce538837e1e3666e3a20d614113ed517586ec7fe3576a9ce90 + 66f4795efbe85315fa0f6872085a408d4120ff5d93db343c185b47484aef9bd8 + e1c5d171e87762960b659344b0aeda6ba0ba20fb047cfd86c9b81883b7920a44 + f5f8909f6360a5e2f2d2d4ee4639d554ab7801202ad7a5b7bafc6d323f3d6ec1 + 4288775095775eb7d72f63cebffae6a0438ccb1120afc2b4cb2ed7c26d7026b1 + 74a22979bf4cf09468d5a31d33dca1aad04df0b1cc207881f54f571cd0416e5a + f36bc6f133660bf8a60b4ded525332f9a314bea4ddea +``` + +On its own, the encoding above is a valid encoding for a single cell, but the encoding of the referenced values would need to be obtained in order to examine the tree of child Blobs - which is four whole gigabytes of data. This is an example of a "partial" value. + +This example illustrates some key benefits of the CAD3 design: +- It is possible to examine and validate the top levels of CAD3 data structures independently, and only retrieve / decode sub-trees where necessary. +- There is very little overhead for large structures. On 4gb of random data, the overhead is ~1% (mainly from ~1.1 million cryptographic hashes, which are in any case necessary for validation as a Merkle tree) +- The entire tree is just 6 levels deep - so navigating down to any single byte only requires 5 lookups via value IDs. ## Basic Rules and Concepts @@ -78,11 +117,11 @@ A CAD3 encoding MUST be a sequence of bytes. Any given cell MUST map to one and only one encoding. -Any two distinct (non-identical) cells MUST map to different encoding +Any two distinct (non-identical) cells MUST map to different encodings -It MUST be possible to reconstruct the cell from its own encoding, to the extent that the cell represents the same Value (it is possible for implementations to use different internal formats if desired, providing these do not affect the CVM value semantics) +It MUST be possible to reconstruct the cell from its own encoding, to the extent that the cell represents the same value (it is possible for implementations to use different internal formats if desired, providing these do not affect value equality semantics) -The Encoding MUST have a maximum length of 8191 bytes. This ensure that a cell encoding will always fit within a reasonable fixed size buffer, and guarantees that most operations on can achieve `O(1)` complexity. +The encoding MUST have a maximum length of 16383 bytes. This ensure that a cell encoding will always fit within a reasonable fixed size buffer, and guarantees that most operations on can achieve `O(1)` complexity. ### Value ID @@ -90,16 +129,16 @@ The value ID of a cell is the SHA3-256 hash of the encoding of the cell. All cells have a unique encoding, therefore they also have a unique value ID (subject the the assumption that the probability of SHA3-256 collisions is extremely low). -A value ID reference may be considered as a "decentralised pointer" to an immutable value. +A value ID reference may be utilised as a "decentralised pointer" to an immutable value, or used as an index for content addressable storage. -Note: since only tree roots and branches are likely to be stored in storage systems, care should be taken with value IDs that point to intermediate non-branch cells, as these may not be persisted in storage. If in doubt, navigate down from a known root or branch cell value ID. +Note: since only tree roots and branches are likely to be stored in storage systems, care should be taken with value IDs that point to intermediate non-branch cells, as these may not be persisted in storage. If in doubt, navigate down from a known root or value ID. ### References A cell encoding MAY contain references ("Refs") to other cells. There are two types of reference: -- Embedded, where the embedded cell's encoding is included within the parent cell encoding -- Branch, where an external reference is encoded as a byte sequence that includes the Value ID of the referenced cell (i.e. the branch) +- **Embedded**, where the embedded cell's encoding is included within the parent cell encoding +- **Branch**, where an external reference is encoded as a byte sequence that includes the Value ID of the referenced cell (i.e. the branch) From a functional perspective, the difference between an embedded cell and a branch cell is negligible, with the important exception that following a branch reference will require accessing a separate encoding (typically cached in memory, but if necessary loaded from storage). @@ -116,7 +155,7 @@ A cell may be defined as embedded in which case the cell's encoding is inserted If a cell is embedded, it MUST NOT be included in the encoding of another cell by external reference. This restriction is required to guarantee uniqueness of encoding (if not enforced, child cells might be encoded as either an embedded reference or by external reference, thus giving two or more different encodings for the parent cell). -An embedded cell MUST have an encoding of 140 bytes or less. This restriction helps to ensure that cell encodings which may contain many child embedded references cannot exceed the overall 8191 byte limit. +An embedded cell MUST have an encoding of 140 bytes or less. This restriction helps to ensure that cell encodings which may contain many child embedded references cannot exceed the overall 16383 byte limit. #### External References @@ -155,13 +194,33 @@ Implementations MUST recognise an invalid encoding, and in particular: Implementations MUST be able to produce the unique valid encoding for any cell. -### Implementation interpretation +Note: Random byte strings are almost always invalid. This is a good thing, because it allows us to quickly reject corrupt or malicious data. This property is due to the multiple constraints on validity: +- Quite a few tags are illegal / reserved +- Within each value, there are significant constraints on validity (correct VLQ counts, embedded children must be 140 bytes or less etc.) +- Even for an otherwise valid encoding, the length of the encoding must be exactly correct + +### Extensibility -Implementations using CAD3 encodings MAY assign semantic meaning to values on an application-specific basis. +CAD3 is designed for applications to use, and is therefore **extensible**. + +This is an important aspect of the design: Just like HTTP does not enforce a specific meaning on content included in a PUT request, CAD3 does not specify how an application might choose to interpret specific CAD3 encoded values. + +Applications using CAD3 encodings MAY assign semantic meaning to values on an application-specific basis. + +Applications SHOULD honour the logical meaning of defined CAD3 types, e.g.: +- A CAD3 Integer should represent an integer value in the application (in some cases it might be repurposed e.g. as a code value) +- A CAD3 String should be treated as a UTF-8 string if the application supports strings / text data. + +Applications SHOULD use the Blob type (`0x31`) for data which has a different binary encoding. This allows applications to encode arbitrary data in CAD3 structures with custom encodings or using other standards. + +Implementations MUST preserve CAD3 encoded values, even if they do not recognise the meaning. This ensures that implementations are compatible with all applications and systems can relay application specific data even if they do not understand it. In practice this means: -- Applications can define what a particular cell value means in context, e.g. the vector `[1 17 :owns]` might represent ownership in a graph where entity `1` owns entity `17` -- Independent of semantic meaning, applications can read and encode arbitrary CAD3 data. +- Applications can define what a particular cell value means in context, e.g. the vector `[1 17 :owns]` might represent an edge in a graph where entity `1` "owns" entity `17` +- Certain categories of values (`0xAn`, `0xCn`, `0xDn` and `0xEn`) are explicitly intended for application usage +- Independent of semantic meaning, applications can encode and decode arbitrary CAD3 data using a suitable implementation library. + +Application developers SHOULD document their extensions so that other developers are able to interpret their extension values if required. ## Encoding Format @@ -177,34 +236,56 @@ Tags are designed with the following objectives: - Plenty of tag bytes are still available for future extension - The hex values convey at least some meaning to experienced human readers. `nil` is `00`. Numbers start with `1`. booleans start with `b`. `ff` is fatal failure etc. -### VLC Integers - -Integers are normally encoded using a Variable Length Coding (VLC) format. This ensure that small integers have a 1-byte Encoding, and most 64-bit values encoded will have an encoded length shorter than 8 bytes, based on the expected distributions of 64-bit integers encountered in the system. - -Encoding rules are: -- The high bit of each byte is `1` if there are following bytes, `0` for the last bytes. -- The remaining bits from each byte are considered as a standard big-endian two's complement binary encoding. -- The highest two's complement bit (i.e. the 2nd highest bit of the first byte) is considered as a sign bit. -- The Encoding is defined to be the shortest possible such encoding for any given integer. - -It should be noted that this system can technically support arbitrary sized integers, but in most contexts in Convex it is used for up to 64-bit integer values. - -### VLC Counts - -VLC Counts are unsigned integer values, typically used where negative numbers are not meaningful, e.g. the size or length of data structures, or for balances that are defined to be non-negative natural numbers. +### Categories + +The high hex digit of each tag byte specifies the general category of teh data value. These are defined as follows: + +| Pattern | Category | Purpose | +| -------- | -------------------- | -------- | +| 0x0x | Basic constants | Special values like `nil` | +| 0x1x | Numerics | Integers, Doubles | +| 0x2x | References | Addresses, References to branch values | +| 0x3x | Strings and Blobs | Raw Blob data, UTF-8 Strings etc. | +| 0x4x | Reserved | Reserved for future use, possible N-dimensional arrays | +| 0x5x | Reserved | Reserved for future use | +| 0x6x | Reserved | Reserved for future use | +| 0x7x | Reserved | Reserved for future use | +| 0x8x | Data Structures | Containers for other values: Maps, Vectors, Lists, Sets etc. | +| 0x9x | Cryptography | Digital Signatures etc. | +| 0xAx | Sparse Records | For application usage, records that frequently omit fields | +| 0xBx | Byte Flags | One-byte flag values (0xB0 and 0xB1 defined as CVM booleans) | +| 0xCx | Coded Values | For application usage, values tagged with an code value | +| 0xDx | Data Records | For application usage, records that have densely packed fields | +| 0xEx | Extension Values | For application usage | +| 0xFx | Special Values | Mostly reserved, 0xFF is illegal | + +The categories have been designed with the following purposes in mind: +- Include all key fundamental types for decentralised lattice data structures +- Allow rapid recognition of general types based on the first hex digit +- Extensibility: several categories are designed for flexible application usage +- Some degree of human interpretability (from looking at hex values) +- Reasonable space for future extension + +### VLQ Counts + +VLQ Counts are unsigned integer values expressed in a base-128 encoding using a Variable Length Quantity (VLQ) format. They are useful where values are usually small and negative numbers are not meaningful, e.g. the size or length of data structures. Encoding rules are: - The high bit of each byte is `1` if there are following bytes, `0` for the last bytes. -- The remaining bits from each byte are considered as a standard unsigned big-endian two's complement binary encoding. +- The remaining 7 bits from each byte are considered as a standard unsigned big-endian two's complement binary encoding. - The encoding is defined to be the shortest possible such encoding for any given integer. -In data structures, a VLC Count is always used to specify the number of elements in the data structure. +In data structures, a VLQ Count is frequently used to specify the number of elements in the data structure. -Note: VLC Counts are the same as VLC Integers, except that they are unsigned. Having this distinction is justified by frequent savings of one byte, especially when used as counts within small data structures. +It should be noted that this system can technically support arbitrary sized integers, however for use in CAD3 encoding it is limited to 63-bit integer values. It seems unlikely that anyone will ever actually need to construct or encode a single data structure with this many elements, so this may be considered sufficiently future-proof. + +Note: In the Convex reference implementation a signed variant (VLC Long) is also available. This is not currently used in CAD3. ### `0x00` Nil -The single byte `0x00` is the encoding for `nil` value. +The single byte `0x00` is the encoding for the `nil` value. + +`nil` is conventionally used to indicate the absence of a value in applications. ### `0x10` - `0x18` Integer (Long) @@ -216,38 +297,40 @@ A Long value is encoded by the Tag byte followed by `n` bytes representing the s Note: The value zero is conveniently encoded in this scheme as the single byte `0x10` -Note: This encoding is chosen in preference to a VLC encoding because: -- The length of a small integer can be included in the tag, making it more efficient than VLC which requires continuation bits +Note: This encoding is chosen in preference to a VLQ encoding because: +- The length of a small integer can be included in the tag, making it more efficient than VLQ which requires continuation bits - It is consistent with the natural encoding for two's complement integers on most systems - The numerical part is consistent with the format for BigInts ### `0x19` Integer (BigInt) ``` -0x19 +0x19 ``` -A "Big" Integer is represented by the tag byte `0x19` followed by the VLC encoded length of the Integer in bytes. The Integer MUST be represented in the minimum possible number of bytes - excess leading sign bytes are an invalid encoding. +A "Big" Integer is represented by the tag byte `0x19` followed by the VLQ encoded length of the Integer in bytes. -The length MUST be at least `9` (otherwise the integer MUST be encoded as a Long). +The Integer MUST be represented in the minimum possible number of bytes - excess leading sign bytes are an invalid encoding. This is necessary to ensure an unique encoding for every Integer. -With the exception of the tag byte, The encoding of a BigInt is defined to be exactly equal to a Blob with `n` bytes. +The length MUST be at least `9` (otherwise the integer MUST be encoded as the Long version of Integer). -### `0x1d` Double +With the exception of the tag byte, The encoding of a BigInt is exactly the same as a Blob with `n` bytes. + +### `0x1D` Double A Double is an IEEE754 double precision floating point value. ``` -0x1d <8 bytes IEEE 764> +0x1D <8 bytes IEEE 764> ``` A Double value is encoded as the Tag byte followed by 8 bytes standard representation of an IEEE 754 double-precision floating point value. -All IEEE754 values are supported except that the `NaN` value must be represented with the specific encoding `0x1d7ff8000000000000` in the CVM. This is to ensure a unique encoding of `NaN` values which are otherwise logically equivalent. +All IEEE754 values are supported except that the `NaN` value MUST be represented with the specific encoding `0x1d7ff8000000000000` when used within the CVM. This is to ensure a unique encoding of `NaN` values which are otherwise logically equivalent. ### `0x20` Ref -A Ref is a special encoding that points to another encoding using its value ID (cryptographic hash). +A Ref is a special encoding that points to a branch cell encoding using its value ID (cryptographic hash). ``` 0x20 <32 bytes Value ID> @@ -255,23 +338,23 @@ A Ref is a special encoding that points to another encoding using its value ID ( An external reference is encoded as the tag byte followed by the 32-byte value ID (which is in turn defined as the SHA3-256 hash of the encoding of the referenced value). They are not themselves cell values, rather they represent a reference to another cell -An implementation MUST NOT allow a Ref as a a valid encoding in its own right: it must be included in another encoding. +An implementation MUST NOT admit a Ref as a valid cell encoding in its own right: it can only be included to represent a child value in another encoding. -Ref encodings are used as substitutes for child values contained within other cell encodings subject to the following rules: +Ref encodings are used for child values contained within other cell encodings subject to the following rules: - They MUST be used whenever the child cannot be embedded (i.e. is a branch cell). - They MUST NOT be used when the child cell is embedded. -These rules are necessary to ensure uniqueness of the parent encoding (otherwise, there would be two versions, one with an embedded child and the other with a external ref ). +These rules are necessary to ensure uniqueness of the parent encoding (otherwise, there would be two or more encodings for many values, e.g. one with an embedded child and the other with a external branch ref). ### `0x21` Address Addresses are used to reference sequentially allocated accounts in Convex. ``` -0x21 +0x21 ``` -An Address value is encoded by the tag byte followed by a VLC Encoding of the 64-bit value of the Address. +An Address is encoded by the tag byte followed by a VLQ Encoding of the 64-bit value of the Address. The address number MUST be positive, i.e. a 63-bit positive integer. @@ -286,29 +369,30 @@ A String is a sequence of bytes with UTF-8 string encoding assumed. ``` If String is 4096 bytes or less: -0x30 +0x30 If String is more than 4096 Bytes: -0x30 (repeated 2-16 times) +0x30 (repeated 2-16 times) ``` -Every String encoding starts with the tag byte and a VLC-encoded length. +Every String encoding starts with the tag byte and a VLQ-encoded length. -Encoding then splits depending on the String length `n`. +Encoding then depends on the String length `n`. - If 4096 characters or less, the UTF-8 bytes of the String are encoded directly (`n` bytes total) -- If more than 4096 bytes, the String is broken up into a tree of child Strings, where each child except the last is the maximum sized child possible for a child string (1024, 16384, 262144 etc.), and the last child contains all remaining characters. Up to 16 children are allowed before the tree must grow to the next level. +- If more than 4096 bytes, the String is broken up into a tree of child Blobs, where each child except the last is the maximum sized child possible for a child string (1024, 16384, 262144 etc.), and the last child contains all remaining characters. Up to 16 children are allowed before the tree must grow to the next level. Because child strings are likely to be non-embedded (because of encoding size) they will usually be replaced with Refs (33 bytes length). Thus a typical large String will have a top level cell encoding of a few hundred bytes, allowing for a few child Refs and a (perhaps embedded) final child. Importantly, this design allows: - Arbitrary length Strings to be encoded, while still keeping each cell encoding smaller than the fixed maximum size - Structural sharing of tree nodes, giving O(log n) update with path copying -- Low overhead because of the high branching factor: not many branch nodes are required and each leaf note will compactly store up to 4096 characters. +- Low overhead because of the high branching factor: not many branch nodes are required and each leaf note will compactly store up to 4096 characters +- Most of the implementation can be shared with Blobs -Note: UTF-8 encoding is assumed, but not enforced in encoding rules. Implementations MAY decide to allow invalid UTF-8. +Note: UTF-8 encoding is assumed, but not enforced in CAD3 encoding rules. Applications SHOULD determine their own rules for handling invalid UTF-8. -Note: with the exception of the tag byte, String encoding is exactly the same as a Blob +Note: with the exception of the tag byte, String encoding is exactly the same as a Blob. This includes the fact that the children of Strings are in fact Blobs. This is useful because it facilitates structural sharing between large Strings and Blobs. ### `0x31` Blob @@ -317,24 +401,24 @@ A Blob is an arbitrary length sequence of bytes (Binary Large OBject). ``` If Blob is 4096 bytes or less: -0x31 +0x31 If Blob is more than 4096 bytes: -0x31 (repeated 2-16 times) +0x31 (repeated 2-16 times) ``` -Every Blob encoding starts with the tag byte and a VLC-encoded length. +Every Blob encoding starts with the tag byte and a VLQ-encoded length. Encoding then varies depending on the Blob length `n`. -- If 4096 bytes or less, the bytes of the Blob are encoded directly (`n` bytes following the VLC Count) +- If 4096 bytes or less, the bytes of the Blob are encoded directly (`n` bytes following the VLQ Count) - If more than 4096 bytes, the Blob is broken up into a tree of child Blobs, where each child except the last is the maximum sized child possible for a child Blob (4096, 65536, 1048576 etc.), and the last child contains all remaining bytes data. Up to 16 children are allowed before the tree must grow to the next level. Applications MAY include whatever data or encoding they wish within Blobs. Applications SHOULD use Blobs for binary data where the data is not otherwise meaningfully represented as a CAD3 type. Examples might include PNG format image data, a binary database file, or text in an encoding other than UTF-8. -Applications SHOULD use Blobs for data which is normally represented as a string of bytes, e.g. cryptographic hashes or signatures. +Applications SHOULD use Blobs for data which is naturally represented as a string of bytes, e.g. cryptographic hashes or signatures. Because child Blobs are likely to be non-embedded (because of encoding size) they will usually be replaced with Refs (33 bytes length). Thus a typical large Blob will have a top level cell encoding of a few hundred bytes, allowing for a few child Refs and a final child for the remaining bytes (which may be embedded). @@ -401,11 +485,11 @@ A Character encoding is invalid if: ``` If a leaf cell: -0x80 (repeated 0-16 times) +0x80 (repeated 0-16 times) If a non-leaf cell: -0x80 (repeated 2-16 times) +0x80 (repeated 2-16 times) ``` A leaf cell is a Vector with Count `n` being 0, 16, or any other positive integer which is not an exact multiple of 16. @@ -414,7 +498,7 @@ A Vector is defined as "packed" if its count is a positive multiple of 16. A lea A Vector is defined as "fully packed" if its Count is `16 ^ level`, where `level` is any positive integer. Intuitively, this represents a Vector which has the maximum number of elements before a new level in the tree must be added. -All Vector encodings start with the tag byte and a VLC Count of elements in the Vector. +All Vector encodings start with the tag byte and a VLQ Count of elements in the Vector. Subsequently: - For leaf cells, a packed prefix vector is encoded (which may be `nil`) that contains all elements up to the highest multiple of 16 less than the Count, followed by the Values @@ -437,21 +521,29 @@ A List is encoded the same as a Vector, except: A Map is a hash map from keys to values. ``` -If a leaf cell: +If a map leaf cell: -0x80 (repeated n times, in order of key hashes) +0x80 ... (key + value repeated n times, in order of key hashes) -If a non-leaf cell: +If a map tree cell: -0x80 (repeated 2-16 times) +0x80 (repeated 2-16 times) Where: -- specifies the hex position where the map branches (0 = at the fist hex digit, etc.) -- is a 16-bit bitmask indicating key hash hex valeus are included (low bit = `0` ... high bit = `F`) -- are Refs to Map cells which can be Leaf or non-Leaf nodes +- specifies the hex position where the map branches (0 = at the first hex digit,.... 63 = last digit) +- is a 16-bit bit mask indicating key hash hex values are included (low bit = `0` ... high bit = `F`) +- are Refs to Map cells which can be Leaf or non-Leaf nodes ``` +If the count n is 15 or less, the Map MUST be encoded as a map leaf cell, otherwise it MUST be encoded as a map tree cell. This is to ensure unique encoding. The number 15 is chosen for optimal binary search and so that all types of Map cells have 0-16 child value refs. + All entries MUST be encoded in the order of key hashes. +- In a map leaf cell, this means that the `` pairs are sorted by key hash +- In a map tree, it means that the child maps are ordered according to the hex digit at the shift position + +All entries within a map tree cell (directly or indirectly) MUST have identical key hashes up to the position of the shift byte. Since hashes are 32 bytes, this means that the maximum possible shift byte value is 63 (though this is very unlikely to occur in practice: it would imply someone found at least 9 SHA-256 hashes differing only by the last 4 bits!) + +A map tree cell MUST have at least two children (if not, it should not exist since the branch must occur at a later hex digit). Again, this is necessary to ensure uniqueness of encoding. A Map MAY contain arbitrary keys and values. @@ -461,18 +553,18 @@ A Set is a logical set of included values. A Set is encoded exactly the same as a Map, except: - The tag byte is `0x83` -- The Value Refs are omitted +- The Value elements in the entries are omitted ### `0x84` Index ``` -0x84 (repeated 1-16 times) +0x84 (repeated 1-16 times) Where: is either: -- 0x00 (if no entry present at this position in Index) -- 0x20 (if entry present) +- 0x00 (if no entry present at this position in Index) +- 0x20 (if entry present) is an unsigned byte indicating the hex digit at which the entry / branch occurs. If an entry is present, depth must match the hex length of the entry key @@ -486,32 +578,34 @@ Special cases: - No entry and at least 2 children which differ in the hex digit at this depth ``` -An Index serves as a specialised map with BlobLike keys (Blobs, Strings, Addresses etc.). Logically, it is a mapping from byte arrays to values. +An Index serves as a specialised map with ordered keys. Logically, it is a mapping from byte arrays to values. + +Key values MUST be Blobs, Strings, Addresses, Keywords or Symbols. These are regarded as "BlobLike" because they can be considered as a sequence of bytes like a Blob. This encoding ensures that entries are encoded in lexicographic ordering. Unlike the hash based Maps, an Index is constrained to use only BlobLike keys, and cannot store two keys which have the same Blob representation (though the keys will retain their original type). ### `0x88` Syntax -A Syntax object is a value annotated with a metadata map. +A Syntax Object is a value annotated with a Map of metadata. ``` -0x88 +0x88 -Where is either: +Where is a value which is either: - 0x00 (nil) if there is no metadata (considered as empty map) -- A Ref to a non-empty Map containing the metadata +- A a non-empty Map containing the metadata -The can be any value. +The can be any value. ``` -Logically, a `Syntax` value is a wrapped value with a metadata map. The metadata can be any Map of keys to values. +The metadata MUST be a Map of keys to values (`nil` is used as the empty Map, for efficiency) ### `0x90` Signed Represents a digitally signed data value. ``` -`0x90` +`0x90` Where: - Public Key is 32 bytes Ed25519 public key @@ -528,47 +622,47 @@ This is the same as `0x90` signed but excluding the public key ### `0xA0` - `0xAF` Sparse Records -A Sparse Record is an implementation-defined structure containing 0-63 fields. The fields are stored sparsely, with `nil` values omitted. +A Sparse Record is an structure containing 0-63 fields. The fields are stored sparsely, with `nil` values omitted. ``` -`0xAn` (repeated for each set bit in inclusion mask) +`0xAn` (repeated for each set bit in inclusion mask) Where: - `n` is an implementation-defined hex value (0-15) which MAY be used to disambiguate distinct record types. -- The inclusion mask is an unsigned integer (63 bits max). +- The inclusion mask is an unsigned integer (63 bits max) represented with a VLQ Count. ``` The inclusion mask is a non-negative value indicating which fields are included in the Record as bit mask. -The number of Value Refs MUST be equal to the number of `1` bits in the inclusion count, with the first Value Ref corresponding to the least significant `1` bit etc. +The number of Value Refs MUST be equal to the number of `1` bits in the inclusion count, with the first Value Ref corresponding to the least significant `1` bit etc. For maximal encoding efficiency, it is recommended that the most commonly included fields are defined in the first 7 positions, which maximises the chance that the inclusion mask will only require one byte. -Value Refs, if included, MUST NOT be `nil`. This is necessary to ensure unique encoding, since excluded fields are defined as `nil`. +Value Refs, if included, MUST NOT be `nil`. This is necessary to ensure unique encoding, since excluded fields are already defined as `nil`. -Implementations which require more than 64 fields MAY adopt their own scheme to further embed additional structures within the 63 fields available. Reasonable options include: +Implementations which require more than 63 fields MAY adopt their own scheme to further embed additional structures within the 63 fields available. Reasonable options include: - Group subsets of similar fields into child Records. This is especially useful if common groups of fields are frequently used in multiple places and logically grouped together. - Have the Record specify one field which contains a vector of additional fields - Use the first field (index 0) to specify the interpretation of following fields (which may contain arbitrary values as sub-structures) -### `0xb0` - `0xb1` Boolean +### `0xB0` - `0xB1` Byte Flags (Boolean) -The possible Boolean values are `true` and `false` +The possible Boolean values are `true` and `false`, which are coded as 1-byte Byte Flags. ``` Encoded as: -0xb0 <=> false -0xb1 <=> true +0xB0 <=> false +0xB1 <=> true ``` The two Boolean Values `true` or `false` have the Encodings `0xb1` and `0xb0` respectively. Note: These Tags are chosen to aid human readability, such that the first hexadecimal digit `b` suggests "binary" or "boolean", and the second hexadecimal digit represents the bit value. -### `0xb2`-`0xbF` Byte Flags +### `0xB2`-`0xBF` Byte Flags (Extensible) Byte flags are one byte encodings (similar to Booleans) available for application specific use. ``` -`0xbn` +`0xBn` Where - n = a hex value from 2-15 @@ -578,43 +672,67 @@ Applications MAY use byte flags as a efficient single byte value, i.e. the compl Values `0xb0` and `0xb1` are already reserved for the two boolean values, though an application MAY repurpose these as single byte values (along with `0x00` and `0x10`) providing these values are not needed for some other purpose. -### `0xc0`-`0xcf` Codes +Fun Idea: A 1-byte Lisp where `0x10` is an opening paren, `0x00` is a closing paren and `0xb0 - 0xbf` are the allowable tokens. + +### `0xC0`-`0xCF` Codes + +Codes are values tagged with another value. + +``` +`0xCz` + +Where: +- is any value indicating what code is being used +- is any value representing the coded payload +- z = a hex value from 0-15 +``` + +Codes are intended for applications to represent values requiring special interpretation depending on the code used. e.g. the code could be used to represent the MIME content type for a Blob of data. + +Applications SHOULD use a small code value (e.g. a small Long, or a Byte Flag) to specify the precise type of value being encoded, and a corresponding value that is meaningful for the given code value. -Codes are values intended to represent values requiring special coded interpretation. +Applications MAY in addition use the hex digit `z` to further disambiguate code types. In combination with the 18 valid one byte encodings, this gives a reasonably generous 288 distinct code types before another byte is required. + +### `0xD0`-`0xDF` Data Records + +Data Records are record types where every field value is encoded (densely coded). ``` -`0xcn` +`0xDz` Where: -- is any value indicating what code is being used -- is any value representing the coded payload -- n = a hex value from 0-15 +- z = a hex value from 0-15 +- n = the number of fields in the record ``` -Applications SHOULD use a small code value (e.g. a small Integer) to specify the precise type of value being encoded, and a corresponding value that is meaningful for the given code value. +Data Record encoding is exactly the same as a Vector, with the exception of the tag byte. Note that if there are more than 16 fields, this means there will be a child cells which are Vectors. + +Applications MAY use the hex digit `z` and/or the field count `n` to distinguish record types. If this is insufficient, applications MAY use the first or the last field value to indicate the type, or embed a Data Record as a coded value (`0xCz`) to tag with an arbitrary type. -Applications MAY in addition use the hex digit `n` to further disambiguate code types. In combination with the 18 valid valid one byte encodings, this gives a reasonably generous 288 distinct code types before another byte is required. +The intention of Data records is that applications may interpret these as records in their own custom format. For example, a record might represent a listing on a decentralised market place with fields such as Asset ID, Price, Seller ID, Listing description, Creation Time, Time Limit etc. -### `0xd0`-`0xdf` Dense Records +### `0xE0`-`0xEF` Extension Values -Data Records are record types where every field value is encoded. +Extension values are arbitrary values allowing 16 application specific meanings. ``` -`0xdn` +`0xEz` Where: -- n = a hex value from 0-15 +- z = a hex value from 0-15 ``` -Data Record encoding is exactly the same as a Vector, with the exception of the tag byte. +Extension values are arbitrary values with a one byte tag, where the low byte of the tag is available for applications to define a special meaning for the value. -Applications MAY in use the hex digit `n` to disambiguate record types. If this is insufficient, implementations SHOULD use the first or the last record value to indicate the type. +Examples: +- an application might define `0xEB` as an extension where the value is a String containing JSON data with a specific schema. +- another application might define `0xE0` as an enum where the values are the possible states of a finite state machine -### `0xff` Illegal +### `0xFF` Illegal -The `0xff` tag is always illegal as a tag byte in any encoding. +The `0xFF` tag is always illegal as a tag byte in any encoding. -Implementations MUST treat and values encoded starting with `0xff` as an invalid encoding. +Implementations MUST treat and values encoded starting with `0xFF` as an invalid encoding. ### Reserved Tags @@ -641,14 +759,18 @@ Cells which are no longer referenced by any cells currently in use may be safely ## Cell validation -Cell validation occurs in multiple steps: +Cell validation will typically occurs in multiple stages: - Encoding correctness (is this a valid encoding?) - Structural correctness (is the whole tree of cells valid?) - Semantic correctness (does the value make sense in this context?) ### Encoding correctness -A cell's encoding can be checked for correctness in its up to the point of external references (i.e. you know you have `n` external references that appear to be 32 bit hash values, but the encoding or validity of those may or may not be known) +A cell's encoding can be quickly checked for correctness up to the point of external branch references (i.e. you know you have `n` external references that appear to be 32 bit hash values, but the encoding or validity of those may or may not be known). + +Implementations SHOULD detect and reject invalid encodings as early as possible. + +An invalid encoding MAY be considered sufficient evidence to discard the entire message in which it is received, since the source is provably not behaving correctly. ### Structural correctness @@ -660,9 +782,11 @@ Checking for structural correctness requires traversing external references in o Checking for structural correctness is typically an `O(n)` operation in the number of cells checked. For performance reasons, it is usually valuable to cache the results of structural correctness checks so that they do not need to be recomputed - this is especially important for large data structures with structural sharing. +Applications SHOULD NOT trust or re-transmit CAD3 messages unless they have validated for structural correctness. This is for security and robustness reasons: structurally incorrect messages are likely to cause errors or unexpected behaviour which an attacker might exploit. + ### Semantic correctness -Checking semantic correctness is to validate that the cell value makes sense / has meaning in the context that it is used. A cell could be structurally correct but contain values that are illegal in some application (e.g. a Vector that should contain all Integers actually containing a String). +Checking semantic correctness is to validate that the cell value makes sense / has meaning in the context that it is used. A cell could be structurally correct but contain values that are illegal in some application (e.g. a Vector that should contain Integers but actually contains a String). Checking for semantic correctness is application defined and outside the scope of CAD003. @@ -672,8 +796,6 @@ Checking for semantic correctness is application defined and outside the scope o Applications are free to assign semantic meaning to CAD3 encoded values. -This is an important aspect of the design: Just like HTTP does not enforce a specific meaning on content included in a PUT request, CAD3 does not specify how an application might choose to interpret specific CAD3 encoded values. - Applications SHOULD ensure that they are able to read, persist and communicate arbitrary CAD3 encoded values, even if the semantic meaning may be unknown. This is important for several reasons: - The application SHOULD be robust and not fail due to unrecognised but legal encodings - The application MAY need to pass on these values to other systems that do understand them @@ -684,16 +806,47 @@ In practice the recommended approach is: - If valid, application can proceed with the semantic meaning it defines - If invalid, this is presumably an exception that needs handling (e.g. a malicious message from an external source that should be rejected) +### Compatible Subsets + +CAD3 encodings are designed to support various other data formats as a natural subset. Applications may find it useful to exploit these correspondences to efficiently store any data in CAD3 format. + +- **JSON** encodes naturally using Map, String, Vector, Double, Integer, Boolean and Nil +- **UTF-8** text encodes naturally as a String +- **Binary data** naturally encodes as a Blob, or as a Code with the encoding format specified +- **Encrypted data** is perfectly suited for storage in a Blob +- **S-expressions** are naturally coded using Lists, Symbols and a selection of other values (Integers, Strings etc.) +- **XML** can be encoded in multiple ways e.g.: + - As a UTF-8 String + - As a Vector where each element is either a content String or a markup value. The metadata map of a Syntax Object could be used to specify element attributes +- **Tabular data** like **CSV** or **SQL** result sets is naturally represented as a Vector of Vectors. This has the added advantage of fast indexing by row number. +- **Content addressable storage** is naturally represented with an Index, which has the advantage of fast indexed lookup by content ID +- **Abstract Syntax Trees** are naturally encoded using Syntax Objects, where arbitrary metadata can be attached to nodes. The nodes themselves might contain a Vector of child nodes. + +Developers wishing to utilise such subsets SHOULD research and collaborate with other developers to establish common standards for embedding such data in CAD3 format. This is outside the scope of CAD3, but new CAD proposals defining such standards are welcome. + +### Partial Implementations + +It is possible to write a partial implementation that understands only a subset of CAD3. This may be useful e.g. for embedded devices. + +Partial implementations MUST be able to decode any cell, and recognise it as valid / invalid from an encoding perspective. This is necessary for correctness and interoperability. + +Partial implementations MAY ignore CAD3 values that they cannot interpret. This means that they MUST at a minimum be able to: +- calculate the length of an embedded value so that they can skip over it. This may require a bounded amount of recursion, as embedded values may embed other values inside them (up to a small depth limit limited by 140 bytes) +- store the encoding of any value(s) they have ignored if they need to re-encode the data for onward transmission + +Partial implementations MAY ignore branch references, and hence avoid the need to compute SHA3-256 hashes / look up child cells by reference. In this case, care must be taken that values are small enough that they always result in embedded encodings. + ### Convex JVM Implementation In the Convex JVM implementation, cells are represented by subclasses of the class `convex.core.data.ACell`. Having a common abstract base class allows for convenient implementation of common cell functionality, is helpful for performance, and ensures that all cell instances offer a common interface. -The implementation allows for internal usage of "non-canonical" cells. These are cell instances that may break rules, e.g. encoding more than 4096 bytes in a single flat Blob. These are only used for temporary purposes (typically for performance reasons) and are always converted back to a canonical implementation for encoding purposes. +The implementation allows for internal usage of "non-canonical" cells. These are `ACell` instances that may break normal rules, e.g. encoding more than 4096 bytes in a single flat Blob. These are used for temporary purposes (typically for performance reasons) and are always converted back to a canonical implementation for CAD3 encoding purposes. The implementation keeps singleton "interned" references for various common values. This is mainly to avoid repeated memory allocation for such values. These currently include: -- The two boolean values `true` and `false` -- Small integers -- Empty maps, sets, strings and blobs etc. +- Byte Flags including the two boolean values `true` and `false` +- Small Integers (0-255) +- ASCII Characters +- Empty Maps, Sets, Strings and Blobs etc. - Static constants such as Strings, Keywords and Symbols used frequently in the CVM -The JVM `null` value is interpreted as the Convex `nil` value. This is an implementation decision, again chosen for efficiency and performance reasons. However there is no strict requirement that `nil` must be represented this way (for example, it could alternatively be a singleton value). +The JVM `null` value is interpreted as the Convex `nil` value. This is an implementation decision, again chosen for efficiency and performance reasons. However there is no formal requirement that `nil` must be represented this way (for example, it could be a singleton value). diff --git a/docs/cad/016_peerstake/README.md b/docs/cad/016_peerstake/README.md index 876db98..4a3bf74 100644 --- a/docs/cad/016_peerstake/README.md +++ b/docs/cad/016_peerstake/README.md @@ -2,13 +2,23 @@ ## Overview -Staking is the process by which Peers in the Network and other participants lock up economic value (Stake) to support the security of the Network and earn economic rewards from participating in the CPoS consensus. +Staking is the process by which peers in the network and other participants lock up economic value (stake) to support the security of the network and earn economic rewards from participating in the CPoS consensus. -Peers must place a Peer Stake to participate in consensus. This is at risk if the Peer provably misbehaves, and may be lost through a process of Slashing, but is safe as long as the Peer continues to operate correctly and securely. +Peers must place a peer stake to participate in consensus. This is at risk if the Peer provably misbehaves, and may be lost through a process of Slashing, but is safe as long as the Peer continues to operate correctly and securely. -Other participants may also place a Delegated Stake on a Peer they wish to support. It is in the interests of large coin holders to support the security of the Network by placing stake on Good Peers that they trust, as well as to earn additional rewards on their holdings. +Other participants may also place a delegated stake on a peer they wish to vouch for. It is in the interests of large coin holders to support the security of the Network by placing stake on good peer operators that they trust, as well as to earn additional rewards on their holdings. -The Total Stake of a Peer determines its voting weight in the CPoS consensus. +The total Stake of a seer determines its voting weight in the CPoS consensus algorithm. + +## Meaning of Stake + +Stake involves taking a risk and performing useful work for the network to earn rewards. + +A peer operator that stakes on it own peer ("peer stake") is warranting that it has **fully secured its peer key used for operational participation in consensus**. The work they do is ensuring this peer is properly managed, secured and maintains network consensus correctly. It may lose its stake if this key is compromised (typically this would mean that the peer server is itself compromised). It may also lose its stake if the controller account is compromised. + +Delegated stakers are warranting that they **trust the peer operator to maintain consensus and earn rewards while properly protecting the peer controller account**. The work they do is in evaluating peer operators and betting their coins that the peer operators performs their role honestly and effectively. Their delegated stake is not at risk if the peer itself is compromised or crashes, but *is* at risk if the controller account is compromised. + +It should be observed that the most important thing from a security perspective is the private key used to control the peer controller account: all stake is at risk if this is lost. For this reason it is STRONGLY RECOMMENDED that important peer controller keys are kept secure in offline storage / air-gapped systems. This is a good incentive since the network as a whole could go offline if sufficient numbers of peers are simultaneously compromised. ## Rewards @@ -17,28 +27,78 @@ Stakers are rewarded with a share of Convex Coins earned from - Reward Pools set by the Convex Foundation Rewards are divided as follows: -- The Total reward is divided over all Peers according to Peer Stake +- The total reward is divided over all Peers according to Peer Stake - For each Peer: - 50% is allocated to the Peer itself (added to peer stake) - 50% is divided across delegated stakers on the peer (according to their relative stake) - - If there are no delegated stakers, the reward goes to the Peer + - If there are no delegated stakers, the full reward goes to the Peer + +## Stake pools + +It is possible to establish a stake pool where an actor places stake on behalf of others. + +Examples: +- A public stake pool which issues a token that entitles stake pool members to a share of returns gained from peer rewards +- A private stake pool run by a large peer operator to manage stake across its own peers +- A charitable stake pool which distributes returns to good causes + +Stake pools are made possible by peer staking and CVM actor code, but are outside the scope of CAD016. Innovation is encouraged in designing effective stake pool implementations. -## Stake decay +## Effective stake decay -Peer stakes are discounted if the peer is temporarily inactive. This enables the network to progress even in the event of major peers going offline for an amount of time. +Peer stakes are temporarily discounted if the peer is inactive. This enables the network to progress even in the event of major peers going offline for an amount of time. Stake decay occurs at the following rate by default: - 3 minutes grace period with no decay - A fall by a factor of `1/e` every 5 minutes thereafter -Stake decay does not effect the actual peer's stake, only the effectiveness of the stake in consensus. +Stake decay does not effect the actual peer's stake, but does affect: +- The effectiveness and voting weight of the stake in consensus +- The ability of other network participants to evict the peer ## Slashing -There will be no stake slashing in Protonet, although stake decay is active so inactive or misbehaving peers will become quickly irrelevant to consensus. +Slashing is the penalisation of peers for bad behaviour. Any slashing will result in a deduction of stake, which will be transferred to the overall peer reward pool for properly behaving peers to collect in the future. + +There will be **no stake slashing in Protonet**, although stake decay is active so inactive or misbehaving peers will become quickly irrelevant to consensus (and probably be evicted). + +Slashing conditions for main network will be evaluated during Protonet phase. Questions to be considered: +- Under what conditions might slashing occur? +- Is delegated stake subject to slashing or not? + +## Changing Peer Stake + +Peer operators may add or remove peer stake from their owns peers with the following command: + +```clojure +;; note: stake is denominated in coppers +(set-peer-stake 0x42272E789B7a3D57f8267c15c2d9B8BeD9b0E2035b3a8AE9A0eb9A024B7FADe5 10000000000000) +``` + +Removing all peer stake can be done by setting stake to `0`, though typically it is better to use the `evict-peer` command to remove the peer record entirely and get a memory refund. + +## Changing Delegated Stake + +Changing delegated stake on any peer can be done with the `set-stake` command: + +```clojure +;; note: stake is denominated in coppers +(set-stake 0x42272E789B7a3D57f8267c15c2d9B8BeD9b0E2035b3a8AE9A0eb9A024B7FADe5 10000000000000) +``` + +## Peer Eviction + +Peers can be evicted from the global state in two situations: +- The peer's effective stake is less than the minimum effective stake (currently 1000 Convex Gold) +- The peer controller can always evict it's own peer(s) -TODO: verify slashing conditions for main network +```clojure +(evict-peer 0x42272E789B7a3D57f8267c15c2d9B8BeD9b0E2035b3a8AE9A0eb9A024B7FADe5) +``` -Question: is delegated stake subject to slashing or not? +When evicted: +- The peer's stake is returned to the controller account +- Any delegated stakes are returned to the staking accounts +- The peer record is removed from the Global State -## Adding and removing Stake +There is an incentive to evict peers because deletion of the peer record will result in a memory refund to the account that performs the `evict-peer` operation. Anyone can do this for insufficiently staked peers and it reduces the CVM state size, so it is good for the ecosystem! diff --git a/docs/cad/020_tokenomics/README.md b/docs/cad/020_tokenomics/README.md index 6283676..7a07c7d 100644 --- a/docs/cad/020_tokenomics/README.md +++ b/docs/cad/020_tokenomics/README.md @@ -14,11 +14,15 @@ This does not in any way constitute financial or legal advice. Participants in t Convex serves as a public utility network, where participants are free to transact on a decentralised basis. As such, there is a requirement for and economic protocol whereby users of the network can fairly compensate the providers of infrastructure for their services. +![Convex High Level Tokenomics](tokenomics.png) + Convex Coins are initially issued in two ways: -- 75% are available for purchase on the **release curve**. This is is a mathematically defined mechanism that releases coins as and when demanded by economic participation in the ecosystem. Funds raised are reinvested in the ecosystem to create a virtuous cycle. +- 75% are available for purchase on the **release curve**. This is is a mathematically defined mechanism that releases coins as and when demanded by economic participation in the ecosystem. Funds raised are reinvested in the ecosystem to create a virtuous cycle. - 25% are available as **awards** to contributors who add value to the ecosystem in various ways (can be software engineering, open source contributions, marketing, building great uses cases etc.). Contributions must benefit the ecosystem as a whole. -Once issued, coins are fully transferable and can circulate freely according to the wishes of their holders (e.g. traded on a private basis, used in smart contracts etc.) +Once issued, coins are fully transferable and can circulate freely according to the wishes of their holders (e.g. traded on a private basis, used in smart contracts etc.). + +Coins used for transaction fees (or deliberately burned) are removed from the coin supply and a placed in a special "Reward Pool" which is released back to peer operators and stakers over time as a reward for maintaining the network. This model strikes the right balance between enabling long term sustainable growth and recognising those who bring value to the Convex ecosystem (financially or otherwise). There is a maximum supply cap of 1,000,000,000 Convex coins, though it will take a long time to get there. The total Coin supply at Protonet launch is estimated to be ~1-2m Convex Coins. @@ -98,19 +102,21 @@ The following overall tokenomic flows are possible: ### Coin Supply -The issued coin supply is VARIABLE based on coin issuance via the Release Curve or contributor awards. - -The Network MUST implement a technical fixed maximum coin supply cap of 1,000,000,000 Convex Coins. The number of issued coins at any time may be less than this amount, but can never exceed this amount. +The issued coin supply is VARIABLE based on coin issuance via the Release Curve or contributor awards. It is denominated in Convex Coins. Each Convex Coin MUST be sub-divided into 1,000,000,000 base units, referred to informally as "coppers" -The Network must treat Convex Coins and coppers identically, i.e. the implementation should consider the range of possible coin values to be a value from `0` to `10^18`. +The Network must treat Convex Coins and coppers identically, i.e. the implementation should consider the range of possible coin values to be a value from `0` to `10^18`, where `10^9` is a Convex Coin. + +Coins that are used for transaction fees (or deliberately burned) are removed from the coin supply and placed in a special Reward Pool that is used to pay rewards to peer operators and stakers. In this way, the coin economy becomes fully circular after initial issuance. + +The Network MUST implement a technical fixed maximum coin supply cap of 1,000,000,000 Convex Coins. The number of issued coins at any time may be less than this amount, but can never exceed this amount. Note: The maximum supply cap is chosen so that all valid coin balances can be expressed within a 64-bit long value, which allows for efficient implementation on most modern CPU architectures. ### Genesis -The genesis process in Convex includes the process of creating the initial Global State and establishing the first peer on the network, to which others can then connect. This genesis state is important for tokenomics because it established the initial coin allocation and the rules by which future colin allocations may be made. +The genesis process in Convex includes the process of creating the initial Global State and establishing the first peer on the network, to which others can then connect. This genesis state is important for tokenomics because it established the initial coin allocation and the rules by which future coin allocations may be made. #### Top Level Coin Allocation @@ -121,9 +127,9 @@ The Network MUST divide the total initial supply of Convex Coins into two quanti #### Reserve accounts -The genesis MUST create a set of reserve accounts (`#0` to `#7`) which represent unissued coins. Such coins MUST NOT be considered part of the current coin supply. +The genesis MUST create a set of reserve accounts (`#1` to `#7`) which represent unissued coins. Such coins MUST NOT be considered part of the current coin supply. -By reserving these accounts, we maintain the invariant that the total supply cap of 1,000,000,000 Convex Gold is constant and coins cannot be created or destroyed, but the majority of these are not yet part of the current total supply. +By reserving these amounts, we maintain the technical invariant that the total maximum supply cap of 1,000,000,000 Convex Gold is constant and coins cannot be created or destroyed - however the majority of these may not yet part of the current coin supply. Any cryptographic keys for reserve accounts MUST be kept securely and governed according to the release tokenomics described in this CAD. The Convex Foundation will use air-gapped systems initially for this purpose. @@ -140,25 +146,36 @@ The genesis account MUST NOT have access to the majority of the reserve account #### Distribution account(s) -The genesis process SHOULD produce one or more secondary distribution accounts that will hold Convex coins temporarily before distribution to award recipients or purchasers. +The genesis process SHOULD define one or more secondary distribution accounts that will hold Convex coins temporarily before distribution to award recipients or purchasers. The distribution accounts SHOULD NOT hold large balances of coins, and are only intended for short term holdings of coins that are already allocated to recipients (e.g. purchasers who have purchased coins, but not yet provided a public key or account into which the coins can be delivered). These balances are considered as issued (i.e. part of the current coin supply) but not yet distributed, i.e. still in the control of the governance body. The governance body MUST ensure these accounts are securely controlled by authorised individuals to ensure legitimate distributions are made. +### Memory Exchange Pool + +A certain amount of Convex Coins are placed in an AMM exchange for CVM memory allowances. Such coins are in effect locked under a smart contract, though should still be considered part of the overall coin supply as they are technically available for use (e.g. people selling back memory allowances). + +The memory allowances themselves are a secondary native token used purely for memory accounting purposes. + +See [CAD006 Memory Accounting](../006_memory/README.md) for more details. + ### Release Curve +The release curve determines the price at which new coins are issued. There is an important economic principle behind this: more **coins only get released when prices go up** (i.e. hit new highs). This gives coin purchasers the assurance that they will never get diluted by new coin releases at lower prices, while still allowing for the coin supply to be increased as ecosystem demand grows. + Coin purchases MUST be priced in fiat currency or equivalent, consistent with the Release Curve defined in this section. The price of a Coin on the release curve is defined as `$100 * x / (1-x)` where `x` is the proportion of coins released out of the total allocation for coin purchasers, and `$` represents United States dollars or equivalent currency. -Note: The constant value `$100` is chosen so that once `50%` of all coins are issued, the market cap of Convex Coins would be equal to `$50bn`. At this stage, the Convex Foundation would have a significant treasury sufficient to develop and maintain the Convex ecosystem in perpetuity. +![Convex Coin Release Curve](release-curve.png) + +Note: The constant value `$100` is chosen so that once `50%` of all coins are issued, the market cap of Convex Coins would be equal to `$50bn`. At this stage, one would expect the Convex Foundation to have a significant treasury available to give strong economic support to the Convex ecosystem in perpetuity. The Release Curve formula MAY be adjusted in the event of significant economic events affecting the relative value of fiat currencies used (e.g. sustained high rates of inflation). The Foundation MUST consult with the ecosystem and provide a robust rationale for any such changes. To account for transaction costs, effective financial management or purchaser convenience, the Foundation MAY group the release of some coins into rounds, provided that such rounds MUST be broadly consistent with the overall Release Curve. - ### Coin Purchases The 75% allocation for Coin Purchasers MUST be distributed on the basis of purchases of coins in a manner consistent with the release curve. @@ -196,12 +213,13 @@ The Convex Foundation MAY require contributors to verify their legal identity (K The Convex Foundation SHOULD aim to ensure that the rate of awards remains broadly consistent with the ratio 25% : 75% relative to purchases from the release curve, with the understanding that this ratio may deviate from target in the short term. The Convex Foundation SHOULD explore options for decentralised governance of awards. In the long term, decentralised governance SHOULD apply to all awards. +earn future awards. ### Vesting -Early coin purchases via the FCPA (up to and during Protonet phase) are subject to a vesting schedule, reflecting the desire that early purchasers should remain committed to the ecosystem for a period of time, and to mitigate the risk of large simultaneous sales of coins. +Early coin purchases via the FCPA (up to and during Protonet phase) are subject to a vesting schedule, reflecting the desire that early purchasers should remain committed to the ecosystem for a period of time, and to mitigate the potential destabilising effect of large simultaneous sales of coins. -Coin awards will not be subject to any vesting schedule as they are considered already "earned" by contributors. However, contributors are likely to wish to remain involved for other reasons e.g. building applications on top of Convex or wishing to earn future awards. +Coin awards will not be subject to any vesting schedule as they are considered already "earned" by contributors. However, contributors are likely to wish to remain involved for other reasons e.g. building applications on top of Convex or wishing to stake their coins in various ways. ### Transaction Fees @@ -209,17 +227,52 @@ Transactions executed on the Convex network are subject to fees that reflect the Transaction fees are intended to be small, to encourage adoption and use of the Convex network. Transaction fees MUST NOT be zero to mitigate against denial of service (DoS) attacks on the network. -Transaction fees MUST be collected at the point of transaction execution, and placed in a pool for subsequent distribution to peer operators. This process MUST occur automatically as part of the network protocol. +Transaction fees MUST be collected at the point of transaction execution, and placed in the Reward Pool for subsequent distribution to peer operators. This process MUST occur automatically as part of the network protocol. + +### Reward Pool + +The Peer Reward Pool is stored in the special account `#0`. + +Transactions Fees for execution of transactions are deposited in the Reward Pool (this occurs at the end of each block of transactions successfully submitted by a peer and confirmed in consensus). + +Over time, this Reward Pool is used to make payments for peers that are participating actively and correctly in maintaining network consensus, thus giving a return to stakers. + +Account `#0` is also an address to which users can optionally "burn" coins. Such coins are removed from the coin supply, but will be available for future distribution as peer rewards. + +The Convex Foundation MAY, at its discretion, issue coins and immediately "burn" them in order to increase the incentives for peer operators to participate. Such coins will be considered to come out of the 25% available for contributor awards. ## Other considerations +### Secondary sale + +Once issued, Convex Coins may be traded on secondary markets, e.g. via private sale or on digital asset exchanges. Such exchanges are beyond the scope of CAD020 and do not effect the overall coin supply (since they represent transfers between users and/or actors with coins that are already part of the coin supply). + +For example: It is entirely legitimate business model to purchase coins from the Release Curve and offer them for sale to retail users of the network. This is good for the ecosystem because it enables innovation and diversity in the ways that Convex Coins are made available to end users. + +### Wrapped / Locked Coins + +It is possible to lock Convex Coins in an actor account / smart contract. Examples: +- Holding a deposit in escrow for a smart contract +- Liquidity pools on the Torus DEX +- "Wrapping" Convex Coins as CAD029 fungible token + +Such coins are considered part of the coin supply, even though they are not immediately available for users: the rationale for this is that they can still be redeemed or withdrawn under various circumstances. + +### Lost coins + +It is possible to put coins beyond any possible future use in certain ways, e.g.: +- Transferring them to an actor account which has no way to return them +- Permanently losing the private key to a user account + +Such coins are considered "lost". While technically still part of the coin supply, for practical purposes they are gone forever. + ### Fully Diluted Value Some tokenomic analysis depend on the concept of "Fully Diluted Value", by multiplying the current price of tokens by some maximum supply cap. Such analysis is generally invalid and SHOULD NOT be applied in the case of Convex, because full dilution cannot occur under the release curve model at the current price and such a number is effectively meaningless. -Analysts SHOULD NOT consider FDV in relation to Convex, and instead use the market capitalisation of issued coins (i.e. the current supply), which more accurately represents the coin ecosystem. This is also more consistent with other tokens that do not have a fixed supply. +Analysts SHOULD use the market capitalisation of issued coins (i.e. the current coin supply multiplied by the market price) as the overall valuation for the Convex Coin, which more accurately represents the coin ecosystem. This is also consistent with valuation approaches for other tokens that do not have a fixed supply. ### Testnets @@ -233,4 +286,4 @@ Participants in the ecosystem SHOULD NOT conduct economically significant activi In the event of a high severity event that substantially affects overall tokenomics, the governance body MAY take remedial action, up to and including requiring a mandatory network update to fix the issue. Such action will only be taken as a last resort, and MUST be done in a manner that minimises the impact on legitimate coin holders. -The governance body MUST NOT take any action which results in the modification of the coin balances of any 3rd party user or actor accounts, except where necessary to remediate the effects of a security breach as above. +The governance body MUST NOT take any action which results in the modification of the coin balances of any 3rd party user or actor accounts, except where necessary to remediate the effects of a security breach as above. \ No newline at end of file diff --git a/docs/cad/020_tokenomics/release-curve.png b/docs/cad/020_tokenomics/release-curve.png new file mode 100644 index 0000000..becb1fc Binary files /dev/null and b/docs/cad/020_tokenomics/release-curve.png differ diff --git a/docs/cad/020_tokenomics/tokenomics.png b/docs/cad/020_tokenomics/tokenomics.png new file mode 100644 index 0000000..3555390 Binary files /dev/null and b/docs/cad/020_tokenomics/tokenomics.png differ