From 1ad87aaf0f75a8e3806bc886ac0d2d15a6397700 Mon Sep 17 00:00:00 2001 From: Carsten Bormann Date: Fri, 1 Nov 2024 14:45:44 +0100 Subject: [PATCH 1/6] Merge general information about encoding indicators --- draft-ietf-cbor-edn-literals.md | 79 ++++++++++++++++----------------- 1 file changed, 39 insertions(+), 40 deletions(-) diff --git a/draft-ietf-cbor-edn-literals.md b/draft-ietf-cbor-edn-literals.md index 760fdf3..3ee4e75 100644 --- a/draft-ietf-cbor-edn-literals.md +++ b/draft-ietf-cbor-edn-literals.md @@ -4,7 +4,7 @@ v: 3 title: > CBOR Extended Diagnostic Notation (EDN) docname: draft-ietf-cbor-edn-literals-latest -# date: 2024-08-24 +# date: 2024-11-01 keyword: Internet-Draft cat: info @@ -439,35 +439,60 @@ or, combining the use of inline and end-of-line comments: ## Encoding Indicators {#encoding-indicators} -TODO: align this with {{spec}} - Sometimes it is useful to indicate in the diagnostic notation which of several alternative representations were actually used; for example, a data item written >1.5\< by a diagnostic decoder might have been encoded as a half-, single-, or double-precision float. The convention for encoding indicators is that anything starting with -an underscore and all following characters that are alphanumeric or +an underscore and all immediately following characters that are alphanumeric or underscore is an encoding indicator, and can be ignored by anyone not interested in this information. For example, `_` or `_3`. Encoding indicators are always optional. -An underscore followed by a decimal digit n indicates that the +(In the following, an abbreviation of the form `ai=`nn gives nn as +the numeric value of the field _additional information_, the low-order 5 +bits of the initial byte: see {{Section 3 of RFC8949@-cbor}}.) + +An underscore followed by a decimal digit `n` indicates that the preceding item (or, for arrays and maps, the item starting with the preceding bracket or brace) was encoded with an additional information -value of 24+n. For example, `1.5_1` is a half-precision floating-point +value of `ai=`24+`n`. For example, `1.5_1` is a half-precision floating-point number, while `1.5_3` is encoded as double precision. -(Note that the encoding -indicator "_" is thus an abbreviation of the full form "_7", which is -not used.) -Encoding Indicators are discussed further below for indefinite length -strings, and for arrays and maps. +The encoding indicator `_` is an abbreviation of what would in full +form be `_7`, which is not used. +Therefore, an underscore `_` on its own stands for indefinite length +encoding (`ai=31`). +(Note that this encoding indicator is only available behind the opening +brace/bracket for `map` and `array` ({{ei-container}}): strings have a special syntax +`streamstring` for indefinite length encoding except for the special +cases ''_ and ""_ ({{ei-string}}).) + +The encoding indicators `_0` to `_3` can be used to indicate `ai=24` +to `ai=27`, respectively. + +Surprisingly, {{Section 8.1 of RFC8949@-cbor}} does not address `ai=0` to +`ai=23` — the assumption seems to have been that preferred serialization +({{Section 4.1 of RFC8949@-cbor}}) will be used when converting CBOR +diagnostic notation to an encoded CBOR data item, so leaving out the +encoding indicator for a data item with a preferred serialization +will implicitly use `ai=0` to `ai=23` if that is possible. +The present specification allows making this explicit: + +`_i` ("immediate") stands for encoding with `ai=0` to `ai=23`. + +While no pressing use for further values for encoding indicators +comes to mind, this is an extension point for EDN; {{reg-ei}} defines +a registry for additional values. + +Encoding Indicators are discussed in further detail in {{ei-string}} for +indefinite length strings and in {{ei-container}} for arrays and maps. ## Numbers @@ -579,7 +604,7 @@ text-based byte string literals, e.g., `\\` stands for a single backslash and `\'` stands for a single quote. (See {{concat}} for details.) -### Encoding Indicators of Strings +### Encoding Indicators of Strings {#ei-string} The detailed chunk structure of byte and text strings encoded with indefinite length can be @@ -753,7 +778,7 @@ as are In summary, comma use is now aligned between EDN and CDDL, in a fully backwards compatible way. -### Encoding Indicators of Arrays and Maps +### Encoding Indicators of Arrays and Maps {#ei-container} A single underscore can be written after the opening brace of a map or the opening bracket of an array to indicate that the data item was @@ -1221,33 +1246,7 @@ The following additional items should help in the interpretation: floating point literal. * {: #spec} `spec` stands for an encoding indicator. - - (In the following, an abbreviation of the form `ai=`nn gives nn as - the numeric value of the field _additional information_, the low-order 5 - bits of the initial byte: see {{Section 3 of RFC8949@-cbor}}.) - - As per {{Section 8.1 of RFC8949@-cbor}}: - - * an underscore `_` on its own stands - for indefinite length encoding (`ai=31`, only available behind the - opening brace/bracket for `map` and `array`: strings have a special - syntax `streamstring` for indefinite length encoding except for the - special cases ''_ and ""_), and - * `_0` to `_3` stand for `ai=24` to `ai=27`, respectively. - - Surprisingly, {{Section 8.1 of RFC8949@-cbor}} does not address `ai=0` to - `ai=23` — the assumption seems to be that preferred serialization - ({{Section 4.1 of RFC8949@-cbor}}) will be used when converting CBOR - diagnostic notation to an encoded CBOR data item, so leaving out the - encoding indicator for a data item with a preferred serialization - will implicitly use `ai=0` to `ai=23` if that is possible. - The present specification allows to make this explicit: - - * `_i` ("immediate") stands for encoding with `ai=0` to `ai=23`. - - While no pressing use for further values for encoding indicators - comes to mind, this is an extension point for EDN; {{reg-ei}} defines - a registry for additional values. + See {{encoding-indicators}} for details. * {: #concat} Extended diagnostic notation allows a (text or byte) string to be From 41aad1bb7d8e4b51b548860e195d2bfe1cf2790d Mon Sep 17 00:00:00 2001 From: Carsten Bormann Date: Sun, 3 Nov 2024 17:06:08 +0000 Subject: [PATCH 2/6] Update boilerplate for BCP14 --- draft-ietf-cbor-edn-literals.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft-ietf-cbor-edn-literals.md b/draft-ietf-cbor-edn-literals.md index 3ee4e75..bdcc08e 100644 --- a/draft-ietf-cbor-edn-literals.md +++ b/draft-ietf-cbor-edn-literals.md @@ -285,7 +285,7 @@ well as {{-cddlupd}}. Additional information about the relationship between the two languages EDN and CDDL is captured in {{edn-and-cddl}}. -{::boilerplate bcp14-tagged} +{::boilerplate bcp14-tagged-bcp14} ## (Non-)Objectives of this Document From 7367d76f37165e1ee9e3ac809472213e5b1ee5c2 Mon Sep 17 00:00:00 2001 From: Carsten Bormann Date: Sun, 3 Nov 2024 18:39:36 +0000 Subject: [PATCH 3/6] Add proper reference to STD63 (UTF-8). --- draft-ietf-cbor-edn-literals.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/draft-ietf-cbor-edn-literals.md b/draft-ietf-cbor-edn-literals.md index bdcc08e..9caa618 100644 --- a/draft-ietf-cbor-edn-literals.md +++ b/draft-ietf-cbor-edn-literals.md @@ -34,6 +34,7 @@ normative: STD94: cbor RFC8742: seq STD68: abnf + STD63: utf8 RFC7405: abnfcs RFC3339: datetime RFC3986: uri @@ -368,7 +369,7 @@ expressiveness of CBOR and to increase its usability. EDN borrows the JSON syntax for numbers (integer and floating-point, {{numbers}}), certain simple values ({{simple-values}}), -UTF-8 text +UTF-8 {{-utf8}} text strings, arrays, and maps (maps are called objects in JSON; the diagnostic notation extends JSON here by allowing any data item in the map key position). @@ -553,7 +554,7 @@ of numbers in CBOR can be found in the informational document ## Strings CBOR distinguishes two kinds of strings: text strings (the bytes in -the string constitute UTF-8 text, major type 3), and byte strings +the string constitute UTF-8 {{-utf8}} text, major type 3), and byte strings (CBOR does not further characterize the bytes that constitute the string, major type 2). From a52189b878bd0ec63aeff26f05f5eb5d0c964c5d Mon Sep 17 00:00:00 2001 From: Carsten Bormann Date: Sun, 3 Nov 2024 18:40:43 +0000 Subject: [PATCH 4/6] Properly explain prefixed string literals --- draft-ietf-cbor-edn-literals.md | 160 ++++++++++++++++++++++++-------- 1 file changed, 120 insertions(+), 40 deletions(-) diff --git a/draft-ietf-cbor-edn-literals.md b/draft-ietf-cbor-edn-literals.md index 9caa618..de4561c 100644 --- a/draft-ietf-cbor-edn-literals.md +++ b/draft-ietf-cbor-edn-literals.md @@ -236,7 +236,7 @@ Similarly, this notation could be extended in a separate document to provide documentation for NaN payloads, which are not covered in this document. --> -After introductory material, {{application-oriented-extension-literals}} +After introductory material, {{app-lit}} introduces the concept of application-oriented extension literals and defines the "dt" and "ip" extensions. {{stand-in}} defines mechanisms @@ -299,7 +299,7 @@ In particular, it states: One important application of EDN is the notation of CBOR data for humans: in specifications, on whiteboards, and for entering test data. -A number of features, such as comments in string literals, are mainly +A number of features, such as comments inside prefixed string literals, are mainly useful for people-to-people communication via EDN. Programs also often output EDN for diagnostic purposes, such as in error messages or to enable comparison (including generation of diffs @@ -384,7 +384,8 @@ with basic validity discussed in {{Section 5.3.1 of RFC8949@-cbor}}, and tag validity discussed in {{Section 5.3.2 of RFC8949@-cbor}}. Tag validity is more likely a subject for individual application-oriented extensions, while the two cases of basic validity -(for text strings and for maps) are addressed below under the heading +(for text strings and for maps) are addressed in Sections +{{false\<, >true\<, >null\<, and >undefined\< cannot be used as such +prefixes. +This means that the text string value (the "content") of the single-quoted string +literal is not used directly as a byte string, but is further +processed in a way that is defined by the meaning given to the prefix. +Depending on the prefix, the result of that processing can, but need +not be, a byte string value. + +Prefixed string literals (which are always single-quoted after the +prefix) are used both for base-encoded byte string literals (see {{encoded-byte-strings}}) and for +application-oriented extension literals (see {{app-lit}}, called app-string). +(Additional base-encoded string literals can be defined as +application-oriented extension literals by registering their prefixes; +there is no fundamental difference between the two predefined +base-encoded string literal prefixes (`h`, `b64`) and any such potential +future extension literal prefixes.) + ### Encoding Indicators of Strings {#ei-string} The detailed chunk structure of byte and text strings encoded with @@ -624,19 +648,20 @@ to preserve the chunk structure. Besides the unprefixed byte string literals that are analogous to JSON text string literals, EDN provides base-encoded byte string literals. -These are notated in one of the base encodings {{-base}}, without -padding, enclosed in a single-quoted string literal, prefixed by >h\< for base16 or +These are notated as prefixed string literals that carry one of the base encodings {{-base}}, without +padding, i.e., the base encoding is +enclosed in a single-quoted string literal, prefixed by >h\< for base16 or >b64\< for base64 or base64url (the actual encodings of the latter do not overlap, so the string remains unambiguous). -For example, the byte string consisting of the four bytes `12 34 56 -78` (given in hexadecimal here) could be written `h'12345678'` or `b64'EjRWeA'`. +For example, the byte string consisting of the four bytes `12 34 56 78` +(given in hexadecimal here) could be written `h'12345678'` or `b64'EjRWeA'`. (Note that {{Section 8 of RFC8949@-cbor}} also mentions >b32\< for base32 and >h32\< for base32hex. This has not been implemented widely and therefore is not directly included in this specification. These and further byte string formats now can easily be added back as -application-oriented literals.) +application-oriented extension literals.) Examples often benefit from some blank space (spaces, line breaks) in byte strings. In EDN, blank space is ignored in prefixed byte strings; for @@ -679,11 +704,22 @@ instance, each pair of columns in the following are equivalent: <<>> h'' ~~~~ -### Validity of Text Strings - -TODO: Add general text about validity of text strings, also addressing -the semantics of concatenation (`+`). -This could move up some text from {{concat}} and following. +### Validity of Text Strings {#text-validity} + +To be valid CBOR, {{Section 5.3.1 of RFC8949@-cbor}} requires that text +strings are byte sequences in UTF-8 {{-utf8}} form. +EDN provides several ways to construct such byte strings (see {{concat}} +for details). +These mechanisms might operate on subsequences that do not themselves +constitute UTF-8, e.g., by building larger sequences out of +concatenating the subsequences; for validity of a text string +resulting from these mechanisms it is only of importance that the +result is UTF-8. +Both double-quoted and single-quoted string literals have been defined +such that they lead to byte sequences that are UTF-8: the source +language of EDN is UTF-8, and all escaping mechanisms lead only to +adding further UTF-8 characters. +Only prefixed string literals can generate non-UTF-8 byte sequences.