diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 6253ab1bd28..185230e0297 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -251,8 +251,14 @@ External specifications may also reference particular components of Unicode loca ### EBNF The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are: -1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8} -2. Constraints (well-formedness or validity) use separate notes +1. Bounded repetition following Perl regex syntax is allowed, such as `alphanum{3,8}`. +2. Whitespace inside bracketed enumerations and ranges is ignored. + * eg., `[A-Z a-z]` is the same as `[A-Za-z]` +3. A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character. + * eg., `\x20` is the same as `#x20` and `[\&\-]` is the same as `[#x26#x2D]` +4. Constraints (well-formedness or validity) may use separate notes, and/or the W3C notations: + * [ wfc: ... ] + * [ vc: ... ] In the text, this is sometimes referred to as "EBNF (Perl-based)". @@ -325,8 +331,8 @@ A _Unicode language identifier_ has the following structure (provided in EBNF (P alphanum
= [0-9 A-Z a-z] ;
-> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: -> The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). +The following is an additional well-formedness constraint: + 1. [ wfc: The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). ] The semantics of the various subtags is explained in _[Language Identifier Field Definitions](#Field_Definitions)_ ; there are also direct links from [`unicode_language_subtag`](#unicode_language_subtag) , etc. While theoretically the [`unicode_language_subtag`](#unicode_language_subtag) may have more than 3 letters through the IANA registration process, in practice that has not occurred. The [`unicode_language_subtag`](#unicode_language_subtag) "und" may be omitted when there is a [`unicode_script_subtag`](#unicode_script_subtag) ; for that reason [`unicode_language_subtag`](#unicode_language_subtag) values with 4 letters are not permitted. However, such [`unicode_language_id`](#unicode_language_id) values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier). @@ -336,8 +342,6 @@ For example, "en-US" (American English), "en_GB" (British English), "es-419" (La A _Unicode locale identifier_ is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in _[Unicode BCP 47 U Extension](#u_Extension)_ and _[Unicode BCP 47 T Extension](#BCP47_T_Extension)_. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically _well-formed_ identifiers: they are not necessarily _valid_ identifiers. For additional validity criteria, see the links on the right. -As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: There cannot be more than one extension with the same singleton (-a-, …, -t-, -u-, …). Note that the private use extension (-x-) must come after all other extensions. - | | EBNF | Validity / Comments | | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------- | | `unicode_locale_id` | `= unicode_language_id`
  `extensions*`
  `pu_extensions? ;` | @@ -358,8 +362,11 @@ As is often the case, the complete syntactic constraints are not easily captured | `tkey` | `= alpha digit ;` | | `tvalue` | `= (sep alphanum{3,8})+ ;` | -> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: -> The sequence of variant subtags in a tlang must not have any duplicates. +The following are additional well-formedness constraints: + 1. [ wfc: There cannot be more than one extension with the same singleton. For example, en-u-ca-buddhist-u-cf-standard is ill-formed.] + 2. [ wfc: There cannot be more than one ukey or tkey. For example, en-u-ca-buddhist-ca-islamic is ill-formed. ] + 2. [ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ] + 3. [ wfc: The private use extension (-x-) must come after all other extensions. ] For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_. @@ -2983,12 +2990,14 @@ In such a case, that specification may specify a subset or superset of the synta | `s` |
= \[:Pattern_White_Space:\]\*
| optional whitespace | | `sRequired` |
= \[:Pattern_White_Space:\]\+
| required whitespace | -Some constraints on UnicodeSet syntax are not captured by this EBNF. -Notably: -1. Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. -2. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**. -3. Ranges (**X**-**Y**) are only supported in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are supported, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are not, because single-codepoint-strings are equivalent to that code point. -4. If **\[…\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**. +The following are additional well-formedness and validity constraints: +1. [ wfc: Ranges (**X**-**Y**) are only well-formed in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are well-formed because single-codepoint-strings are equivalent to that code point, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are ill-formed. ] +2. [ vc: Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. ] + +Note also that: +1. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**. +2. If **\[…\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error. +3. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**. The syntax characters are listed in the table below: