Skip to content

Commit

Permalink
CLDR-17211 Fix well-formedness clauses (#3605)
Browse files Browse the repository at this point in the history
  • Loading branch information
macchiati authored Apr 3, 2024
1 parent 9de9528 commit 827d458
Showing 1 changed file with 23 additions and 14 deletions.
37 changes: 23 additions & 14 deletions docs/ldml/tr35.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,8 +251,14 @@ External specifications may also reference particular components of Unicode loca
### EBNF
The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are:

1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8}
2. Constraints (well-formedness or validity) use separate notes
1. Bounded repetition following Perl regex syntax is allowed, such as `alphanum{3,8}`.
2. Whitespace inside bracketed enumerations and ranges is ignored.
* eg., `[A-Z a-z]` is the same as `[A-Za-z]`
3. A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character.
* eg., `\x20` is the same as `#x20` and `[\&\-]` is the same as `[#x26#x2D]`
4. Constraints (well-formedness or validity) may use separate notes, and/or the W3C notations:
* [ wfc: ... ]
* [ vc: ... ]

In the text, this is sometimes referred to as "EBNF (Perl-based)".

Expand Down Expand Up @@ -325,8 +331,8 @@ A _Unicode language identifier_ has the following structure (provided in EBNF (P
<tr><td><code>alphanum</code></td><td><pre>= [0-9 A-Z a-z] ;</pre></td></tr>
</tbody></table>

> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition:
> The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed).
The following is an additional well-formedness constraint:
1. [ wfc: The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). ]

The semantics of the various subtags is explained in _[Language Identifier Field Definitions](#Field_Definitions)_ ; there are also direct links from [`unicode_language_subtag`](#unicode_language_subtag) , etc. While theoretically the [`unicode_language_subtag`](#unicode_language_subtag) may have more than 3 letters through the IANA registration process, in practice that has not occurred. The [`unicode_language_subtag`](#unicode_language_subtag) "und" may be omitted when there is a [`unicode_script_subtag`](#unicode_script_subtag) ; for that reason [`unicode_language_subtag`](#unicode_language_subtag) values with 4 letters are not permitted. However, such [`unicode_language_id`](#unicode_language_id) values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier).

Expand All @@ -336,8 +342,6 @@ For example, "en-US" (American English), "en_GB" (British English), "es-419" (La

A _Unicode locale identifier_ is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in _[Unicode BCP 47 U Extension](#u_Extension)_ and _[Unicode BCP 47 T Extension](#BCP47_T_Extension)_. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically _well-formed_ identifiers: they are not necessarily _valid_ identifiers. For additional validity criteria, see the links on the right.

As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: There cannot be more than one extension with the same singleton (-a-, …, -t-, -u-, …). Note that the private use extension (-x-) must come after all other extensions.

| | EBNF | Validity / Comments |
| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------- |
| <a name="unicode_locale_id" href="#unicode_locale_id">`unicode_locale_id`</a> | `= unicode_language_id`<br/>  `extensions*`<br/>  `pu_extensions? ;` |
Expand All @@ -358,8 +362,11 @@ As is often the case, the complete syntactic constraints are not easily captured
| `tkey` | `= alpha digit ;` |
| `tvalue` | `= (sep alphanum{3,8})+ ;` |

> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition:
> The sequence of variant subtags in a tlang must not have any duplicates.
The following are additional well-formedness constraints:
1. [ wfc: There cannot be more than one extension with the same singleton. For example, en-u-ca-buddhist-u-cf-standard is ill-formed.]
2. [ wfc: There cannot be more than one ukey or tkey. For example, en-u-ca-buddhist-ca-islamic is ill-formed. ]
2. [ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
3. [ wfc: The private use extension (-x-) must come after all other extensions. ]

For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.

Expand Down Expand Up @@ -2983,12 +2990,14 @@ In such a case, that specification may specify a subset or superset of the synta
| `s` | <pre>= \[:Pattern_White_Space:\]\*</pre> | optional whitespace |
| `sRequired` | <pre>= \[:Pattern_White_Space:\]\+</pre> | required whitespace |

Some constraints on UnicodeSet syntax are not captured by this EBNF.
Notably:
1. Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)].
2. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**.
3. Ranges (**X**-**Y**) are only supported in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are supported, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are not, because single-codepoint-strings are equivalent to that code point.
4. If **\[\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**.
The following are additional well-formedness and validity constraints:
1. [ wfc: Ranges (**X**-**Y**) are only well-formed in the case that elements **X** and **Y** resolve to single code points. That is, **\[a-b\]** and **\[\{a\}-\{b\}\]** are well-formed because single-codepoint-strings are equivalent to that code point, while **\[a-{bz}\]** and **\[\{ax\}-\{bz\}\]** are ill-formed. ]
2. [ vc: Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. ]

Note also that:
1. Escapes that use multiple code points are equivalent to their flattened representation, i.e., `\x{61 62}` is equivalent to `\x{61}\x{62}`. These can also occur in strings, so **\[\{\\x\{ 061 62 0063\}\}\]** is equivalent to **\[\{abc\}\]**.
2. If **\[\]** starts with \[:, then it begins a prop, and must also terminate with :\]. Thus **\[:di:\]** is a valid property expression, **\[di:\]** is a 3 code-point set, and **\[:di\]** raises an error.
3. Whitespace is significant when initiating/terminating a POSIX property expression, so **\[ :\]** is syntactically valid and equivalent to **\[\\:\]**.

The syntax characters are listed in the table below:

Expand Down

0 comments on commit 827d458

Please sign in to comment.