Date: Thu, 5 Oct 2023 17:57:26 +0200
Subject: [PATCH 07/11] CLDR-16937 Minor clarification for dx (#3320)
---
docs/ldml/tr35.md | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md
index 5b192781633..0946fa589ff 100644
--- a/docs/ldml/tr35.md
+++ b/docs/ldml/tr35.md
@@ -789,8 +789,9 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other
"dx" |
Dictionary break script exclusions |
unicode_script_subtag values |
- One or more items of type SCRIPT_CODE, which are valid unicode_script_subtag values.
- The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. |
+ One or more items of type SCRIPT_CODE (as usual, separated by hyphens), which are valid unicode_script_subtag values.
+ The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified.
+ If others are included mistakenly, they are ignored. |
A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji"> . The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml. |
"em" |
From 609ed4215705670b3806ecb2a929e2d6c0fb1cf7 Mon Sep 17 00:00:00 2001
From: Mark Davis
Date: Thu, 5 Oct 2023 17:57:49 +0200
Subject: [PATCH 08/11] CLDR-16038 fix spec constraints using unit id component
(#3321)
---
docs/ldml/tr35-general.md | 49 +++++++++++++++++++++++++--------------
1 file changed, 31 insertions(+), 18 deletions(-)
diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md
index 39b8e7d14fe..a0bc167cbee 100644
--- a/docs/ldml/tr35-general.md
+++ b/docs/ldml/tr35-general.md
@@ -913,16 +913,21 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
| long_unit_identifier
core_unit_identifier | := |
- product_unit ("-per-" product_unit)*
- | "per-" product_unit ("-per-" product_unit)*
+ | product_unit ("-" per "-" product_unit)*
+ | per "-" product_unit ("-" per "-" product_unit)*
- Examples:
- foot-per-second-per-second
- per-second
- Note: The normalized form will have only one "per"
- - Note: The token 'per' is the single value in <unitIdComponent type=”per”>
|
+per | := |
+ "per"
+
+ - Constraint: The token 'per' is the single value in <unitIdComponent type="per">
+ |
+
product_unit | := |
single_unit ("-" single_unit)* ("-" pu_single_unit)*
| pu_single_unit ("-" pu_single_unit)*
@@ -935,9 +940,9 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
- Examples: square-meter, or 100-square-meter
|
pu_single_unit | := |
- “xxx-” single_unit | “x-” single_unit
+ | "xxx-" single_unit | "x-" single_unit
- Example: xxx-square-knuts (a Harry Potter unit)
- - Note: “x-” is only for backwards compatibility
+ - Note: "x-" is only for backwards compatibility
- See Private-Use Units
|
@@ -954,18 +959,19 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
dimensionality_prefix | := |
"square-" | "cubic-" | "pow" ([2-9]|1[0-5]) "-"
+ - Constraint: must be value in: <unitIdComponent type="power">.
- Note: "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"
- - Note: These are values in <unitIdComponent type=”power”>
+ - Note: These are values in <unitIdComponent type="power">
|
simple_unit | := |
(prefix_component "-")* (prefixed_unit | base_component) ("-" suffix_component)*
| currency_unit
- | “em” | “g” | “us” | “hg” | "of"
+ | "em" | "g" | "us" | "hg" | "of"
- Examples: kilometer, meter, cup-metric, fluid-ounce, curr-chf, em
- - Note: Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, “g-force”).
- We will likely deprecate those and add conformant aliases in the future: the “hg” and “of” are already only in deprecated simple_units.
+ - Note: Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, "g-force").
+ We will likely deprecate those and add conformant aliases in the future: the "hg" and "of" are already only in deprecated simple_units.
|
prefixed_unit | |
@@ -984,16 +990,16 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
prefix_component | := |
[a-z]{3,∞}
- - Constraint: must be value in: <unitIdComponent type=”prefix_component”>.
|
+ - Constraint: must be value in: <unitIdComponent type="prefix">.
base_component | := |
[a-z]{3,∞}
- Constraint: must not be a value in any of the following:
- <unitIdComponent type=”prefix_component”>
- or <unitIdComponent type=”suffix_component”>
- or <unitIdComponent type=”power”>
- or <unitIdComponent type=”and”>
- or <unitIdComponent type=”per”>.
+ <unitIdComponent type="prefix">
+ or <unitIdComponent type="suffix">
+ or <unitIdComponent type="power">
+ or <unitIdComponent type="and">
+ or <unitIdComponent type="per">.
- Constraint: must not have a prefix as an initial segment.
- Constraint: no two different base_components will share the first 8 letters.
@@ -1004,12 +1010,19 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
suffix_component | := |
[a-z]{3,∞}
- - Constraint: must be value in: <unitIdComponent type=”suffix_component”>
|
+
+ - Constraint: must be value in: <unitIdComponent type="suffix">
+
|
mixed_unit_identifier | := |
- (single_unit | pu_single_unit) ("-and-" (single_unit | pu_single_unit ))*
+ | (single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
- Example: foot-and-inch
- - Note: The token 'and' is the single value in <unitIdComponent type=”and”>
+
|
+
+and | := |
+ "and"
+
+ - Constraint: The token 'and' is the single value in <unitIdComponent type="and">
|
long_unit_identifier | := |
From 2e03d3c30a3205729ced20294e3eee472b14b118 Mon Sep 17 00:00:00 2001
From: Mark Davis
Date: Thu, 5 Oct 2023 17:58:43 +0200
Subject: [PATCH 09/11] CLDR-16249 Describe EBNF syntax more clearly (#3322)
---
docs/ldml/tr35.md | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md
index 0946fa589ff..943c86997c0 100644
--- a/docs/ldml/tr35.md
+++ b/docs/ldml/tr35.md
@@ -246,7 +246,13 @@ External specifications may also reference particular components of Unicode loca
> _Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes._
+### EBNF
+The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are:
+1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8}
+2. Constraints (well-formedness or validity) use separate notes
+
+In the text, this is sometimes referred to as "EBNF (Perl-based)".
## What is a Locale?
From 873ab680b1736e80cc1b54dbf403ee70d6e7fdf1 Mon Sep 17 00:00:00 2001
From: Mark Davis
Date: Thu, 5 Oct 2023 17:59:04 +0200
Subject: [PATCH 10/11] CLDR-16251 Add constraint on duplicate variant tags
(#3323)
* CLDR-16251 Add constraint on duplicate variant tags
* CLDR-16251 Add note about tlang.
---
docs/ldml/tr35.md | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md
index 943c86997c0..7c295ab99af 100644
--- a/docs/ldml/tr35.md
+++ b/docs/ldml/tr35.md
@@ -323,6 +323,9 @@ A _Unicode language identifier_ has the following structure (provided in EBNF (P
alphanum | = [0-9 A-Z a-z] ; |
+> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition:
+> The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed).
+
The semantics of the various subtags is explained in _[Language Identifier Field Definitions](#Field_Definitions)_ ; there are also direct links from [`unicode_language_subtag`](#unicode_language_subtag) , etc. While theoretically the [`unicode_language_subtag`](#unicode_language_subtag) may have more than 3 letters through the IANA registration process, in practice that has not occurred. The [`unicode_language_subtag`](#unicode_language_subtag) "und" may be omitted when there is a [`unicode_script_subtag`](#unicode_script_subtag) ; for that reason [`unicode_language_subtag`](#unicode_language_subtag) values with 4 letters are not permitted. However, such [`unicode_language_id`](#unicode_language_id) values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier).
For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.
@@ -353,6 +356,9 @@ As is often the case, the complete syntactic constraints are not easily captured
| `tkey` | `= alpha digit ;` |
| `tvalue` | `= (sep alphanum{3,8})+ ;` |
+> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition:
+> The sequence of variant subtags in a tlang must not have any duplicates.
+
For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.
As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.
From ba1c4f0cb14e669d6ffdc14bf48b6a18fabdff73 Mon Sep 17 00:00:00 2001
From: "Steven R. Loomis"
Date: Thu, 5 Oct 2023 12:50:30 -0500
Subject: [PATCH 11/11] CLDR-17145 BRS/kbd: update Modifications section about
the keyboard spec (#3325)
---
docs/ldml/tr35.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md
index 7c295ab99af..729734cd6ec 100644
--- a/docs/ldml/tr35.md
+++ b/docs/ldml/tr35.md
@@ -4058,7 +4058,7 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni
* Rewrote and clarified the material in [Unit Preferences Overrides](tr35-info.md#Unit_Preferences_Data)
* [Keyboards](tr35-keyboards.md#Contents)
- * Complete revision, description TBS
+ * Complete rewrite of the specification by the Keyboard Subcommittee. Available a technical preview in CLDR version 44. See [Part 7: Status](tr35-keyboards.md#status).
* [Person Names](tr35-personNames.md#Contents)
* Added material in [API Implementaion](tr35-personNames.md#api-implementation) on recommended implementation API options.