From 3493f87b560c56f9c3f1b054d1bcfb5bbfb8a730 Mon Sep 17 00:00:00 2001 From: Peter Edberg <42151464+pedberg-icu@users.noreply.github.com> Date: Wed, 4 Oct 2023 13:41:20 -0700 Subject: [PATCH 01/11] CLDR-17148 Update spec date and mods section (#3315) --- docs/ldml/tr35.md | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index f74cafd7d0f..7f51d252c17 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -5,7 +5,7 @@ |Version|44 (draft)| |-------|----------| |Editors|Mark Davis (markdavis@google.com) and other CLDR committee members| -|Date|2023-06-13| +|Date|2023-10-03| |This Version|https://www.unicode.org/reports/tr35/tr35-68/tr35.html| |Previous Version|https://www.unicode.org/reports/tr35/tr35-67/tr35.html| |Latest Version|https://www.unicode.org/reports/tr35/| @@ -4024,10 +4024,36 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni **Differences from LDML Version 43** +* [Core](#Contents) + * In [Time Zone Identifiers](#Time_Zone_Identifiers), added information on the new `iana` attribute for stability; also see information on `iana` in the section [U Extension Data Files](#Unicode_Locale_Extension_Data_Files). + * [Likely Subtags](#Likely_Subtags): There is a fix to how macroregions are handled by adding likely subtags, such as with `und_419` + * [Unicode Sets](#Unicode_Sets): New sections on the following, with additional clarifications: + * [UnicodeSet syntax](#unicodeset-syntax) + * [Backslash Escapes](#Backslash_Escapes) + * [Variables in UnicodeSets](#Variables_in_UnicodeSets) + +* [General](tr35-general.md#Contents) + * Added new section [Unit Identifier Uniqueness](tr35-general.md#Unit_Identifier_Uniqueness), and added a relevant constraint on base_component in the [Syntax](tr35-general.md#syntax) section. + +* [Dates](tr35-dates.md#Contents) + * New section [First Day Overrides](tr35-dates.md#first-day-overrides): Described the various locale ID elements that affect determination of the first day of the week (for week of year calculations), and the order in which they should be considered. Also noted in [Key/Type Definitions](#Key_Type_Definitions) which keys can affect determination of first day. + +* [Supplemental](tr35-info.md#Contents) + * In [Conversion Data](tr35-info.md#conversion-data), expanded the list of values for the convertUnit systems attribute. + * Added new section [Derived Unit System](tr35-info.md#derived-unit-system) + * Rewrote and clarified the material in [Unit Preferences Overrides](tr35-info.md#Unit_Preferences_Data) + +* [Keyboards](tr35-keyboards.md#Contents) + * Complete revision, description TBS + * [Person Names](tr35-personNames.md#Contents) + * Added material in [API Implementaion](tr35-personNames.md#api-implementation) on recommended implementation API options. + * Describe new [parameterDefault Element](tr35-personNames.md#parameterdefault-element) element that specifies default formality and length. + * Describe new [nativeSpaceReplacement Element](tr35-personNames.md#nativespacereplacement-element) that specifies how spaces should be handled when the name language is the same as the formatting language. + * In [Modifiers](tr35-personNames.md#modifiers) added the modifiers retain, genitive and vocative. + * Added sections on [Grammatical Modifiers for Names](tr35-personNames.md#grammatical-modifiers-for-names) and [Future Modifiers](tr35-personNames.md#future-modifiers). * Fixed a problem in [Switch the formatting locale if necessary](tr35-personNames.md#switch-the-formatting-locale-if-necessary), where the full formatting locale wasn't being set correctly when the name object has a locale whose script is incompatibility with name script. -* [Likely Subtags](#Likely_Subtags) - * There is a fix to how macroregions are handled by adding likely subtags, such as with und_419 + * Rewrote the section on [Setting the spaceReplacement](tr35-personNames.md#setting-the-spacereplacement). **Differences from LDML Version 42** From 1789e1f4289e82d2ca561d256a4d5ebf3fc295e1 Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Wed, 4 Oct 2023 22:42:48 +0200 Subject: [PATCH 02/11] CLDR-16775 Fixes to transform spec (#3316) * CLDR-16775 Fixes to transform spec * CLDR-16775 tweaks * CLDR-16775 formatting tweaks * CLDR-16775 more formatting tweaks --- docs/ldml/tr35-general.md | 65 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 63 insertions(+), 2 deletions(-) diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index 67a963659d4..39b8e7d14fe 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -1834,7 +1834,8 @@ If the direction is `forward`, then an ID is composed from `target + "-" + sourc The `visibility` attribute indicates whether the IDs should be externally visible, or whether they are only used internally. -In previous versions, the rules were expressed as fine-grained XML. That was discarded in CLDR version 29, in favor of a simpler format where the separate rules are simply terminated with ";". +Note: In CLDR v28 and before, the rules were expressed as fine-grained XML. +That was discarded in CLDR version 29, in favor of a simpler format where the separate rules are simply terminated with ";". The transform rules are similar to regular-expression substitutions, but adapted to the specific domain of text transformations. The rules and comments in this discussion will be intermixed, with # marking the comments. The simplest rule is a conversion rule, which replaces one string of characters with another. The conversion rule takes the following form: @@ -1859,6 +1860,8 @@ All of the ASCII characters except numbers and letters are reserved for use in t '←' → arrow' 'sign ; ``` +Note: The characters `→`, `←`, `↔` are preferred, but can be represented by the ASCII character `>`, `<`, and `<>`, respectively. + Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example: ``` @@ -1940,7 +1943,9 @@ It will thus convert “-B A-B a-b” to “B AB a-b”. #### Revisiting -If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text. Thus the | marks a "cursor" position. For example, if we have the following, then the string "xa" will convert to "w". +If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text. +Thus the | marks a "cursor" position. +For example, if we have the following, then the string "xa" will convert to "yw". ``` x → y | z ; @@ -2108,6 +2113,12 @@ Conversion rules can be forward, backward, or double. The complete conversion ru > b | c ← e { f g } h ; > ``` +The `completed_result` | `result_to_revisit` is also known as the `resulting_text`. Either or both of the values can be empty. For example, the following removes any a, b, or c. + +``` +[a-c] → ; +``` + #### Intermixing Transform Rules and Conversion Rules Transform rules and conversion rules may be freely intermixed. Inserting a transform rule into the middle of a set of conversion rules has an important side effect. @@ -2230,6 +2241,56 @@ m → r ; Note how the irrelevant rules (the inverse filter rule and the rules containing ←) are omitted (ignored, actually) in the forward direction, and notice how things are reversed: the transform rules are inverted and happen in the opposite order, and the groups of conversion rules are also executed in the opposite relative order (although the rules within each group are executed in the same order). +Because the order of rules matters, the following will not work as expected +``` +c → s; +ch → kh; +``` +The second rule can never execute, because it is "masked" by the first. +To help prevent errors, implementations should try to alert readers when this occurs, eg: +``` +Rule {c > s;} masks {ch > kh;} +``` + +### Transform Syntax Characters + +The following summarizes the syntax characters used in transforms. + +| Character(s) | Description | Example | +| - | - | - | +| ; | End of a conversion rule, variable definition, or transform rule invocation | a → b ; | +| \:\: | Invoke a transform | :: Null ; | +| (, ) | In a transform rule invocation, marks the backwards transform | :: Null (NFD); | +| $ | Mark the start of a variable, when followed by an ASCII letter | $abc | +| = | Used to define variables | $a = abc ; | +| →, \> | Transform from left to right (only for forward conversion rules) | a → b ; | +| ←, \< | Transform from right to left (only for backward conversion rules) | a ← b ; | +| ↔, \<\> | Transform from left to right (for forward) and right to left (for backward) | a ↔ b ; | +| { | Mark the boundary between before_context and the text_to_replace | a {b} c → B ; | +| } | Mark the boundary between the text_to_replace and after_context | a {b} c → B ; | +| ' | Escape one or more characters, until the next ' | '\<\>' → x ; | +| " | Escape one or more characters, until the next " | "\<\>" → x ; | +| \\ | Escape the next character | \\\<\\\> → x ; | +| # | Comment (until the end of a line) | a → ; # remove a | +| \| | In the resulting_text, moves the cursor | a → A \| b; | +| @ | In the resulting_text, filler character used to move the cursor before the start or after the end of the result | a → Ab@\|; | +| (, ) | In text_to_replace, a capturing group | ([a-b]) > &hex($1); | +| $ | In replacement_text, when followed by 1..9, is replaced by the contents of a capture group | ([a-b]) > &hex($1); | +| ^ | In a before_context, by itself, equivalent to [$] **(deprecated)** | ... | +| ? | In a before_context, after_context, or text_to_replace, a possessive quantifier for zero or one | a?b → c ; | +| + | In a before_context, after_context, or text_to_replace, a possessive quantifier for one or more | a+b → c ; | +| * | In a before_context, after_context, or text_to_replace, a possessive quantifier for zero or more | a*b → c ; | +| & | Invoke a function in the replacement_text | ([a-b]) > &hex($1); | +| !, %, _, ~, -, ., / | Reserved for future syntax | ... | +| SPACE | Ignored except when quoted | a b # same as ab | +| \uXXXX | Hex notation: 4 Xs | \u0061 | +| \x{XX...} | Hex notation: 1-6 Xs | \x{61} | +| [, ] | Marks a UnicodeSet | [a-z] | +| \p{...} | Marks a UnicodeSet formed from a property | \p{di} | +| \P{...} | Marks a negative UnicodeSet formed from a property | \p{DI} | +| $ | Within a UnicodeSet (not before ASCII letter), matches the start or end of the source text (but is not replaced) | [$] b → c | +| Other | Many of these characters have special meanings inside a UnicodeSet | ... | + ## List Patterns ```xml From 076d4f53e88f74555411089e0953266111ce2fdc Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Wed, 4 Oct 2023 14:38:30 -0700 Subject: [PATCH 03/11] CLDR-17137 Add --license-file to Ldml2json for overriding the license included in the binary (#3300) --- .../unicode/cldr/json/Ldml2JsonConverter.java | 29 ++++++++++++++++--- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java index f630256a43d..3a3af824b93 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java @@ -15,6 +15,7 @@ import com.ibm.icu.text.MessageFormat; import com.ibm.icu.util.NoUnit; import com.ibm.icu.util.ULocale; +import java.io.BufferedReader; import java.io.File; import java.io.IOException; import java.io.PrintWriter; @@ -28,6 +29,7 @@ import java.util.Locale; import java.util.Map; import java.util.Map.Entry; +import java.util.Optional; import java.util.Set; import java.util.TreeMap; import java.util.TreeSet; @@ -237,7 +239,15 @@ private class AvailableLocales { 'M', "(true|false)", "true", - "Whether to include the -modern tier"); + "Whether to include the -modern tier") + // Primarily useful for non-Maven build systems where CldrUtility.LICENSE may + // not be available as it is put in place by pom.xml + .add( + "license-file", + null, + ".*", + "", + "Override the license file included in the bundle"); public static void main(String[] args) throws Exception { System.out.println(GEAR_ICON + " " + Ldml2JsonConverter.class.getName() + " options:"); @@ -288,7 +298,9 @@ static void processType(final String runType) throws Exception { Boolean.parseBoolean(options.get("bcp47").getValue()), Boolean.parseBoolean(options.get("bcp47-no-subtags").getValue()), Boolean.parseBoolean(options.get("Modern").getValue()), - Boolean.parseBoolean(options.get("Redundant").getValue())); + Boolean.parseBoolean(options.get("Redundant").getValue()), + Optional.ofNullable(options.get("license-file").getValue()) + .filter(s -> !s.isEmpty())); DraftStatus status = DraftStatus.valueOf(options.get("draftstatus").getValue()); l2jc.processDirectory(runType, status); @@ -331,6 +343,7 @@ public int compareTo(JSONSection other) { private final String pkgVersion; private final boolean strictBcp47; private final boolean writeModernPackage; + private final Optional licenseFile; private final boolean skipBcp47LocalesWithSubtags; private LdmlConfigFileReader configFileReader; @@ -348,7 +361,8 @@ public Ldml2JsonConverter( boolean strictBcp47, boolean skipBcp47LocalesWithSubtags, boolean writeModernPackage, - boolean includeRedundant) { + boolean includeRedundant, + Optional licenseFile) { this.writeModernPackage = writeModernPackage; this.strictBcp47 = strictBcp47; this.skipBcp47LocalesWithSubtags = strictBcp47 && skipBcp47LocalesWithSubtags; @@ -376,6 +390,7 @@ public Ldml2JsonConverter( this.sections = configFileReader.getSections(); this.packages = new TreeSet<>(); this.includeRedundant = includeRedundant; + this.licenseFile = licenseFile; } /** @@ -1232,7 +1247,13 @@ public void writeReadme(String outputDir, String packageName) throws IOException } try (PrintWriter outf = FileUtilities.openUTF8Writer(outputDir + "/" + packageName, "LICENSE"); ) { - FileCopier.copy(CldrUtility.getUTF8Data(CldrUtility.LICENSE), outf); + if (licenseFile.isPresent()) { + try (BufferedReader br = FileUtilities.openUTF8Reader("", licenseFile.get()); ) { + FileCopier.copy(br, outf); + } + } else { + FileCopier.copy(CldrUtility.getUTF8Data("unicode-license.txt"), outf); + } } } From c14c0712b4599b713aad456d855aec9d8429ec28 Mon Sep 17 00:00:00 2001 From: Tom Bishop Date: Wed, 4 Oct 2023 23:36:57 -0400 Subject: [PATCH 04/11] CLDR-17128 GenerateProductionData should not add defaultNumberingSystem items to ar_001 (#3318) -The method localeIsSpecial should return false for the default content locale ar_001 --- .../main/java/org/unicode/cldr/tool/GenerateProductionData.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateProductionData.java b/tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateProductionData.java index 9cf5504e4c3..7bfa3c578c9 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateProductionData.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateProductionData.java @@ -575,7 +575,7 @@ private static boolean directoryIsSpecial(String directory) { } private static boolean localeIsSpecial(String localeId) { - return localeId.equals("ar") || localeId.startsWith("ar_"); + return localeId.equals("ar") || (localeId.startsWith("ar_") && !"ar_001".equals(localeId)); } private static final String[] SPECIAL_PATHS = From bb2bdcd9b6cbfd21e9470923a98f6a596f2a2ef4 Mon Sep 17 00:00:00 2001 From: Peter Edberg <42151464+pedberg-icu@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:10:50 -0700 Subject: [PATCH 05/11] CLDR-16848 Note parse issues for week of year; update mods for Transforms (#3319) --- docs/ldml/tr35-dates.md | 4 +++- docs/ldml/tr35.md | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/ldml/tr35-dates.md b/docs/ldml/tr35-dates.md index 99e04f38fb5..3288447af46 100644 --- a/docs/ldml/tr35-dates.md +++ b/docs/ldml/tr35-dates.md @@ -1144,7 +1144,9 @@ These values provide territory-specific information needed for week-of-year and … ``` -In order for a week to count as the first week of a new year for week-of-year calculations, it must include at least the number of days in the new year specified by the minDays value; otherwise the week will count as the last week of the previous year (and for week-of-month calculations, `minDays` also specifies the minimum number of days in the new month for a week to count as part of that month). +In order for a week to count as the first week of a new year for week-of-year calculations, the week beginning with `firstDay` must include at least the number of days in the new year specified by the `minDays` value; otherwise the week will count as the last week of the previous year (and for week-of-month calculations, `minDays` also specifies the minimum number of days in the new month for a week to count as part of that month). + +> **Note:** For week-of-year calculations, Gregorian years may have 52 or 53 weeks. Changes in the value of `minDays` or `firstDay` can affect the year to which a date is assigned as well as the number of weeks in a given year; implementations that parse dates using week-of-year formats should be prepared to handle such cases. For example when parsing a date in week 53 of a year for which current values of `minDays` and `firstDay` no longer result in a 53-week year, that date should be treated as in the first week of the following year. The day indicated by `firstDay` is the one that should be shown as the first day of the week in a calendar view. This is not necessarily the same as the first day after the weekend (or the first work day of the week), which should be determined from the weekend information. Currently, day-of-week numbering is based on `firstDay` (that is, day 1 is the day specified by `firstDay`), but in the future we may add a way to specify this separately. The `firstDay` value determined from the region can be overridden by the locale keyword "fw", see [Unicode First Day Identifier](tr35.md#UnicodeFirstDayIdentifier). diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 7f51d252c17..5b192781633 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -5,7 +5,7 @@ |Version|44 (draft)| |-------|----------| |Editors|Mark Davis (markdavis@google.com) and other CLDR committee members| -|Date|2023-10-03| +|Date|2023-10-04| |This Version|https://www.unicode.org/reports/tr35/tr35-68/tr35.html| |Previous Version|https://www.unicode.org/reports/tr35/tr35-67/tr35.html| |Latest Version|https://www.unicode.org/reports/tr35/| @@ -4034,6 +4034,7 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni * [General](tr35-general.md#Contents) * Added new section [Unit Identifier Uniqueness](tr35-general.md#Unit_Identifier_Uniqueness), and added a relevant constraint on base_component in the [Syntax](tr35-general.md#syntax) section. + * Several clarifications were added in [Transform Rules Syntax](tr35-general.md#Transform_Rules_Syntax), and a new section [Transform Syntax Characters](tr35-general.md#transform-syntax-characters) was added with a table of the characters. * [Dates](tr35-dates.md#Contents) * New section [First Day Overrides](tr35-dates.md#first-day-overrides): Described the various locale ID elements that affect determination of the first day of the week (for week of year calculations), and the order in which they should be considered. Also noted in [Key/Type Definitions](#Key_Type_Definitions) which keys can affect determination of first day. From 048d0635917280eb658e7f688bd99adfdca92e90 Mon Sep 17 00:00:00 2001 From: "Steven R. Loomis" Date: Thu, 5 Oct 2023 10:29:31 -0500 Subject: [PATCH 06/11] CLDR-17092 kbd: drop fallback="omit" (#3309) --- docs/ldml/tr35-keyboards.md | 26 ++------------------------ keyboards/3.0/fr-t-k0-azerty.xml | 1 - keyboards/3.0/pcm.xml | 1 - keyboards/dtd/ldmlKeyboard3.dtd | 2 -- keyboards/dtd/ldmlKeyboard3.xsd | 8 -------- 5 files changed, 2 insertions(+), 36 deletions(-) diff --git a/docs/ldml/tr35-keyboards.md b/docs/ldml/tr35-keyboards.md index 6cd93351509..d2be0295a26 100644 --- a/docs/ldml/tr35-keyboards.md +++ b/docs/ldml/tr35-keyboards.md @@ -589,7 +589,7 @@ An element used to keep track of layout-specific settings by implementations. Th **Syntax** ```xml - + ``` > @@ -602,29 +602,7 @@ An element used to keep track of layout-specific settings by implementations. Th > > -_Attribute:_ `fallback="omit"` - -> The presence of this attribute means that when a modifier key combination goes unmatched, no output is produced. The default behavior (when this attribute is not present) is to fall back to the base map when the modifier key combination goes unmatched. - -If this attribute is present, it must have a value of `omit`. - -**Example** - -```xml - - … - - … - -``` - -Indicates that: - -1. When a modifier combination goes unmatched, do not output anything when a key is pressed. -2. If a transform is terminated, output the contents of the buffer. -3. During a transform, hide the contents of the buffer as the user is typing. - -_Attribute:_ `normalization` +_Attribute:_ `normalization="disabled"` > Normalization will not typically be the responsibility of the keyboard author, rather this will be managed by the implementation. > The implementation will apply normalization as appropriate when matching transform rules and `` value matching. diff --git a/keyboards/3.0/fr-t-k0-azerty.xml b/keyboards/3.0/fr-t-k0-azerty.xml index af1c042dbda..106a3970d2c 100644 --- a/keyboards/3.0/fr-t-k0-azerty.xml +++ b/keyboards/3.0/fr-t-k0-azerty.xml @@ -21,7 +21,6 @@ - diff --git a/keyboards/3.0/pcm.xml b/keyboards/3.0/pcm.xml index 5ef43e5d682..15242ed8121 100644 --- a/keyboards/3.0/pcm.xml +++ b/keyboards/3.0/pcm.xml @@ -6,7 +6,6 @@ - diff --git a/keyboards/dtd/ldmlKeyboard3.dtd b/keyboards/dtd/ldmlKeyboard3.dtd index 2cfe1513781..044112a4099 100644 --- a/keyboards/dtd/ldmlKeyboard3.dtd +++ b/keyboards/dtd/ldmlKeyboard3.dtd @@ -71,8 +71,6 @@ Please view the subcommittee page for the most recent information. - - diff --git a/keyboards/dtd/ldmlKeyboard3.xsd b/keyboards/dtd/ldmlKeyboard3.xsd index 1a772458658..0009638d985 100644 --- a/keyboards/dtd/ldmlKeyboard3.xsd +++ b/keyboards/dtd/ldmlKeyboard3.xsd @@ -132,13 +132,6 @@ Note: DTD @-annotations are not currently converted to .xsd. For full CLDR file - - - - - - - @@ -151,7 +144,6 @@ Note: DTD @-annotations are not currently converted to .xsd. For full CLDR file - From ce93a83bff950adceccd3238e5efc3f0f2681a0a Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Thu, 5 Oct 2023 17:57:26 +0200 Subject: [PATCH 07/11] CLDR-16937 Minor clarification for dx (#3320) --- docs/ldml/tr35.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 5b192781633..0946fa589ff 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -789,8 +789,9 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other "dx" Dictionary break script exclusions unicode_script_subtag values -

One or more items of type SCRIPT_CODE, which are valid unicode_script_subtag values.

-

The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified.

+

One or more items of type SCRIPT_CODE (as usual, separated by hyphens), which are valid unicode_script_subtag values.

+

The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. + If others are included mistakenly, they are ignored.

A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji">. The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml. "em" From 609ed4215705670b3806ecb2a929e2d6c0fb1cf7 Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Thu, 5 Oct 2023 17:57:49 +0200 Subject: [PATCH 08/11] CLDR-16038 fix spec constraints using unit id component (#3321) --- docs/ldml/tr35-general.md | 49 +++++++++++++++++++++++++-------------- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index 39b8e7d14fe..a0bc167cbee 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -913,16 +913,21 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver | long_unit_identifier core_unit_identifier:= - product_unit ("-per-" product_unit)*
- | "per-" product_unit ("-per-" product_unit)* + product_unit ("-" per "-" product_unit)*
+ | per "-" product_unit ("-" per "-" product_unit)*
  • Examples:
    • foot-per-second-per-second
    • per-second
  • Note: The normalized form will have only one "per"
  • -
  • Note: The token 'per' is the single value in <unitIdComponent type=”per”>
+per:= + "per" +
    +
  • Constraint: The token 'per' is the single value in <unitIdComponent type="per">
  • +
+ product_unit:= single_unit ("-" single_unit)* ("-" pu_single_unit)*
| pu_single_unit ("-" pu_single_unit)* @@ -935,9 +940,9 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver
  • Examples: square-meter, or 100-square-meter
pu_single_unit:= - “xxx-” single_unit | “x-” single_unit + "xxx-" single_unit | "x-" single_unit
  • Example: xxx-square-knuts (a Harry Potter unit)
  • -
  • Note: “x-” is only for backwards compatibility
  • +
  • Note: "x-" is only for backwards compatibility
  • See Private-Use Units
@@ -954,18 +959,19 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver dimensionality_prefix:= "square-"

| "cubic-"

| "pow" ([2-9]|1[0-5]) "-"

    +
  • Constraint: must be value in: <unitIdComponent type="power">.
  • Note: "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"
  • -
  • Note: These are values in <unitIdComponent type=”power”>
  • +
  • Note: These are values in <unitIdComponent type="power">
simple_unit:= (prefix_component "-")* (prefixed_unit | base_component) ("-" suffix_component)*
| currency_unit
- | “em” | “g” | “us” | “hg” | "of" + | "em" | "g" | "us" | "hg" | "of"
  • Examples: kilometer, meter, cup-metric, fluid-ounce, curr-chf, em
  • -
  • Note: Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, “g-force”). - We will likely deprecate those and add conformant aliases in the future: the “hg” and “of” are already only in deprecated simple_units.
  • +
  • Note: Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, "g-force"). + We will likely deprecate those and add conformant aliases in the future: the "hg" and "of" are already only in deprecated simple_units.
prefixed_unit @@ -984,16 +990,16 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver prefix_component:= [a-z]{3,∞} -
  • Constraint: must be value in: <unitIdComponent type=”prefix_component”>.
+
  • Constraint: must be value in: <unitIdComponent type="prefix">.
base_component:= [a-z]{3,∞}
  • Constraint: must not be a value in any of the following:
    - <unitIdComponent type=”prefix_component”>
    - or <unitIdComponent type=”suffix_component”>
    - or <unitIdComponent type=”power”>
    - or <unitIdComponent type=”and”>
    - or <unitIdComponent type=”per”>. + <unitIdComponent type="prefix">
    + or <unitIdComponent type="suffix">
    + or <unitIdComponent type="power">
    + or <unitIdComponent type="and">
    + or <unitIdComponent type="per">.
  • Constraint: must not have a prefix as an initial segment.
  • Constraint: no two different base_components will share the first 8 letters. @@ -1004,12 +1010,19 @@ Some of the constraints reference data from the unitIdComponents in [Unit_Conver suffix_component:= [a-z]{3,∞} -
    • Constraint: must be value in: <unitIdComponent type=”suffix_component”>
    +
      +
    • Constraint: must be value in: <unitIdComponent type="suffix">
    • +
    mixed_unit_identifier:= - (single_unit | pu_single_unit) ("-and-" (single_unit | pu_single_unit ))* + (single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
    • Example: foot-and-inch
    • -
    • Note: The token 'and' is the single value in <unitIdComponent type=”and”>
    • +
    + +and:= + "and" +
      +
    • Constraint: The token 'and' is the single value in <unitIdComponent type="and">
    long_unit_identifier:= From 2e03d3c30a3205729ced20294e3eee472b14b118 Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Thu, 5 Oct 2023 17:58:43 +0200 Subject: [PATCH 09/11] CLDR-16249 Describe EBNF syntax more clearly (#3322) --- docs/ldml/tr35.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 0946fa589ff..943c86997c0 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -246,7 +246,13 @@ External specifications may also reference particular components of Unicode loca > _Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes._ +### EBNF +The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are: +1. Bounded repetition following Perl regex syntax is allowed, such as alphanum{3,8} +2. Constraints (well-formedness or validity) use separate notes + +In the text, this is sometimes referred to as "EBNF (Perl-based)". ## What is a Locale? From 873ab680b1736e80cc1b54dbf403ee70d6e7fdf1 Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Thu, 5 Oct 2023 17:59:04 +0200 Subject: [PATCH 10/11] CLDR-16251 Add constraint on duplicate variant tags (#3323) * CLDR-16251 Add constraint on duplicate variant tags * CLDR-16251 Add note about tlang. --- docs/ldml/tr35.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 943c86997c0..7c295ab99af 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -323,6 +323,9 @@ A _Unicode language identifier_ has the following structure (provided in EBNF (P alphanum
    = [0-9 A-Z a-z] ;
    +> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: +> The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). + The semantics of the various subtags is explained in _[Language Identifier Field Definitions](#Field_Definitions)_ ; there are also direct links from [`unicode_language_subtag`](#unicode_language_subtag) , etc. While theoretically the [`unicode_language_subtag`](#unicode_language_subtag) may have more than 3 letters through the IANA registration process, in practice that has not occurred. The [`unicode_language_subtag`](#unicode_language_subtag) "und" may be omitted when there is a [`unicode_script_subtag`](#unicode_script_subtag) ; for that reason [`unicode_language_subtag`](#unicode_language_subtag) values with 4 letters are not permitted. However, such [`unicode_language_id`](#unicode_language_id) values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier). For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers. @@ -353,6 +356,9 @@ As is often the case, the complete syntactic constraints are not easily captured | `tkey` | `= alpha digit ;` | | `tvalue` | `= (sep alphanum{3,8})+ ;` | +> As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: +> The sequence of variant subtags in a tlang must not have any duplicates. + For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_. As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information. From ba1c4f0cb14e669d6ffdc14bf48b6a18fabdff73 Mon Sep 17 00:00:00 2001 From: "Steven R. Loomis" Date: Thu, 5 Oct 2023 12:50:30 -0500 Subject: [PATCH 11/11] CLDR-17145 BRS/kbd: update Modifications section about the keyboard spec (#3325) --- docs/ldml/tr35.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 7c295ab99af..729734cd6ec 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -4058,7 +4058,7 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni * Rewrote and clarified the material in [Unit Preferences Overrides](tr35-info.md#Unit_Preferences_Data) * [Keyboards](tr35-keyboards.md#Contents) - * Complete revision, description TBS + * Complete rewrite of the specification by the Keyboard Subcommittee. Available a technical preview in CLDR version 44. See [Part 7: Status](tr35-keyboards.md#status). * [Person Names](tr35-personNames.md#Contents) * Added material in [API Implementaion](tr35-personNames.md#api-implementation) on recommended implementation API options.