From 61b74a36de8329daed152005133a699ae7f2012b Mon Sep 17 00:00:00 2001 From: "Steven R. Loomis" Date: Thu, 14 Sep 2023 08:33:40 +0100 Subject: [PATCH] CLDR-16825 kbd: drop \u1234 format escaping (#3228) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - for consistency, only \u{…} format escaping is supported. --- docs/ldml/tr35-keyboards.md | 106 +++++++++++++------------ keyboards/3.0/fr-t-k0-azerty.xml | 20 ++--- keyboards/test/fr-t-k0-azerty-test.xml | 10 +-- keyboards/test/ja-Latn-test.xml | 2 +- keyboards/test/pt-t-k0-abnt2-test.xml | 4 +- 5 files changed, 73 insertions(+), 69 deletions(-) diff --git a/docs/ldml/tr35-keyboards.md b/docs/ldml/tr35-keyboards.md index 2ce8ee865a7..2e42582e888 100644 --- a/docs/ldml/tr35-keyboards.md +++ b/docs/ldml/tr35-keyboards.md @@ -263,7 +263,11 @@ When explicitly specified, attribute values can contain escaped characters. This ### UnicodeSet Escaping -The _UnicodeSet_ notation is described in [UTS #35 section 5.3.3](tr35.md#Unicode_Sets) and allows for comprehensive character matching, including by character range, properties, names, or codepoints. Currently, the following attribute values allow _UnicodeSet_ notation: +The _UnicodeSet_ notation is described in [UTS #35 section 5.3.3](tr35.md#Unicode_Sets) and allows for comprehensive character matching, including by character range, properties, names, or codepoints. + +Note that the `\u1234` and `\x{C1}` format escaping is not supported, only the `\u{…}` format (using `bracketedHex`). + +Currently, the following attribute values allow _UnicodeSet_ notation: * `from` or `before` on the `` element * `from` or `before` on the `` element @@ -928,7 +932,7 @@ where a flick to the Northeast then South produces two code points. ```xml - + ``` @@ -1037,7 +1041,7 @@ For combining characters, U+25CC `◌` is used as a base. It is an error to use For example, a key which outputs a combining tilde (U+0303) can be represented as follows: ```xml - + ``` This way, a key which outputs a combining tilde (U+0303) will be represented as `◌̃` (a tilde on a dotted circle). @@ -1112,12 +1116,12 @@ This attribute may be escaped with `\u` notation, see [Escaping](#Escaping). ```xml - + - - + + @@ -1679,9 +1683,9 @@ _Attribute:_ `value` (required) ```xml - + - + ``` @@ -1776,7 +1780,7 @@ If the input context changes, such as if the cursor or mouse moves the insertion Ideally, markers are implemented entirely out-of-band from the normal text stream. However, implementations _may_ choose to map each marker to a [Unicode private-use character](https://www.unicode.org/glossary/#private_use_character) for use only within the implementation’s processing and temporary storage in the input context. -For example, the first marker encountered could be represented as U+E000, the second by U+E001 and so on. If a regex processing engine were used, then those PUA characters could be processed through the existing regex processing engine. `[^\uE000-\uE009]` could be used as an expression to match a character that is not a marker, and `[Ee]\u{E000}` could match `E` or `e` followed by the first marker. +For example, the first marker encountered could be represented as U+E000, the second by U+E001 and so on. If a regex processing engine were used, then those PUA characters could be processed through the existing regex processing engine. `[^\u{E000}-\u{E009}]` could be used as an expression to match a character that is not a marker, and `[Ee]\u{E000}` could match `E` or `e` followed by the first marker. Such implementations must take care to remove all such markers (see prior section) from the resultant text. As well, implementations must take care to avoid conflicts if applications themselves are using PUA characters, such as is often done with not-yet-encoded scripts or characters. @@ -1874,7 +1878,7 @@ _Attribute:_ `from` (required) - **Unicode codepoint escapes** - `\u1234 \u012A` + `\u{1234} \u{012A}` `\u{22} \u{012a} \u{1234A}` The hex escaping is case insensitive. The value may not match a surrogate or illegal character, nor a marker character. @@ -1886,13 +1890,13 @@ _Attribute:_ `from` (required) The value of these classes do not change with Unicode versions. - `\s` for example is exactly `[\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]` + `\s` for example is exactly `[\f\n\r\t\v\u{00a0}\u{1680}\u{2000}-\u{200a}\u{2028}\u{2029}\u{202f}\u{205f}\u{3000}\u{feff}]` `\\` and `\$` evaluate to `\` and `$`, respectively. - **Character classes** - `[abc]` `[^def]` `[a-z]` `[ॲऄ-आइ-ऋ]` `[\u093F-\u0944\u0962\u0963]` + `[abc]` `[^def]` `[a-z]` `[ॲऄ-आइ-ऋ]` `[\u{093F}-\u{0944}\u{0962}\u{0963}]` - supported - no Unicode properties such as `\p{…}` @@ -1995,7 +1999,7 @@ The following are additions to standard Regex syntax. Tooling may choose to suggest an expansion of properties, such as `\p{Mn}` to all non spacing marks for a certain Unicode version. As well, a set of variables could be constructed in an `import`-able file matching particularly useful Unicode properties. ```xml - + ``` - **Backreferences** @@ -2252,12 +2256,12 @@ Finally, the user might also type in the sequence with the tone _after_ the lowe We want all of these sequences to end up ordered as the first. To do this, we use the following rules: ```xml - - - - - - + + + + + + ``` The first reorder is the default ordering for the _sakot_ which allows for it to be placed anywhere in a sequence, but moves any non-consonants that may immediately follow it, back before it in the sequence. The next two rules give the orders for the top vowel component and tone marks respectively. The next three rules give the _sakot_ and _wa_ characters a primary order that places them before the _o_. Notice particularly the final reorder rule where the _sakot_+_wa_ is split by the tone mark. This rule is necessary in case someone types into the middle of previously normalized text. @@ -2293,21 +2297,21 @@ Consider this fragment from a shared reordering for the Myanmar script: - + - + - + - + - + - + ``` @@ -2317,17 +2321,17 @@ A particular Myanmar keyboard layout can have these `reorder` elements: - + - + - + ``` -The effect of this is that the _e-vowel_ will be identified as a prebase and will have an order of 30. Likewise a _medial-r_ will be identified as a prebase and will have an order of 20. Notice that a _shan-e-vowel_ (`\u1084`) will not be identified as a prebase (even if it should be!). The _kinzi_ is described in the layout since it moves something across a run boundary. By separating such movements (prebase or moving to in front of a base) from the shared ordering rules, the shared ordering rules become a self-contained combining order description that can be used in other keyboards or even in other contexts than keyboarding. +The effect of this is that the _e-vowel_ will be identified as a prebase and will have an order of 30. Likewise a _medial-r_ will be identified as a prebase and will have an order of 20. Notice that a _shan-e-vowel_ (`\u{1084}`) will not be identified as a prebase (even if it should be!). The _kinzi_ is described in the layout since it moves something across a run boundary. By separating such movements (prebase or moving to in front of a base) from the shared ordering rules, the shared ordering rules become a self-contained combining order description that can be used in other keyboards or even in other contexts than keyboarding. #### Example Post-reorder transforms @@ -2344,8 +2348,8 @@ First, a partial example from Khmer where split vowels are combined after reorde … - - + + ``` @@ -2360,7 +2364,7 @@ Another partial example allows a keyboard implementation to prevent people typin … - + ``` @@ -2403,7 +2407,7 @@ While this character is made up of three codepoints, the following rule causes a ```xml - + ``` @@ -2416,35 +2420,35 @@ A more complex example comes from a Burmese visually ordered keyboard: - + - + - + - + - + - + - + - + - + - + ``` @@ -2844,7 +2848,7 @@ Specifies the starting context. This text may be escaped with `\u` notation, see **Example** ```xml - + ``` @@ -2961,7 +2965,7 @@ This attribute specifies the expected resultant text in a document after process **Example** ```xml - + ``` @@ -2970,22 +2974,22 @@ This attribute specifies the expected resultant text in a document after process ```xml - + - + - + - + - + - + ``` diff --git a/keyboards/3.0/fr-t-k0-azerty.xml b/keyboards/3.0/fr-t-k0-azerty.xml index f91943f3b9b..b7162f6e01b 100644 --- a/keyboards/3.0/fr-t-k0-azerty.xml +++ b/keyboards/3.0/fr-t-k0-azerty.xml @@ -43,7 +43,7 @@ - + - - - + + + @@ -99,9 +99,9 @@ - - - + + + @@ -203,9 +203,9 @@ - - - + + + diff --git a/keyboards/test/fr-t-k0-azerty-test.xml b/keyboards/test/fr-t-k0-azerty-test.xml index 638fe3954fd..5c827d6b878 100644 --- a/keyboards/test/fr-t-k0-azerty-test.xml +++ b/keyboards/test/fr-t-k0-azerty-test.xml @@ -6,17 +6,17 @@ - + - + - + - + - + diff --git a/keyboards/test/ja-Latn-test.xml b/keyboards/test/ja-Latn-test.xml index 047795bad40..a1e3a185ebf 100644 --- a/keyboards/test/ja-Latn-test.xml +++ b/keyboards/test/ja-Latn-test.xml @@ -3,7 +3,7 @@ + chars="[a-z A-Z 0-9 !\u{0022}#$%\u{0026}\[\]\{\}=\-|¥~\^_\u{0020}\u{003c}>,./?`@\+\*]" type="simple" /> diff --git a/keyboards/test/pt-t-k0-abnt2-test.xml b/keyboards/test/pt-t-k0-abnt2-test.xml index 3457ab67fa7..4eb59db813b 100644 --- a/keyboards/test/pt-t-k0-abnt2-test.xml +++ b/keyboards/test/pt-t-k0-abnt2-test.xml @@ -3,7 +3,7 @@ @@ -35,7 +35,7 @@ - +