Skip to content

Commit

Permalink
CLDR-16775 Fixes to transform spec (#3316)
Browse files Browse the repository at this point in the history
* CLDR-16775 Fixes to transform spec

* CLDR-16775 tweaks

* CLDR-16775 formatting tweaks

* CLDR-16775 more formatting tweaks
  • Loading branch information
macchiati authored Oct 4, 2023
1 parent 3493f87 commit 1789e1f
Showing 1 changed file with 63 additions and 2 deletions.
65 changes: 63 additions & 2 deletions docs/ldml/tr35-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -1834,7 +1834,8 @@ If the direction is `forward`, then an ID is composed from `target + "-" + sourc

The `visibility` attribute indicates whether the IDs should be externally visible, or whether they are only used internally.

In previous versions, the rules were expressed as fine-grained XML. That was discarded in CLDR version 29, in favor of a simpler format where the separate rules are simply terminated with ";".
Note: In CLDR v28 and before, the rules were expressed as fine-grained XML.
That was discarded in CLDR version 29, in favor of a simpler format where the separate rules are simply terminated with ";".

The transform rules are similar to regular-expression substitutions, but adapted to the specific domain of text transformations. The rules and comments in this discussion will be intermixed, with # marking the comments. The simplest rule is a conversion rule, which replaces one string of characters with another. The conversion rule takes the following form:

Expand All @@ -1859,6 +1860,8 @@ All of the ASCII characters except numbers and letters are reserved for use in t
'←' → arrow' 'sign ;
```

Note: The characters ``, ``, `` are preferred, but can be represented by the ASCII character `>`, `<`, and `<>`, respectively.

Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example:

```
Expand Down Expand Up @@ -1940,7 +1943,9 @@ It will thus convert “-B A-B a-b” to “B AB a-b”.

#### <a name="Revisiting" href="#Revisiting">Revisiting</a>

If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text. Thus the | marks a "cursor" position. For example, if we have the following, then the string "xa" will convert to "w".
If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text.
Thus the | marks a "cursor" position.
For example, if we have the following, then the string "xa" will convert to "yw".

```
x → y | z ;
Expand Down Expand Up @@ -2108,6 +2113,12 @@ Conversion rules can be forward, backward, or double. The complete conversion ru
> b | c ← e { f g } h ;
> ```
The `completed_result` | `result_to_revisit` is also known as the `resulting_text`. Either or both of the values can be empty. For example, the following removes any a, b, or c.
```
[a-c] → ;
```
#### <a name="Intermixing_Transform_Rules_and_Conversion_Rules" href="#Intermixing_Transform_Rules_and_Conversion_Rules">Intermixing Transform Rules and Conversion Rules</a>
Transform rules and conversion rules may be freely intermixed. Inserting a transform rule into the middle of a set of conversion rules has an important side effect.
Expand Down Expand Up @@ -2230,6 +2241,56 @@ m → r ;
Note how the irrelevant rules (the inverse filter rule and the rules containing ←) are omitted (ignored, actually) in the forward direction, and notice how things are reversed: the transform rules are inverted and happen in the opposite order, and the groups of conversion rules are also executed in the opposite relative order (although the rules within each group are executed in the same order).
Because the order of rules matters, the following will not work as expected
```
c → s;
ch → kh;
```
The second rule can never execute, because it is "masked" by the first.
To help prevent errors, implementations should try to alert readers when this occurs, eg:
```
Rule {c > s;} masks {ch > kh;}
```
### Transform Syntax Characters
The following summarizes the syntax characters used in transforms.
| Character(s) | Description | Example |
| - | - | - |
| ; | End of a conversion rule, variable definition, or transform rule invocation | a → b ; |
| \:\: | Invoke a transform | :: Null ; |
| (, ) | In a transform rule invocation, marks the backwards transform | :: Null (NFD); |
| $ | Mark the start of a variable, when followed by an ASCII letter | $abc |
| = | Used to define variables | $a = abc ; |
| →, \> | Transform from left to right (only for forward conversion rules) | a → b ; |
| ←, \< | Transform from right to left (only for backward conversion rules) | a ← b ; |
| ↔, \<\> | Transform from left to right (for forward) and right to left (for backward) | a ↔ b ; |
| { | Mark the boundary between before_context and the text_to_replace | a {b} c → B ; |
| } | Mark the boundary between the text_to_replace and after_context | a {b} c → B ; |
| ' | Escape one or more characters, until the next ' | '\<\>' → x ; |
| " | Escape one or more characters, until the next " | "\<\>" → x ; |
| \\ | Escape the next character | \\\<\\\> → x ; |
| # | Comment (until the end of a line) | a → ; # remove a |
| \| | In the resulting_text, moves the cursor | a → A \| b; |
| @ | In the resulting_text, filler character used to move the cursor before the start or after the end of the result | a → Ab@\|; |
| (, ) | In text_to_replace, a capturing group | ([a-b]) > &hex($1); |
| $ | In replacement_text, when followed by 1..9, is replaced by the contents of a capture group | ([a-b]) > &hex($1); |
| ^ | In a before_context, by itself, equivalent to [$] **(deprecated)** | ... |
| ? | In a before_context, after_context, or text_to_replace, a possessive quantifier for zero or one | a?b → c ; |
| + | In a before_context, after_context, or text_to_replace, a possessive quantifier for one or more | a+b → c ; |
| * | In a before_context, after_context, or text_to_replace, a possessive quantifier for zero or more | a*b → c ; |
| & | Invoke a function in the replacement_text | ([a-b]) > &hex($1); |
| !, %, _, ~, -, ., / | Reserved for future syntax | ... |
| SPACE | Ignored except when quoted | a b # same as ab |
| \uXXXX | Hex notation: 4 Xs | \u0061 |
| \x{XX...} | Hex notation: 1-6 Xs | \x{61} |
| [, ] | Marks a UnicodeSet | [a-z] |
| \p{...} | Marks a UnicodeSet formed from a property | \p{di} |
| \P{...} | Marks a negative UnicodeSet formed from a property | \p{DI} |
| $ | Within a UnicodeSet (not before ASCII letter), matches the start or end of the source text (but is not replaced) | [$] b → c |
| Other | Many of these characters have special meanings inside a UnicodeSet | ... |
## <a name="ListPatterns" href="#ListPatterns">List Patterns</a>
```xml
Expand Down

0 comments on commit 1789e1f

Please sign in to comment.