From d331b753c1f52d72bd01b0d2896302edc57c0d22 Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Tue, 17 Oct 2023 08:45:48 +0200 Subject: [PATCH] CLDR-16617 Additional tweak to transforms spec (#3327) --- docs/ldml/tr35-general.md | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index 27c2e80e1fb..66e32fa592a 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -1966,12 +1966,43 @@ x → y | z ; z a → w ; ``` -First, "xa" is converted to "yza". Then the processing will continue from after the character "y", pick up the "za", and convert it. Had we not had the "|", the result would have been simply "yza". The '@' character can be used as filler character to place the revisiting point off the start or end of the string. Thus the following causes x to be replaced, and the cursor to be backed up by two characters. +First, "xa" is converted to "yza". Then the processing will continue from after the character "y", pick up the "za", and convert it. Had we not had the "|", the result would have been simply "yza". + +The '@' character can be used as filler character to place the revisiting point off the start or end of the string — but only within the context. Consider the following rules, with the table afterwards showing how they work. + +``` +1. [a-z]{x > |@ab ; +2. ab > J; +3. ca > M; +``` +The ⸠ indicates the virtual cursor: + +| Current text | Matching rule | +| - | - | +| ⸠cx | no match, cursor advances one code point | +| c⸠x | matches rule 1, so the text is replaced and cursor backs up. | +| ⸠cab | matches rule 3, so the text is replaced, with cursor at the end. | +| Mb⸠ | cursor is at the end, so we are done. | + +Notice that rule 2 did not have a chance to trigger. + +There is a current restriction that @ cannot back up before the before_context or after the after_context. +Consider the rules if rule 1 is adjusted to have no before_context. ``` -x → |@@y; +1'. x > |@ab ; +2. ab > J ; +3. ca > M; ``` +In that case, the results are different. +| Current text | Matching rule | +| - | - | +| ⸠cx | no match, cursor advances one code point | +| c⸠x | matches rule 1, so the text is replaced and cursor backs up; but only to where | +| c⸠ab | matches **rule 2**, so the text is replaced, with cursor at the end. | +| cJ⸠ | cursor is at the end, so we are done. | + #### Example The following shows how these features are combined together in the Transliterator "Any-Publishing". This transform converts the ASCII typewriter conventions into text more suitable for desktop publishing (in English). It turns straight quotation marks or UNIX style quotation marks into curly quotation marks, fixes multiple spaces, and converts double-hyphens into a dash.