Skip to content

Commit

Permalink
CLDR-17192 kbd: update on Normalization: segmentation discussion
Browse files Browse the repository at this point in the history
- added rationale for gluing
- further explanation
  • Loading branch information
srl295 committed Jan 24, 2024
1 parent 61286bb commit ff30d78
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions docs/ldml/tr35-keyboards.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,8 @@ Regardless of the normalization form in the keyboard source file or in the edit

### Normalization and Markers

A special issue occurs when markers are involved. The markers may be reordered along with characters.
A special issue occurs when markers are involved. Markers are not text, and so are themselves affected by the normalization algorithm.
However, the markers must be reordered by the implementation along with their surrounding characters.

**Example 1**

Expand All @@ -344,7 +345,11 @@ If we add markers:
- `e\u{0300}\m{marker}\u{0320}` (original)
- `e\m{marker}\u{0320}\u{0300}` (NFD)

The principle is that the marker stays 'glued' to the following character, in this case the `\u{0320}`. If a marker occurs at the end of input or at the end of a normalization-safe segment, the marker is 'glued' to the end of that segment during the course of normalization for the current keystroke. As another example:
During a processing step, such as appending a keystroke or one stage of applying a transform, the marker is 'glued' to the _following_ character. In the above example, `\m{marker}` was 'glued' to the `\u{0320}`. If a marker occured at the end of input or at the end of a normalization-safe segment, the marker is 'glued' to the end of that segment during that normalization step.

The 'gluing' is only applicable during one particular processing step. It does not persist or affect further processing steps or future keystrokes.

A second example:

**Example 2**

Expand All @@ -367,6 +372,12 @@ There are two normalization-safe segments here:

Normalization (and marker rearranging) occurs within each segment. While `\m{marker1}` is 'glued' to the `\u{0320}`, it is glued within the first segment and has no effect on the second segment.

#### Rationale for 'gluing' markers

It is recognized that the processing described here seems to be an innovation among Unicode normalization implementations.

This specification has markers 'glued' (remaining with) the following character so that if a context ends with a marker, that marker would be guaranteed to remain at the end after processing. Authors can keep a marker together with a character of interest by emitting the marker just before the character of interest, that is, `output="\m{marker}X"` instead of `output="X\m{marker}"`.

### Normalization and Character Classes

If pre-composed (non-NFD) characters are used in [character classes](#regex-like-syntax), such as `[á-é]` or `[\u{00e1}-\u{00e9}]`, these may not match as keyboard authors expect, as the U+00E1 character will not occur in NFD form.
Expand Down

0 comments on commit ff30d78

Please sign in to comment.