16.0 normalization woes #619

eggrobin · 2023-12-01T21:54:41Z

UTC-178-A17 Set NFxC_Quick_Check=Maybe for characters like U+16D68 KIRAT RAI VOWEL SIGN AI which may change in NFxC normalization depending on context. For Unicode 16.0. See L2/24-009 item 5.1.

UTC-178-A18 Add test cases to NormalizationTest.txt that exercise composition with the components of U+16D68 KIRAT RAI VOWEL SIGN AI and similar characters. For Unicode 16.0. See L2/24-009 item 5.1.

UTC-178-A78 Add normalization tests for all potential decompositions of U+16D68 KIRAT RAI VOWEL SIGN AI, U+16D6A KIRAT RAI VOWEL SIGN AU (proposed in L2/22-043), and U+113C5 TULU-TIGALARI VOWEL SIGN AI (proposed in L2/22-031), for Unicode Version 16.0. [Ref. Section 5 of L2/24-013R]

See unicode-org/properties#206.

This PR fixes the derivation of the NFmeowC_QC properties, and adds:

A test that the quickCheck algorithm defined in UAX15 is correct on all columns of all test cases in NormalizationTest.txt, for both normalization forms (failed thanks to the new parts before fixing the derivations; the existing parts would not have caught it).
An invariant test that NFmeowC_QC≠Yes on characters whose meow decomposition starts with a character with NFmeowC_QC≠Yes (failed before fixing the derivations).
Part 4 of NormalizationTest.txt « canonical closures (excluding Hangul) », all strings canonically equivalent to single code points, excluding NFD and NFC (already covered in Part 1) and excluding Hangul.
Part 5 of NormalizationTest.txt « chained primary composites ». All chains of primary composites, defined below, appear in this part. Most of them appear in c1, but some appear only in c2 (NFC).
- Let X and Y be primary composites or non-decomposable starters.
- Let S be canonically equivalent to the concatenation of X and Y.
- S is a chain of primary composites if both of the following hold:
  1. S does not split into two substrings canonically equivalent to X and Y;
  2. S is not canonically equivalent to a single code point.
    - Note: If S is canonically equivalent to a single code point, it is covered by the earlier parts.
    - Note: This implies that at least one of X and Y is a primary composite.

Test failures before fixing the derivations:

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
NFC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI (@Part4 # Canonical closures (excluding Hangul))
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN OO (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AU (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI (@Part5 # Chained primary composites) ==> expected: <0> but was: <5>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+113C7 (𑏇) TULU-TIGALARI VOWEL SIGN OO has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+113C8 (𑏈) TULU-TIGALARI VOWEL SIGN AU has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+16121 (𖄡) GURUNG KHEMA VOWEL SIGN U has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16122 (𖄢) GURUNG KHEMA VOWEL SIGN UU has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16123 (𖄣) GURUNG KHEMA VOWEL SIGN E has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16124 (𖄤) GURUNG KHEMA VOWEL SIGN EE has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+16129 (𖄩) GURUNG KHEMA VOWEL LENGTH MARK, which has NFC_QC=Maybe
U+16125 (𖄥) GURUNG KHEMA VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16D68 (𖵨) KIRAT RAI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+16D67 (𖵧) KIRAT RAI VOWEL SIGN E, which has NFC_QC=Maybe ==> expected: <0> but was: <9>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
NFKC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI (@Part4 # Canonical closures (excluding Hangul))
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN OO (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AU (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI (@Part5 # Chained primary composites) ==> expected: <0> but was: <5>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+113C7 (𑏇) TULU-TIGALARI VOWEL SIGN OO has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+113C8 (𑏈) TULU-TIGALARI VOWEL SIGN AU has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+16121 (𖄡) GURUNG KHEMA VOWEL SIGN U has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16122 (𖄢) GURUNG KHEMA VOWEL SIGN UU has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16123 (𖄣) GURUNG KHEMA VOWEL SIGN E has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16124 (𖄤) GURUNG KHEMA VOWEL SIGN EE has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+16129 (𖄩) GURUNG KHEMA VOWEL LENGTH MARK, which has NFKC_QC=Maybe
U+16125 (𖄥) GURUNG KHEMA VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16D68 (𖵨) KIRAT RAI VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+16D67 (𖵧) KIRAT RAI VOWEL SIGN E, which has NFKC_QC=Maybe ==> expected: <0> but was: <9>
[INFO]
[ERROR] Tests run: 15, Failures: 4, Errors: 0, Skipped: 1

markusicu · 2023-12-01T22:20:37Z

Test failure:

[ERROR] TestProperties.TestQuickCheckConsistency:228->TestFmwkMinusMinus.assertEquals:45 U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe ==> expected: <Yes> but was: <Maybe>

Let's see...

1138B;TULU-TIGALARI LETTER EE;Lo;0;L;;;;;N;;;;;
1138E;TULU-TIGALARI LETTER AI;Lo;0;L;1138B 113C2;;;;N;;;;;
...
113C2;TULU-TIGALARI VOWEL SIGN EE;Mc;0;L;;;;;N;;;;;
113C5;TULU-TIGALARI VOWEL SIGN AI;Mc;0;L;113C2 113C2;;;;N;;;;;
113C7;TULU-TIGALARI VOWEL SIGN OO;Mc;0;L;113C2 113B8;;;;N;;;;;
113C8;TULU-TIGALARI VOWEL SIGN AU;Mc;0;L;113C2 113C9;;;;N;;;;;

So 113C2 (EE) is the second character in decomps for 1138E (letter AI) and 113C5 (vowel sign AI), and that's why it's NFC_QC=Maybe. In the latter decomp, it's also the first character, like Kirat Rai AI=E+E.

Similar to Kirat Rai, quickCheck_NFC(letter EE, vowel sign AI)=Yes which is wrong because toNFC(letter EE, vowel sign AI)=letter AI, vowel sign EE.
We need NFC_QC(113C5 vowel sign AI)=Maybe.

(However, quickCheck_NFC(vowel sign EE, vowel sign AI)=Maybe already because of vowel sign EE.)

markusicu · 2023-12-01T22:46:31Z

(I edited/fixed my response.)

eggrobin · 2023-12-29T14:47:53Z

Since we are going to want good conformance tests for 16.0 α, I (very stupidly) generated NormalizationTest.txt test cases for the closure under canonical equivalence of every canonical decomposition. It’s mostly Hangul by volume, we easily could drop that part if we wanted.

The following test cases seem interesting, so far we had only looked at Kirat Rai and Tulu-Tigalari, but it looks like Gurung Khema does the multi-path thing too:

1611E 16123;16126;1611E 1611E 1611F;16126;1611E 1611E 1611F; # (◌𖄞◌𖄣; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN E
16121 1611F;16126;1611E 1611E 1611F;16126;1611E 1611E 1611F; # (◌𖄡◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN I
1611E 16124;16127;1611E 16129 1611F;16127;1611E 16129 1611F; # (◌𖄞◌𖄤; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN EE
16122 1611F;16127;1611E 16129 1611F;16127;1611E 16129 1611F; # (◌𖄢◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN UU, GURUNG KHEMA VOWEL SIGN I
1611E 16125;16128;1611E 1611E 16120;16128;1611E 1611E 16120; # (◌𖄞◌𖄥; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN AI
16121 16120;16128;1611E 1611E 16120;16128;1611E 1611E 16120; # (◌𖄡◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN II
16D63 16D68;16D6A;16D63 16D67 16D67;16D6A;16D63 16D67 16D67; # (𖵪; 𖵪; 𖵪; 𖵪; 𖵪; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI
16D69 16D67;16D6A;16D63 16D67 16D67;16D6A;16D63 16D67 16D67; # (𖵪; 𖵪; 𖵪; 𖵪; 𖵪; ) KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN E

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

markusicu · 2023-12-29T18:00:17Z

Since we are going to want good conformance tests for 16.0 α, I (very stupidly) generated NormalizationTest.txt test cases for the closure under canonical equivalence of every canonical decomposition.

Thanks!

It’s mostly Hangul by volume, we easily could drop that part if we wanted.

Yes, let's drop Hangul.
The file is too large to view in the browser, but that data is likely boring. Hangul normalization is done in code and easily tested with just a handful of cases.

The following test cases seem interesting, so far we had only looked at Kirat Rai and Tulu-Tigalari, but it looks like Gurung Khema does the multi-path thing too:

I see... U=AA+AA, and several other vowel signs have decomps starting with AA. (U itself, UU, E, AI, O, OO, AU)

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

Yes. In particular, we should hardcode some additional test cases like Kirat Rai AA+AI+E, AA+E+AI, E+AI+AI, O+AI, O+AI+AI.

markusicu · 2023-12-29T18:18:39Z

Gurung Khema has another overlap: UU=AA+length, and EE=length+I --> AA+EE=UU+I

I added Gurung Khema to the title and writeup of https://github.com/unicode-org/properties/issues/206

eggrobin · 2023-12-29T18:22:54Z

Yes, let's drop Hangul.

Done.

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

Yes. In particular, we should hardcode some additional test cases like Kirat Rai AA+AI+E, AA+E+AI, E+AI+AI, O+AI, O+AI+AI.

I tried generating that instead of hardcoding it, which found Gurung Khema:

1138B 113C5 113C2;1138E 113C5;1138B 113C2 113C2 113C2;1138E 113C5;1138B 113C2 113C2 113C2; # (𑎎𑏅; 𑎎𑏅; 𑎎𑏅; 𑎎𑏅; 𑎎𑏅; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN EE
1138B 113C5 113B8;1138E 113C7;1138B 113C2 113C2 113B8;1138E 113C7;1138B 113C2 113C2 113B8; # (𑎎𑏇; 𑎎𑏇; 𑎎𑏇; 𑎎𑏇; 𑎎𑏇; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN AA
1138B 113C5 113C9;1138E 113C8;1138B 113C2 113C2 113C9;1138E 113C8;1138B 113C2 113C2 113C9; # (𑎎𑏈; 𑎎𑏈; 𑎎𑏈; 𑎎𑏈; 𑎎𑏈; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI AU LENGTH MARK
113C2 113C5 113C2;113C5 113C5;113C2 113C2 113C2 113C2;113C5 113C5;113C2 113C2 113C2 113C2; # (𑏅𑏅; 𑏅𑏅; 𑏅𑏅; 𑏅𑏅; 𑏅𑏅; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN EE
113C2 113C5 113B8;113C5 113C7;113C2 113C2 113C2 113B8;113C5 113C7;113C2 113C2 113C2 113B8; # (𑏅𑏇; 𑏅𑏇; 𑏅𑏇; 𑏅𑏇; 𑏅𑏇; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN AA
113C2 113C5 113C9;113C5 113C8;113C2 113C2 113C2 113C9;113C5 113C8;113C2 113C2 113C2 113C9; # (𑏅𑏈; 𑏅𑏈; 𑏅𑏈; 𑏅𑏈; 𑏅𑏈; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI AU LENGTH MARK
1611E 16121 1611E;16121 16121;1611E 1611E 1611E 1611E;16121 16121;1611E 1611E 1611E 1611E; # (◌𖄞◌𖄡◌𖄞; ◌𖄡◌𖄡; ◌𖄞◌𖄞◌𖄞◌𖄞; ◌𖄡◌𖄡; ◌𖄞◌𖄞◌𖄞◌𖄞; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA
1611E 16121 16129;16121 16122;1611E 1611E 1611E 16129;16121 16122;1611E 1611E 1611E 16129; # (◌𖄞◌𖄡◌𖄩; ◌𖄡◌𖄢; ◌𖄞◌𖄞◌𖄞◌𖄩; ◌𖄡◌𖄢; ◌𖄞◌𖄞◌𖄞◌𖄩; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL LENGTH MARK
1611E 16126;16121 16123;1611E 1611E 1611E 1611F;16121 16123;1611E 1611E 1611E 1611F; # (◌𖄞◌𖄦; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN O
1611E 16121 1611F;16121 16123;1611E 1611E 1611E 1611F;16121 16123;1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN I
1611E 16128;16121 16125;1611E 1611E 1611E 16120;16121 16125;1611E 1611E 1611E 16120; # (◌𖄞◌𖄨; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN AU
1611E 16121 16120;16121 16125;1611E 1611E 1611E 16120;16121 16125;1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN II
1611E 16121 16123;16121 16126;1611E 1611E 1611E 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄣; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN E
1611E 16121 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN I
1611E 16121 16124;16121 16127;1611E 1611E 1611E 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F; # (◌𖄞◌𖄡◌𖄤; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN EE
1611E 16121 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F; # (◌𖄞◌𖄡◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL LENGTH MARK, GURUNG KHEMA VOWEL SIGN I
1611E 16121 16125;16121 16128;1611E 1611E 1611E 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄥; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AI
1611E 16121 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN II
1611E 16127;16121 16124;1611E 1611E 16129 1611F;16121 16124;1611E 1611E 16129 1611F; # (◌𖄞◌𖄧; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN OO
1611E 16122 1611F;16121 16124;1611E 1611E 16129 1611F;16121 16124;1611E 1611E 16129 1611F; # (◌𖄞◌𖄢◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN UU, GURUNG KHEMA VOWEL SIGN I
16D67 16D68 16D67;16D68 16D68;16D67 16D67 16D67 16D67;16D68 16D68;16D67 16D67 16D67 16D67; # (𖵨𖵨; 𖵨𖵨; 𖵨𖵨; 𖵨𖵨; 𖵨𖵨; ) KIRAT RAI VOWEL SIGN E, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D6A 16D67;16D6A 16D67;16D63 16D67 16D67 16D67;16D6A 16D67;16D63 16D67 16D67 16D67; # (𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; ) KIRAT RAI VOWEL SIGN AU, KIRAT RAI VOWEL SIGN E
16D63 16D68 16D67;16D6A 16D67;16D63 16D67 16D67 16D67;16D6A 16D67;16D63 16D67 16D67 16D67; # (𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D69 16D68 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67; # (𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; ) KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D63 16D67 16D68 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67; # (𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN E, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E

markusicu · 2023-12-29T18:25:03Z

Very good! Thank you!!

eggrobin · 2023-12-29T19:56:50Z

(The generator for the chained decompositions also picks up some weird old stuff that I didn’t mean to pick up, but I’ll try to figure that out next year.)

…ompositions

…r cases)

eggrobin · 2024-01-05T16:58:03Z

@markusicu I still need to fix the code that generates the NFmeowC_QC generator so it passes the new test, but the changes to GenerateData.java (to generate the new parts of NormalizationTest.txt) should be reviewable.

eggrobin · 2024-01-07T06:33:03Z

@markusicu

And most of [the Gurung Khema] vowel signs are NFC_QC=Yes but need to be Maybe.

Interestingly, none of the tests fail because of that. I suspect the exact set of decompositions here means we could get away with having them Yes, that is, the quickCheck algorithm would still return Maybe or No on a non-normalized sequence, because none of the Yes composites would recombine with each other, see L2/22-157 p. 11. But let’s not try to be too smart.

markusicu

Results look very good -- especially the NFxC_QC values!
I gave up trying to understand 100% of the generator...

unicodetools/src/main/java/org/unicode/text/UCD/GenerateData.java

unicodetools/src/main/java/org/unicode/text/UCD/Normalizer.java

Co-authored-by: Markus Scherer <[email protected]>

markusicu

(partial rs)lgtm

macchiati · 2024-01-20T01:46:20Z

Note, about the following. I think it would help for us to do the following in ICU to any UTF16 method that has a corresponding Character method. a) rewrite the method to call the JDK. That way any IDE refactoring will allow inlining to convert easily to the JDK method. b) (eventually) deprecate the method - for (int i = UTF16.getCharCount(first); + for (int i = Character.charCount(first);

…

On Fri, Jan 19, 2024 at 4:59 PM Markus Scherer ***@***.***> wrote: ***@***.**** approved this pull request. (partial rs)lgtm — Reply to this email directly, view it on GitHub <#619 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMA6QIRY2VAMTI25MVTYPMJH3AVCNFSM6AAAAABADOJB3SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZUGE3DEOJZGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

…odetools into normalization-woes

A test which I expected to fail, but not in this way

cc76245

eggrobin changed the title ~~16.0 Normalization woesA test which I expected to fail, but not in this way~~ 16.0 Normalization woes Dec 1, 2023

eggrobin changed the title ~~16.0 Normalization woes~~ 16.0 normalization woes Dec 1, 2023

Pre-16 and NFKCQC

e23d1c1

eggrobin added 3 commits December 2, 2023 01:47

🤪

24fe8e1

Canonical closure tests

328c761

Generate canonical closures

2d0ceaf

eggrobin added 2 commits December 29, 2023 18:48

Some interesting sequences

3880c4f

Some very crappy code

b3b53c0

Drop Hangul and make sure we have all overlaps

22dfd8c

eggrobin added 10 commits January 3, 2024 17:19

Split it into its own part and look at chaining compositions, not dec…

a742327

…ompositions

despam

182cc3a

spots

53459b0

Regenerate UCD

5f16271

Some comments.

747f982

Allow a single non-decomposable starter at either end of the chain

695c95e

Deduplicate parts 4 and 5

9fea9ea

Remove redundant test cases in NFC (covered by the NFC column of othe…

7362f2d

…r cases)

Clean things up

cdd391a

more cleanup

7bcb9b4

more cleanup

cf4275c

eggrobin added 2 commits January 7, 2024 05:34

More testing

3cb23ac

Fix the QC properties

e41b3ea

eggrobin marked this pull request as ready for review January 7, 2024 05:17

eggrobin requested a review from markusicu January 7, 2024 05:17

eggrobin added 3 commits January 7, 2024 06:19

stray import

0c312ce

factor

361a977

report all failures

0380b27

eggrobin mentioned this pull request Jan 10, 2024

Partially test consistency of grapheme cluster segmentation with canonical equivalence, and fix it for LGCs #645

Merged

markusicu reviewed Jan 20, 2024

View reviewed changes

Markus’s suggestions

7a6220b

Co-authored-by: Markus Scherer <[email protected]>

markusicu previously approved these changes Jan 20, 2024

View reviewed changes

Merge remote-tracking branch 'la-vache/main' into normalization-woes

89cdf7a

eggrobin added 3 commits January 20, 2024 02:51

More honest primaryCompositesByMeowNFDCodePoint maps

e1a01ed

Regenerate UCD

b0b4cf6

Merge branch 'normalization-woes' of https://github.com/eggrobin/unic…

910039c

…odetools into normalization-woes

eggrobin dismissed markusicu’s stale review via 910039c January 20, 2024 01:51

spotless

c21622e

eggrobin requested a review from markusicu January 22, 2024 22:31

markusicu approved these changes Jan 22, 2024

View reviewed changes

eggrobin merged commit 51c579b into unicode-org:main Jan 22, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16.0 normalization woes #619

16.0 normalization woes #619

eggrobin commented Dec 1, 2023 •

edited by markusicu

Loading

markusicu commented Dec 1, 2023 •

edited

Loading

markusicu commented Dec 1, 2023

eggrobin commented Dec 29, 2023

markusicu commented Dec 29, 2023

markusicu commented Dec 29, 2023

eggrobin commented Dec 29, 2023

markusicu commented Dec 29, 2023

eggrobin commented Dec 29, 2023

eggrobin commented Jan 5, 2024

eggrobin commented Jan 7, 2024

markusicu left a comment

markusicu left a comment

macchiati commented Jan 20, 2024 via email

16.0 normalization woes #619

16.0 normalization woes #619

Conversation

eggrobin commented Dec 1, 2023 • edited by markusicu Loading

markusicu commented Dec 1, 2023 • edited Loading

markusicu commented Dec 1, 2023

eggrobin commented Dec 29, 2023

markusicu commented Dec 29, 2023

markusicu commented Dec 29, 2023

eggrobin commented Dec 29, 2023

markusicu commented Dec 29, 2023

eggrobin commented Dec 29, 2023

eggrobin commented Jan 5, 2024

eggrobin commented Jan 7, 2024

markusicu left a comment

Choose a reason for hiding this comment

markusicu left a comment

Choose a reason for hiding this comment

macchiati commented Jan 20, 2024 via email

eggrobin commented Dec 1, 2023 •

edited by markusicu

Loading

markusicu commented Dec 1, 2023 •

edited

Loading