Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16.0 normalization woes #619

Merged
merged 30 commits into from
Jan 22, 2024
Merged

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Dec 1, 2023

UTC-178-A17 Set NFxC_Quick_Check=Maybe for characters like U+16D68 KIRAT RAI VOWEL SIGN AI which may change in NFxC normalization depending on context. For Unicode 16.0. See L2/24-009 item 5.1.

UTC-178-A18 Add test cases to NormalizationTest.txt that exercise composition with the components of U+16D68 KIRAT RAI VOWEL SIGN AI and similar characters. For Unicode 16.0. See L2/24-009 item 5.1.

UTC-178-A78 Add normalization tests for all potential decompositions of U+16D68 KIRAT RAI VOWEL SIGN AI, U+16D6A KIRAT RAI VOWEL SIGN AU (proposed in L2/22-043), and U+113C5 TULU-TIGALARI VOWEL SIGN AI (proposed in L2/22-031), for Unicode Version 16.0. [Ref. Section 5 of L2/24-013R]

See unicode-org/properties#206.

This PR fixes the derivation of the NFmeowC_QC properties, and adds:

  • A test that the quickCheck algorithm defined in UAX15 is correct on all columns of all test cases in NormalizationTest.txt, for both normalization forms (failed thanks to the new parts before fixing the derivations; the existing parts would not have caught it).
  • An invariant test that NFmeowC_QC≠Yes on characters whose meow decomposition starts with a character with NFmeowC_QC≠Yes (failed before fixing the derivations).
  • Part 4 of NormalizationTest.txt « canonical closures (excluding Hangul) », all strings canonically equivalent to single code points, excluding NFD and NFC (already covered in Part 1) and excluding Hangul.
  • Part 5 of NormalizationTest.txt « chained primary composites ». All chains of primary composites, defined below, appear in this part. Most of them appear in c1, but some appear only in c2 (NFC).
    • Let X and Y be primary composites or non-decomposable starters.
    • Let S be canonically equivalent to the concatenation of X and Y.
    • S is a chain of primary composites if both of the following hold:
      1. S does not split into two substrings canonically equivalent to X and Y;
      2. S is not canonically equivalent to a single code point.
        • Note: If S is canonically equivalent to a single code point, it is covered by the earlier parts.
        • Note: This implies that at least one of X and Y is a primary composite.

Test failures before fixing the derivations:

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
NFC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI (@Part4 # Canonical closures (excluding Hangul))
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN OO (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AU (@Part5 # Chained primary composites)
NFC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI (@Part5 # Chained primary composites) ==> expected: <0> but was: <5>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+113C7 (𑏇) TULU-TIGALARI VOWEL SIGN OO has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+113C8 (𑏈) TULU-TIGALARI VOWEL SIGN AU has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe
U+16121 (𖄡) GURUNG KHEMA VOWEL SIGN U has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16122 (𖄢) GURUNG KHEMA VOWEL SIGN UU has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16123 (𖄣) GURUNG KHEMA VOWEL SIGN E has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16124 (𖄤) GURUNG KHEMA VOWEL SIGN EE has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+16129 (𖄩) GURUNG KHEMA VOWEL LENGTH MARK, which has NFC_QC=Maybe
U+16125 (𖄥) GURUNG KHEMA VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFC_QC=Maybe
U+16D68 (𖵨) KIRAT RAI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+16D67 (𖵧) KIRAT RAI VOWEL SIGN E, which has NFC_QC=Maybe ==> expected: <0> but was: <9>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
NFKC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI (@Part4 # Canonical closures (excluding Hangul))
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN OO (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AU (@Part5 # Chained primary composites)
NFKC quickCheck returns YES for non-normalized string KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI (@Part5 # Chained primary composites) ==> expected: <0> but was: <5>
[ERROR]   TestProperties>TestFmwkMinusMinus.tearDown:23 errln()
U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+113C7 (𑏇) TULU-TIGALARI VOWEL SIGN OO has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+113C8 (𑏈) TULU-TIGALARI VOWEL SIGN AU has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFKC_QC=Maybe
U+16121 (𖄡) GURUNG KHEMA VOWEL SIGN U has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16122 (𖄢) GURUNG KHEMA VOWEL SIGN UU has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16123 (𖄣) GURUNG KHEMA VOWEL SIGN E has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16124 (𖄤) GURUNG KHEMA VOWEL SIGN EE has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+16129 (𖄩) GURUNG KHEMA VOWEL LENGTH MARK, which has NFKC_QC=Maybe
U+16125 (𖄥) GURUNG KHEMA VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+1611E (𖄞) GURUNG KHEMA VOWEL SIGN AA, which has NFKC_QC=Maybe
U+16D68 (𖵨) KIRAT RAI VOWEL SIGN AI has NFKC_QC=Yes, but its Decomposition_Mapping starts with U+16D67 (𖵧) KIRAT RAI VOWEL SIGN E, which has NFKC_QC=Maybe ==> expected: <0> but was: <9>
[INFO]
[ERROR] Tests run: 15, Failures: 4, Errors: 0, Skipped: 1

@eggrobin eggrobin changed the title 16.0 Normalization woesA test which I expected to fail, but not in this way 16.0 Normalization woes Dec 1, 2023
@eggrobin eggrobin changed the title 16.0 Normalization woes 16.0 normalization woes Dec 1, 2023
@markusicu
Copy link
Member

markusicu commented Dec 1, 2023

Test failure:

[ERROR] TestProperties.TestQuickCheckConsistency:228->TestFmwkMinusMinus.assertEquals:45 U+113C5 (𑏅) TULU-TIGALARI VOWEL SIGN AI has NFC_QC=Yes, but its (canonical) Decomposition_Mapping starts with U+113C2 (𑏂) TULU-TIGALARI VOWEL SIGN EE, which has NFC_QC=Maybe ==> expected: <Yes> but was: <Maybe>

Let's see...

1138B;TULU-TIGALARI LETTER EE;Lo;0;L;;;;;N;;;;;
1138E;TULU-TIGALARI LETTER AI;Lo;0;L;1138B 113C2;;;;N;;;;;
...
113C2;TULU-TIGALARI VOWEL SIGN EE;Mc;0;L;;;;;N;;;;;
113C5;TULU-TIGALARI VOWEL SIGN AI;Mc;0;L;113C2 113C2;;;;N;;;;;
113C7;TULU-TIGALARI VOWEL SIGN OO;Mc;0;L;113C2 113B8;;;;N;;;;;
113C8;TULU-TIGALARI VOWEL SIGN AU;Mc;0;L;113C2 113C9;;;;N;;;;;

So 113C2 (EE) is the second character in decomps for 1138E (letter AI) and 113C5 (vowel sign AI), and that's why it's NFC_QC=Maybe. In the latter decomp, it's also the first character, like Kirat Rai AI=E+E.

Similar to Kirat Rai, quickCheck_NFC(letter EE, vowel sign AI)=Yes which is wrong because toNFC(letter EE, vowel sign AI)=letter AI, vowel sign EE.
We need NFC_QC(113C5 vowel sign AI)=Maybe.

(However, quickCheck_NFC(vowel sign EE, vowel sign AI)=Maybe already because of vowel sign EE.)

@markusicu
Copy link
Member

(I edited/fixed my response.)

@eggrobin
Copy link
Member Author

Since we are going to want good conformance tests for 16.0 α, I (very stupidly) generated NormalizationTest.txt test cases for the closure under canonical equivalence of every canonical decomposition. It’s mostly Hangul by volume, we easily could drop that part if we wanted.

The following test cases seem interesting, so far we had only looked at Kirat Rai and Tulu-Tigalari, but it looks like Gurung Khema does the multi-path thing too:

1611E 16123;16126;1611E 1611E 1611F;16126;1611E 1611E 1611F; # (◌𖄞◌𖄣; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN E
16121 1611F;16126;1611E 1611E 1611F;16126;1611E 1611E 1611F; # (◌𖄡◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ◌𖄦; ◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN I
1611E 16124;16127;1611E 16129 1611F;16127;1611E 16129 1611F; # (◌𖄞◌𖄤; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN EE
16122 1611F;16127;1611E 16129 1611F;16127;1611E 16129 1611F; # (◌𖄢◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ◌𖄧; ◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN UU, GURUNG KHEMA VOWEL SIGN I
1611E 16125;16128;1611E 1611E 16120;16128;1611E 1611E 16120; # (◌𖄞◌𖄥; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN AI
16121 16120;16128;1611E 1611E 16120;16128;1611E 1611E 16120; # (◌𖄡◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ◌𖄨; ◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN II
16D63 16D68;16D6A;16D63 16D67 16D67;16D6A;16D63 16D67 16D67; # (𖵪; 𖵪; 𖵪; 𖵪; 𖵪; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI
16D69 16D67;16D6A;16D63 16D67 16D67;16D6A;16D63 16D67 16D67; # (𖵪; 𖵪; 𖵪; 𖵪; 𖵪; ) KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN E

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

@markusicu
Copy link
Member

Since we are going to want good conformance tests for 16.0 α, I (very stupidly) generated NormalizationTest.txt test cases for the closure under canonical equivalence of every canonical decomposition.

Thanks!

It’s mostly Hangul by volume, we easily could drop that part if we wanted.

Yes, let's drop Hangul.
The file is too large to view in the browser, but that data is likely boring. Hangul normalization is done in code and easily tested with just a handful of cases.

The following test cases seem interesting, so far we had only looked at Kirat Rai and Tulu-Tigalari, but it looks like Gurung Khema does the multi-path thing too:

I see... U=AA+AA, and several other vowel signs have decomps starting with AA. (U itself, UU, E, AI, O, OO, AU)

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

Yes. In particular, we should hardcode some additional test cases like Kirat Rai AA+AI+E, AA+E+AI, E+AI+AI, O+AI, O+AI+AI.

@markusicu
Copy link
Member

Gurung Khema has another overlap: UU=AA+length, and EE=length+I --> AA+EE=UU+I

I added Gurung Khema to the title and writeup of https://github.com/unicode-org/properties/issues/206

@eggrobin
Copy link
Member Author

Yes, let's drop Hangul.

Done.

Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC.

Yes. In particular, we should hardcode some additional test cases like Kirat Rai AA+AI+E, AA+E+AI, E+AI+AI, O+AI, O+AI+AI.

I tried generating that instead of hardcoding it, which found Gurung Khema:

1138B 113C5 113C2;1138E 113C5;1138B 113C2 113C2 113C2;1138E 113C5;1138B 113C2 113C2 113C2; # (𑎎𑏅; 𑎎𑏅; 𑎎𑏅; 𑎎𑏅; 𑎎𑏅; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN EE
1138B 113C5 113B8;1138E 113C7;1138B 113C2 113C2 113B8;1138E 113C7;1138B 113C2 113C2 113B8; # (𑎎𑏇; 𑎎𑏇; 𑎎𑏇; 𑎎𑏇; 𑎎𑏇; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN AA
1138B 113C5 113C9;1138E 113C8;1138B 113C2 113C2 113C9;1138E 113C8;1138B 113C2 113C2 113C9; # (𑎎𑏈; 𑎎𑏈; 𑎎𑏈; 𑎎𑏈; 𑎎𑏈; ) TULU-TIGALARI LETTER EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI AU LENGTH MARK
113C2 113C5 113C2;113C5 113C5;113C2 113C2 113C2 113C2;113C5 113C5;113C2 113C2 113C2 113C2; # (𑏅𑏅; 𑏅𑏅; 𑏅𑏅; 𑏅𑏅; 𑏅𑏅; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN EE
113C2 113C5 113B8;113C5 113C7;113C2 113C2 113C2 113B8;113C5 113C7;113C2 113C2 113C2 113B8; # (𑏅𑏇; 𑏅𑏇; 𑏅𑏇; 𑏅𑏇; 𑏅𑏇; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI VOWEL SIGN AA
113C2 113C5 113C9;113C5 113C8;113C2 113C2 113C2 113C9;113C5 113C8;113C2 113C2 113C2 113C9; # (𑏅𑏈; 𑏅𑏈; 𑏅𑏈; 𑏅𑏈; 𑏅𑏈; ) TULU-TIGALARI VOWEL SIGN EE, TULU-TIGALARI VOWEL SIGN AI, TULU-TIGALARI AU LENGTH MARK
1611E 16121 1611E;16121 16121;1611E 1611E 1611E 1611E;16121 16121;1611E 1611E 1611E 1611E; # (◌𖄞◌𖄡◌𖄞; ◌𖄡◌𖄡; ◌𖄞◌𖄞◌𖄞◌𖄞; ◌𖄡◌𖄡; ◌𖄞◌𖄞◌𖄞◌𖄞; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA
1611E 16121 16129;16121 16122;1611E 1611E 1611E 16129;16121 16122;1611E 1611E 1611E 16129; # (◌𖄞◌𖄡◌𖄩; ◌𖄡◌𖄢; ◌𖄞◌𖄞◌𖄞◌𖄩; ◌𖄡◌𖄢; ◌𖄞◌𖄞◌𖄞◌𖄩; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL LENGTH MARK
1611E 16126;16121 16123;1611E 1611E 1611E 1611F;16121 16123;1611E 1611E 1611E 1611F; # (◌𖄞◌𖄦; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN O
1611E 16121 1611F;16121 16123;1611E 1611E 1611E 1611F;16121 16123;1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄣; ◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN I
1611E 16128;16121 16125;1611E 1611E 1611E 16120;16121 16125;1611E 1611E 1611E 16120; # (◌𖄞◌𖄨; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN AU
1611E 16121 16120;16121 16125;1611E 1611E 1611E 16120;16121 16125;1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄥; ◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN II
1611E 16121 16123;16121 16126;1611E 1611E 1611E 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄣; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN E
1611E 16121 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F;16121 16126;1611E 1611E 1611E 1611E 1611F; # (◌𖄞◌𖄡◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ◌𖄡◌𖄦; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN I
1611E 16121 16124;16121 16127;1611E 1611E 1611E 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F; # (◌𖄞◌𖄡◌𖄤; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN EE
1611E 16121 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F;16121 16127;1611E 1611E 1611E 16129 1611F; # (◌𖄞◌𖄡◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄧; ◌𖄞◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL LENGTH MARK, GURUNG KHEMA VOWEL SIGN I
1611E 16121 16125;16121 16128;1611E 1611E 1611E 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄥; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AI
1611E 16121 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120;16121 16128;1611E 1611E 1611E 1611E 16120; # (◌𖄞◌𖄡◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ◌𖄡◌𖄨; ◌𖄞◌𖄞◌𖄞◌𖄞◌𖄠; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN U, GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN II
1611E 16127;16121 16124;1611E 1611E 16129 1611F;16121 16124;1611E 1611E 16129 1611F; # (◌𖄞◌𖄧; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN OO
1611E 16122 1611F;16121 16124;1611E 1611E 16129 1611F;16121 16124;1611E 1611E 16129 1611F; # (◌𖄞◌𖄢◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ◌𖄡◌𖄤; ◌𖄞◌𖄞◌𖄩◌𖄟; ) GURUNG KHEMA VOWEL SIGN AA, GURUNG KHEMA VOWEL SIGN UU, GURUNG KHEMA VOWEL SIGN I
16D67 16D68 16D67;16D68 16D68;16D67 16D67 16D67 16D67;16D68 16D68;16D67 16D67 16D67 16D67; # (𖵨𖵨; 𖵨𖵨; 𖵨𖵨; 𖵨𖵨; 𖵨𖵨; ) KIRAT RAI VOWEL SIGN E, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D6A 16D67;16D6A 16D67;16D63 16D67 16D67 16D67;16D6A 16D67;16D63 16D67 16D67 16D67; # (𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; ) KIRAT RAI VOWEL SIGN AU, KIRAT RAI VOWEL SIGN E
16D63 16D68 16D67;16D6A 16D67;16D63 16D67 16D67 16D67;16D6A 16D67;16D63 16D67 16D67 16D67; # (𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; 𖵪𖵧; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D69 16D68 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67; # (𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; ) KIRAT RAI VOWEL SIGN O, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E
16D63 16D67 16D68 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67;16D6A 16D68;16D63 16D67 16D67 16D67 16D67; # (𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; 𖵪𖵨; ) KIRAT RAI VOWEL SIGN AA, KIRAT RAI VOWEL SIGN E, KIRAT RAI VOWEL SIGN AI, KIRAT RAI VOWEL SIGN E

@markusicu
Copy link
Member

Very good! Thank you!!

@eggrobin
Copy link
Member Author

(The generator for the chained decompositions also picks up some weird old stuff that I didn’t mean to pick up, but I’ll try to figure that out next year.)

@eggrobin
Copy link
Member Author

eggrobin commented Jan 5, 2024

@markusicu I still need to fix the code that generates the NFmeowC_QC generator so it passes the new test, but the changes to GenerateData.java (to generate the new parts of NormalizationTest.txt) should be reviewable.

@eggrobin eggrobin marked this pull request as ready for review January 7, 2024 05:17
@eggrobin eggrobin requested a review from markusicu January 7, 2024 05:17
@eggrobin
Copy link
Member Author

eggrobin commented Jan 7, 2024

@markusicu

And most of [the Gurung Khema] vowel signs are NFC_QC=Yes but need to be Maybe.

Interestingly, none of the tests fail because of that. I suspect the exact set of decompositions here means we could get away with having them Yes, that is, the quickCheck algorithm would still return Maybe or No on a non-normalized sequence, because none of the Yes composites would recombine with each other, see L2/22-157 p. 11. But let’s not try to be too smart.

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results look very good -- especially the NFxC_QC values!
I gave up trying to understand 100% of the generator...

Co-authored-by: Markus Scherer <[email protected]>
markusicu
markusicu previously approved these changes Jan 20, 2024
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(partial rs)lgtm

@macchiati
Copy link
Member

macchiati commented Jan 20, 2024 via email

@eggrobin eggrobin requested a review from markusicu January 22, 2024 22:31
@eggrobin eggrobin merged commit 51c579b into unicode-org:main Jan 22, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants