-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
16.0 normalization woes #619
Conversation
Test failure:
Let's see...
So 113C2 (EE) is the second character in decomps for 1138E (letter AI) and 113C5 (vowel sign AI), and that's why it's NFC_QC=Maybe. In the latter decomp, it's also the first character, like Kirat Rai AI=E+E. Similar to Kirat Rai, quickCheck_NFC(letter EE, vowel sign AI)=Yes which is wrong because toNFC(letter EE, vowel sign AI)=letter AI, vowel sign EE. (However, quickCheck_NFC(vowel sign EE, vowel sign AI)=Maybe already because of vowel sign EE.) |
(I edited/fixed my response.) |
Since we are going to want good conformance tests for 16.0 α, I (very stupidly) generated NormalizationTest.txt test cases for the closure under canonical equivalence of every canonical decomposition. It’s mostly Hangul by volume, we easily could drop that part if we wanted. The following test cases seem interesting, so far we had only looked at Kirat Rai and Tulu-Tigalari, but it looks like Gurung Khema does the multi-path thing too:
Additional test cases should be added to properly cover the spukhafte Fernwirkung from the two E E = AI decompositions; that only gets really interesting when considering sequences that are more than one character long in NFC. |
Thanks!
Yes, let's drop Hangul.
I see... U=AA+AA, and several other vowel signs have decomps starting with AA. (U itself, UU, E, AI, O, OO, AU)
Yes. In particular, we should hardcode some additional test cases like Kirat Rai AA+AI+E, AA+E+AI, E+AI+AI, O+AI, O+AI+AI. |
Gurung Khema has another overlap: UU=AA+length, and EE=length+I --> AA+EE=UU+I I added Gurung Khema to the title and writeup of https://github.com/unicode-org/properties/issues/206 |
Done.
I tried generating that instead of hardcoding it, which found Gurung Khema:
|
Very good! Thank you!! |
(The generator for the chained decompositions also picks up some weird old stuff that I didn’t mean to pick up, but I’ll try to figure that out next year.) |
@markusicu I still need to fix the code that generates the NFmeowC_QC generator so it passes the new test, but the changes to GenerateData.java (to generate the new parts of NormalizationTest.txt) should be reviewable. |
Interestingly, none of the tests fail because of that. I suspect the exact set of decompositions here means we could get away with having them Yes, that is, the quickCheck algorithm would still return Maybe or No on a non-normalized sequence, because none of the Yes composites would recombine with each other, see L2/22-157 p. 11. But let’s not try to be too smart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results look very good -- especially the NFxC_QC values!
I gave up trying to understand 100% of the generator...
unicodetools/src/main/java/org/unicode/text/UCD/GenerateData.java
Outdated
Show resolved
Hide resolved
unicodetools/src/main/java/org/unicode/text/UCD/GenerateData.java
Outdated
Show resolved
Hide resolved
unicodetools/src/main/java/org/unicode/text/UCD/Normalizer.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Markus Scherer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(partial rs)lgtm
Note, about the following. I think it would help for us to do the following
in ICU to any UTF16 method that has a corresponding Character method.
a) rewrite the method to call the JDK. That way any IDE refactoring will
allow inlining to convert easily to the JDK method.
b) (eventually) deprecate the method
- for (int i = UTF16.getCharCount(first);
+ for (int i = Character.charCount(first);
…On Fri, Jan 19, 2024 at 4:59 PM Markus Scherer ***@***.***> wrote:
***@***.**** approved this pull request.
(partial rs)lgtm
—
Reply to this email directly, view it on GitHub
<#619 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMA6QIRY2VAMTI25MVTYPMJH3AVCNFSM6AAAAABADOJB3SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZUGE3DEOJZGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
UTC-178-A17 Set NFxC_Quick_Check=Maybe for characters like U+16D68 KIRAT RAI VOWEL SIGN AI which may change in NFxC normalization depending on context. For Unicode 16.0. See L2/24-009 item 5.1.
UTC-178-A18 Add test cases to NormalizationTest.txt that exercise composition with the components of U+16D68 KIRAT RAI VOWEL SIGN AI and similar characters. For Unicode 16.0. See L2/24-009 item 5.1.
UTC-178-A78 Add normalization tests for all potential decompositions of U+16D68 KIRAT RAI VOWEL SIGN AI, U+16D6A KIRAT RAI VOWEL SIGN AU (proposed in L2/22-043), and U+113C5 TULU-TIGALARI VOWEL SIGN AI (proposed in L2/22-031), for Unicode Version 16.0. [Ref. Section 5 of L2/24-013R]
See unicode-org/properties#206.
This PR fixes the derivation of the NFmeowC_QC properties, and adds:
Test failures before fixing the derivations: