Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InCB derivation fix #679

Merged
merged 3 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions unicodetools/data/ucd/dev/DerivedCoreProperties.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedCoreProperties-16.0.0.txt
# Date: 2024-04-25, 17:06:11 GMT
# Date: 2024-04-25, 19:58:12 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -12980,12 +12980,16 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
09BC ; InCB; Extend # Mn BENGALI SIGN NUKTA
09FE ; InCB; Extend # Mn BENGALI SANDHI MARK
0A3C ; InCB; Extend # Mn GURMUKHI SIGN NUKTA
0A4D ; InCB; Extend # Mn GURMUKHI SIGN VIRAMA
0ABC ; InCB; Extend # Mn GUJARATI SIGN NUKTA
0B3C ; InCB; Extend # Mn ORIYA SIGN NUKTA
0BCD ; InCB; Extend # Mn TAMIL SIGN VIRAMA
0C3C ; InCB; Extend # Mn TELUGU SIGN NUKTA
0C55..0C56 ; InCB; Extend # Mn [2] TELUGU LENGTH MARK..TELUGU AI LENGTH MARK
0CBC ; InCB; Extend # Mn KANNADA SIGN NUKTA
0CCD ; InCB; Extend # Mn KANNADA SIGN VIRAMA
0D3B..0D3C ; InCB; Extend # Mn [2] MALAYALAM SIGN VERTICAL BAR VIRAMA..MALAYALAM SIGN CIRCULAR VIRAMA
0DCA ; InCB; Extend # Mn SINHALA SIGN AL-LAKUNA
0E38..0E3A ; InCB; Extend # Mn [3] THAI CHARACTER SARA U..THAI CHARACTER PHINTHU
0E48..0E4B ; InCB; Extend # Mn [4] THAI CHARACTER MAI EK..THAI CHARACTER MAI CHATTAWA
0EB8..0EBA ; InCB; Extend # Mn [3] LAO VOWEL SIGN U..LAO SIGN PALI VIRAMA
Expand Down Expand Up @@ -13019,6 +13023,7 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
1AB0..1ABD ; InCB; Extend # Mn [14] COMBINING DOUBLED CIRCUMFLEX ACCENT..COMBINING PARENTHESES BELOW
1ABF..1ACE ; InCB; Extend # Mn [16] COMBINING LATIN SMALL LETTER W BELOW..COMBINING LATIN SMALL LETTER INSULAR T
1B34 ; InCB; Extend # Mn BALINESE SIGN REREKAN
1B44 ; InCB; Extend # Mc BALINESE ADEG ADEG
1B6B..1B73 ; InCB; Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1BAA ; InCB; Extend # Mc SUNDANESE SIGN PAMAAEH
1BAB ; InCB; Extend # Mn SUNDANESE SIGN VIRAMA
Expand Down Expand Up @@ -13046,11 +13051,14 @@ A66F ; InCB; Extend # Mn COMBINING CYRILLIC VZMET
A674..A67D ; InCB; Extend # Mn [10] COMBINING CYRILLIC LETTER UKRAINIAN IE..COMBINING CYRILLIC PAYEROK
A69E..A69F ; InCB; Extend # Mn [2] COMBINING CYRILLIC LETTER EF..COMBINING CYRILLIC LETTER IOTIFIED E
A6F0..A6F1 ; InCB; Extend # Mn [2] BAMUM COMBINING MARK KOQNDON..BAMUM COMBINING MARK TUKWENTIS
A806 ; InCB; Extend # Mn SYLOTI NAGRI SIGN HASANTA
A82C ; InCB; Extend # Mn SYLOTI NAGRI SIGN ALTERNATE HASANTA
A8C4 ; InCB; Extend # Mn SAURASHTRA SIGN VIRAMA
A8E0..A8F1 ; InCB; Extend # Mn [18] COMBINING DEVANAGARI DIGIT ZERO..COMBINING DEVANAGARI SIGN AVAGRAHA
A92B..A92D ; InCB; Extend # Mn [3] KAYAH LI TONE PLOPHU..KAYAH LI TONE CALYA PLOPHU
A953 ; InCB; Extend # Mc REJANG VIRAMA
A9B3 ; InCB; Extend # Mn JAVANESE SIGN CECAK TELU
A9C0 ; InCB; Extend # Mc JAVANESE PANGKON
AAB0 ; InCB; Extend # Mn TAI VIET MAI KANG
AAB2..AAB4 ; InCB; Extend # Mn [3] TAI VIET VOWEL I..TAI VIET VOWEL U
AAB7..AAB8 ; InCB; Extend # Mn [2] TAI VIET MAI KHIT..TAI VIET VOWEL IA
Expand All @@ -13074,34 +13082,43 @@ FE20..FE2F ; InCB; Extend # Mn [16] COMBINING LIGATURE LEFT HALF..COMBINING
10EFD..10EFF ; InCB; Extend # Mn [3] ARABIC SMALL LOW WORD SAKTA..ARABIC SMALL LOW WORD MADDA
10F46..10F50 ; InCB; Extend # Mn [11] SOGDIAN COMBINING DOT BELOW..SOGDIAN COMBINING STROKE BELOW
10F82..10F85 ; InCB; Extend # Mn [4] OLD UYGHUR COMBINING DOT ABOVE..OLD UYGHUR COMBINING TWO DOTS BELOW
11046 ; InCB; Extend # Mn BRAHMI VIRAMA
11070 ; InCB; Extend # Mn BRAHMI SIGN OLD TAMIL VIRAMA
1107F ; InCB; Extend # Mn BRAHMI NUMBER JOINER
110BA ; InCB; Extend # Mn KAITHI SIGN NUKTA
110B9..110BA ; InCB; Extend # Mn [2] KAITHI SIGN VIRAMA..KAITHI SIGN NUKTA
11100..11102 ; InCB; Extend # Mn [3] CHAKMA SIGN CANDRABINDU..CHAKMA SIGN VISARGA
11133..11134 ; InCB; Extend # Mn [2] CHAKMA VIRAMA..CHAKMA MAAYYAA
11173 ; InCB; Extend # Mn MAHAJANI SIGN NUKTA
111C0 ; InCB; Extend # Mc SHARADA SIGN VIRAMA
111CA ; InCB; Extend # Mn SHARADA SIGN NUKTA
11235 ; InCB; Extend # Mc KHOJKI SIGN VIRAMA
11236 ; InCB; Extend # Mn KHOJKI SIGN NUKTA
112E9..112EA ; InCB; Extend # Mn [2] KHUDAWADI SIGN NUKTA..KHUDAWADI SIGN VIRAMA
1133B..1133C ; InCB; Extend # Mn [2] COMBINING BINDU BELOW..GRANTHA SIGN NUKTA
1134D ; InCB; Extend # Mc GRANTHA SIGN VIRAMA
11366..1136C ; InCB; Extend # Mn [7] COMBINING GRANTHA DIGIT ZERO..COMBINING GRANTHA DIGIT SIX
11370..11374 ; InCB; Extend # Mn [5] COMBINING GRANTHA LETTER A..COMBINING GRANTHA LETTER PA
113CE ; InCB; Extend # Mn TULU-TIGALARI SIGN VIRAMA
113CF ; InCB; Extend # Mc TULU-TIGALARI SIGN LOOPED VIRAMA
113D0 ; InCB; Extend # Mn TULU-TIGALARI CONJOINER
11442 ; InCB; Extend # Mn NEWA SIGN VIRAMA
11446 ; InCB; Extend # Mn NEWA SIGN NUKTA
1145E ; InCB; Extend # Mn NEWA SANDHI MARK
114C3 ; InCB; Extend # Mn TIRHUTA SIGN NUKTA
115C0 ; InCB; Extend # Mn SIDDHAM SIGN NUKTA
114C2..114C3 ; InCB; Extend # Mn [2] TIRHUTA SIGN VIRAMA..TIRHUTA SIGN NUKTA
115BF..115C0 ; InCB; Extend # Mn [2] SIDDHAM SIGN VIRAMA..SIDDHAM SIGN NUKTA
1163F ; InCB; Extend # Mn MODI SIGN VIRAMA
116B6 ; InCB; Extend # Mc TAKRI SIGN VIRAMA
116B7 ; InCB; Extend # Mn TAKRI SIGN NUKTA
1172B ; InCB; Extend # Mn AHOM SIGN KILLER
1183A ; InCB; Extend # Mn DOGRA SIGN NUKTA
11839..1183A ; InCB; Extend # Mn [2] DOGRA SIGN VIRAMA..DOGRA SIGN NUKTA
1193D ; InCB; Extend # Mc DIVES AKURU SIGN HALANTA
1193E ; InCB; Extend # Mn DIVES AKURU VIRAMA
11943 ; InCB; Extend # Mn DIVES AKURU SIGN NUKTA
119E0 ; InCB; Extend # Mn NANDINAGARI SIGN VIRAMA
11A34 ; InCB; Extend # Mn ZANABAZAR SQUARE SIGN VIRAMA
11A47 ; InCB; Extend # Mn ZANABAZAR SQUARE SUBJOINER
11A99 ; InCB; Extend # Mn SOYOMBO SUBJOINER
11C3F ; InCB; Extend # Mn BHAIKSUKI SIGN VIRAMA
11D42 ; InCB; Extend # Mn MASARAM GONDI SIGN NUKTA
11D44..11D45 ; InCB; Extend # Mn [2] MASARAM GONDI SIGN HALANTA..MASARAM GONDI VIRAMA
11D97 ; InCB; Extend # Mn GUNJALA GONDI VIRAMA
Expand Down Expand Up @@ -13133,6 +13150,6 @@ FE20..FE2F ; InCB; Extend # Mn [16] COMBINING LIGATURE LEFT HALF..COMBINING
1E8D0..1E8D6 ; InCB; Extend # Mn [7] MENDE KIKAKUI COMBINING NUMBER TEENS..MENDE KIKAKUI COMBINING NUMBER MILLIONS
1E944..1E94A ; InCB; Extend # Mn [7] ADLAM ALIF LENGTHENER..ADLAM NUKTA

# Total code points: 908
# Total code points: 929

# EOF
Original file line number Diff line number Diff line change
Expand Up @@ -1177,35 +1177,25 @@ public int getMaxWidth(boolean isShort) {
.addAll(script.getSet("Orya"))
.addAll(script.getSet("Beng"))
.addAll(script.getSet("Deva"));
final UnicodeSet incbLinker =
conjunctLinkingScripts
.cloneAsThawed()
.retainAll(isc.getSet(Indic_Syllabic_Category_Values.Virama));
final UnicodeSet incbConsonant =
conjunctLinkingScripts
.cloneAsThawed()
.retainAll(isc.getSet(Indic_Syllabic_Category_Values.Consonant));
final UnicodeMap<String> incbDefinition =
new UnicodeMap<String>()
.setErrorOnReset(true)
.putAll(
conjunctLinkingScripts
.cloneAsThawed()
.retainAll(
isc.getSet(
Indic_Syllabic_Category_Values.Virama)),
"Linker")
.putAll(
conjunctLinkingScripts
.cloneAsThawed()
.retainAll(
isc.getSet(
Indic_Syllabic_Category_Values
.Consonant)),
"Consonant")
.putAll(incbLinker, "Linker")
.putAll(incbConsonant, "Consonant")
.putAll(
gcb.getSet("Extend")
.removeAll(ccc.getSet("Not_Reordered"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggrobin This still removes ccc=0 so

  • [179-C31] Consensus: Change the derivation of InCB=Extend to [\p{gcb=Extend}\p{gcb=ZWJ}-\p{InCB=Linker}-\p{InCB=Consonant}-[\u200c]]. For Unicode Version 16.0. See document L2/24-064 item 5.8.

is not done yet in this code, right?

.addAll(gcb.getSet("ZWJ"))
.removeAll(
isc.getSet(
Indic_Syllabic_Category_Values.Virama))
.removeAll(
isc.getSet(
Indic_Syllabic_Category_Values
.Consonant)),
.removeAll(incbLinker)
.removeAll(incbConsonant),
"Extend")
.setMissing("None");
add(
Expand Down
Loading