Skip to content

Commit

Permalink
Test consistency of grapheme cluster segmentation with canonical equi…
Browse files Browse the repository at this point in the history
…valence some more (#839)
  • Loading branch information
eggrobin authored Jul 25, 2024
1 parent 883157e commit ef8d616
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 25 deletions.
30 changes: 18 additions & 12 deletions unicodetools/data/ucd/dev/DerivedCoreProperties.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedCoreProperties-16.0.0.txt
# Date: 2024-05-02, 15:02:37 GMT
# Date: 2024-05-31, 18:09:32 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -10699,8 +10699,11 @@ E01F0..E0FFF ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<rese
0C81 ; Grapheme_Extend # Mn KANNADA SIGN CANDRABINDU
0CBC ; Grapheme_Extend # Mn KANNADA SIGN NUKTA
0CBF ; Grapheme_Extend # Mn KANNADA VOWEL SIGN I
0CC0 ; Grapheme_Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; Grapheme_Extend # Mc KANNADA VOWEL SIGN UU
0CC6 ; Grapheme_Extend # Mn KANNADA VOWEL SIGN E
0CC7..0CC8 ; Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CCC..0CCD ; Grapheme_Extend # Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIGN VIRAMA
0CD5..0CD6 ; Grapheme_Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0CE2..0CE3 ; Grapheme_Extend # Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANNADA VOWEL SIGN VOCALIC LL
Expand Down Expand Up @@ -10780,9 +10783,11 @@ E01F0..E0FFF ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<rese
1B34 ; Grapheme_Extend # Mn BALINESE SIGN REREKAN
1B35 ; Grapheme_Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B36..1B3A ; Grapheme_Extend # Mn [5] BALINESE VOWEL SIGN ULU..BALINESE VOWEL SIGN RA REPA
1B3B ; Grapheme_Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3C ; Grapheme_Extend # Mn BALINESE VOWEL SIGN LA LENGA
1B3D ; Grapheme_Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B42 ; Grapheme_Extend # Mn BALINESE VOWEL SIGN PEPET
1B44 ; Grapheme_Extend # Mc BALINESE ADEG ADEG
1B43..1B44 ; Grapheme_Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1B6B..1B73 ; Grapheme_Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1B80..1B81 ; Grapheme_Extend # Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PANGLAYAR
1BA2..1BA5 ; Grapheme_Extend # Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA..SUNDANESE VOWEL SIGN PANYUKU
Expand Down Expand Up @@ -11024,7 +11029,7 @@ FF9E..FF9F ; Grapheme_Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK.
E0020..E007F ; Grapheme_Extend # Cf [96] TAG SPACE..CANCEL TAG
E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

# Total code points: 2185
# Total code points: 2193

# ================================================

Expand Down Expand Up @@ -11316,10 +11321,8 @@ E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELE
0CB5..0CB9 ; Grapheme_Base # Lo [5] KANNADA LETTER VA..KANNADA LETTER HA
0CBD ; Grapheme_Base # Lo KANNADA SIGN AVAGRAHA
0CBE ; Grapheme_Base # Mc KANNADA VOWEL SIGN AA
0CC0..0CC1 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN II..KANNADA VOWEL SIGN U
0CC1 ; Grapheme_Base # Mc KANNADA VOWEL SIGN U
0CC3..0CC4 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN VOCALIC R..KANNADA VOWEL SIGN VOCALIC RR
0CC7..0CC8 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CDD..0CDE ; Grapheme_Base # Lo [2] KANNADA LETTER NAKAARA POLLU..KANNADA LETTER FA
0CE0..0CE1 ; Grapheme_Base # Lo [2] KANNADA LETTER VOCALIC RR..KANNADA LETTER VOCALIC LL
0CE6..0CEF ; Grapheme_Base # Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE
Expand Down Expand Up @@ -11526,9 +11529,7 @@ E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELE
1AA8..1AAD ; Grapheme_Base # Po [6] TAI THAM SIGN KAAN..TAI THAM SIGN CAANG
1B04 ; Grapheme_Base # Mc BALINESE SIGN BISAH
1B05..1B33 ; Grapheme_Base # Lo [47] BALINESE LETTER AKARA..BALINESE LETTER HA
1B3B ; Grapheme_Base # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D..1B41 ; Grapheme_Base # Mc [5] BALINESE VOWEL SIGN LA LENGA TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B43 ; Grapheme_Base # Mc BALINESE VOWEL SIGN PEPET TEDUNG
1B3E..1B41 ; Grapheme_Base # Mc [4] BALINESE VOWEL SIGN TALING..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B45..1B4C ; Grapheme_Base # Lo [8] BALINESE LETTER KAF SASAK..BALINESE LETTER ARCHAIC JNYA
1B4E..1B4F ; Grapheme_Base # Po [2] BALINESE INVERTED CARIK SIKI..BALINESE INVERTED CARIK PAREREN
1B50..1B59 ; Grapheme_Base # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
Expand Down Expand Up @@ -12811,7 +12812,7 @@ FFFC..FFFD ; Grapheme_Base # So [2] OBJECT REPLACEMENT CHARACTER..REPLACEME
30000..3134A ; Grapheme_Base # Lo [4939] CJK UNIFIED IDEOGRAPH-30000..CJK UNIFIED IDEOGRAPH-3134A
31350..323AF ; Grapheme_Base # Lo [4192] CJK UNIFIED IDEOGRAPH-31350..CJK UNIFIED IDEOGRAPH-323AF

# Total code points: 152738
# Total code points: 152730

# ================================================

Expand Down Expand Up @@ -13026,8 +13027,11 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
0C81 ; InCB; Extend # Mn KANNADA SIGN CANDRABINDU
0CBC ; InCB; Extend # Mn KANNADA SIGN NUKTA
0CBF ; InCB; Extend # Mn KANNADA VOWEL SIGN I
0CC0 ; InCB; Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; InCB; Extend # Mc KANNADA VOWEL SIGN UU
0CC6 ; InCB; Extend # Mn KANNADA VOWEL SIGN E
0CC7..0CC8 ; InCB; Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; InCB; Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CCC..0CCD ; InCB; Extend # Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIGN VIRAMA
0CD5..0CD6 ; InCB; Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0CE2..0CE3 ; InCB; Extend # Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANNADA VOWEL SIGN VOCALIC LL
Expand Down Expand Up @@ -13106,9 +13110,11 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
1B34 ; InCB; Extend # Mn BALINESE SIGN REREKAN
1B35 ; InCB; Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B36..1B3A ; InCB; Extend # Mn [5] BALINESE VOWEL SIGN ULU..BALINESE VOWEL SIGN RA REPA
1B3B ; InCB; Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3C ; InCB; Extend # Mn BALINESE VOWEL SIGN LA LENGA
1B3D ; InCB; Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B42 ; InCB; Extend # Mn BALINESE VOWEL SIGN PEPET
1B44 ; InCB; Extend # Mc BALINESE ADEG ADEG
1B43..1B44 ; InCB; Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1B6B..1B73 ; InCB; Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1B80..1B81 ; InCB; Extend # Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PANGLAYAR
1BA2..1BA5 ; InCB; Extend # Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA..SUNDANESE VOWEL SIGN PANYUKU
Expand Down Expand Up @@ -13351,6 +13357,6 @@ FF9E..FF9F ; InCB; Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HA
E0020..E007F ; InCB; Extend # Cf [96] TAG SPACE..CANCEL TAG
E0100..E01EF ; InCB; Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

# Total code points: 2184
# Total code points: 2192

# EOF
11 changes: 8 additions & 3 deletions unicodetools/data/ucd/dev/PropList.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# PropList-16.0.0.txt
# Date: 2024-05-08, 03:40:06 GMT
# Date: 2024-05-31, 18:09:48 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -1273,7 +1273,10 @@ FFFFE..FFFFF ; Noncharacter_Code_Point # Cn [2] <noncharacter-FFFFE>..<noncha
0B57 ; Other_Grapheme_Extend # Mc ORIYA AU LENGTH MARK
0BBE ; Other_Grapheme_Extend # Mc TAMIL VOWEL SIGN AA
0BD7 ; Other_Grapheme_Extend # Mc TAMIL AU LENGTH MARK
0CC0 ; Other_Grapheme_Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; Other_Grapheme_Extend # Mc KANNADA VOWEL SIGN UU
0CC7..0CC8 ; Other_Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Other_Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CD5..0CD6 ; Other_Grapheme_Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0D3E ; Other_Grapheme_Extend # Mc MALAYALAM VOWEL SIGN AA
0D57 ; Other_Grapheme_Extend # Mc MALAYALAM AU LENGTH MARK
Expand All @@ -1282,7 +1285,9 @@ FFFFE..FFFFF ; Noncharacter_Code_Point # Cn [2] <noncharacter-FFFFE>..<noncha
1715 ; Other_Grapheme_Extend # Mc TAGALOG SIGN PAMUDPOD
1734 ; Other_Grapheme_Extend # Mc HANUNOO SIGN PAMUDPOD
1B35 ; Other_Grapheme_Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B44 ; Other_Grapheme_Extend # Mc BALINESE ADEG ADEG
1B3B ; Other_Grapheme_Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D ; Other_Grapheme_Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B43..1B44 ; Other_Grapheme_Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1BAA ; Other_Grapheme_Extend # Mc SUNDANESE SIGN PAMAAEH
1BF2..1BF3 ; Other_Grapheme_Extend # Mc [2] BATAK PANGOLAT..BATAK PANONGONAN
200C ; Other_Grapheme_Extend # Cf ZERO WIDTH NON-JOINER
Expand Down Expand Up @@ -1312,7 +1317,7 @@ FF9E..FF9F ; Other_Grapheme_Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND
1D16D..1D172 ; Other_Grapheme_Extend # Mc [6] MUSICAL SYMBOL COMBINING AUGMENTATION DOT..MUSICAL SYMBOL COMBINING FLAG-5
E0020..E007F ; Other_Grapheme_Extend # Cf [96] TAG SPACE..CANCEL TAG

# Total code points: 152
# Total code points: 160

# ================================================

Expand Down
21 changes: 11 additions & 10 deletions unicodetools/data/ucd/dev/auxiliary/GraphemeBreakProperty.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# GraphemeBreakProperty-16.0.0.txt
# Date: 2024-04-30, 21:48:20 GMT
# Date: 2024-05-31, 18:09:38 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -164,8 +164,11 @@ E01F0..E0FFF ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>
0C81 ; Extend # Mn KANNADA SIGN CANDRABINDU
0CBC ; Extend # Mn KANNADA SIGN NUKTA
0CBF ; Extend # Mn KANNADA VOWEL SIGN I
0CC0 ; Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; Extend # Mc KANNADA VOWEL SIGN UU
0CC6 ; Extend # Mn KANNADA VOWEL SIGN E
0CC7..0CC8 ; Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CCC..0CCD ; Extend # Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIGN VIRAMA
0CD5..0CD6 ; Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0CE2..0CE3 ; Extend # Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANNADA VOWEL SIGN VOCALIC LL
Expand Down Expand Up @@ -245,9 +248,11 @@ E01F0..E0FFF ; Control # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>
1B34 ; Extend # Mn BALINESE SIGN REREKAN
1B35 ; Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B36..1B3A ; Extend # Mn [5] BALINESE VOWEL SIGN ULU..BALINESE VOWEL SIGN RA REPA
1B3B ; Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3C ; Extend # Mn BALINESE VOWEL SIGN LA LENGA
1B3D ; Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B42 ; Extend # Mn BALINESE VOWEL SIGN PEPET
1B44 ; Extend # Mc BALINESE ADEG ADEG
1B43..1B44 ; Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1B6B..1B73 ; Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1B80..1B81 ; Extend # Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PANGLAYAR
1BA2..1BA5 ; Extend # Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA..SUNDANESE VOWEL SIGN PANYUKU
Expand Down Expand Up @@ -490,7 +495,7 @@ FF9E..FF9F ; Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDT
E0020..E007F ; Extend # Cf [96] TAG SPACE..CANCEL TAG
E0100..E01EF ; Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

# Total code points: 2190
# Total code points: 2198

# ================================================

Expand Down Expand Up @@ -527,10 +532,8 @@ E0100..E01EF ; Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
0C41..0C44 ; SpacingMark # Mc [4] TELUGU VOWEL SIGN U..TELUGU VOWEL SIGN VOCALIC RR
0C82..0C83 ; SpacingMark # Mc [2] KANNADA SIGN ANUSVARA..KANNADA SIGN VISARGA
0CBE ; SpacingMark # Mc KANNADA VOWEL SIGN AA
0CC0..0CC1 ; SpacingMark # Mc [2] KANNADA VOWEL SIGN II..KANNADA VOWEL SIGN U
0CC1 ; SpacingMark # Mc KANNADA VOWEL SIGN U
0CC3..0CC4 ; SpacingMark # Mc [2] KANNADA VOWEL SIGN VOCALIC R..KANNADA VOWEL SIGN VOCALIC RR
0CC7..0CC8 ; SpacingMark # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; SpacingMark # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CF3 ; SpacingMark # Mc KANNADA SIGN COMBINING ANUSVARA ABOVE RIGHT
0D02..0D03 ; SpacingMark # Mc [2] MALAYALAM SIGN ANUSVARA..MALAYALAM SIGN VISARGA
0D3F..0D40 ; SpacingMark # Mc [2] MALAYALAM VOWEL SIGN I..MALAYALAM VOWEL SIGN II
Expand Down Expand Up @@ -560,9 +563,7 @@ E0100..E01EF ; Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
1A57 ; SpacingMark # Mc TAI THAM CONSONANT SIGN LA TANG LAI
1A6D..1A72 ; SpacingMark # Mc [6] TAI THAM VOWEL SIGN OY..TAI THAM VOWEL SIGN THAM AI
1B04 ; SpacingMark # Mc BALINESE SIGN BISAH
1B3B ; SpacingMark # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D..1B41 ; SpacingMark # Mc [5] BALINESE VOWEL SIGN LA LENGA TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B43 ; SpacingMark # Mc BALINESE VOWEL SIGN PEPET TEDUNG
1B3E..1B41 ; SpacingMark # Mc [4] BALINESE VOWEL SIGN TALING..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B82 ; SpacingMark # Mc SUNDANESE SIGN PANGWISAD
1BA1 ; SpacingMark # Mc SUNDANESE CONSONANT SIGN PAMINGKAL
1BA6..1BA7 ; SpacingMark # Mc [2] SUNDANESE VOWEL SIGN PANAELAENG..SUNDANESE VOWEL SIGN PANOLONG
Expand Down Expand Up @@ -660,7 +661,7 @@ ABEC ; SpacingMark # Mc MEETEI MAYEK LUM IYEK
1612A..1612C ; SpacingMark # Mc [3] GURUNG KHEMA CONSONANT SIGN MEDIAL YA..GURUNG KHEMA CONSONANT SIGN MEDIAL HA
16F51..16F87 ; SpacingMark # Mc [55] MIAO SIGN ASPIRATION..MIAO VOWEL SIGN UI

# Total code points: 386
# Total code points: 378

# ================================================

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1027,6 +1027,10 @@ Let $TwoVietnameseReadingMarks := [\p{U15.1.0:ccc=6}]
# an LV or V, respectively.
[\p{NFC_QC=Maybe}&\p{ccc=0}] ⊆ [\p{GCB=Extend}\p{GCB=T}\p{GCB=V}]

# Canonical decomposition preserves the initial GCB, except for LV and LVT.
In [\p{dt=canonical}-[\p{gcb=LV}\p{gcb=LVT}]], gcb * (take 1) * dm = gcb
In [\p{gcb=LV}\p{gcb=LVT}], (take 1) * dm ∈ [\p{gcb=L} \p{gcb=LV} \p{gcb=LVT}]

# ICU relies on this to avoid carrying data for HST which would be mostly
# redundant with GCB. If this breaks, it should be noted on the landing page,
# and ICU-TC should be notified.
Expand Down

0 comments on commit ef8d616

Please sign in to comment.