Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invariant for lb=AS numbers #537

Merged
merged 6 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions unicodetools/data/ucd/dev/LineBreak.txt
Original file line number Diff line number Diff line change
Expand Up @@ -790,7 +790,7 @@
1B43 ; CM # Mc BALINESE VOWEL SIGN PEPET TEDUNG
1B44 ; VI # Mc BALINESE ADEG ADEG
1B45..1B4C ; AK # Lo [8] BALINESE LETTER KAF SASAK..BALINESE LETTER ARCHAIC JNYA
1B50..1B59 ; ID # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1B50..1B59 ; AS # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1B5A..1B5B ; BA # Po [2] BALINESE PANTI..BALINESE PAMADA
1B5C ; ID # Po BALINESE WINDU
1B5D..1B60 ; BA # Po [4] BALINESE CARIK PAMUNGKAH..BALINESE PAMENENG
Expand Down Expand Up @@ -1643,7 +1643,7 @@ A9C1..A9C6 ; ID # Po [6] JAVANESE LEFT RERENGGAN..JAVANESE PADA WINDU
A9C7..A9C9 ; BA # Po [3] JAVANESE PADA PANGKAT..JAVANESE PADA LUNGSI
A9CA..A9CD ; ID # Po [4] JAVANESE PADA ADEG..JAVANESE TURNED PADA PISELEH
A9CF ; BA # Lm JAVANESE PANGRANGKEP
A9D0..A9D9 ; ID # Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE
A9D0..A9D9 ; AS # Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE
A9DE..A9DF ; ID # Po [2] JAVANESE PADA TIRTA TUMETES..JAVANESE PADA ISEN-ISEN
A9E0..A9E4 ; SA # Lo [5] MYANMAR LETTER SHAN GHA..MYANMAR LETTER SHAN BHA
A9E5 ; SA # Mn MYANMAR SIGN SHAN SAW
Expand All @@ -1662,7 +1662,7 @@ AA43 ; CM # Mn CHAM CONSONANT SIGN FINAL NG
AA44..AA4B ; BA # Lo [8] CHAM LETTER FINAL CH..CHAM LETTER FINAL SS
AA4C ; CM # Mn CHAM CONSONANT SIGN FINAL M
AA4D ; CM # Mc CHAM CONSONANT SIGN FINAL H
AA50..AA59 ; ID # Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE
AA50..AA59 ; AS # Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE
AA5C ; ID # Po CHAM PUNCTUATION SPIRAL
AA5D..AA5F ; BA # Po [3] CHAM PUNCTUATION DANDA..CHAM PUNCTUATION TRIPLE DANDA
AA60..AA6F ; SA # Lo [16] MYANMAR LETTER KHAMTI GA..MYANMAR LETTER KHAMTI FA
Expand Down Expand Up @@ -3031,7 +3031,7 @@ FFFD ; AI # So REPLACEMENT CHARACTER
11942 ; CM # Mc DIVES AKURU MEDIAL RA
11943 ; CM # Mn DIVES AKURU SIGN NUKTA
11944..11946 ; BA # Po [3] DIVES AKURU DOUBLE DANDA..DIVES AKURU END OF TEXT MARK
11950..11959 ; ID # Nd [10] DIVES AKURU DIGIT ZERO..DIVES AKURU DIGIT NINE
11950..11959 ; AS # Nd [10] DIVES AKURU DIGIT ZERO..DIVES AKURU DIGIT NINE
119A0..119A7 ; AL # Lo [8] NANDINAGARI LETTER A..NANDINAGARI LETTER VOCALIC RR
119AA..119D0 ; AL # Lo [39] NANDINAGARI LETTER E..NANDINAGARI LETTER RRA
119D1..119D3 ; CM # Mc [3] NANDINAGARI VOWEL SIGN AA..NANDINAGARI VOWEL SIGN II
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,7 @@ Let $IDInclusions = [[:block=/Ideographs/:] [[\U00020000-\U0003FFFF][\U0001F000-
\p{Line_Break=Unknown} = [\p{General_Category=Unassigned} \p{GeneralCategory=PrivateUse} - $IDInclusions - [\u20C0-\u20CF]]

Let $BrahmicLineBreaking = [\p{sc=Balinese}\p{sc=Batak}\p{sc=Brahmi}\p{sc=Cham}\p{sc=DivesAkuru}\p{sc=Grantha}\p{sc=Javanese}\p{sc=Makasar}\p{sc=Kawi}\p{sc=Cham}\p{sc=Makasar}]
Let $VFScripts = [\p{sc=Batak}]

Let $OPInclusions = [\u00A1\u00BF\u2E18\U00013258-\U0001325A\U00013286\U00013288\U00013379\U0001342F\U00013437\U0001343C\U0001343E\U000145CE\U0001E95E-\U0001E95F]
# 7.0 Removed hack - [\u2308\u230A]
Expand All @@ -538,7 +539,7 @@ Let $OPInclusions = [\u00A1\u00BF\u2E18\U00013258-\U0001325A\U00013286\U00013288

# See L2/22-086 for an explanation of the special case of Batak.
\p{LB=VI} = [[\p{Indic_Syllabic_Category=Virama}\p{Indic_Syllabic_Category=Invisible_Stacker}] & $BrahmicLineBreaking]
\p{LB=VF} = [\p{Indic_Syllabic_Category=Pure_Killer} & \p{sc=Batak}]
\p{LB=VF} = [\p{Indic_Syllabic_Category=Pure_Killer} & $VFScripts]

# 15.1: Action item UTC-176-A81: change [[:PCM:]-\u070F] lb=AL->NU
\p{LB=CM} = [[\u3035] \p{GC=Mn} \p{GC=Me} \p{GC=Mc} \p{GC=Cc} \p{GC=Cf} -[\U00013437\U00013438\U0001343C-\U0001343F] -\p{LB=SA} -\p{LB=WJ} -\p{LB=ZW} -\p{LB=BA} -\p{LB=LF} -\p{LB=BK} -\p{LB=CR} -\p{LB=NL} -\p{LB=GL} -\p{LB=AL} -\p{LB=ZWJ} - \p{LB=VI} - \p{LB=VF} - \p{LB=NU}]
Expand All @@ -553,6 +554,13 @@ Let $NUInclusions = [\u066B\u066C]
Let $NUFormats = [[:PCM:]-[\u070F]]
\p{LB=NU} = [\p{GC=Nd} $NUInclusions $NUFormats - \p{EA=F} - $BrahmicLineBreaking]

# Digits are lb=AS in scripts with brahmic line breaking.
[\p{GC=Nd} & $BrahmicLineBreaking] ⊆ \p{LB=AS}

# Batak is the one case where assigning LB=AS to digits could lead to unexpected results, because
# of the rule (AK | ◌ | AS) × (AK | ◌ | AS) VF. There are no Batak digits.
[\p{GC=Nd} & $VFScripts] = []

Let $PRInclusions = [\u002b\u005c\u00b1\u2116\u2212\u2213\u20C0-\u20CF]
\p{LB=PR} = [\p{GC=Sc} $PRInclusions - \p{LB=PO}]

Expand Down
Loading