Skip to content

Commit

Permalink
Merge remote-tracking branch 'la-vache/main' into 5-modifier-click-le…
Browse files Browse the repository at this point in the history
…tters
  • Loading branch information
eggrobin committed Jul 25, 2024
2 parents 6b9773d + ef8d616 commit fb2804d
Show file tree
Hide file tree
Showing 40 changed files with 1,966 additions and 519 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
uses: actions/checkout@v3
with:
sparse-checkout: py/pipeline-workflow
- name: Check L2 document
- name: Check L2 document and WG references
run: |
python3 py/pipeline-workflow/check-l2-document.py
utc-decision:
Expand Down
36 changes: 16 additions & 20 deletions docs/help/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,42 +3,38 @@
The Unicode Utilities have been modified to support both properties from the
released version of Unicode (via ICU) and from the new Unicode beta.

To get the beta version of the property, insert β *after* the property name.
To get the beta version of the property, insert `Uβ:` *before* the property name.
The explicit version number for the β can be used;
the resulting property is then only valid when that specific β is current.
Examples:

| `\p{Word_Break=ALetter}` | Released version of Unicode |
| `\p{Word_Breakβ=ALetter}` | Beta version of Unicode |
| Query | Result |
|---|---|
| `\p{Word_Break=ALetter}` | Released version of Unicode. |
| `\p{Uβ:Word_Break=ALetter}` | Beta version of Unicode; error outside of beta review. |
| `\p{U16β:Word_Break=ALetter}` | Beta version of Unicode 16.0; error during the beta review of any other version. |


For example, to see additions to that property value in the beta version, use:

<center>

[`\p{Word_Breakβ=ALetter}-\\p{Word_Break=ALetter}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BWord_Break%CE%B2%3DALetter%7D-%5Cp%7BWord_Break%3DALetter%7D&g=&i=)
[`\p{Uβ:Word_Break=ALetter}-\p{Word_Break=ALetter}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BU%CE%B2%3AWord_Break%3DALetter%7D-%5Cp%7BWord_Break%3DALetter%7D&g=&i=)

</center>


## Caveats

The support is not complete done, and there are some known problems.

1. Some properties are not supported in beta versions. See
<https://util.unicode.org/UnicodeJsps/properties.jsp>
for the list.
2. When characters are listed, the new blocks and subheads don't show up.
3. If you use a property that has a β version but no ICU version, you get no
error: just an empty listing.
4. The beta properties don't yet have the "shorthands" for cases like \\p{Lu}.
So make sure the property is listed, eg \\p{gcβ=Lu}
1. Example:
[`\p{gcβ=Lu}-\\p{gc=Lu}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bgc%CE%B2%3DLu%7D-%5Cp%7Bgc%3DLu%7D&g=&i=)
5. Tools for segmentation, etc. use the release properties; there isn't a way
The support is not completely done, and there are some known problems.

1. The General_Category groupings such as \\p{Uβ:L} are not correctly implemented.
Only actual values, such as \\p{Uβ:Lu} etc., work.
2. Tools for segmentation, etc. use the release properties; there isn't a way
to have them use the beta properties.
6. There are probably others...
3. There are probably others...

If you find a problem, please file a ticket at
<https://cldr.unicode.org/index/bug-reports>: make sure to start the summary with
"Unicode Utilities: "
https://github.com/unicode-org/unicodetools/issues.

[Back to Unicode Utilities Help Home](index)
1 change: 1 addition & 0 deletions docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ PR preparation:
- [ ] If from SAH — Link SAH issue
- [ ] If from ESC or CJK — Mention ESC or CJK in the PR description
- [ ] When for a UTC decision — Cite in the format UTC-\d\d\d-[MC]\d+ or with a link.
- [ ] Link RMG issue
- [ ] Whenever there is a Proposal document — Cite L2 number in the format L2/yy-nnn
- [ ] data-for-new — Set label
- [ ] pipeline-* — Set label to **pipeline-recommended-to-UTC** if the characters are not yet in the pipeline, and **pipeline-provisionally-assigned**, or **pipeline-`<version>`** depending on their status in [the Pipeline](https://unicode.org/alloc/Pipeline.html#future).
Expand Down
5 changes: 5 additions & 0 deletions py/pipeline-workflow/check-l2-document.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,9 @@
"PRs for character additions must include a link to the SAH issue, or "
"the mention ESC or CJK.")
errors += 1
if not re.search(r"(unicode-org/utc-release-management(#|/issues/)\d)", pr_body):
print("::error title=Need RMG reference::"
"PRs for character additions must include a link to the corresponding "
"RMG issue.")
errors += 1
exit(errors)
441 changes: 441 additions & 0 deletions unicodetools/data/emoji/dev/internal/emoji-ordering-rules.txt

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/DerivedAge.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedAge-16.0.0.txt

Check warning on line 1 in unicodetools/data/ucd/dev/DerivedAge.txt

View workflow job for this annotation

GitHub Actions / Draft unless approved

Not in the 16.0 pipeline

While the Unicode Technical Committee has provisionally assigned these characters, they have not been accepted for Unicode 16.0, nor for any specific version of Unicode. The Age property values for new characters are likely incorrect right now. They will be recomputed after the UTC accepts their encoding and this pull request is updated for the target version.
# Date: 2024-07-25, 16:09:29 GMT
# Date: 2024-07-25, 16:15:32 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
30 changes: 18 additions & 12 deletions unicodetools/data/ucd/dev/DerivedCoreProperties.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedCoreProperties-16.0.0.txt
# Date: 2024-07-25, 16:10:14 GMT
# Date: 2024-07-25, 16:16:16 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -10699,8 +10699,11 @@ E01F0..E0FFF ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<rese
0C81 ; Grapheme_Extend # Mn KANNADA SIGN CANDRABINDU
0CBC ; Grapheme_Extend # Mn KANNADA SIGN NUKTA
0CBF ; Grapheme_Extend # Mn KANNADA VOWEL SIGN I
0CC0 ; Grapheme_Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; Grapheme_Extend # Mc KANNADA VOWEL SIGN UU
0CC6 ; Grapheme_Extend # Mn KANNADA VOWEL SIGN E
0CC7..0CC8 ; Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Grapheme_Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CCC..0CCD ; Grapheme_Extend # Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIGN VIRAMA
0CD5..0CD6 ; Grapheme_Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0CE2..0CE3 ; Grapheme_Extend # Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANNADA VOWEL SIGN VOCALIC LL
Expand Down Expand Up @@ -10780,9 +10783,11 @@ E01F0..E0FFF ; Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<rese
1B34 ; Grapheme_Extend # Mn BALINESE SIGN REREKAN
1B35 ; Grapheme_Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B36..1B3A ; Grapheme_Extend # Mn [5] BALINESE VOWEL SIGN ULU..BALINESE VOWEL SIGN RA REPA
1B3B ; Grapheme_Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3C ; Grapheme_Extend # Mn BALINESE VOWEL SIGN LA LENGA
1B3D ; Grapheme_Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B42 ; Grapheme_Extend # Mn BALINESE VOWEL SIGN PEPET
1B44 ; Grapheme_Extend # Mc BALINESE ADEG ADEG
1B43..1B44 ; Grapheme_Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1B6B..1B73 ; Grapheme_Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1B80..1B81 ; Grapheme_Extend # Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PANGLAYAR
1BA2..1BA5 ; Grapheme_Extend # Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA..SUNDANESE VOWEL SIGN PANYUKU
Expand Down Expand Up @@ -11024,7 +11029,7 @@ FF9E..FF9F ; Grapheme_Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK.
E0020..E007F ; Grapheme_Extend # Cf [96] TAG SPACE..CANCEL TAG
E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

# Total code points: 2185
# Total code points: 2193

# ================================================

Expand Down Expand Up @@ -11316,10 +11321,8 @@ E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELE
0CB5..0CB9 ; Grapheme_Base # Lo [5] KANNADA LETTER VA..KANNADA LETTER HA
0CBD ; Grapheme_Base # Lo KANNADA SIGN AVAGRAHA
0CBE ; Grapheme_Base # Mc KANNADA VOWEL SIGN AA
0CC0..0CC1 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN II..KANNADA VOWEL SIGN U
0CC1 ; Grapheme_Base # Mc KANNADA VOWEL SIGN U
0CC3..0CC4 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN VOCALIC R..KANNADA VOWEL SIGN VOCALIC RR
0CC7..0CC8 ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; Grapheme_Base # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CDD..0CDE ; Grapheme_Base # Lo [2] KANNADA LETTER NAKAARA POLLU..KANNADA LETTER FA
0CE0..0CE1 ; Grapheme_Base # Lo [2] KANNADA LETTER VOCALIC RR..KANNADA LETTER VOCALIC LL
0CE6..0CEF ; Grapheme_Base # Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE
Expand Down Expand Up @@ -11526,9 +11529,7 @@ E0100..E01EF ; Grapheme_Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELE
1AA8..1AAD ; Grapheme_Base # Po [6] TAI THAM SIGN KAAN..TAI THAM SIGN CAANG
1B04 ; Grapheme_Base # Mc BALINESE SIGN BISAH
1B05..1B33 ; Grapheme_Base # Lo [47] BALINESE LETTER AKARA..BALINESE LETTER HA
1B3B ; Grapheme_Base # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D..1B41 ; Grapheme_Base # Mc [5] BALINESE VOWEL SIGN LA LENGA TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B43 ; Grapheme_Base # Mc BALINESE VOWEL SIGN PEPET TEDUNG
1B3E..1B41 ; Grapheme_Base # Mc [4] BALINESE VOWEL SIGN TALING..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B45..1B4C ; Grapheme_Base # Lo [8] BALINESE LETTER KAF SASAK..BALINESE LETTER ARCHAIC JNYA
1B4E..1B4F ; Grapheme_Base # Po [2] BALINESE INVERTED CARIK SIKI..BALINESE INVERTED CARIK PAREREN
1B50..1B59 ; Grapheme_Base # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
Expand Down Expand Up @@ -12811,7 +12812,7 @@ FFFC..FFFD ; Grapheme_Base # So [2] OBJECT REPLACEMENT CHARACTER..REPLACEME
30000..3134A ; Grapheme_Base # Lo [4939] CJK UNIFIED IDEOGRAPH-30000..CJK UNIFIED IDEOGRAPH-3134A
31350..323AF ; Grapheme_Base # Lo [4192] CJK UNIFIED IDEOGRAPH-31350..CJK UNIFIED IDEOGRAPH-323AF

# Total code points: 152743
# Total code points: 152735

# ================================================

Expand Down Expand Up @@ -13026,8 +13027,11 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
0C81 ; InCB; Extend # Mn KANNADA SIGN CANDRABINDU
0CBC ; InCB; Extend # Mn KANNADA SIGN NUKTA
0CBF ; InCB; Extend # Mn KANNADA VOWEL SIGN I
0CC0 ; InCB; Extend # Mc KANNADA VOWEL SIGN II
0CC2 ; InCB; Extend # Mc KANNADA VOWEL SIGN UU
0CC6 ; InCB; Extend # Mn KANNADA VOWEL SIGN E
0CC7..0CC8 ; InCB; Extend # Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI
0CCA..0CCB ; InCB; Extend # Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO
0CCC..0CCD ; InCB; Extend # Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIGN VIRAMA
0CD5..0CD6 ; InCB; Extend # Mc [2] KANNADA LENGTH MARK..KANNADA AI LENGTH MARK
0CE2..0CE3 ; InCB; Extend # Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANNADA VOWEL SIGN VOCALIC LL
Expand Down Expand Up @@ -13106,9 +13110,11 @@ ABED ; Grapheme_Link # Mn MEETEI MAYEK APUN IYEK
1B34 ; InCB; Extend # Mn BALINESE SIGN REREKAN
1B35 ; InCB; Extend # Mc BALINESE VOWEL SIGN TEDUNG
1B36..1B3A ; InCB; Extend # Mn [5] BALINESE VOWEL SIGN ULU..BALINESE VOWEL SIGN RA REPA
1B3B ; InCB; Extend # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3C ; InCB; Extend # Mn BALINESE VOWEL SIGN LA LENGA
1B3D ; InCB; Extend # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B42 ; InCB; Extend # Mn BALINESE VOWEL SIGN PEPET
1B44 ; InCB; Extend # Mc BALINESE ADEG ADEG
1B43..1B44 ; InCB; Extend # Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..BALINESE ADEG ADEG
1B6B..1B73 ; InCB; Extend # Mn [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
1B80..1B81 ; InCB; Extend # Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE SIGN PANGLAYAR
1BA2..1BA5 ; InCB; Extend # Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA..SUNDANESE VOWEL SIGN PANYUKU
Expand Down Expand Up @@ -13351,6 +13357,6 @@ FF9E..FF9F ; InCB; Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HA
E0020..E007F ; InCB; Extend # Cf [96] TAG SPACE..CANCEL TAG
E0100..E01EF ; InCB; Extend # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

# Total code points: 2184
# Total code points: 2192

# EOF
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/DerivedNormalizationProps.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedNormalizationProps-16.0.0.txt
# Date: 2024-07-25, 16:10:21 GMT
# Date: 2024-07-25, 16:16:24 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Loading

0 comments on commit fb2804d

Please sign in to comment.