Skip to content

Commit

Permalink
Merge remote-tracking branch 'la-vache/main' into pipeline-gap-174-C23
Browse files Browse the repository at this point in the history
  • Loading branch information
eggrobin committed Jun 7, 2024
2 parents e12002f + 8215285 commit 99625b1
Show file tree
Hide file tree
Showing 26 changed files with 203 additions and 41 deletions.
36 changes: 16 additions & 20 deletions docs/help/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,42 +3,38 @@
The Unicode Utilities have been modified to support both properties from the
released version of Unicode (via ICU) and from the new Unicode beta.

To get the beta version of the property, insert β *after* the property name.
To get the beta version of the property, insert `Uβ:` *before* the property name.
The explicit version number for the β can be used;
the resulting property is then only valid when that specific β is current.
Examples:

| `\p{Word_Break=ALetter}` | Released version of Unicode |
| `\p{Word_Breakβ=ALetter}` | Beta version of Unicode |
| Query | Result |
|---|---|
| `\p{Word_Break=ALetter}` | Released version of Unicode. |
| `\p{Uβ:Word_Break=ALetter}` | Beta version of Unicode; error outside of beta review. |
| `\p{U16β:Word_Break=ALetter}` | Beta version of Unicode 16.0; error during the beta review of any other version. |


For example, to see additions to that property value in the beta version, use:

<center>

[`\p{Word_Breakβ=ALetter}-\\p{Word_Break=ALetter}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BWord_Break%CE%B2%3DALetter%7D-%5Cp%7BWord_Break%3DALetter%7D&g=&i=)
[`\p{Uβ:Word_Break=ALetter}-\p{Word_Break=ALetter}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BU%CE%B2%3AWord_Break%3DALetter%7D-%5Cp%7BWord_Break%3DALetter%7D&g=&i=)

</center>


## Caveats

The support is not complete done, and there are some known problems.

1. Some properties are not supported in beta versions. See
<https://util.unicode.org/UnicodeJsps/properties.jsp>
for the list.
2. When characters are listed, the new blocks and subheads don't show up.
3. If you use a property that has a β version but no ICU version, you get no
error: just an empty listing.
4. The beta properties don't yet have the "shorthands" for cases like \\p{Lu}.
So make sure the property is listed, eg \\p{gcβ=Lu}
1. Example:
[`\p{gcβ=Lu}-\\p{gc=Lu}`](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bgc%CE%B2%3DLu%7D-%5Cp%7Bgc%3DLu%7D&g=&i=)
5. Tools for segmentation, etc. use the release properties; there isn't a way
The support is not completely done, and there are some known problems.

1. The General_Category groupings such as \\p{Uβ:L} are not correctly implemented.
Only actual values, such as \\p{Uβ:Lu} etc., work.
2. Tools for segmentation, etc. use the release properties; there isn't a way
to have them use the beta properties.
6. There are probably others...
3. There are probably others...

If you find a problem, please file a ticket at
<https://cldr.unicode.org/index/bug-reports>: make sure to start the summary with
"Unicode Utilities: "
https://github.com/unicode-org/unicodetools/issues.

[Back to Unicode Utilities Help Home](index)
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/DerivedAge.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedAge-16.0.0.txt

Check warning on line 1 in unicodetools/data/ucd/dev/DerivedAge.txt

View workflow job for this annotation

GitHub Actions / Draft unless approved

Not in the 16.0 pipeline

While the Unicode Technical Committee has provisionally assigned these characters, they have not been accepted for Unicode 16.0, nor for any specific version of Unicode. The Age property values for new characters are likely incorrect right now. They will be recomputed after the UTC accepts their encoding and this pull request is updated for the target version.
# Date: 2024-06-06, 10:07:23 GMT
# Date: 2024-06-07, 16:34:38 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/DerivedCoreProperties.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedCoreProperties-16.0.0.txt
# Date: 2024-06-06, 10:07:42 GMT
# Date: 2024-06-07, 16:34:58 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
11 changes: 9 additions & 2 deletions unicodetools/data/ucd/dev/PropertyAliases.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# PropertyAliases-16.0.0.txt
# Date: 2024-04-30, 21:48:30 GMT
# Date: 2024-06-06, 21:52:48 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -99,6 +99,11 @@ cjkIRG_VSource ; kIRG_VSource
cjkRSUnicode ; kRSUnicode ; Unicode_Radical_Stroke; URS
isc ; ISO_Comment
JSN ; Jamo_Short_Name
kEH_Cat ; kEH_Cat
kEH_Desc ; kEH_Desc
kEH_HG ; kEH_HG
kEH_IFAO ; kEH_IFAO
kEH_JSesh ; kEH_JSesh
na ; Name
na1 ; Unicode_1_Name
Name_Alias ; Name_Alias
Expand Down Expand Up @@ -179,6 +184,8 @@ IDSB ; IDS_Binary_Operator
IDST ; IDS_Trinary_Operator
IDSU ; IDS_Unary_Operator
Join_C ; Join_Control
kEH_NoMirror ; kEH_NoMirror
kEH_NoRotate ; kEH_NoRotate
LOE ; Logical_Order_Exception
Lower ; Lowercase
Math ; Math
Expand Down Expand Up @@ -213,6 +220,6 @@ XO_NFKC ; Expands_On_NFKC
XO_NFKD ; Expands_On_NFKD

# ================================================
# Total: 135
# Total: 142

# EOF
32 changes: 31 additions & 1 deletion unicodetools/data/ucd/dev/PropertyValueAliases.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# PropertyValueAliases-16.0.0.txt
# Date: 2024-06-06, 10:08:00 GMT
# Date: 2024-06-07, 16:35:15 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -1676,4 +1676,34 @@ XIDS; Y ; Yes ; T

# @missing: 0000..10FFFF; cjkRSUnicode; <none>

# kEH_Cat (kEH_Cat)

# @missing: 0000..10FFFF; kEH_Cat; <none>

# kEH_Desc (kEH_Desc)

# @missing: 0000..10FFFF; kEH_Desc; <none>

# kEH_HG (kEH_HG)

# @missing: 0000..10FFFF; kEH_HG; <none>

# kEH_IFAO (kEH_IFAO)

# @missing: 0000..10FFFF; kEH_IFAO; <none>

# kEH_JSesh (kEH_JSesh)

# @missing: 0000..10FFFF; kEH_JSesh; <none>

# kEH_NoMirror (kEH_NoMirror)

kEH_NoMirror; N ; No ; F ; False
kEH_NoMirror; Y ; Yes ; T ; True

# kEH_NoRotate (kEH_NoRotate)

kEH_NoRotate; N ; No ; F ; False
kEH_NoRotate; Y ; Yes ; T ; True

# EOF
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# GraphemeBreakProperty-16.0.0.txt
# Date: 2024-06-06, 10:07:48 GMT
# Date: 2024-06-07, 16:35:03 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SentenceBreakProperty-16.0.0.txt
# Date: 2024-06-06, 10:08:13 GMT
# Date: 2024-06-07, 16:35:29 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/auxiliary/WordBreakProperty.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# WordBreakProperty-16.0.0.txt
# Date: 2024-06-06, 10:08:15 GMT
# Date: 2024-06-07, 16:35:31 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/extracted/DerivedBidiClass.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedBidiClass-16.0.0.txt
# Date: 2024-06-06, 10:07:40 GMT
# Date: 2024-06-07, 16:34:55 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedCombiningClass-16.0.0.txt
# Date: 2024-06-06, 10:07:41 GMT
# Date: 2024-06-07, 16:34:57 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedEastAsianWidth-16.0.0.txt
# Date: 2024-06-06, 10:07:44 GMT
# Date: 2024-06-07, 16:34:59 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedGeneralCategory-16.0.0.txt
# Date: 2024-06-06, 10:07:44 GMT
# Date: 2024-06-07, 16:34:59 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/extracted/DerivedJoiningType.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedJoiningType-16.0.0.txt
# Date: 2024-06-06, 10:07:45 GMT
# Date: 2024-06-07, 16:35:00 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/extracted/DerivedLineBreak.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedLineBreak-16.0.0.txt
# Date: 2024-06-06, 10:07:45 GMT
# Date: 2024-06-07, 16:35:01 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
2 changes: 1 addition & 1 deletion unicodetools/data/ucd/dev/extracted/DerivedName.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# DerivedName-16.0.0.txt
# Date: 2024-06-06, 10:07:45 GMT
# Date: 2024-06-07, 16:35:01 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -516,7 +516,10 @@ static void parseSourceFile(
} else {
indexUnicodeProperties.getFileNames().add(fullFilename);
UcdLineParser parser = new UcdLineParser(FileUtilities.in("", fullFilename));
if (fileName.startsWith("Unihan") || fileName.startsWith("k")) {
if (fileName.startsWith("Unihan")
|| fileName.startsWith("Unikemet")
|| (fileName.endsWith("Sources") && !fileName.startsWith("Emoji"))
|| fileName.startsWith("k")) {
parser.withTabs(true);
}
PropertyParsingInfo propInfo;
Expand Down
14 changes: 12 additions & 2 deletions unicodetools/src/main/java/org/unicode/props/PropertyStatus.java
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ public enum PropertyScope {
UcdProperty.Emoji_KDDI,
UcdProperty.Emoji_SB);

// TODO(egg): These lists are not up to date!
private static final EnumSet<UcdProperty> CONTRIBUTORY_PROPERTY =
EnumSet.of(
UcdProperty.Jamo_Short_Name,
Expand Down Expand Up @@ -230,7 +231,10 @@ public enum PropertyScope {
UcdProperty.Named_Sequences_Prov,
UcdProperty.Regional_Indicator,
UcdProperty.Standardized_Variant,
UcdProperty.Vertical_Orientation);
UcdProperty.Vertical_Orientation,
// Unikemet
UcdProperty.kEH_Cat,
UcdProperty.kEH_Desc);

private static final EnumSet<UcdProperty> NORMATIVE_PROPERTY =
EnumSet.of(
Expand Down Expand Up @@ -290,7 +294,13 @@ public enum PropertyScope {
UcdProperty.kIRG_MSource,
UcdProperty.kIRG_TSource,
UcdProperty.kIRG_USource,
UcdProperty.kIRG_VSource);
UcdProperty.kIRG_VSource,
// Unikemet
UcdProperty.kEH_HG,
UcdProperty.kEH_IFAO,
UcdProperty.kEH_JSesh,
UcdProperty.kEH_NoMirror,
UcdProperty.kEH_NoRotate);
private static final EnumSet<UcdProperty> IMMUTABLE_PROPERTY =
EnumSet.of(
UcdProperty.Name,
Expand Down
15 changes: 15 additions & 0 deletions unicodetools/src/main/java/org/unicode/props/UcdProperty.java
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,14 @@ public enum UcdProperty {
kDaeJaweon(PropertyType.Miscellaneous, "cjkDaeJaweon"),
kDefinition(PropertyType.Miscellaneous, "cjkDefinition"),
kEACC(PropertyType.Miscellaneous, "cjkEACC"),
kEH_Cat(PropertyType.Miscellaneous, "kEH_Cat"),
kEH_Desc(PropertyType.Miscellaneous, "kEH_Desc"),
kEH_FVal(PropertyType.Miscellaneous, "kEH_FVal"),
kEH_Func(PropertyType.Miscellaneous, "kEH_Func"),
kEH_HG(PropertyType.Miscellaneous, "kEH_HG"),
kEH_IFAO(PropertyType.Miscellaneous, "kEH_IFAO"),
kEH_JSesh(PropertyType.Miscellaneous, "kEH_JSesh"),
kEH_UniK(PropertyType.Miscellaneous, "kEH_UniK"),
kFanqie(PropertyType.Miscellaneous, "cjkFanqie"),
kFenn(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkFenn"),
kFennIndex(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkFennIndex"),
Expand Down Expand Up @@ -182,13 +190,15 @@ public enum UcdProperty {
kRSKanWa(PropertyType.Miscellaneous, "cjkRSKanWa"),
kRSKangXi(PropertyType.Miscellaneous, "cjkRSKangXi"),
kRSKorean(PropertyType.Miscellaneous, "cjkRSKorean"),
kRSTUnicode(PropertyType.Miscellaneous, "kRSTUnicode"),
kRSUnicode(
PropertyType.Miscellaneous,
null,
ValueCardinality.Ordered,
"cjkRSUnicode",
"Unicode_Radical_Stroke",
"URS"),
kReading(PropertyType.Miscellaneous, "kReading"),
kSBGY(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkSBGY"),
kSMSZD2003Index(PropertyType.Miscellaneous, "cjkSMSZD2003Index"),
kSMSZD2003Readings(PropertyType.Miscellaneous, "cjkSMSZD2003Readings"),
Expand All @@ -200,9 +210,11 @@ public enum UcdProperty {
ValueCardinality.Unordered,
"cjkSpecializedSemanticVariant"),
kSpoofingVariant(PropertyType.Miscellaneous, "cjkSpoofingVariant"),
kSrc_NushuDuben(PropertyType.Miscellaneous, "kSrc_NushuDuben"),
kStrange(PropertyType.Miscellaneous, "cjkStrange"),
kTGH(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkTGH"),
kTGHZ2013(PropertyType.Miscellaneous, "cjkTGHZ2013"),
kTGT_MergedSrc(PropertyType.Miscellaneous, "kTGT_MergedSrc"),
kTaiwanTelegraph(PropertyType.Miscellaneous, "cjkTaiwanTelegraph"),
kTang(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkTang"),
kTotalStrokes(PropertyType.Miscellaneous, null, ValueCardinality.Ordered, "cjkTotalStrokes"),
Expand Down Expand Up @@ -341,6 +353,9 @@ public enum UcdProperty {
White_Space(PropertyType.Binary, Binary.class, null, "WSpace", "space"),
XID_Continue(PropertyType.Binary, Binary.class, null, "XIDC"),
XID_Start(PropertyType.Binary, Binary.class, null, "XIDS"),
kEH_Core(PropertyType.Binary, Binary.class, null, "kEH_Core"),
kEH_NoMirror(PropertyType.Binary, Binary.class, null, "kEH_NoMirror"),
kEH_NoRotate(PropertyType.Binary, Binary.class, null, "kEH_NoRotate"),

// Unknown
;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1434,6 +1434,14 @@ public static Joining_Type_Values forName(String name) {
// kDaeJaweon
// kDefinition
// kEACC
// kEH_Cat
// kEH_Desc
// kEH_Func
// kEH_FVal
// kEH_HG
// kEH_IFAO
// kEH_JSesh
// kEH_UniK
// kFanqie
// kFenn
// kFennIndex
Expand Down Expand Up @@ -1501,11 +1509,13 @@ public static Joining_Type_Values forName(String name) {
// kPhonetic
// kPrimaryNumeric
// kPseudoGB1
// kReading
// kRSAdobe_Japan1_6
// kRSJapanese
// kRSKangXi
// kRSKanWa
// kRSKorean
// kRSTUnicode
// kRSUnicode
// kSBGY
// kSemanticVariant
Expand All @@ -1514,11 +1524,13 @@ public static Joining_Type_Values forName(String name) {
// kSMSZD2003Readings
// kSpecializedSemanticVariant
// kSpoofingVariant
// kSrc_NushuDuben
// kStrange
// kTaiwanTelegraph
// kTang
// kTGH
// kTGHZ2013
// kTGT_MergedSrc
// kTotalStrokes
// kTraditionalVariant
// kUnihanCore2020
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,13 @@ public String _getValue(int codepoint) {
"cjkIRG_VSource",
"cjkIRG_VSource",
"kIRG_VSource");
add(iup.getProperty("kEH_Cat"));
add(iup.getProperty("kEH_Desc"));
add(iup.getProperty("kEH_HG"));
add(iup.getProperty("kEH_IFAO"));
add(iup.getProperty("kEH_JSesh"));
add(iup.getProperty("kEH_NoMirror"));
add(iup.getProperty("kEH_NoRotate"));
add(iup.getProperty("Emoji"));
add(iup.getProperty("Emoji_Presentation"));
add(iup.getProperty("Emoji_Modifier"));
Expand Down
Loading

0 comments on commit 99625b1

Please sign in to comment.