-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around a UnicodeSet bug #908
Conversation
test failure: https://github.com/unicode-org/unicodetools/actions/runs/10322995005/job/28579439631?pr=908
|
unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java
Outdated
Show resolved
Hide resolved
unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java
Outdated
Show resolved
Hide resolved
unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java
Outdated
Show resolved
Hide resolved
ParseException.class, | ||
() -> | ||
TestUnicodeInvariants.parseUnicodeSet( | ||
"TEST [\\N{MEOW}]", new ParsePosition(5))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see a character encoding proposal brewing ;-)
Is there an ICU ticket? Would you be willing to work on it? |
@aheninger FYI |
The ICU User Guide does not go into details about |
Not yet. Could you file one?
Yes, if you file the ticket please assign it to me. |
Actually, at a closer look, while there is some documentation for ICU regex, there is no documentation for
Trying this in an old C++ UnicodeSet demo gives a weird result: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFFF&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetk.x=12&gosetk.y=23 I think this is not an ICU bug because ICU UnicodeSet does not support \N. In the unicodetools, it must be handled by Mark's SymbolTable implementation. And that probably has no way of distinguishing between "set" and "character" results -- looking at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.XSymbolTable.html |
Actually, clicking a different button on the old demo seems to recognize \N: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFF0&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetn.x=17&gosetn.y=17 Could you please try this in vanilla ICU4J UnicodeSet, without the Unicode Tools machinery? |
@markusicu wrote:
|
Support for |
Next thing to check in ICU behavior is whether we allow a subtraction like CHARACTER_CLASS '-' LITERAL. (Using UTS18 syntax rule names here.) If we do, then turning \N into a LITERAL would be harmless. If we don't, then we could potentially break existing patterns. --> ICU-22851 “UnicodeSet should treat \N like a character not like a set” |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are still CI failures
unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java
Outdated
Show resolved
Hide resolved
unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java
Outdated
Show resolved
Hide resolved
For ICU4C regex, \N{whatever} matches exactly one code point. |
Co-authored-by: Markus Scherer <[email protected]>
Co-authored-by: Markus Scherer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm tnx
See https://www.unicode.org/reports/tr35/#unicodeset-syntax,
\N{whatever}
is aquoted
, which is achar
, which is anelement
, so one should be able to use it in arange
.ICU treats it a a synonym for
\p{Name=whatever}
, which is aunicodeSet
, so that the thing that should be a range is a set difference (equal to its LHS in practice).That also means that
\N{nonexistent name}
is silently empty.We should fix it in ICU, but we urgently need to fix it here for UCD development, because we have already been bitten by it, with no-one noticing; the following only tests the first letter of the Garay alphabet:
unicodetools/unicodetools/src/main/resources/org/unicode/text/UCD/AdditionComparisons.txt
Lines 27 to 34 in cccbe93