Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around a UnicodeSet bug #908

Merged
merged 7 commits into from
Aug 9, 2024

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Aug 9, 2024

See https://www.unicode.org/reports/tr35/#unicodeset-syntax, \N{whatever} is a quoted, which is a char, which is an element, so one should be able to use it in a range.

ICU treats it a a synonym for \p{Name=whatever}, which is a unicodeSet, so that the thing that should be a range is a set difference (equal to its LHS in practice).

That also means that \N{nonexistent name} is silently empty.

We should fix it in ICU, but we urgently need to fix it here for UCD development, because we have already been bitten by it, with no-one noticing; the following only tests the first letter of the Garay alphabet:

# Garay is a right-to-left cased script:
Propertywise [\N{GARAY SMALL LETTER A} - \N{GARAY SMALL LETTER OLD NA}]
: [\N{GARAY CAPITAL LETTER A} - \N{GARAY CAPITAL LETTER OLD NA}]
CorrespondTo [\N{OLD HUNGARIAN SMALL LETTER A}]
: [\N{OLD HUNGARIAN CAPITAL LETTER A}]
UpTo: Block (Garay vs Old_Hungarian),
Script (Garay vs Old_Hungarian),
Script_Extensions (Garay vs Old_Hungarian)

@eggrobin eggrobin requested review from macchiati and markusicu August 9, 2024 17:09
@markusicu
Copy link
Member

test failure: https://github.com/unicode-org/unicodetools/actions/runs/10322995005/job/28579439631?pr=908

ParseErrorCount=6
TestFailureCount=0
Error:  Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 63.737 s <<< FAILURE! - in org.unicode.text.UCD.TestTestUnicodeInvariants
Error:  org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants  Time elapsed: 63.719 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: TestUnicodeInvariants.testInvariants(default) failed ==> expected: <0> but was: <6>
	at org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants(TestTestUnicodeInvariants.java:39)

ParseException.class,
() ->
TestUnicodeInvariants.parseUnicodeSet(
"TEST [\\N{MEOW}]", new ParsePosition(5)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see a character encoding proposal brewing ;-)

@markusicu
Copy link
Member

Is there an ICU ticket? Would you be willing to work on it?

@markusicu
Copy link
Member

@aheninger FYI

@markusicu
Copy link
Member

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

@eggrobin
Copy link
Member Author

eggrobin commented Aug 9, 2024

Is there an ICU ticket?

Not yet. Could you file one?

Would you be willing to work on it?

Yes, if you file the ticket please assign it to me.

@markusicu
Copy link
Member

Actually, at a closer look, while there is some documentation for ICU regex, there is no documentation for \N for UnicodeSet:

Trying this in an old C++ UnicodeSet demo gives a weird result: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFFF&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetk.x=12&gosetk.y=23

I think this is not an ICU bug because ICU UnicodeSet does not support \N.

In the unicodetools, it must be handled by Mark's SymbolTable implementation. And that probably has no way of distinguishing between "set" and "character" results -- looking at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.XSymbolTable.html

@markusicu
Copy link
Member

Actually, clicking a different button on the old demo seems to recognize \N: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFF0&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetn.x=17&gosetn.y=17

Could you please try this in vanilla ICU4J UnicodeSet, without the Unicode Tools machinery?

@eggrobin
Copy link
Member Author

eggrobin commented Aug 9, 2024

Support for \N was added on 2002-08-28, in unicode-org/icu@681c046, for ICU-1767.

@markusicu
Copy link
Member

Next thing to check in ICU behavior is whether we allow a subtraction like CHARACTER_CLASS '-' LITERAL. (Using UTS18 syntax rule names here.) If we do, then turning \N into a LITERAL would be harmless. If we don't, then we could potentially break existing patterns.

--> ICU-22851 “UnicodeSet should treat \N like a character not like a set”

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are still CI failures

@aheninger
Copy link

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

For ICU4C regex, \N{whatever} matches exactly one code point.
The whatever string, delimited by the braces, is dumped into
UChar32 theChar = u_charFromName(U_UNICODE_CHAR_NAME, name, fStatus);
to get the character.

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm tnx

@eggrobin eggrobin merged commit 5845839 into unicode-org:main Aug 9, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants