Work around a UnicodeSet bug #908

eggrobin · 2024-08-09T17:08:57Z

See https://www.unicode.org/reports/tr35/#unicodeset-syntax, \N{whatever} is a quoted, which is a char, which is an element, so one should be able to use it in a range.

ICU treats it a a synonym for \p{Name=whatever}, which is a unicodeSet, so that the thing that should be a range is a set difference (equal to its LHS in practice).

That also means that \N{nonexistent name} is silently empty.

We should fix it in ICU, but we urgently need to fix it here for UCD development, because we have already been bitten by it, with no-one noticing; the following only tests the first letter of the Garay alphabet:

unicodetools/unicodetools/src/main/resources/org/unicode/text/UCD/AdditionComparisons.txt

Lines 27 to 34 in cccbe93

    
           # Garay is a right-to-left cased script: 
        
           Propertywise [\N{GARAY SMALL LETTER A} - \N{GARAY SMALL LETTER OLD NA}] 
        
                      : [\N{GARAY CAPITAL LETTER A} - \N{GARAY CAPITAL LETTER OLD NA}] 
        
           CorrespondTo [\N{OLD HUNGARIAN SMALL LETTER A}] 
        
                      : [\N{OLD HUNGARIAN CAPITAL LETTER A}] 
        
               UpTo: Block             (Garay vs Old_Hungarian), 
        
                     Script            (Garay vs Old_Hungarian), 
        
                     Script_Extensions (Garay vs Old_Hungarian)

markusicu · 2024-08-09T17:15:56Z

test failure: https://github.com/unicode-org/unicodetools/actions/runs/10322995005/job/28579439631?pr=908

ParseErrorCount=6
TestFailureCount=0
Error:  Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 63.737 s <<< FAILURE! - in org.unicode.text.UCD.TestTestUnicodeInvariants
Error:  org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants  Time elapsed: 63.719 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: TestUnicodeInvariants.testInvariants(default) failed ==> expected: <0> but was: <6>
	at org.unicode.text.UCD.TestTestUnicodeInvariants.testUnicodeInvariants(TestTestUnicodeInvariants.java:39)

unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java

markusicu · 2024-08-09T17:30:20Z

unicodetools/src/test/java/org/unicode/text/UCD/TestTestUnicodeInvariants.java

+                        ParseException.class,
+                        () ->
+                                TestUnicodeInvariants.parseUnicodeSet(
+                                        "TEST [\\N{MEOW}]", new ParsePosition(5)));


i see a character encoding proposal brewing ;-)

markusicu · 2024-08-09T17:34:17Z

Is there an ICU ticket? Would you be willing to work on it?

markusicu · 2024-08-09T17:35:32Z

@aheninger FYI

markusicu · 2024-08-09T17:37:21Z

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

eggrobin · 2024-08-09T17:41:16Z

Is there an ICU ticket?

Not yet. Could you file one?

Would you be willing to work on it?

Yes, if you file the ticket please assign it to me.

markusicu · 2024-08-09T18:01:11Z

Actually, at a closer look, while there is some documentation for ICU regex, there is no documentation for \N for UnicodeSet:

Trying this in an old C++ UnicodeSet demo gives a weird result: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFFF&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetk.x=12&gosetk.y=23

I think this is not an ICU bug because ICU UnicodeSet does not support \N.

In the unicodetools, it must be handled by Mark's SymbolTable implementation. And that probably has no way of distinguishing between "set" and "character" results -- looking at https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.XSymbolTable.html

markusicu · 2024-08-09T18:09:51Z

Actually, clicking a different button on the old demo seems to recognize \N: https://icu4c-demos.unicode.org/icu-bin/ubrowse?go=FFF0&us=%5B%5CN%7BDIGIT+FOUR%7D%5D&gosetn.x=17&gosetn.y=17

Could you please try this in vanilla ICU4J UnicodeSet, without the Unicode Tools machinery?

eggrobin · 2024-08-09T18:59:00Z

@markusicu wrote:

I think this is not an ICU bug because ICU UnicodeSet does not support \N.

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3741-L3751

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3791

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3802-L3812

https://github.com/unicode-org/icu/blob/b5b3e16afac61f9aa9b775aaf497f8cc88ce9481/icu4j/main/core/src/main/java/com/ibm/icu/text/UnicodeSet.java#L3835-L3850

eggrobin · 2024-08-09T19:02:57Z

Support for \N was added on 2002-08-28, in unicode-org/icu@681c046, for ICU-1767.

markusicu · 2024-08-09T20:17:35Z

Next thing to check in ICU behavior is whether we allow a subtraction like CHARACTER_CLASS '-' LITERAL. (Using UTS18 syntax rule names here.) If we do, then turning \N into a LITERAL would be harmless. If we don't, then we could potentially break existing patterns.

--> ICU-22851 “UnicodeSet should treat \N like a character not like a set”

markusicu

there are still CI failures

unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java

aheninger · 2024-08-09T21:09:50Z

The ICU User Guide does not go into details about \N: https://unicode-org.github.io/icu/userguide/strings/regexp.html

For ICU4C regex, \N{whatever} matches exactly one code point.
The whatever string, delimited by the braces, is dumped into
UChar32 theChar = u_charFromName(U_UNICODE_CHAR_NAME, name, fStatus);
to get the character.

Co-authored-by: Markus Scherer <[email protected]>

markusicu

lgtm tnx

Work around a UnicodeSet bug

853ca2c

eggrobin requested review from macchiati and markusicu August 9, 2024 17:09

Fix tests that relied on \N being a set

1dd90c5

markusicu reviewed Aug 9, 2024

View reviewed changes

After markus’s review

737784e

markusicu reviewed Aug 9, 2024

View reviewed changes

unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java Outdated Show resolved Hide resolved

unicodetools/src/main/java/org/unicode/text/UCD/TestUnicodeInvariants.java Outdated Show resolved Hide resolved

eggrobin and others added 4 commits August 9, 2024 23:13

more []

619428d

Comment.

708a46d

Co-authored-by: Markus Scherer <[email protected]>

Try accepting Markus’s comment again…

63388e6

Co-authored-by: Markus Scherer <[email protected]>

No class fields were harmed in the making of this list

357bde6

markusicu approved these changes Aug 9, 2024

View reviewed changes

eggrobin merged commit 5845839 into unicode-org:main Aug 9, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around a UnicodeSet bug #908

Work around a UnicodeSet bug #908

eggrobin commented Aug 9, 2024 •

edited

Loading

markusicu commented Aug 9, 2024

markusicu Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

eggrobin commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

eggrobin commented Aug 9, 2024

eggrobin commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu left a comment

aheninger commented Aug 9, 2024

markusicu left a comment

	# Garay is a right-to-left cased script:
	Propertywise [\N{GARAY SMALL LETTER A} - \N{GARAY SMALL LETTER OLD NA}]
	: [\N{GARAY CAPITAL LETTER A} - \N{GARAY CAPITAL LETTER OLD NA}]
	CorrespondTo [\N{OLD HUNGARIAN SMALL LETTER A}]
	: [\N{OLD HUNGARIAN CAPITAL LETTER A}]
	UpTo: Block (Garay vs Old_Hungarian),
	Script (Garay vs Old_Hungarian),
	Script_Extensions (Garay vs Old_Hungarian)

Work around a UnicodeSet bug #908

Work around a UnicodeSet bug #908

Conversation

eggrobin commented Aug 9, 2024 • edited Loading

markusicu commented Aug 9, 2024

markusicu Aug 9, 2024

Choose a reason for hiding this comment

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

eggrobin commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu commented Aug 9, 2024

eggrobin commented Aug 9, 2024

eggrobin commented Aug 9, 2024

markusicu commented Aug 9, 2024

markusicu left a comment

Choose a reason for hiding this comment

aheninger commented Aug 9, 2024

markusicu left a comment

Choose a reason for hiding this comment

eggrobin commented Aug 9, 2024 •

edited

Loading