Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21757 Bring in CollectionUtilities, depend only on icu4j #810

Merged
merged 6 commits into from
May 9, 2024
Merged

ICU-21757 Bring in CollectionUtilities, depend only on icu4j #810

merged 6 commits into from
May 9, 2024

Conversation

mihnita
Copy link
Contributor

@mihnita mihnita commented May 7, 2024

No description provided.

@mihnita mihnita marked this pull request as draft May 7, 2024 18:33
@mihnita mihnita marked this pull request as ready for review May 7, 2024 19:25
srl295
srl295 previously approved these changes May 7, 2024
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mihai and I are working on the TestCodeInvariants test failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Consider a follow-up PR to remove things from here which we don't need in the Unicode Tools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Consider a follow-up PR to remove things from here which we don't need in the Unicode Tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

There are 90+ instances of CollectionUtilities.join, I didn't want to clutter this CL with it.

echeran
echeran previously approved these changes May 7, 2024
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markusicu
Copy link
Member

More test failures.

  • org.unicode.jsptest.TestProperties.TestCCC
  • org.unicode.jsptest.TestUnicodeSet.TestNF

I suspect that this test code also uses an ICU normalizer object, and needs adjustments for going from Unicode 15 to Unicode 16.
@eggrobin @macchiati FYI

@mihnita
Copy link
Contributor Author

mihnita commented May 8, 2024

Error: org.unicode.jsptest.TestProperties.TestCCC Time elapsed: 0.077 s <<< FAILURE!
[:ccc=/3/:] missing: [ࢗ𐵩-𐵭𞗮]

The missing regexp is this (escaped): [\u{897}\u{10D69}-\u{10D6D}\u{1E5EE}]

0897 => ARABIC PEPET
10D69-10D6D => Garay (Garay block: 10D40..10D8F)
1E5EE => Ol Onal (Ol Onal block: 1E5D0..1E5FF)

All added in UTF-16.

@mihnita
Copy link
Contributor Author

mihnita commented May 8, 2024

TestUnicodeSet.TestNF

[:isnfd:] == [:nfdqc!=N:] unexpected: [\u{105C9}\u{105E4}\u{11383}\u{11385}\u{1138E}\u{11391}\u{113C5}\u{113C7}\u{113C8}\u{16121}-\u{16128}\u{16D68}-\u{16D6A}]
[:isnfd:] == [:nfdqc!=N:][[:isnfd:]]!=[[:nfdqc!=N:]]
[:isnfkd:] == [:nfkdqc!=N:] unexpected: [\u{105C9}\u{105E4}\u{11383}\u{11385}\u{1138E}\u{11391}\u{113C5}\u{113C7}\u{113C8}\u{16121}-\u{16128}\u{16D68}-\u{16D6A}\u{1CCD6}-\u{1CCF9}]
[:isnfkd:] == [:nfkdqc!=N:][[:isnfkd:]]!=[[:nfkdqc!=N:]]
[:isnfkc:] == [:nfkcqc!=N:] unexpected: [\u{1CCD6}-\u{1CCF9}]
[:isnfkc:] == [:nfkcqc!=N:][[:isnfkc:]]!=[[:nfkcqc!=N:]] ==> expected: <0> but was: <6>

@eggrobin
Copy link
Member

eggrobin commented May 8, 2024

Wow, we are checking normalization properties that are typically from the previous version (via ICU) in TestCodeInvariants, of all places? This is terrifying.

@markusicu
Copy link
Member

org.unicode.jsptest.TestProperties.TestCCC

https://github.com/unicode-org/unicodetools/blob/main/UnicodeJsps/src/test/java/org/unicode/jsptest/TestProperties.java
line 286

This is not really a test of properties per se. It's a test that checks that you can use a text regex in a JSP “Unicode Utilities: UnicodeSet” pattern with a partial-value-string match. Concretely, it checks that [:ccc=/3/:] works by looking at the decimal string for each integer 0..255, and if that contains '3' then adding the set of ccc=integer.

It creates each of these UnicodeSets via a simple pattern string, which means that it uses the Unicode ccc data built into ICU.

What I find baffling is that it is failing now. It seems like it should have failed when the tools depended on ICU 72 and its Unicode 15.0, but the tools themselves had moved on to draft 16.0.

In principle, this test should get the ccc=integer set from the Unicode Tools, not from ICU, but that would also use draft 16.0 data, so roughly the same as what draft ICU 76 has.

@macchiati What's going on?

@eggrobin
Copy link
Member

eggrobin commented May 8, 2024

@macchiati What's going on?

I am not @macchiati, but I think I know what is going on.

In principle, this test should get the ccc=integer set from the Unicode Tools, not from ICU,

Now it should, yes.

but that would also use draft 16.0 data

No, because by default the JSPs use the last (released) version, not the latest (dev) version.

If you change line 290 to String test = "[:Udev:ccc=/3/:]"; it should unbreak for now (but obviously the ICU coupling on line 296 needs to go away eventually otherwise it will blow up again).

@markusicu
Copy link
Member

but that would also use draft 16.0 data

No, because by default the JSPs use the last (released) version, not the latest (dev) version.

🤦

And we didn't notice this version skew when we moved to Unicode 15.1 (while ICU remained stuck on 72/15.0) because there happened to not be any new characters with interesting normalization properties.

If you change line 290 to String test = "[:Udev:ccc=/3/:]"; it should unbreak for now (but obviously the ICU coupling on line 296 needs to go away eventually otherwise it will blow up again).

Yeah, but that would change the intent of the test.

For now, I propose that @mihnita disable this test case, as well as TestUnicodeSet.TestNF (which also fails for Unicode 16 characters), via a test method annotation like @Disabled("stop using ICU for properties: version skew")

@eggrobin
Copy link
Member

eggrobin commented May 8, 2024

And we didn't notice this version skew when we moved to Unicode 15.1 (while ICU remained stuck on 72/15.0) because there happened to not be any new characters with interesting normalization properties.

It might also have been ICU on both sides at the time; I moved some things to the tools by default recentlyish.

Yeah, but that would change the intent of the test.

Not meaningfully; we are not testing the version match (which cannot work), we are testing that the JSPs support regexes in Unicode sets.
I am happy to do a pass to clean it up afterwards (the fact that it cannot work is a little bit silly), but if possible I would feel better about tweaking them to make them green than disabling them until I can get around to them.
At least if they run, we will notice when they start rotting too much again…

@markusicu
Copy link
Member

Yeah, but that would change the intent of the test.

Not meaningfully; we are not testing the version match (which cannot work), we are testing that the JSPs support regexes in Unicode sets.

This test method should compare a set pattern processed like the JSPs do with more basic properties fetching for the same version of Unicode. I would leave the version out of the pattern and (in a future fix) fetch the properties for Settings.LAST_VERSION. I am also fine with changing this test method to use String test = "[:Udev:ccc=/3/:]";.

However, I don't see such an easy, quick fix for TestUnicodeSet.TestNF -- I didn't try to figure out which part of the code calls ICU vs. which part uses tools properties. In order not to block this PR here, I suggest we do disable that one for now.

@eggrobin
Copy link
Member

eggrobin commented May 9, 2024

This test method should compare a set pattern processed like the JSPs do with more basic properties fetching for the same version of Unicode.

Yes. Dependencies on ICU data in here are madness. But I am OK with fixing this one hackily now and cleanly later.

However, I don't see such an easy, quick fix for TestUnicodeSet.TestNF -- I didn't try to figure out which part of the code calls ICU vs. which part uses tools properties. In order not to block this PR here, I suggest we do disable that one for now.

Oh this one is a mess, because it will involve the much nastier question « which tool properties ? ».
And I do not want to subject Mihai to that much madness, so I am OK with disabling TestNF (while keeping TestCCC alive, hackily if needed).

@mihnita
Copy link
Contributor Author

mihnita commented May 9, 2024

Thank you both!

Implemented the :Udev and @Disabled, as suggested.
Tests now pass locally.

Mihai

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah!

@markusicu
Copy link
Member

@mihnita do you want me to squash-and-merge for you?

@mihnita
Copy link
Contributor Author

mihnita commented May 9, 2024

@mihnita do you want me to squash-and-merge for you?

Yes please. I don't have the right.
Thank you,
M

@markusicu markusicu merged commit 3e8adc4 into unicode-org:main May 9, 2024
15 checks passed
@mihnita mihnita deleted the mihai_bing_collectionutilities branch May 9, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants