Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22898 MF2: fix various parser bugs and add more tests #3092

Merged

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Aug 8, 2024

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22898
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@mihnita
Copy link
Contributor

mihnita commented Aug 9, 2024

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.


I general having Maven consume files from outside it's own folder tree is hacky and error prone.
And when it finally works, we discover that it breaks the import to various IDEs.
I think that we would want to support at least Eclipse, IntelliJ, VS Code.

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore)
I'm working on that...


The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times.
Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code.
Because (often) the C++ and Java devs might not be the same.

This PR being an example.


TLDR: I am tempted to keep two copies of the test data.
There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

These json files are probably in the same bucket: they live in CLDR, but must be tested against in ICU.

But the idea is that the ant task can copy the files in two places, icu4c and icu4j.
If code breaks, but one has time / expertise for C/C++ only, they can open a ticket against icu4j, priority zero,
but revert the test files in icu4j until that issue is fixed.

I don't know if that is a good idea or not.
But this is what we already do for other tests.

Here is a fragment of the ant script:

        <fileset id="cldrTestData" dir="${cldrDir}/common/testData">
            <!-- Add directories here to control which test data is installed. -->
            <include name="localeIdentifiers/**"/> <!-- ... -->
            <include name="personNameTest/**"/> <!-- Used in ExhaustivePersonNameTest -->
            <include name="units/**"/> <!-- Used in UnitsTest tests -->
       </fileset>

        <copy todir="${testDataDir4C}">
            <fileset refid="cldrTestData"/>
        </copy>
        <copy todir="${testDataDir4J}">
            <fileset refid="cldrTestData"/>
        </copy>

We copy the same test files in icu4c/source/test/testdata/cldr and icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

@mihnita
Copy link
Contributor

mihnita commented Aug 9, 2024

Try:

diff -r icu4c/source/test/testdata/cldr icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

There are 3 sub-folders there: localeIdentifiers, personNameTest, units
I would propose we add another one (messageformat2) and do the same thing.

@@ -62,10 +62,22 @@ int readCodePoint() {
return c;
}

// Backup a number of characters.
// Backup a number of code points.
void backup(int amount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method works on code units (Java char), not code points.

That's because int readCodePoint() also works on code units.
It returns a code point, but the current offset in the input (cursor) is expresses in code units.
So when it finds a code point above BMP it advances with 2

Reasons:

  • When one wants to parse something from an offset, it is faster to express the offset in the same kind of units we use for storage. We store using char, offset in code units. Otherwise we must iterate all the way to the offset in code points. Our input buffer uses char
  • Easier to deal with not-properly paired surrogates. The spec does not try to enforce the surrogate correctness, and I don't think it is the job of the MF2 to do that. If I pass ".....\uDC00...{$user}! there is no reason to not format to ".....\uDC00...John!

Even if there is disagreement about the reasons, and we decide to express offsets in code points, this fix is not sufficient.

We would also have to change getPosition(), skip(int amount), gotoPosition(int position)
Because we either work on code points, or on code units, we should not mix and match.

Copy link
Contributor Author

@catamorphism catamorphism Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I see that the issue isn't really with backup(), but how it's used in conjunction with readCodePoint() (reading a code point that happens to be a wide char, and then calling backup(1) to "push it back", is a bug).

In eee547c I reverted the change to backup() and fixed the three places I could find where readCodePoint() and backup() were being used in that way.

}
}

void backupOneCodePoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it is an unpaired low surrogate?
There is nothing to prevent such a thing from showing up in a message.

True, it is incorrect.
But we are not trying to reject that, we accept it.
And this code would misbehave.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need a test for this! (I'll add one.)

Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thank you very much for taking a stab at updating the Java implementation to the latest spec.
I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Or you can continue it, as you already did a big chunk of it (all?)
Your call.

Thanks again,
M

if (nr != null) {
return nr;
}
// "The resolution of a text or literal MUST resolve to a string."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this change breaks selection as it is currently implemented.
Because it means that "1.0" does not match "1.00"

Might be desired, to not match, I don't know.
But any change here should be accompanied by a change in the plural matcher.

One might say that this is desired.
But the format is locale sensitive.
For example formatting 1 dollar I might get "1.00 dollars", and does not match |1|.
But formatting 1 Japanese yen will result in "1 ...", and that matches |1|.

So the developer using |1| in their message creates something that behaves differently on different locales.

I proposed to interpret |1| as a string, and 1 as a number, and it is not resolved:
unicode-org/message-format-wg#712

And Eemeli argued for it recently:
unicode-org/message-format-wg#842

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the plural matcher has to be changed? This change is to address situations where a number literal isn't annotated, like:

{
    "src": "Format {123456789.9876} number",
    "locale": "en-IN",
    "exp": "Format 123456789.9876 number",
    "comment": "Number literals are not formatted as numbers by default"
 }

This change doesn't change the behavior for .match on something with a :number annotation, or at least, there aren't any tests that show that.

@@ -712,7 +721,7 @@ private MFDataModel.Declaration getDeclaration() throws MFParseException {
MFDataModel.Expression expression;
switch (declName) {
case "input":
skipMandatoryWhitespaces();
skipOptionalWhitespaces();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule is input-declaration = input [s] variable-expression
So the space is mandatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? If it was mandatory, it would say input-declaration = input s variable-expression (without the square brackets.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not delete these extra tests.
They cover functionality not covered by the tests in the spec.
And that can't be covered by the spec.

They test that the implementation works end-to-end.

Some might be moved to the standard (CLDR).

But some can't, because they test that custom functions work, or that icu:skeletons work (namespaced),
or that various types work (Date, Calendar, java.time, or even long with :dateformat.

And that the type "magically selects" the proper function (For example "...{$exp}..." formats as a date if the exp is a date-like type (Date / Calendar / java.time)

This comment also (or mostly?) apply to the other deleted files (FunctionsTests, IcuFunctionsTest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be less confusing to look at #3063 first. This PR includes that PR, plus additional changes.

In the description for #3063 I explained that I didn't delete any tests (unless they were duplicates), but rather, moved all the JSON filenames being read into a single file, because now the schema is the same for all the tests, so the same code can be re-used to read all of the files.

@@ -29,16 +29,5 @@ public void testNullInput() throws Exception {
MFParser.parse(null);
}

@Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not much / anything left if we remove this.

So the same comment that I had for deleted files applies: do we really want to delete extra tests?
I think that the more tests we have the better.
Even some don't come from the official CLDR spec.

We can separate these tests in a different folder for cleanliness.
But not delete them, unless they are redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about deleting tests.

@catamorphism
Copy link
Contributor Author

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.

I general having Maven consume files from outside it's own folder tree is hacky and error prone. And when it finally works, we discover that it breaks the import to various IDEs. I think that we would want to support at least Eclipse, IntelliJ, VS Code.

Of course; is there a way to add automated testing that changes don't break IDEs?

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore) I'm working on that...

The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times. Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code. Because (often) the C++ and Java devs might not be the same.

This PR being an example.

TLDR: I am tempted to keep two copies of the test data. There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

I don't want to unilaterally say yes or no to this; is it something to discuss in the TC meeting, maybe? (Note: I'll be on vacation from early next week until September 2, so I won't be at the next few meetings, but I can go back and read minutes later.)

@catamorphism
Copy link
Contributor Author

catamorphism commented Aug 9, 2024

First, thank you very much for taking a stab at updating the Java implementation to the latest spec. I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

Right, first it was precipitated by sharing test files; then I wrote a random test generator and decided it was only fair to run ICU4J on some of the generated tests as well as ICU4C, so that's where some of the other changes came from.

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Certainly (like I said in another comment, I think that has to be discussed with the TC).

I think I've addressed all your comments, but let me know if I missed anything!

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_data_model.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_function_registry.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_macros.h is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/messageformat2_serializer.cpp is different
  • icu4c/source/i18n/unicode/messageformat2_data_model.h is no longer changed in the branch
  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DataModelErrorsTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DefaultTestProperties.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FirstReleaseTests.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FunctionsTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MF2Test.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Param.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/ParserSmokeTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SelectorsWithVariousArgumentsTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Sources.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/StringToListAdapter.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SyntaxErrorsTest.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/TestUtils.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Unit.java is no longer changed in the branch
  • testdata/message2/alias-selector-annotations.json is no longer changed in the branch
  • testdata/message2/duplicate-declarations.json is no longer changed in the branch
  • testdata/message2/icu-parser-tests.json is no longer changed in the branch
  • testdata/message2/icu-test-functions.json is no longer changed in the branch
  • testdata/message2/icu-test-previous-release.json is no longer changed in the branch
  • testdata/message2/icu-test-selectors.json is no longer changed in the branch
  • testdata/message2/invalid-number-literals-diagnostics.json is no longer changed in the branch
  • testdata/message2/invalid-options.json is no longer changed in the branch
  • testdata/message2/markup.json is no longer changed in the branch
  • testdata/message2/matches-whitespace.json is no longer changed in the branch
  • testdata/message2/more-data-model-errors.json is no longer changed in the branch
  • testdata/message2/more-functions.json is no longer changed in the branch
  • testdata/message2/more-syntax-errors.json is no longer changed in the branch
  • testdata/message2/README.txt is no longer changed in the branch
  • testdata/message2/reserved-syntax.json is no longer changed in the branch
  • testdata/message2/resolution-errors.json is no longer changed in the branch
  • testdata/message2/runtime-errors.json is no longer changed in the branch
  • testdata/message2/spec/data-model-errors.json is no longer changed in the branch
  • testdata/message2/spec/functions/date.json is no longer changed in the branch
  • testdata/message2/spec/functions/datetime.json is no longer changed in the branch
  • testdata/message2/spec/functions/integer.json is no longer changed in the branch
  • testdata/message2/spec/functions/number.json is no longer changed in the branch
  • testdata/message2/spec/functions/string.json is no longer changed in the branch
  • testdata/message2/spec/functions/time.json is no longer changed in the branch
  • testdata/message2/spec/syntax-errors.json is no longer changed in the branch
  • testdata/message2/spec/test-core.json is no longer changed in the branch
  • testdata/message2/spec/test-functions.json is no longer changed in the branch
  • testdata/message2/syntax-errors-diagnostics-multiline.json is no longer changed in the branch
  • testdata/message2/syntax-errors-end-of-input.json is no longer changed in the branch
  • testdata/message2/tricky-declarations.json is no longer changed in the branch
  • testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism changed the title DRAFT: MF2: fix various parser bugs and add more tests DRAFT: ICU-22898: MF2: fix various parser bugs and add more tests Sep 18, 2024
@catamorphism catamorphism marked this pull request as ready for review September 18, 2024 18:53
@catamorphism catamorphism changed the title DRAFT: ICU-22898: MF2: fix various parser bugs and add more tests ICU-22898: MF2: fix various parser bugs and add more tests Sep 18, 2024
@catamorphism
Copy link
Contributor Author

catamorphism commented Sep 18, 2024

I've rebased this PR, so it's now ready for review.

Note that this includes several fixes related to unsupported expression/statement (reserved) parsing. Recently these features were removed from the spec. However, since the recent spec changes haven't been integrated yet, I'd prefer to land these changes. I'm open to suggestions, though. Changed my mind; since reserved syntax was removed from ICU4J already, I've removed reserved-related changes from this PR.

@catamorphism
Copy link
Contributor Author

/cc @echeran

@catamorphism catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from 2b5e263 to bf9d4c5 Compare September 19, 2024 02:22
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is now changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
  • testdata/message2/reserved-syntax-2.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_serializer.cpp is different
  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
  • testdata/message2/reserved-syntax-2.json is no longer changed in the branch
  • testdata/message2/syntax-errors-reserved.json is now changed in the branch
  • testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you,
Mihai

ICU4C: Escape curly braces when serializing and normalizing
ICU4C: Escape '|' in patterns
ICU4C: When normalizing input, escape optionally-escaped characters in patterns
ICU4C/ICU4J: Allow trailing whitespace after a match
ICU4C: Fix parser to iterate over code points, not code units
Add tests with old reserved syntax as syntax-error tests
@catamorphism catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from 8c51816 to 4b376f1 Compare September 20, 2024 22:32
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism changed the title ICU-22898: MF2: fix various parser bugs and add more tests ICU-22898 MF2: fix various parser bugs and add more tests Sep 20, 2024
@catamorphism catamorphism merged commit 8f82fac into unicode-org:main Sep 20, 2024
100 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants