Review Apple Morphology API for use in Inflection project #33

nciric · 2024-06-17T20:09:40Z

nciric
Jun 17, 2024
Maintainer

This is the surface for the review (from George's email to the group):

Here are previous presentations that involve this wrapper code:

UTW Automatic Grammar Agreement in Message Formatting
WWDC23: Unlock the power of grammatical agreement | Apple

I'll add my input to this thread as I go through the APIs.

nciric · 2024-06-17T23:41:27Z

nciric
Jun 17, 2024
Maintainer Author

Let's start from the use case of supporting Message Format 2.0, as a next localization standard. That's not taking us far from Apple's AttributedString. Both formats annotate parts of the string in need of grammatical support, e.g. plurals, gender etc just in technically different ways and require runtime information to be passed, e.g. names, values.

From the link above, use pattern for MF2.0 in Java is:

// 1. Build the immutable MF2.0 with a message and locale provided.
MessageFormatter mf2 = MessageFormatter.builder()
     .setPattern("{Hello {$name :inflect case=vocative number=singular},"
                 +"your card expires on {$exp :datetime skeleton=yMMMdE}!}")
     .setLocale("en-GB")
     .build();

// 2. Get the runtime arguments.
Map<String, Object> arguments = new HashMap<>();
arguments.put("name", "John");
arguments.put("exp", new Date(1679971371000L));

// 3. Resolve arguments and format the string for UI rendering.
assertEquals(
     "Hello John, your card expires on Mon, 27 Mar 2023!",
     mf2.formatToString(arguments));

The {$name :inflect case=vocative number=singular} says that caller has to pass a $name value before calling formatToString(), which is going to call custom, inflect function.

The default parameters for inflect are set to case=vocative number=singular and can be changed/removed by translators to match the target language.

Since arguments allow Java Object to be passed as a parameter, we can define an option bag, like Apple's Morphology to pass additional fields or override defaults. Additional benefit of having an option bag is that it would allow for variable number of parameters as needed by a language, e.g. attributes of the noun from a dictionary like animate/inanimate.

My proposal:

Apple InflectionRule becomes the inflect and it takes a word, in this case $name, and additional parameters/option bag in the form of Morphology.
The internals of Morphology are TBD, but it should have a case, number, gender etc.

1 reply

macchiati Jun 18, 2024
Maintainer

Sounds like a pretty clean mapping between the APIs

grhoten · 2024-06-18T04:21:15Z

grhoten
Jun 18, 2024
Maintainer

I do want to make the existing following less than ideal choices very clear. They have stuck around due to backwards compatibility and stability reasons.

The code contribution uses "count" instead of "number" for the grammatical category called grammatical number, which is similar but not exactly the same as grammatical count. Grammatical number is really meant.

The Morphology.GrammaticalNumber made the unfortunate choice of conflating CLDR plural rules with grammatical number. When you get to a language like Arabic or Russian, this distinction becomes very clear.

I recommend that "number" is used to refer to grammatical number, and it should not refer to the CLDR plural rules. CLDR plural rules should be referenced completely separately. Those CLDR plural rules can reference combinations of case, animacy, number and so forth, but not the other way around.

To highlight the importance of this distinction, please see the table in this comment in CLDR-11981.

0 replies

grhoten · 2024-06-18T04:31:27Z

grhoten
Jun 18, 2024
Maintainer

Another design choice that is different in this API uses a concept called lemmaless inflection. This public usage requires this concept. The older implementation used to use a concept of lemma based inflection where you had to specify all of the grammemes to switch the lemma to the desired surface form. With lemmaless inflection, you're starting with any surface form (any cell in a declension table for a lemma) and modifying only the relevant grammemes. So if you're using a word in a sentence, you may want to keep the case the same but change the gender or number of the word. This is helpful when each surface form is unique in spelling and easy to deduce. It becomes harder when there is ambiguity. In such cases, you have to guide the inflection to what the current form is, and what the desired form is. Usually, this helps keeps the translation simpler so that you don't have to know all of the grammatical cases for a translator that knows their language well, but may not know the name for a given case in English.

0 replies

grhoten · 2024-06-18T04:35:30Z

grhoten
Jun 18, 2024
Maintainer

A design choice that remains in transition is the notion of semantic features. For example, you may want to add a definite article to a word or a preposition. These semantic features do not scale. You can't chain them, and they're very specific to specific prepositions and articles, which makes the concept hard to scale. What does scale much more is the ability to use grammatical categories. You can chain them, and you don't have to hand craft rules for each semantic feature. It doesn't mean that there isn't a need for it, but I'd discourage exposing it.

0 replies

grhoten · 2024-06-18T04:53:16Z

grhoten
Jun 18, 2024
Maintainer

A design choice that remains incompatible with both MF2 and this public API is the notion of the spoken form. You have print and speak strings. When you have a quantity, like "1 kilometer" and "2 kilometers", you need to be able to disambiguate the pronunciation of the number in numerous languages. You need to be able to have a spoken and printed form. This is important for a VUI. It's not as important in a GUI where the reader can infer it while reading.

This code contribution can handle this situation, but it's not exposed in the API being referenced. MF2 can discard this information, but I'd prefer to keep that functionality around. This functionality relies heavily on ICU and CLDR RBNF. When you get to numbers, a data resource using Wikidata becomes unscalable, and you have to use RBNF.

Based on previous discussions, this design point was initially different between Google and Apple when naming rules for RBNF. It might be good to note this non-obvious point of difference.

0 replies

nciric · 2024-06-18T17:42:51Z

nciric
Jun 18, 2024
Maintainer Author

@mihnita could you confirm my expectation on how custom functions and parameters work in MF2.0 or point me to the docs?

0 replies

macchiati · 2024-06-18T20:19:00Z

macchiati
Jun 18, 2024
Maintainer

Thanks for the clarifications, George. A few questions. - The lemmaless inflections sound good. I'm curious how the disambiguation would work. If X could be the genitive of word X or the dative of word Y, how would the software disambiguate?For the - For the definiteness, can your API treat that as an option to the inflection function, eg (with MF2 syntax) I see {$toy :inflect definiteness=definite count=other}. => "I see the teddy bears." - The GUI vs VUI distinction is interesting. Is the issue that "1" needs to be inflected differently for speech? Eg, in German, where it would be pronounced as "ein" vs "eine" (or other forms in oblique cases).

…

On Tue, Jun 18, 2024 at 10:43 AM Nebojša Ćirić ***@***.***> wrote: @mihnita <https://github.com/mihnita> could you confirm my expectation <#33 (comment)> on how custom functions and parameters work in MF2.0 or point me to the docs? — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMHPWDEQXXVSPT3L2OLZIBWTDAVCNFSM6AAAAABJOVDWNKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQMBZGIZDC> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

grhoten Jun 21, 2024
Maintainer

Thanks for the clarifications, George. A few questions.

The lemmaless inflections sound good. I'm curious how the disambiguation would work. If X could be the genitive of word X or the dative of word Y, how would the software disambiguate?For the

This is still up for discussion. The current code does not completely solve this part of the problem. Disambiguation based on part of speech is possible when provided as a constraint.

For the definiteness, can your API treat that as an option to the inflection function, eg (with MF2 syntax) I see {$toy :inflect definiteness=definite count=other}. => "I see the teddy bears."

I think you really meant definiteness=definite count=plural. As mentioned earlier, usage of "other" is a misnomer specific only to the usage in Foundation. It gets mapped to plural, but that won't scale to all languages.

The GUI vs VUI distinction is interesting. Is the issue that "1" needs to be inflected differently for speech? Eg, in German, where it would be pronounced as "ein" vs "eine" (or other forms in oblique cases).

Yes, that's a prime example. There are many ways to pronounce a number in many languages, especially for values of 1. Separation of the printed and spoken forms is ideal for inflection, but it's more important when retaining the known pronunciation of a word. For example, there are 3 ways of pronouncing Jorge. There's the Spanish way, the Portuguese way and the Portuguese way trying to blend in with an English environment. This isn't a made up example. If I generate a message that says, "Jorge's phone number is 1-408-098-7654." How would the genitive form of Jorge sound? Was that string of numbers pronounced as a phone number or as a math equation? If you only return a single string, this distinction become impossible to support. A VUI environment also includes accessibility scenarios with screen readers for blind & low vision people.

mihnita · 2024-06-20T20:04:39Z

mihnita
Jun 20, 2024
Maintainer

Although the spec proper does not mention grammatical inflections, they were always part of our use cases, and often used as examples / theoretical testing for the design.

I strongly believe that the existing extensibility mechanisms should be able to handle most inflections.

One mechanism is the one Nebojša described, with custom functions to format the placeholders.
We can also use custom selectors (similar to the old ICU select, but specialized for grammar)
The results from formatting to parts should also provide context to modify a result post-formatting (for example converting "la abeille" to "l'abeille" in French).

We also played with the idea of agreement, between placeholders (one placeholder accessing grammatical properties of another placeholder)
For example something like: "The {$item} is {$color :grammar agreement='gender,number,case' with='item'}"

There was some opossition to that, because it would mean that a placeholder (in the example $color) would need to have access to another placeholder.

So this is not in the spec. But it is not explicitly forbidden either, so it should be possible to add it later, and it would not be against the spec.

One of the test units for MF2 in ICU4J implements (a very naive) inflection for Romanian names:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CustomFormatterGrammarCaseTest.java

But the inflection algorithm is not the important part, of course.
The tests shows that this kind of use case is possible, and how it would work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Apple Morphology API for use in Inflection project #33

{{title}}

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Review Apple Morphology API for use in Inflection project #33

nciric Jun 17, 2024 Maintainer

Replies: 8 comments · 2 replies

nciric Jun 17, 2024 Maintainer Author

macchiati Jun 18, 2024 Maintainer

grhoten Jun 18, 2024 Maintainer

grhoten Jun 18, 2024 Maintainer

grhoten Jun 18, 2024 Maintainer

grhoten Jun 18, 2024 Maintainer

nciric Jun 18, 2024 Maintainer Author

macchiati Jun 18, 2024 Maintainer

grhoten Jun 21, 2024 Maintainer

mihnita Jun 20, 2024 Maintainer

nciric
Jun 17, 2024
Maintainer

Replies: 8 comments 2 replies

nciric
Jun 17, 2024
Maintainer Author

macchiati Jun 18, 2024
Maintainer

grhoten
Jun 18, 2024
Maintainer

grhoten
Jun 18, 2024
Maintainer

grhoten
Jun 18, 2024
Maintainer

grhoten
Jun 18, 2024
Maintainer

nciric
Jun 18, 2024
Maintainer Author

macchiati
Jun 18, 2024
Maintainer

grhoten Jun 21, 2024
Maintainer

mihnita
Jun 20, 2024
Maintainer