Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17382 languagematch Ukrainian should not fall back to Russian #3993

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

stenshamn
Copy link
Contributor

CLDR-17382

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@AEApple
Copy link
Contributor

AEApple commented Aug 28, 2024

I didn't think we were changing the match to English, I think it should match what was done for Macedonian. I thought the goal was to remove the language match and not to explicitly cause it to match to English.

@AEApple AEApple requested a review from markusicu August 28, 2024 17:41
@stenshamn
Copy link
Contributor Author

I didn't think we were changing the match to English, I think it should match what was done for Macedonian. I thought the goal was to remove the language match and not to explicitly cause it to match to English.

Before making the change, I looked at some of the other language matches and previous changes, and it seemed that a change, not deletion, was the right thing to do in this case.
For Estonian they commented it out because (according to ticket/PR comments) they couldn't seem to decide between Russian and English and decided not to have a fallback at all. Many other languages have explicit fallback to English. I didn't want to leave the fallback to be random or unpredictable so I went with explicit.

<languageMatch desired="uk" supported="ru" distance="20" oneway="true"/> <!-- Ukrainian ⇒ Russian -->

<languageMatch desired="uk" supported="en" distance="30" oneway="true"/> <!-- Ukrainian ⇒ English -->
<!-- CLDR-17382: languageMatch: Ukrainian should not fall back to Russian -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case then can we have the comment say that Ukrainian should match with English if we are adding an explicit match?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to have the explicit fallback to en.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree very much. That's not how this is designed, nor advertised.

macchiati
macchiati previously approved these changes Aug 28, 2024
<languageMatch desired="uk" supported="ru" distance="20" oneway="true"/> <!-- Ukrainian ⇒ Russian -->

<languageMatch desired="uk" supported="en" distance="30" oneway="true"/> <!-- Ukrainian ⇒ English -->
<!-- CLDR-17382: languageMatch: Ukrainian should not fall back to Russian -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to have the explicit fallback to en.

<languageMatch desired="uk" supported="ru" distance="20" oneway="true"/> <!-- Ukrainian ⇒ Russian -->

<languageMatch desired="uk" supported="en" distance="30" oneway="true"/> <!-- Ukrainian ⇒ English -->
<!-- CLDR-17382: languageMatch: Ukrainian should not fall back to Russian -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree very much. That's not how this is designed, nor advertised.

@macchiati macchiati requested a review from markusicu August 29, 2024 01:30
@macchiati
Copy link
Member

I think we can wait til tomorrow to see what Markus says, rather than Resolve the Conversation now.

@markusicu
Copy link
Member

The information we got is "Ukrainian-language users don't want to be matched with Russian-language contents". The corresponding data change is to remove (comment out) the languageMatch for this pair.

This is language matcher data which feeds into an implementation (e.g., ICU LocaleMatcher) which a caller sets up with a list of supported languages plus an optional default language. When there is no match, then the matcher returns the default language, when it's set, otherwise with a "no match" result.

The default language is chosen by the caller. It need not be English. And not setting one at all is a valid, important choice. Some callers have special strategies for what to do next.

When we overcorrect and force a "fallback" to English, then we short-circuit the functioning of the algorithm and defeat the caller's intent.

We should handle this like the other geopolitical cases in the past, like Macedonian.


I haven't reviewed other "fallbacks to English" in detail. I assumed that they were generally matches based on some information, like populations are actually somewhat likely to understand English because it's one of the local government and entertainment languages, or remaining influence in colonies, etc. I would expect similar data for one-way matches to French (e.g., Breton --> French), Portuguese, Chinese, and Arabic.

Some of these might not make sense. At a glance, I see a one-way match from Esperanto to English; that looks bogus.

A review of existing data deserves a separate ticket.

@macchiati
Copy link
Member

Upshot

I don't think it matters much whether we include an explicit fallback or not, and we don't for many languages. So to be conservative, we could omit the fallback mapping to English, then revisit this in the next cycle.

Background

I looked at this a bit. If someone uses the default settings with the proposed change, here is what happens.

  • the default direction includes ONE_WAY
  • the default max threshhold distance is 50
  • the default locale is the first supported language (the caller really has to supply the supported locales)

If the user's desired languages are <Ukrainian, French> and the default language is set to German, then the priority order among the app's supported languages would be:

<Ukrainian, French, English, German>

So on systems that allow for secondary desired languages, such as iOS, Android, and MacOS, it is easy for users get the the desired result, if their favored fallback language is French (or Russian) rather than German. (Of course, that doesn't necessarily mean that users take advantage of this ability.)

The difference would be that English would come before what the system has set as a default.

Now, if the system doesn't set a specific default, or doesn't order the supported languages to put a reasonable default as the first one, the results would be a bit random. On the other hand, that is the case for many of our current languages, since we don't always have a fallback for all major languages. I suspect that a very large number of users of LocaleMatcher will use a system default of English; although that might be different in Ukraine.

I think that's why we made an effort to have fallback locales for most locales that are not "top tier" (ie, those supported by most applications), so that the user would get some reasonable result for systems that don't set the default based on the likelyhood that users in that country would understand the language.

@stenshamn
Copy link
Contributor Author

@macchiati , I'm not sure if I can discern a tie-breaking call in your reply. Should we default to English or leave it commented out like Markus suggests?

@markusicu
Copy link
Member

I don't think it matters much whether we include an explicit fallback or not, and we don't for many languages. So to be conservative, we could omit the fallback mapping to English, then revisit this in the next cycle.

Yes please let's just comment out the offensive mapping as requested in the ticket.

I think that's why we made an effort to have fallback locales for most locales that are not "top tier" (ie, those supported by most applications), so that the user would get some reasonable result for systems that don't set the default based on the likelyhood that users in that country would understand the language.

That last part is important. If it's reasonable to assume that people who understand language x might also understand English or French or... then we should have a medium-high-distance one-way match for that. If not, then we shouldn't have a languageMatch entry. I would argue that Esperanto-->English is the latter. (And I am not asking for changing that in this PR nor under this ticket!)

@macchiati
Copy link
Member

@macchiati , I'm not sure if I can discern a tie-breaking call in your reply. Should we default to English or leave it commented out like Markus suggests?

Let's comment it out for now, and revisit next cycle.

@macchiati
Copy link
Member

If it's reasonable to assume that people who understand language x might also understand English or French or... then we should have a medium-high-distance one-way match for that.

I don't see it that way. The goal for the fallbacks should be: based on the best information we have, in the absence of any other information, what are people most likely to understand if the language in question is not available. [Caveat geopolitical]

So when when people can't supply secondary languages, is it better to:

  • pick that fallback
    • note that when that fallback is not supported, it will fall further back to the system default.
  • or go with a system default.

Changing approach to being open-ended instead of explicit to English.

Co-authored-by: Markus Scherer <[email protected]>
Copy link
Contributor

@AEApple AEApple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AEApple AEApple dismissed markusicu’s stale review August 29, 2024 22:41

Please re-review updated commit

@AEApple AEApple requested a review from macchiati August 29, 2024 22:41
@macchiati
Copy link
Member

I'll go ahead and merge

@macchiati macchiati merged commit eb4b003 into main Aug 30, 2024
13 checks passed
@macchiati macchiati deleted the CLDR-17382-languagematch-ukrainian branch August 30, 2024 03:12
haytenf pushed a commit to haytenf/cldr that referenced this pull request Sep 17, 2024
conradarcturus pushed a commit that referenced this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants