Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17897 Make ConvertLanguageData Consistent #4015

Conversation

conradarcturus
Copy link
Contributor

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in language_script.tsv to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to country_language_population.tsv when the writing system for a country was a variant.

CLDR-17897

  • This PR completes the ticket.
  • This PR is pre-work. It fixes the data so its easier to make updates without side effects happening when working on that ticket.

Scripts ran

  • mvn package -DskipTests=true
  • java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
  • java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

ALLOW_MANY_COMMITS=true

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistences with the XML outputs to match expectations.
@conradarcturus conradarcturus force-pushed the CLDR-17897-Make-ConvertLanguageData-Consistent branch from 7f3bd83 to 49aae9c Compare September 4, 2024 04:15
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati
macchiati previously approved these changes Sep 4, 2024
@conradarcturus
Copy link
Contributor Author

Sorry Mark had to update the Chinese language tags -- hardcoding it a bit otherwise they flipped between Hans and Hant in the xml and failed tests.

@conradarcturus conradarcturus changed the base branch from main to ddl/v47 September 5, 2024 17:18
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK if nan->nan_Hant_TW instead of nan->nan_Hans_CN is intended.

@@ -497,7 +501,9 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="myz" to="myz_Mand_IR"/> <!--Classical Mandaic‧?‧? ➡ Classical Mandaic‧Mandaean‧Iran-->
<likelySubtag from="mzn" to="mzn_Arab_IR"/> <!--Mazanderani‧?‧? ➡ Mazanderani‧Arabic‧Iran-->
<likelySubtag from="na" to="na_Latn_NR"/> <!--Nauru‧?‧? ➡ Nauru‧Latin‧Nauru-->
<likelySubtag from="nan" to="nan_Hans_CN"/> <!--Min Nan Chinese‧?‧? ➡ Min Nan Chinese‧Simplified‧China-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main change that I think needs to be confirmed.

Copy link
Contributor Author

@conradarcturus conradarcturus Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I can hard-code this back -- let me do that quickly and we can worry about that down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea, this was going back and forth depending on when you ran the script. I'll keep it as it was -- it but it was subject to population counts and which was the "default" script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it looks like its ..._Hans_CN in v45, v46, and now fixed in this PR to stay like that.

@conradarcturus conradarcturus force-pushed the CLDR-17897-Make-ConvertLanguageData-Consistent branch from 8dd134f to 5a4ac87 Compare September 10, 2024 00:05
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@@ -1647,7 +1647,7 @@ XXX Code for transations where no currency is involved
<language type="ha" scripts="Arab Latn"/>
<language type="ha" territories="NE NG" alt="secondary"/>
<language type="hai" scripts="Latn"/>
<language type="hak" scripts="Hans"/>
<language type="hak" scripts="Hans Hant" territories="TW"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the items in scripts=... and territories=... are ordered. For the script, this matters for computing likely subtags. So we need to make sure that the first script is the most likely one because it affects how the population data is read when there is no explicit script (the territory doesn't matter).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to make this cleaner, and have the non-secondary element have only one script value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we could enforce that if a language is ever written in multiple scripts (aside from non-standard use, like Shavian with English), that the population data must have the script, eg nan_Hant in TW and nan_Hans in CN. Enforce that with tests. Then it would be crystal clear.

Copy link
Member

@srl295 srl295 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see https://unicode-org.atlassian.net/browse/CLDR-11224 for an item (needs design) on making crystal clear the status of 'multi script' locales, which I would say is right now actually somewhat nebulous and ad-hoc and not well documented. I added an xref back to here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is alphabetically ordered in the generating script. It would be the same output if it was ordered by usage. if we presume 100% Hans in CN and 100% Hant in TW, there are more Hans speakers of both Hakka and Min Nan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is always written usage, which is hard to determine as we know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I follow up on designing that in another change? The non-idempotence of the aforementioned scripts is blocking a lot of PRs waiting on my server.

@macchiati
Copy link
Member

ok, we can land this. We'll want to do a comparison of the end data to 46 before we merge the branch in, though. And i think we'll want the structure/tooling beforehand. But that isn't pressing.

@conradarcturus conradarcturus merged commit 8ac1a2f into unicode-org:ddl/v47 Sep 18, 2024
9 checks passed
conradarcturus added a commit to conradarcturus/cldr that referenced this pull request Sep 27, 2024
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant.

Scripts ran:

 mvn package -DskipTests=true
 java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
 java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
@@ -1036,6 +1046,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/> <!--?‧Ahom‧? ➡ Ahom‧Ahom‧India-->
<likelySubtag from="und_Arab" to="ar_Arab_EG"/> <!--?‧Arabic‧? ➡ Arabic‧Arabic‧Egypt-->
<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/> <!--?‧Arabic‧Afghanistan ➡ Persian‧Arabic‧Afghanistan-->
<likelySubtag from="und_Arab_AZ" to="az_Arab_AZ"/> <!--?‧Arabic‧Azerbaijan ➡ Azerbaijani‧Arabic‧Azerbaijan-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati wrote:

I think @roozbehp had some comments about this one, since it would be exceedingly rare to have Azerbaijan written in Arabic script these days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roozbeh's comment in the previous attempt was about und_Arab_AZ mapping to tly_Arab_AZ -- he said that was very unlikely and it probably should be az_Arab_AZ as you see in this PR. Because of multiple scripts this gets automatically generated so we have to have some value here -- at least in this PR its the right one.

srl295 pushed a commit that referenced this pull request Oct 25, 2024
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant.

Scripts ran:

 mvn package -DskipTests=true
 java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
 java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants