CLDR-17897 Make ConvertLanguageData Consistent #4015

conradarcturus · 2024-09-04T04:04:03Z

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in language_script.tsv to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to country_language_population.tsv when the writing system for a country was a variant.

CLDR-17897

This PR completes the ticket.
This PR is pre-work. It fixes the data so its easier to make updates without side effects happening when working on that ticket.

Scripts ran

mvn package -DskipTests=true
java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

ALLOW_MANY_COMMITS=true

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistences with the XML outputs to match expectations.

jira-pull-request-webhook · 2024-09-04T04:15:36Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

conradarcturus · 2024-09-05T17:18:08Z

Sorry Mark had to update the Chinese language tags -- hardcoding it a bit otherwise they flipped between Hans and Hant in the xml and failed tests.

srl295

Seems OK if nan->nan_Hant_TW instead of nan->nan_Hans_CN is intended.

srl295 · 2024-09-09T17:26:06Z

common/supplemental/likelySubtags.xml

@@ -497,7 +501,9 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="myz" to="myz_Mand_IR"/>		<!--Classical Mandaic‧?‧?	➡ Classical Mandaic‧Mandaean‧Iran-->
 		<likelySubtag from="mzn" to="mzn_Arab_IR"/>		<!--Mazanderani‧?‧?	➡ Mazanderani‧Arabic‧Iran-->
 		<likelySubtag from="na" to="na_Latn_NR"/>		<!--Nauru‧?‧?	➡ Nauru‧Latin‧Nauru-->
-		<likelySubtag from="nan" to="nan_Hans_CN"/>		<!--Min Nan Chinese‧?‧?	➡ Min Nan Chinese‧Simplified‧China-->


this is the main change that I think needs to be confirmed.

Ah sorry, I can hard-code this back -- let me do that quickly and we can worry about that down the road.

Ah yea, this was going back and forth depending on when you ran the script. I'll keep it as it was -- it but it was subject to population counts and which was the "default" script.

I don't know which is correct.

Okay, it looks like its ..._Hans_CN in v45, v46, and now fixed in this PR to stay like that.

jira-pull-request-webhook · 2024-09-10T00:05:26Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati · 2024-09-12T15:11:23Z

common/supplemental/supplementalData.xml

@@ -1647,7 +1647,7 @@ XXX Code for transations where no currency is involved
 		<language type="ha" scripts="Arab Latn"/>
 		<language type="ha" territories="NE NG" alt="secondary"/>
 		<language type="hai" scripts="Latn"/>
-		<language type="hak" scripts="Hans"/>
+		<language type="hak" scripts="Hans Hant" territories="TW"/>


Note: the items in scripts=... and territories=... are ordered. For the script, this matters for computing likely subtags. So we need to make sure that the first script is the most likely one because it affects how the population data is read when there is no explicit script (the territory doesn't matter).

We might want to make this cleaner, and have the non-secondary element have only one script value.

Also, we could enforce that if a language is ever written in multiple scripts (aside from non-standard use, like Shavian with English), that the population data must have the script, eg nan_Hant in TW and nan_Hans in CN. Enforce that with tests. Then it would be crystal clear.

see https://unicode-org.atlassian.net/browse/CLDR-11224 for an item (needs design) on making crystal clear the status of 'multi script' locales, which I would say is right now actually somewhat nebulous and ad-hoc and not well documented. I added an xref back to here.

This field is alphabetically ordered in the generating script. It would be the same output if it was ordered by usage. if we presume 100% Hans in CN and 100% Hant in TW, there are more Hans speakers of both Hakka and Min Nan.

The question is always written usage, which is hard to determine as we know.

Mind if I follow up on designing that in another change? The non-idempotence of the aforementioned scripts is blocking a lot of PRs waiting on my server.

macchiati · 2024-09-18T17:11:18Z

ok, we can land this. We'll want to do a comparison of the end data to 46 before we merge the branch in, though. And i think we'll want the structure/tooling beforehand. But that isn't pressing.

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant. Scripts ran: mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

srl295 · 2024-09-30T14:34:06Z

common/supplemental/likelySubtags.xml

@@ -1036,6 +1046,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/>		<!--?‧Ahom‧?	➡ Ahom‧Ahom‧India-->
 		<likelySubtag from="und_Arab" to="ar_Arab_EG"/>		<!--?‧Arabic‧?	➡ Arabic‧Arabic‧Egypt-->
 		<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/>		<!--?‧Arabic‧Afghanistan	➡ Persian‧Arabic‧Afghanistan-->
+		<likelySubtag from="und_Arab_AZ" to="az_Arab_AZ"/>		<!--?‧Arabic‧Azerbaijan	➡ Azerbaijani‧Arabic‧Azerbaijan-->


@macchiati wrote:

I think @roozbehp had some comments about this one, since it would be exceedingly rare to have Azerbaijan written in Arabic script these days.

Roozbeh's comment in the previous attempt was about und_Arab_AZ mapping to tly_Arab_AZ -- he said that was very unlikely and it probably should be az_Arab_AZ as you see in this PR. Because of multiple scripts this gets automatically generated so we have to have some value here -- at least in this PR its the right one.

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant. Scripts ran: mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

github-actions bot assigned conradarcturus Sep 4, 2024

CLDR-17897 Make ConvertLanguageData Consistent

49aae9c

If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistences with the XML outputs to match expectations.

conradarcturus force-pushed the CLDR-17897-Make-ConvertLanguageData-Consistent branch from 7f3bd83 to 49aae9c Compare September 4, 2024 04:15

conradarcturus mentioned this pull request Sep 4, 2024

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

Closed

2 tasks

conradarcturus requested review from macchiati and srl295 September 4, 2024 16:46

macchiati previously approved these changes Sep 4, 2024

View reviewed changes

conradarcturus dismissed macchiati’s stale review via 8dd134f September 4, 2024 18:05

conradarcturus requested a review from macchiati September 5, 2024 17:18

conradarcturus changed the base branch from main to ddl/v47 September 5, 2024 17:18

srl295 reviewed Sep 9, 2024

View reviewed changes

conradarcturus added 2 commits September 9, 2024 17:00

CLDR-17897 Update Chinese tags

47f9bcd

CLDR-17897 Keep Chinese defaults

5a4ac87

conradarcturus force-pushed the CLDR-17897-Make-ConvertLanguageData-Consistent branch from 8dd134f to 5a4ac87 Compare September 10, 2024 00:05

conradarcturus added 2 commits September 10, 2024 16:37

CLDR-17897 Keep Chinese defaults

a4fecbd

CLDR-17897 merge with remote

2915009

macchiati reviewed Sep 12, 2024

View reviewed changes

macchiati approved these changes Sep 18, 2024

View reviewed changes

conradarcturus merged commit 8ac1a2f into unicode-org:ddl/v47 Sep 18, 2024
9 checks passed

srl295 reviewed Sep 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-17897 Make ConvertLanguageData Consistent #4015

CLDR-17897 Make ConvertLanguageData Consistent #4015

conradarcturus commented Sep 4, 2024

jira-pull-request-webhook bot commented Sep 4, 2024

conradarcturus commented Sep 5, 2024

srl295 left a comment

srl295 Sep 9, 2024

conradarcturus Sep 9, 2024 •

edited

Loading

conradarcturus Sep 9, 2024

srl295 Sep 10, 2024

conradarcturus Sep 10, 2024

jira-pull-request-webhook bot commented Sep 10, 2024

macchiati Sep 12, 2024

macchiati Sep 12, 2024

macchiati Sep 12, 2024

srl295 Sep 12, 2024 •

edited

Loading

conradarcturus Sep 12, 2024

macchiati Sep 12, 2024

conradarcturus Sep 12, 2024

macchiati commented Sep 18, 2024

srl295 Sep 30, 2024

conradarcturus Oct 1, 2024

CLDR-17897 Make ConvertLanguageData Consistent #4015

CLDR-17897 Make ConvertLanguageData Consistent #4015

Conversation

conradarcturus commented Sep 4, 2024

jira-pull-request-webhook bot commented Sep 4, 2024

conradarcturus commented Sep 5, 2024

srl295 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conradarcturus Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conradarcturus Sep 9, 2024 •

edited

Loading

srl295 Sep 12, 2024 •

edited

Loading