-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-17897 Make ConvertLanguageData Consistent #4015
CLDR-17897 Make ConvertLanguageData Consistent #4015
Conversation
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistences with the XML outputs to match expectations.
7f3bd83
to
49aae9c
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Sorry Mark had to update the Chinese language tags -- hardcoding it a bit otherwise they flipped between Hans and Hant in the xml and failed tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK if nan->nan_Hant_TW
instead of nan->nan_Hans_CN
is intended.
@@ -497,7 +501,9 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="myz" to="myz_Mand_IR"/> <!--Classical Mandaic‧?‧? ➡ Classical Mandaic‧Mandaean‧Iran--> | |||
<likelySubtag from="mzn" to="mzn_Arab_IR"/> <!--Mazanderani‧?‧? ➡ Mazanderani‧Arabic‧Iran--> | |||
<likelySubtag from="na" to="na_Latn_NR"/> <!--Nauru‧?‧? ➡ Nauru‧Latin‧Nauru--> | |||
<likelySubtag from="nan" to="nan_Hans_CN"/> <!--Min Nan Chinese‧?‧? ➡ Min Nan Chinese‧Simplified‧China--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main change that I think needs to be confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I can hard-code this back -- let me do that quickly and we can worry about that down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yea, this was going back and forth depending on when you ran the script. I'll keep it as it was -- it but it was subject to population counts and which was the "default" script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it looks like its ..._Hans_CN in v45, v46, and now fixed in this PR to stay like that.
8dd134f
to
5a4ac87
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@@ -1647,7 +1647,7 @@ XXX Code for transations where no currency is involved | |||
<language type="ha" scripts="Arab Latn"/> | |||
<language type="ha" territories="NE NG" alt="secondary"/> | |||
<language type="hai" scripts="Latn"/> | |||
<language type="hak" scripts="Hans"/> | |||
<language type="hak" scripts="Hans Hant" territories="TW"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the items in scripts=... and territories=... are ordered. For the script, this matters for computing likely subtags. So we need to make sure that the first script is the most likely one because it affects how the population data is read when there is no explicit script (the territory doesn't matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to make this cleaner, and have the non-secondary element have only one script value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we could enforce that if a language is ever written in multiple scripts (aside from non-standard use, like Shavian with English), that the population data must have the script, eg nan_Hant in TW and nan_Hans in CN. Enforce that with tests. Then it would be crystal clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see https://unicode-org.atlassian.net/browse/CLDR-11224 for an item (needs design) on making crystal clear the status of 'multi script' locales, which I would say is right now actually somewhat nebulous and ad-hoc and not well documented. I added an xref back to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is alphabetically ordered in the generating script. It would be the same output if it was ordered by usage. if we presume 100% Hans in CN and 100% Hant in TW, there are more Hans speakers of both Hakka and Min Nan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is always written usage, which is hard to determine as we know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if I follow up on designing that in another change? The non-idempotence of the aforementioned scripts is blocking a lot of PRs waiting on my server.
ok, we can land this. We'll want to do a comparison of the end data to 46 before we merge the branch in, though. And i think we'll want the structure/tooling beforehand. But that isn't pressing. |
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant. Scripts ran: mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
@@ -1036,6 +1046,7 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/> <!--?‧Ahom‧? ➡ Ahom‧Ahom‧India--> | |||
<likelySubtag from="und_Arab" to="ar_Arab_EG"/> <!--?‧Arabic‧? ➡ Arabic‧Arabic‧Egypt--> | |||
<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/> <!--?‧Arabic‧Afghanistan ➡ Persian‧Arabic‧Afghanistan--> | |||
<likelySubtag from="und_Arab_AZ" to="az_Arab_AZ"/> <!--?‧Arabic‧Azerbaijan ➡ Azerbaijani‧Arabic‧Azerbaijan--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macchiati wrote:
I think @roozbehp had some comments about this one, since it would be exceedingly rare to have Azerbaijan written in Arabic script these days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roozbeh's comment in the previous attempt was about und_Arab_AZ
mapping to tly_Arab_AZ
-- he said that was very unlikely and it probably should be az_Arab_AZ
as you see in this PR. Because of multiple scripts this gets automatically generated so we have to have some value here -- at least in this PR its the right one.
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in `language_script.tsv` to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations to `country_language_population.tsv` when the writing system for a country was a variant. Scripts ran: mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
If we re-run ConvertLanguageData on unrelated data, it will update the order and values of some other data -- this fixes inconsistencies with the XML outputs to match expectations. The biggest change was updating values in
language_script.tsv
to demote script variations to secondary when they really are not expected. Furthermore I added explicit annotations tocountry_language_population.tsv
when the writing system for a country was a variant.CLDR-17897
Scripts ran
ALLOW_MANY_COMMITS=true