CLDR-17115 Update languages/codes #3538

btangmu · 2024-02-28T15:12:16Z

-Numerous changes based on following instructions in Update Language/Script/Region Subtags

-Update world_bank_data.csv by downloading

-URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators

-Note: running AddPopulationData caused no changes

-Update supplementalData.xml by running ConvertLanguageData

-This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)

CLDR-17115

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

-Numerous changes from instructions in Update Language/Script/Region Subtags

macchiati

Looks good, but remove the script.xml file, because we're addressing that in a different PR.

-Revert scripts.xml back to main; to be addressed in a different PR

btangmu · 2024-02-28T18:47:15Z

@macchiati as you requested, I reverted scripts.xml in the last commit

srl295

ISO 3166 seems wrong

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt

-Update world_bank_data.csv by downloading -URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators -Note: running AddPopulationData caused no changes -Update supplementalData.xml by running ConvertLanguageData -This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)

btangmu · 2024-03-13T17:14:13Z

Per discussion, I've reverted iso_3166_status.txt to main

reverted iso_3166_status.txt

macchiati · 2024-03-13T18:21:34Z

It looks like there are a surprising number of errors. I think it is best for me to walk you through this, and you can capture these notes in the instructions.

It appears that ISO had an unexpected number of deprecations, so you're seeing more issues that we normally see.

For lines like the following:
Error: (TestLocale.java:921) Error: : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
Error: (TestLocale.java:927) Error: : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated

What is happening is that likelySubtags.xml is handling languages that are now deprecated. That is to be expected, because ISO does that occasionally, but because we added a lot of SIL language data, the number may be larger each year. To fix that, go to the file an delete the line where it is handled, and delete that line, in this case:

TestMacrolanguages
Error: (TestSupplementalInfo.java:1328) Error: Macrolanguage sa Sanskrit Historical

It looks like the classification changed in ISO. We still use 'sa', because the India government disagrees that it is only historical!Add to if (language.equals("no") || language.equals("sh")) continue; // special cases

TestCompatibility
Error: (TestValidity.java:284) Error: language:dzd:deprecated => regular // add to exception list (ALLOWED_UNDELETIONS) if really un-deprecated

Check the diff in the iso-639 files to verify that dzd is really de-deprecated. Then add dzd to ALLOWED_UNDELETIONS

The "ERROR:" values below in the listing all look like keyboard stuff; I don't think those are counted. I'll file a ticket for Steven to clean those up.

btangmu · 2024-03-14T15:26:32Z

common/validity/language.xml

 			baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy
 			cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd
-			daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd
+			daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl


"dzd" was removed here

Web search for "dzd deprecated" turns up this file:

https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt

which reads as follows:

FOR ARCHIVING: Registration form for 'dzd'

LANGUAGE SUBTAG REGISTRATION FORM

Name of requester: Doug Ewell

E-mail address of requester: doug at ewellic.org

Record Requested:

Type: language
Subtag: dzd
Description: Daza

Intended meaning of the subtag:

Reference to published description of the language (book or article):

Any other relevant information:

This registration tracks a change made to ISO 639-3 effective
2023-01-20, adding the code element 'dzd' for Daza, which had been
retired in 2015 as non-existent. The net effect of this registration is
to remove the Deprecated value from this record.

For more information on the ISO 639-3 change, refer to:
https://iso639-3.sil.org/request/2022-027

macchiati · 2024-03-14T16:03:50Z

Good work. That verifies that it is indeed an intentional change.

…

On Thu, Mar 14, 2024 at 8:26 AM Tom Bishop ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In common/validity/language.xml <#3538 (comment)>: > baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd - daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd + daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl "dzd" was removed here Web search for "dzd deprecated" turns up this file: https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt which reads as follows: FOR ARCHIVING: Registration form for 'dzd' ------------------------------ LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Doug Ewell 2. E-mail address of requester: doug at ewellic.org 3. Record Requested: Type: language Subtag: dzd Description: Daza 1. Intended meaning of the subtag: 2. Reference to published description of the language (book or article): 3. Any other relevant information: This registration tracks a change made to ISO 639-3 effective 2023-01-20, adding the code element 'dzd' for Daza, which had been retired in 2015 as non-existent. The net effect of this registration is to remove the Deprecated value from this record. For more information on the ISO 639-3 change, refer to: https://iso639-3.sil.org/request/2022-027 — Reply to this email directly, view it on GitHub <#3538 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCNSFWGIOHOFUGAZA3YYG6T5AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMZXGA2DMOJTGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-Macrolanguage sa Sanskrit Historical, treat as exception -Take into account the official undeprecation of dzd Daza

btangmu · 2024-03-15T14:30:16Z

@macchiati my latest commit fixes the problems with sa (Sanskrit) and dzd (Daza). It does not fix the problems with ajp and others in this output:

    testLanguageTagParserIsValid {
      Error: (TestLocale.java:921) : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:927) : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:921) : kgm: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:927) : kgm_Latn_BR: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:921) : ksa: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:927) : ksa_Latn_NG: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:921) : nom: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:927) : nom_Latn_PE: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:921) : plj: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:927) : plj_Latn_NG: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:921) : prp: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:927) : prp_Gujr_IN: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:921) : slq: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:927) : slq_Arab_IR: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:921) : szd: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:927) : szd_Latn_MY: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:921) : tmk: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:927) : tmk_Deva_NP: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:921) : xss: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:927) : xss_Cyrl_RU: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:921) : zkb: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:927) : zkb_Cyrl_RU: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:921) : zua: expected "", got "Disallowed language=zua, status=deprecated"
      Error: (TestLocale.java:927) : zua_Latn_NG: expected "", got "Disallowed language=zua, status=deprecated"

You addressed these errors in your last comment, but I still don't understand; they're different from the "sa" error.

"ajp" occurs in languageGroup.xml, languageInfo.xml, and likelySubtags.xml. Should it be deleted from languageInfo.xml, and/or likelySubtags.xml, and then should languageGroup.xml be regenerated?

macchiati · 2024-03-16T14:50:00Z

Here is what to do in more detail.

Case 1, replaced by old:

Take ajp

Look at language-subtag-registry (the diff from the old one)

You see that ajp has 2 items added:

Deprecated: 2023-03-17
Preferred-Value: apc

That means that wherever it occurs, "apc" should be substituted. However, if you look at apc, it is not new. So the actions are to delete it in those files where it occurs. Search the directory supplemental. You find:

languageGroup.xml
93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii ajp akk am …</languageGroup> 

languageInfo.xml
170: <languageMatch desired="ajp" supported="ar" distance="10" oneway="true"/> <!-- South Levantine Arabic --> 

likelySubtags.xml (6 matches)
2,883: <likelySubtag from="ajp" to="ajp_Arab_JO" origin="sil1"/> <!-- South Levantine Arabic ➡︎ South Levantine Arabic (Arabic, Jordan) --> 
4,461: <likelySubtag from="gra" to="gra_Deva_IN" origin="sil1"/> <!-- Rajput Garasia ➡︎ Rajput Garasia (Devanagari, India) --> 
4,462: <likelySubtag from="gra_Gujr" to="gra_Gujr_IN" origin="sil1"/> <!-- Rajput Garasia (Gujarati) ➡︎ Rajput Garasia (Gujarati, India) -->

In languageGroup:
If 'apc', didn't exist in that file you would replace it.
Since it does, you just delete it (leaving the rest of the line alone).

93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii akk am …</languageGroup>

Same in languageInfo.xml and likelySubtags.xml. 'apc' exists in each, so just delete the lines.

Suppose it were in supplementalData in the territory information (it doesn't so this is just illustration!!)

<territory type="PS" gdp="21220000000" literacyPercent="95.3" population="4818260">	<!--Palestinian Territories-->
  <languagePopulation type="ar" populationPercent="100" officialStatus="official"/>	<!--Arabic-->
  <languagePopulation type="apc" populationPercent="87" references="R1173"/>	<!--Levantine Arabic-->
  <languagePopulation type="ajp" populationPercent="2" references="..."/>	<!--South Levantine Arabic-->

In that case you would combine the two figures to get:

  <languagePopulation type="apc" populationPercent="89" references="R1173"/>	<!--Levantine Arabic-->

Use your judgment: sometimes language counts are doubled for bilingual speakers, so if it adds to a crazy amount, don't add it. (These figures are 'best available', so that's ok.)

Case 2, no preferred

In this case, just drop the lines.

Case 3, split

Subtag: ksa
Description: Shuwa-Zamani
Added: 2009-07-29
Deprecated: 2023-03-17
Comments: see izm, rsw

Look at iso-639-3_Retirements.tab for ksa

You'll see "Split into [rsw] Rishiwa and [izm] Kizamani"

Take the first one, and treat this case like Case 1.

btangmu · 2024-03-19T14:54:25Z

@macchiati I've started to follow your directions for "ajp", ...

likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."

So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":

    public static void main(String[] args) throws IOException {
        if (true) {
            throw new IllegalArgumentException("Don't run this tool until it is fixed");
        }

So I'll try hand-editing likelySubtags.xml anyway...

macchiati · 2024-03-19T15:22:10Z

Right, we disabled the tool for now. It should be easy to regex-search for (ajp|...) to find all the lines, although you want to look at each one rather than automatically deleting.

…

On Tue, Mar 19, 2024 at 7:54 AM Tom Bishop ***@***.***> wrote: @macchiati <https://github.com/macchiati> I've started to follow your directions for "ajp", ... likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost." So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed": public static void main(String[] args) throws IOException { if (true) { throw new IllegalArgumentException("Don't run this tool until it is fixed"); } So I'll try hand-editing likelySubtags.xml anyway... — Reply to this email directly, view it on GitHub <#3538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMHOM4XOIU3I5RLCXVLYZBGTTAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGM4TQMRVGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

btangmu · 2024-03-19T15:22:18Z

@macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...

macchiati · 2024-03-19T15:24:39Z

Right. What I do is look at the diffs in the PR. BTW, as you go through this, please jot down in a doc or text file what you are doing, so that we can use that as a basis for updating the instructions.

…

On Tue, Mar 19, 2024 at 8:22 AM Tom Bishop ***@***.***> wrote: @macchiati <https://github.com/macchiati> FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions... — Reply to this email directly, view it on GitHub <#3538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMDPXCMZ3O6VH5SSIQ3YZBJ4DAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGQ3TSNRYG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

btangmu · 2024-03-19T16:09:57Z

@macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:

language-subtag-registry:
Type: language
Subtag: prp
Description: Parsi
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: gu

iso-639-3_Retirements.tab:
prp Parsi M guj 2023-01-20

Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!

Actually likelySubtags.xml already has

		<likelySubtag from="gu" to="gu_Gujr_IN"/>
		<!--{ Gujarati; ?; ? } => { Gujarati; Gujarati; India }-->

So I'm just deleting the prp line from that file

DavidLRowe · 2024-03-19T17:05:26Z

I think it should be "gu". "guj" is the ISO 639-3 equivalent of "gu". The ISO 639-1 (two-letter) code is preferred if it exists.

macchiati · 2024-03-19T17:14:20Z

gu is the right choice. (guj is the 3 letter code, but the BCP47 uses 2 letter whenever it exists)

…

On Tue, Mar 19, 2024 at 9:10 AM Tom Bishop ***@***.***> wrote: @macchiati <https://github.com/macchiati> these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj: language-subtag-registry: Type: language Subtag: prp Description: Parsi Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: gu iso-639-3_Retirements.tab: prp Parsi M guj 2023-01-20 Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct! — Reply to this email directly, view it on GitHub <#3538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMB54DNLDNEZGRLZIH3YZBPOXAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGU4TCNJWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

btangmu · 2024-03-19T17:16:22Z

Another disagreement, for szd -- replace with uki or umi?

iso-639-3_Retirements.tab:
szd Seru M uki 2023-01-20

language-subtag-registry:
Type: language
Subtag: szd
Description: Seru
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: umi

likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file

macchiati · 2024-03-19T17:27:16Z

When in doubt, go by the language subtag registry

…

On Tue, Mar 19, 2024 at 10:16 AM Tom Bishop ***@***.***> wrote: Another disagreement, for szd -- replace with uki or umi? iso-639-3_Retirements.tab: szd Seru M uki 2023-01-20 language-subtag-registry: Type: language Subtag: szd Description: Seru Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: umi likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file — Reply to this email directly, view it on GitHub <#3538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMDCTEQ6JATO4AO6BCLYZBXH3AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG4ZDKMJSHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

btangmu · 2024-03-19T17:30:48Z

I made another commit. Locally there's a new set of errors, which I'll work on next:

    TestLstrConsistency {
      Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <!-- South Levantine Arabic ⇒ Levantine Arabic -->
<languageAlias type="kgm" replacement="plu" reason="deprecated"/> <!-- Karipúna ⇒ Palikúr -->
<languageAlias type="nom" replacement="cbr" reason="deprecated"/> <!-- Nocamán ⇒ Cashibo-Cacataibo -->
<languageAlias type="pmk" replacement="crr" reason="deprecated"/> <!-- Pamlico ⇒ Carolina Algonquian -->
<languageAlias type="prp" replacement="gu" reason="deprecated"/> <!-- Parsi ⇒ Gujarati -->
<languageAlias type="szd" replacement="umi" reason="deprecated"/> <!-- Seru ⇒ Ukit -->
<languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <!-- Northwestern Tamang ⇒ Western Tamang -->
<languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <!-- Tupí ⇒ Tupinambá -->
<languageAlias type="xss" replacement="zko" reason="deprecated"/> <!-- Assan ⇒ Kott -->
<languageAlias type="zkb" replacement="kjh" reason="deprecated"/> <!-- Koibal ⇒ Khakas -->

macchiati · 2024-03-19T17:53:18Z

Good of it to tell you exactly which lines to add!

…

On Tue, Mar 19, 2024 at 10:31 AM Tom Bishop ***@***.***> wrote: I made another commit. Locally there's a new set of errors, which I'll work on next: TestLstrConsistency { Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10 <languageAlias type="ajp" replacement="apc" reason="deprecated"/>  <languageAlias type="kgm" replacement="plu" reason="deprecated"/>  <languageAlias type="nom" replacement="cbr" reason="deprecated"/>  <languageAlias type="pmk" replacement="crr" reason="deprecated"/>  <languageAlias type="prp" replacement="gu" reason="deprecated"/>  <languageAlias type="szd" replacement="umi" reason="deprecated"/>  <languageAlias type="tmk" replacement="tdg" reason="deprecated"/>  <languageAlias type="tpw" replacement="tpn" reason="deprecated"/>  <languageAlias type="xss" replacement="zko" reason="deprecated"/>  <languageAlias type="zkb" replacement="kjh" reason="deprecated"/>  — Reply to this email directly, view it on GitHub <#3538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGYK4RWZXWE5GS74FLYZBY55AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG42TQMJRGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

btangmu · 2024-03-19T18:03:55Z

Good of it to tell you exactly which lines to add!

Add where?

supplementalMetadata.xml?

btangmu · 2024-03-19T18:45:36Z

Tests are passing!

srl295 · 2024-03-19T18:56:42Z

common/validity/language.xml

@@ -1,7 +1,7 @@
 <?xml version='1.0' encoding='UTF-8' ?>
 <!DOCTYPE supplementalData SYSTEM '../../common/dtd/ldmlSupplemental.dtd'>
 <!--
-	Copyright © 1991-2022 Unicode, Inc.
+	Copyright © 1991-2024 Unicode, Inc.
 	For terms of use, see http://www.unicode.org/copyright.html
 	SPDX-License-Identifier: Unicode-DFS-2016


Should be Unicode-3.0

I'll do another pass replacing identifiers.

CLDR-17115 Update languages/codes

3422177

-Numerous changes from instructions in Update Language/Script/Region Subtags

btangmu self-assigned this Feb 28, 2024

btangmu requested review from srl295, yumaoka and macchiati February 28, 2024 15:13

macchiati reviewed Feb 28, 2024

View reviewed changes

CLDR-17115 Revert scripts.xml

bc0df8b

-Revert scripts.xml back to main; to be addressed in a different PR

macchiati previously approved these changes Feb 28, 2024

View reviewed changes

srl295 previously requested changes Feb 29, 2024

View reviewed changes

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt Outdated Show resolved Hide resolved

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt Outdated Show resolved Hide resolved