Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17115 Update languages/codes #3538

Merged
merged 6 commits into from
Mar 20, 2024
Merged

Conversation

btangmu
Copy link
Member

@btangmu btangmu commented Feb 28, 2024

-Numerous changes based on following instructions in Update Language/Script/Region Subtags

-Update world_bank_data.csv by downloading

-URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators

-Note: running AddPopulationData caused no changes

-Update supplementalData.xml by running ConvertLanguageData

-This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)

CLDR-17115

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

-Numerous changes from instructions in Update Language/Script/Region Subtags
@btangmu btangmu self-assigned this Feb 28, 2024
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but remove the script.xml file, because we're addressing that in a different PR.

-Revert scripts.xml back to main; to be addressed in a different PR
@btangmu
Copy link
Member Author

btangmu commented Feb 28, 2024

@macchiati as you requested, I reverted scripts.xml in the last commit

macchiati
macchiati previously approved these changes Feb 28, 2024
srl295
srl295 previously requested changes Feb 29, 2024
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISO 3166 seems wrong

-Update world_bank_data.csv by downloading

-URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators

-Note: running AddPopulationData caused no changes

-Update supplementalData.xml by running ConvertLanguageData

-This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)
@btangmu
Copy link
Member Author

btangmu commented Mar 13, 2024

Per discussion, I've reverted iso_3166_status.txt to main

@btangmu btangmu marked this pull request as ready for review March 13, 2024 17:16
@btangmu btangmu requested a review from srl295 March 13, 2024 17:16
@btangmu btangmu dismissed srl295’s stale review March 13, 2024 17:18

reverted iso_3166_status.txt

macchiati
macchiati previously approved these changes Mar 13, 2024
@macchiati macchiati requested a review from pedberg-icu March 13, 2024 18:21
@macchiati
Copy link
Member

It looks like there are a surprising number of errors. I think it is best for me to walk you through this, and you can capture these notes in the instructions.

It appears that ISO had an unexpected number of deprecations, so you're seeing more issues that we normally see.

For lines like the following:
Error:  (TestLocale.java:921)  Error: : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
Error:  (TestLocale.java:927)  Error: : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated

What is happening is that likelySubtags.xml is handling languages that are now deprecated. That is to be expected, because ISO does that occasionally, but because we added a lot of SIL language data, the number may be larger each year. To fix that, go to the file an delete the line where it is handled, and delete that line, in this case:

    TestMacrolanguages
Error:  (TestSupplementalInfo.java:1328)  Error: Macrolanguage sa Sanskrit Historical

It looks like the classification changed in ISO. We still use 'sa', because the India government disagrees that it is only historical!Add to             if (language.equals("no") || language.equals("sh")) continue; // special cases

    TestCompatibility
Error:  (TestValidity.java:284)  Error: language:dzd:deprecated => regular // add to exception list (ALLOWED_UNDELETIONS) if really un-deprecated

Check the diff in the iso-639 files to verify that dzd is really de-deprecated. Then add dzd to ALLOWED_UNDELETIONS

The "ERROR:" values below in the listing all look like keyboard stuff; I don't think those are counted. I'll file a ticket for Steven to clean those up.

baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy
cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dzd" was removed here

Web search for "dzd deprecated" turns up this file:

https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt

which reads as follows:

FOR ARCHIVING: Registration form for 'dzd'


LANGUAGE SUBTAG REGISTRATION FORM

  1. Name of requester: Doug Ewell

  2. E-mail address of requester: doug at ewellic.org

  3. Record Requested:

Type: language
Subtag: dzd
Description: Daza

  1. Intended meaning of the subtag:

  2. Reference to published description of the language (book or article):

  3. Any other relevant information:

This registration tracks a change made to ISO 639-3 effective
2023-01-20, adding the code element 'dzd' for Daza, which had been
retired in 2015 as non-existent. The net effect of this registration is
to remove the Deprecated value from this record.

For more information on the ISO 639-3 change, refer to:
https://iso639-3.sil.org/request/2022-027

@macchiati
Copy link
Member

macchiati commented Mar 14, 2024 via email

-Macrolanguage sa Sanskrit Historical, treat as exception

-Take into account the official undeprecation of dzd Daza
@btangmu
Copy link
Member Author

btangmu commented Mar 15, 2024

@macchiati my latest commit fixes the problems with sa (Sanskrit) and dzd (Daza). It does not fix the problems with ajp and others in this output:

    testLanguageTagParserIsValid {
      Error: (TestLocale.java:921) : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:927) : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:921) : kgm: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:927) : kgm_Latn_BR: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:921) : ksa: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:927) : ksa_Latn_NG: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:921) : nom: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:927) : nom_Latn_PE: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:921) : plj: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:927) : plj_Latn_NG: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:921) : prp: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:927) : prp_Gujr_IN: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:921) : slq: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:927) : slq_Arab_IR: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:921) : szd: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:927) : szd_Latn_MY: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:921) : tmk: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:927) : tmk_Deva_NP: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:921) : xss: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:927) : xss_Cyrl_RU: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:921) : zkb: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:927) : zkb_Cyrl_RU: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:921) : zua: expected "", got "Disallowed language=zua, status=deprecated"
      Error: (TestLocale.java:927) : zua_Latn_NG: expected "", got "Disallowed language=zua, status=deprecated"

You addressed these errors in your last comment, but I still don't understand; they're different from the "sa" error.

"ajp" occurs in languageGroup.xml, languageInfo.xml, and likelySubtags.xml. Should it be deleted from languageInfo.xml, and/or likelySubtags.xml, and then should languageGroup.xml be regenerated?

@macchiati
Copy link
Member

Here is what to do in more detail.

Case 1, replaced by old:

Take ajp

Look at language-subtag-registry (the diff from the old one)

You see that ajp has 2 items added:

Deprecated: 2023-03-17
Preferred-Value: apc

That means that wherever it occurs, "apc" should be substituted. However, if you look at apc, it is not new. So the actions are to delete it in those files where it occurs. Search the directory supplemental. You find:

languageGroup.xml
93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii ajp akk am …</languageGroup> 

languageInfo.xml
170: <languageMatch desired="ajp" supported="ar" distance="10" oneway="true"/> <!-- South Levantine Arabic --> 

likelySubtags.xml (6 matches)
2,883: <likelySubtag from="ajp" to="ajp_Arab_JO" origin="sil1"/> <!-- South Levantine Arabic ➡︎ South Levantine Arabic (Arabic, Jordan) --> 
4,461: <likelySubtag from="gra" to="gra_Deva_IN" origin="sil1"/> <!-- Rajput Garasia ➡︎ Rajput Garasia (Devanagari, India) --> 
4,462: <likelySubtag from="gra_Gujr" to="gra_Gujr_IN" origin="sil1"/> <!-- Rajput Garasia (Gujarati) ➡︎ Rajput Garasia (Gujarati, India) --> 

In languageGroup:
If 'apc', didn't exist in that file you would replace it.
Since it does, you just delete it (leaving the rest of the line alone).

93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii akk am …</languageGroup> 

Same in languageInfo.xml and likelySubtags.xml. 'apc' exists in each, so just delete the lines.

Suppose it were in supplementalData in the territory information (it doesn't so this is just illustration!!)

<territory type="PS" gdp="21220000000" literacyPercent="95.3" population="4818260">	<!--Palestinian Territories-->
  <languagePopulation type="ar" populationPercent="100" officialStatus="official"/>	<!--Arabic-->
  <languagePopulation type="apc" populationPercent="87" references="R1173"/>	<!--Levantine Arabic-->
  <languagePopulation type="ajp" populationPercent="2" references="..."/>	<!--South Levantine Arabic-->

In that case you would combine the two figures to get:

  <languagePopulation type="apc" populationPercent="89" references="R1173"/>	<!--Levantine Arabic-->

Use your judgment: sometimes language counts are doubled for bilingual speakers, so if it adds to a crazy amount, don't add it. (These figures are 'best available', so that's ok.)

Case 2, no preferred

In this case, just drop the lines.

Case 3, split

Subtag: ksa
Description: Shuwa-Zamani
Added: 2009-07-29
Deprecated: 2023-03-17
Comments: see izm, rsw

Look at iso-639-3_Retirements.tab for ksa

You'll see "Split into [rsw] Rishiwa and [izm] Kizamani"

Take the first one, and treat this case like Case 1.

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

@macchiati I've started to follow your directions for "ajp", ...

likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."

So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":

    public static void main(String[] args) throws IOException {
        if (true) {
            throw new IllegalArgumentException("Don't run this tool until it is fixed");
        }

So I'll try hand-editing likelySubtags.xml anyway...

@macchiati
Copy link
Member

macchiati commented Mar 19, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

@macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...

@macchiati
Copy link
Member

macchiati commented Mar 19, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

@macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:

language-subtag-registry:
Type: language
Subtag: prp
Description: Parsi
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: gu

iso-639-3_Retirements.tab:
prp Parsi M guj 2023-01-20

Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!

Actually likelySubtags.xml already has

		<likelySubtag from="gu" to="gu_Gujr_IN"/>
		<!--{ Gujarati; ?; ? } => { Gujarati; Gujarati; India }-->

So I'm just deleting the prp line from that file

@DavidLRowe
Copy link
Contributor

I think it should be "gu". "guj" is the ISO 639-3 equivalent of "gu". The ISO 639-1 (two-letter) code is preferred if it exists.

@macchiati
Copy link
Member

macchiati commented Mar 19, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

Another disagreement, for szd -- replace with uki or umi?

iso-639-3_Retirements.tab:
szd Seru M uki 2023-01-20

language-subtag-registry:
Type: language
Subtag: szd
Description: Seru
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: umi

likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file

@macchiati
Copy link
Member

macchiati commented Mar 19, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

I made another commit. Locally there's a new set of errors, which I'll work on next:

    TestLstrConsistency {
      Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <!-- South Levantine Arabic ⇒ Levantine Arabic -->
<languageAlias type="kgm" replacement="plu" reason="deprecated"/> <!-- Karipúna ⇒ Palikúr -->
<languageAlias type="nom" replacement="cbr" reason="deprecated"/> <!-- Nocamán ⇒ Cashibo-Cacataibo -->
<languageAlias type="pmk" replacement="crr" reason="deprecated"/> <!-- Pamlico ⇒ Carolina Algonquian -->
<languageAlias type="prp" replacement="gu" reason="deprecated"/> <!-- Parsi ⇒ Gujarati -->
<languageAlias type="szd" replacement="umi" reason="deprecated"/> <!-- Seru ⇒ Ukit -->
<languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <!-- Northwestern Tamang ⇒ Western Tamang -->
<languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <!-- Tupí ⇒ Tupinambá -->
<languageAlias type="xss" replacement="zko" reason="deprecated"/> <!-- Assan ⇒ Kott -->
<languageAlias type="zkb" replacement="kjh" reason="deprecated"/> <!-- Koibal ⇒ Khakas -->

@macchiati
Copy link
Member

macchiati commented Mar 19, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

Good of it to tell you exactly which lines to add!

Add where?

supplementalMetadata.xml?

@btangmu btangmu requested a review from macchiati March 19, 2024 18:45
@btangmu
Copy link
Member Author

btangmu commented Mar 19, 2024

Tests are passing!

@@ -1,7 +1,7 @@
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE supplementalData SYSTEM '../../common/dtd/ldmlSupplemental.dtd'>
<!--
Copyright © 1991-2022 Unicode, Inc.
Copyright © 1991-2024 Unicode, Inc.
For terms of use, see http://www.unicode.org/copyright.html
SPDX-License-Identifier: Unicode-DFS-2016
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Unicode-3.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do another pass replacing identifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@btangmu btangmu merged commit e9f86d0 into unicode-org:main Mar 20, 2024
10 checks passed
@btangmu btangmu deleted the t17115_c branch March 20, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants