-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-17115 Update languages/codes #3538
Conversation
-Numerous changes from instructions in Update Language/Script/Region Subtags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but remove the script.xml file, because we're addressing that in a different PR.
-Revert scripts.xml back to main; to be addressed in a different PR
@macchiati as you requested, I reverted scripts.xml in the last commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISO 3166 seems wrong
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Outdated
Show resolved
Hide resolved
-Update world_bank_data.csv by downloading -URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators -Note: running AddPopulationData caused no changes -Update supplementalData.xml by running ConvertLanguageData -This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)
Per discussion, I've reverted iso_3166_status.txt to main |
It looks like there are a surprising number of errors. I think it is best for me to walk you through this, and you can capture these notes in the instructions. It appears that ISO had an unexpected number of deprecations, so you're seeing more issues that we normally see. For lines like the following: What is happening is that likelySubtags.xml is handling languages that are now deprecated. That is to be expected, because ISO does that occasionally, but because we added a lot of SIL language data, the number may be larger each year. To fix that, go to the file an delete the line where it is handled, and delete that line, in this case:
TestMacrolanguages It looks like the classification changed in ISO. We still use 'sa', because the India government disagrees that it is only historical!Add to if (language.equals("no") || language.equals("sh")) continue; // special cases TestCompatibility Check the diff in the iso-639 files to verify that dzd is really de-deprecated. Then add dzd to ALLOWED_UNDELETIONS The "ERROR:" values below in the listing all look like keyboard stuff; I don't think those are counted. I'll file a ticket for Steven to clean those up. |
baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy | ||
cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd | ||
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd | ||
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dzd" was removed here
Web search for "dzd deprecated" turns up this file:
https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt
which reads as follows:
FOR ARCHIVING: Registration form for 'dzd'
LANGUAGE SUBTAG REGISTRATION FORM
-
Name of requester: Doug Ewell
-
E-mail address of requester: doug at ewellic.org
-
Record Requested:
Type: language
Subtag: dzd
Description: Daza
-
Intended meaning of the subtag:
-
Reference to published description of the language (book or article):
-
Any other relevant information:
This registration tracks a change made to ISO 639-3 effective
2023-01-20, adding the code element 'dzd' for Daza, which had been
retired in 2015 as non-existent. The net effect of this registration is
to remove the Deprecated value from this record.
For more information on the ISO 639-3 change, refer to:
https://iso639-3.sil.org/request/2022-027
Good work. That verifies that it is indeed an intentional change.
…On Thu, Mar 14, 2024 at 8:26 AM Tom Bishop ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In common/validity/language.xml
<#3538 (comment)>:
> baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy
cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd
- daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd
+ daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl
"dzd" was removed here
Web search for "dzd deprecated" turns up this file:
https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt
which reads as follows:
FOR ARCHIVING: Registration form for 'dzd'
------------------------------
LANGUAGE SUBTAG REGISTRATION FORM
1.
Name of requester: Doug Ewell
2.
E-mail address of requester: doug at ewellic.org
3.
Record Requested:
Type: language
Subtag: dzd
Description: Daza
1.
Intended meaning of the subtag:
2.
Reference to published description of the language (book or article):
3.
Any other relevant information:
This registration tracks a change made to ISO 639-3 effective
2023-01-20, adding the code element 'dzd' for Daza, which had been
retired in 2015 as non-existent. The net effect of this registration is
to remove the Deprecated value from this record.
For more information on the ISO 639-3 change, refer to:
https://iso639-3.sil.org/request/2022-027
—
Reply to this email directly, view it on GitHub
<#3538 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMCNSFWGIOHOFUGAZA3YYG6T5AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMZXGA2DMOJTGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
-Macrolanguage sa Sanskrit Historical, treat as exception -Take into account the official undeprecation of dzd Daza
@macchiati my latest commit fixes the problems with sa (Sanskrit) and dzd (Daza). It does not fix the problems with ajp and others in this output:
You addressed these errors in your last comment, but I still don't understand; they're different from the "sa" error. "ajp" occurs in languageGroup.xml, languageInfo.xml, and likelySubtags.xml. Should it be deleted from languageInfo.xml, and/or likelySubtags.xml, and then should languageGroup.xml be regenerated? |
Here is what to do in more detail. Case 1, replaced by old:Take ajp Look at language-subtag-registry (the diff from the old one) You see that ajp has 2 items added:
That means that wherever it occurs, "apc" should be substituted. However, if you look at apc, it is not new. So the actions are to delete it in those files where it occurs. Search the directory supplemental. You find:
In languageGroup:
Same in languageInfo.xml and likelySubtags.xml. 'apc' exists in each, so just delete the lines. Suppose it were in supplementalData in the territory information (it doesn't so this is just illustration!!)
In that case you would combine the two figures to get:
Use your judgment: sometimes language counts are doubled for bilingual speakers, so if it adds to a crazy amount, don't add it. (These figures are 'best available', so that's ok.) Case 2, no preferredIn this case, just drop the lines. Case 3, split
Look at iso-639-3_Retirements.tab for ksa You'll see "Split into [rsw] Rishiwa and [izm] Kizamani" Take the first one, and treat this case like Case 1. |
@macchiati I've started to follow your directions for "ajp", ... likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost." So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":
So I'll try hand-editing likelySubtags.xml anyway... |
Right, we disabled the tool for now. It should be easy to regex-search for
(ajp|...) to find all the lines, although you want to look at each one
rather than automatically deleting.
…On Tue, Mar 19, 2024 at 7:54 AM Tom Bishop ***@***.***> wrote:
@macchiati <https://github.com/macchiati> I've started to follow your
directions for "ajp", ...
likelySubtags.xml says "Likely subtags data is generated programatically
from CLDR's language/territory/population data using the
GenerateMaximalLocales tool. Under normal circumstances, this file should
not be patched by hand, as any changes made in that fashion may be lost."
So I tried to run GenerateMaximalLocales and got
"IllegalArgumentException: Don't run this tool until it is fixed":
public static void main(String[] args) throws IOException {
if (true) {
throw new IllegalArgumentException("Don't run this tool until it is fixed");
}
So I'll try hand-editing likelySubtags.xml anyway...
—
Reply to this email directly, view it on GitHub
<#3538 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMHOM4XOIU3I5RLCXVLYZBGTTAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGM4TQMRVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions... |
Right. What I do is look at the diffs in the PR.
BTW, as you go through this, please jot down in a doc or text file what you
are doing, so that we can use that as a basis for updating the instructions.
…On Tue, Mar 19, 2024 at 8:22 AM Tom Bishop ***@***.***> wrote:
@macchiati <https://github.com/macchiati> FYI you wrote that
iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm]
Kizamani" but the version I'm seeing (in the branch for this ticket)
doesn't say anything like that -- because that file is changed in this PR!
So I need to look at the version of that file before this PR. Just
something to be aware of when we update the instructions...
—
Reply to this email directly, view it on GitHub
<#3538 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMDPXCMZ3O6VH5SSIQ3YZBJ4DAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGQ3TSNRYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj: language-subtag-registry: iso-639-3_Retirements.tab: Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct! Actually likelySubtags.xml already has
So I'm just deleting the prp line from that file |
I think it should be "gu". "guj" is the ISO 639-3 equivalent of "gu". The ISO 639-1 (two-letter) code is preferred if it exists. |
gu is the right choice. (guj is the 3 letter code, but the BCP47 uses 2
letter whenever it exists)
…On Tue, Mar 19, 2024 at 9:10 AM Tom Bishop ***@***.***> wrote:
@macchiati <https://github.com/macchiati> these two files disagree on the
replacement for prp (Parsi), whether to change to gu or guj:
language-subtag-registry:
Type: language
Subtag: prp
Description: Parsi
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: gu
iso-639-3_Retirements.tab:
prp Parsi M guj 2023-01-20
Since your comments mainly refer to language-subtag-registry I'm guessing
"gu", but it's just a wild guess so please confirm or correct!
—
Reply to this email directly, view it on GitHub
<#3538 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMB54DNLDNEZGRLZIH3YZBPOXAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGU4TCNJWHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Another disagreement, for szd -- replace with uki or umi? iso-639-3_Retirements.tab: language-subtag-registry: likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file |
When in doubt, go by the language subtag registry
…On Tue, Mar 19, 2024 at 10:16 AM Tom Bishop ***@***.***> wrote:
Another disagreement, for szd -- replace with uki or umi?
iso-639-3_Retirements.tab:
szd Seru M uki 2023-01-20
language-subtag-registry:
Type: language
Subtag: szd
Description: Seru
Added: 2009-07-29
Deprecated: 2023-03-17
Preferred-Value: umi
likelySubtags.xml has only umi, not uki; I'm just deleting szd from that
file
—
Reply to this email directly, view it on GitHub
<#3538 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMDCTEQ6JATO4AO6BCLYZBXH3AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG4ZDKMJSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I made another commit. Locally there's a new set of errors, which I'll work on next:
|
Good of it to tell you exactly which lines to add!
…On Tue, Mar 19, 2024 at 10:31 AM Tom Bishop ***@***.***> wrote:
I made another commit. Locally there's a new set of errors, which I'll
work on next:
TestLstrConsistency {
Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <!-- South Levantine Arabic ⇒ Levantine Arabic -->
<languageAlias type="kgm" replacement="plu" reason="deprecated"/> <!-- Karipúna ⇒ Palikúr -->
<languageAlias type="nom" replacement="cbr" reason="deprecated"/> <!-- Nocamán ⇒ Cashibo-Cacataibo -->
<languageAlias type="pmk" replacement="crr" reason="deprecated"/> <!-- Pamlico ⇒ Carolina Algonquian -->
<languageAlias type="prp" replacement="gu" reason="deprecated"/> <!-- Parsi ⇒ Gujarati -->
<languageAlias type="szd" replacement="umi" reason="deprecated"/> <!-- Seru ⇒ Ukit -->
<languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <!-- Northwestern Tamang ⇒ Western Tamang -->
<languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <!-- Tupí ⇒ Tupinambá -->
<languageAlias type="xss" replacement="zko" reason="deprecated"/> <!-- Assan ⇒ Kott -->
<languageAlias type="zkb" replacement="kjh" reason="deprecated"/> <!-- Koibal ⇒ Khakas -->
—
Reply to this email directly, view it on GitHub
<#3538 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGYK4RWZXWE5GS74FLYZBY55AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG42TQMJRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Add where? supplementalMetadata.xml? |
Tests are passing! |
@@ -1,7 +1,7 @@ | |||
<?xml version='1.0' encoding='UTF-8' ?> | |||
<!DOCTYPE supplementalData SYSTEM '../../common/dtd/ldmlSupplemental.dtd'> | |||
<!-- | |||
Copyright © 1991-2022 Unicode, Inc. | |||
Copyright © 1991-2024 Unicode, Inc. | |||
For terms of use, see http://www.unicode.org/copyright.html | |||
SPDX-License-Identifier: Unicode-DFS-2016 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Unicode-3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do another pass replacing identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
-Numerous changes based on following instructions in Update Language/Script/Region Subtags
-Update world_bank_data.csv by downloading
-URL is https://databank.worldbank.org/reports.aspx?source=world-development-indicators
-Note: running AddPopulationData caused no changes
-Update supplementalData.xml by running ConvertLanguageData
-This caused removal of tok (Tok Pona) and vo (Volapük), and revision of comment for ace (Achinese to Acehnese)
CLDR-17115
ALLOW_MANY_COMMITS=true