Skip to content

Commit

Permalink
CLDR-17566 md files and original text
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 committed Sep 1, 2024
1 parent eb4b003 commit aaabace
Show file tree
Hide file tree
Showing 14 changed files with 514 additions and 0 deletions.
5 changes: 5 additions & 0 deletions docs/site/TEMP-TEXT-FILES/external-version-metadata.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Updating External Version Metadata
Updating Metadata
CLDR-15005 is for updating the process for external metadata versions. The following table is out of date with common/properties/external_data_versions.tsv
TODO: Need to add instructions for updating external metadata
The following tells how to get the version info for imported data used in a CLDR release.
10 changes: 10 additions & 0 deletions docs/site/TEMP-TEXT-FILES/language-script-description.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Language Script Description
The language_script spreadsheet should list all of the language / script combinations that are in common modern use. The countries are not important, since their function has been overtaken by the country_language_population spreadsheet.
If the language and script are both modern, and the script is a major way to write the language in some country, then we should see that line marked as primary.
Otherwise it should be marked secondary.
Every language that is in official use in any country according to country_language_population  should have at least one primary script in the language_script spreadsheet.
If a language has multiple primary scripts, then it should not appear without the script tag in the country_language_population.tsv. For example, we should not see "az", but rather "az_Cyrl", "az_Latn", and so on. For each country where the language is used, we should see figures on the script-specific values. The values may overlap, that is, we may see az_Cyrl at 60% and az_Latn at 55%. However, the combination with the predominantly used script must have a larger figure than the others.
This is also reflected in CLDR main: languages with multiple scripts will have that reflected in their structure (eg sr-Cyrl-RS), with aliases for the language-region combinations.
Files in https://github.com/unicode-org/cldr/tree/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data
country_language_population.tsv
language_script.tsv
17 changes: 17 additions & 0 deletions docs/site/TEMP-TEXT-FILES/likelysubtags-and-default-content.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
LikelySubtags and Default Content
First make sure that you do Update Language/Script/Region Subtags first
Run GenerateMaximalLocales with VM argument -DCLDR_DIR set to your cldr directory to generate the likely subtag data AND the default content locales.
If you are trying to debug, add the VM argument -DGenerateMaximalLocalesDebug
Input data:
Data comes from territory/language information in supplemental data.
However, it is supplemented by LANGUAGE_OVERRIDES in GenerateMaximalLocales.java
If there is no territory/language information in supplemental data for a language, add it to LANGUAGE_OVERRIDES.
If the mapping changes when it shouldn't (there are some special cases), add to LANGUAGE_OVERRIDES.
Output:
Creates {CLDR_DIR}/../Generated/cldr/supplemental/likelySubtags.xml and {CLDR_DIR}/../Generated/cldr/supplemental/supplementalMetadata.xml
Diff with {CLDR_DIR}/common/supplemental/likelySubtags.xml and {CLDR_DIR}/common/supplemental/supplementalMetadata.xml
Be very careful to diff everything and check for errors.
Watch especially for backwards incompatible changes; that is, changes rather than just additions.
Look at the above to handle that with LANGUAGE_OVERRIDES.
Run tests, fix input data, and iterate as necessary.
Copy into the svn workspace and commit.
55 changes: 55 additions & 0 deletions docs/site/TEMP-TEXT-FILES/update-currency-codes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
Update Currency Codes
Go to https://www.six-group.com/en/products-services/financial-information/data-standards.html#scrollTo=currency-codes
Take the link for "Current Currency and Funds": "List one (XML)"
Save the page as {cldr}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a1.xml
curl 'https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list_one.xml' > tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a1.xml
Take the link for "Historic denominations": "List three (XML)"
Save the page as {cldr}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a3.xml
curl 'https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list_three.xml' > tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a3.xml
Use git diff to sanity check the two XML files against the old, and check them in.
"git diff -w" is helpful to ignore whitespace. If there are only whitespace changes, there's no need to check them in.
Check the ISO amendments to get changes that will happen during the current cycle.
Example: https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/amendments/dl_currency_iso_amendment_170.pdf
It appears right now like there is no good way to collect all the amendments that are applicable, except to change "170" in the above link by incrementing until error #404 results. So:
Review all amendments that are dated after the previous update , and patch the XML files and the supplementalData.xml as below.
Record the last number viewed in the URL above.
(There is a "download all amendments" link now that has a spreadsheet summary.)
Record the version: See Updating External Metadata
If there are no diffs in the two iso tables, and no relevant changes in the amendments, you are done.
Run CountItems -Dmethod=generateCurrencyItems to generate the new currency list.
If any currency is missing from ISO4217.txt, the program will throw an exception and will print a list of items at the end that need to be added to the ISO4217.txt file. Add as described below.
Once the necessary codes are added to ISO4217.txt, repeat the CountItems -Dmethod=generateCurrencyItems until it runs cleanly.
If any country changes the use of a currency, verify that there is a corresponding entry in SupplementalData
Since ISO doesn't publish the exact date change (usually just a month), you may need to do some additional research to see if you can determine the exact date when a new currency becomes active, or when an old currency becomes inactive. If you can't find the exact date, use the last day of the month ISO publishes for an old currency expiring.
For new stuff, see below.
Adding a currency:
Make sure the new code exists in common/bcp47/currency.xml. The currency code should be in lower case, and make sure the "since" release corresponds to the next release of CLDR that will publish using this data.
In SupplementalData:
If it has unusual rounding or number of digits, add to:
<fractions>
<info iso4217="ADP" digits="0" rounding="0"/>
...
For each country in which it comes into use, add a line for when it becomes valid
<region iso3166="TR">
<currency iso4217="TRY" from="2005-01-01"/>
Add the code to the file java/org/unicode/cldr/util/data/ISO4217.txt. This is important, since it is used to get the valid codes for the survey tool.
Example:
currency | TRY | new Turkish Lira | TR | TURKEY | C
Mark the old code in java/org/unicode/cldr/util/data/ISO4217.txt as deprecated.
currency | TRL | Old Turkish Lira | TR | TURKEY | O
Changing currency.
If the currency goes out of use in a country, then add the last day of use, such as:
<region iso3166="TR">
<currency iso4217="TRL" from="1922-11-01"/>
=>
<region iso3166="TR">
<currency iso4217="TRL" from="1922-11-01" to="2005-12-31"/>
Edit common/main/en.xml to add the new names (or change old ones) based on the descriptions.
If there is a collision between a new and old name, the old one typically changes to the currency name with the date range
"currency_name (1983-2003)".
Check in your changes
common/bcp47/currency.xml
tools/java/org/unicode/cldr/util/data/ISO4217.txt
common/main/en.xml
common/supplemental/supplementalData.xml
Note: We no longer maintain the list of currency in supplementalMetadata.xml (#4298). The list is currently maintained by bcp47/currency.xml. We need to move the code used for checking list of ISO currency (and its numeric code mapping) currently in ICU tools repository (http://source.icu-project.org/repos/icu/tools/trunk/currency/).
31 changes: 31 additions & 0 deletions docs/site/TEMP-TEXT-FILES/update-language-script-info.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Update Language Script Info
Main
https://github.com/unicode-org/cldr/tree/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data has files with this form:
country_language_population.tsv
language_script.tsv
For a descriptions of the contents, see Language Script Guidelines
Do not edit the above files with a plain text editor; they are tab-delimited UTF-8 with many fields and should be imported/edited with a spreadsheet editor. Excel or Google sheets should also work fine.
The world bank, un, and factbook data should be updated as per Updating Population, GDP, Literacy
Note that there is an auxiliary file util/data/external/other_country_data.txt, which contains data that supplements the others. If there are errors below because the country population is less than the language population, then that file may need updating.
Run the tool ConvertLanguageData.
-DADD_POP=true; for error messages.
If there are any different country names, you'll get an error:  edit external/alternate_country_names.txt to add them.
Look for failures in the language vs script data, following the line:
Problems in language_script.tsv
Look for Territory Language data, following the line:
Possible Failures ...
In Basic Data but not Population > 20%
and the reverse.
Look for general problems, following the line:
Failures in Output.
It will also warn if a country doesn't have an official or de facto official language.
Work until resolved.
The tool updates in place  {cldrdata}/common/supplemental/supplementalData.xml
Carefully diff
Then run QuickCheck to verify that the DTD is in order, and commit.
Update the supplementalData.xml <territoryContainment>
For UN M.49 codes, see Updating UN Codes
For the UN, go to http://www.un.org/en/member-states/index.html. Copy the table, and paste into util/data/external/un_member_states_raw.txt. Diff with old. BROKEN LINK
For the EU, see instructions on Updating UN Codes
For the EZ, do the same with http://ec.europa.eu/economy_finance/euro/adoption/euro_area/index_en.htm, into util/data/external/ez_member_states_raw.txt  BROKEN LINK
If there are changes, update <territoryContainment>
73 changes: 73 additions & 0 deletions docs/site/TEMP-TEXT-FILES/update-languagescriptregion-subtags.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Update Language/Script/Region Subtags
Updated 2021-02-17 by Yoshito Umaoka
This updates language codes, script codes, and territory codes.
First get the latest ISO 639-3 from http://www-01.sil.org/iso639-3/download.asp
Download the zip file containing the UTF-8 tables, it will have a name like iso-639-3_Code_Tables_20210202.zip
Unpack the zip file and update files below with the latest version:
{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/iso-639-3.tab
{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/iso-639-3_Name_Index.tab
{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/iso-639-3-macrolanguages.tab
{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/iso-639-3_Retirements.tab
Take the latest version number of the zip files (e.g. iso-639-3_Code_Tables_20210202.zip), and paste into
{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/iso-639-3-version.tab
Go to http://www.iana.org/assignments/language-subtag-registry
(you can set up a watch for changes in this page with http://www.watchthatpage.com )
Save as {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language-subtag-registry
Go to http://data.iana.org/TLD/
Right-click on tlds-alpha-by-domain.txt save as
{{CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util//data/tlds-alpha-by-domain.txt
If using Eclipse, refresh the files
Diff each with the old copy to check for consistency
Certain of the steps below require that you note certain differences.
Check if there is a new macrolanguage (marked with M in the second column of the iso-639-3.tab file). (Should automate this, but there typically aren't that many new/changed entries).
Update tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/iso_3166_status.txt
Go to https://www.iso.org/obp/ui/#iso:pub:PUB500001:en
Click Full List of Country Codes
Run the tool CompareIso3166_1Status
Click on the "Officially Assigned" code type and also the "Other Codes" code type
Compare total counts with tool output:  example "formerly_used ||  22"  coinciding with 22 Formerly Used codes
If something is wrong, you'll have to scroll through the code list and/or dig around for the updates
Check if ISO has done something destabilizing with codes: you need to handle it specially.
Record the version: See Updating External Metadata
Do validity checks and regenerate: for details see Validity
You'll have to do this again in Updating Subdivision Codes.
Edit common/main/en.xml to add any new names, based on the Descriptions in the registry file.
You only need to add new languages and scripts that we add to supplementalMetaData.
But you need all territories.
Any new macrolanguages need a language alias.
Diff for sanity check
If the code becomes deprecated, then add to supplementalMetadata under <alias>
If there is a single replacement add it.
Territories can have multiple replacements. Put them in population order.
There are a few territories that don't yet have a top level domain (TLD) assigned, such as "BQ" or "SS".
If there are new ones added in tlds-alpha-by-domain.txt for a territory already in CLDR, update {cldrdata}\tools\java\org\unicode\cldr\util\data\territory_codes.txt with the new TLD (usually the same as the country code.
For new territories (regions) // TODO: automate this more
Add to the territoryContainment in supplementalData.xml
The data for that is at the UN site: http://unstats.un.org/unsd/methods/m49/m49regin.htm
With data from the EU at http://europa.eu/abc/european_countries/index_en.htm
Add to territory_codes.txt
Use the UN mapping above for the 3letter and 3number codes.
FIPS is a withdrawn standard as of 2008, so any new territories won't have a FIPS10 code.
Look at tlds-alpha-by-domain.txt to see if the new territory has a TLD assigned yet.
rerun CountItems above.
Add metazone mappings as needed. (Usually John - requires research)
Add the country/lang/population data (Usually Rick - requires research)
Add the currency data (Usually John - requires research)
Update util/data/territory_codes.txt
This step will be different once the data is moved into SupplementalData.xml
Todo: fix GenerateEnums around Utility.getUTF8Data("territory_codes.txt");
Then run GenerateEnums.java, and make sure it completes with no exceptions. Fix any necessary results.
Missing alpha3 for: xx, or "In RFC 4646 but not in CLDR: [EA, EZ, IC, UN]"
Ignore if it is {EA, EZ, IC, UN} Otherwise means you needed to do "For new territories" above
Collision with: xx
Ignore if it is {{MM, BU, 104}, {TP, TL, 626}, {YU, CS, 891}, {ZR, CD, 180}}
Not in World but in CLDR: [002, 003, 005, 009, 011, 013, 014, 015, 017... Ignore 3-digit coes
(should have exception lists in tool for the Ignore's above)
Run ConsoleCheckCLDR -f en -z FINAL_TESTING -e
If you missed any codes, you will get error message: "Unexpected Attribute Value"
Run all the unit tests.
If you get a failure in LikelySubtagsTest because of a new region, you can hack around it with something like:
<likelySubtag from="und_202" to="en_Latn_NG"/>
<!-- hack until rebuilt -->
You may also have to fix the coverageLevels.txt file for an error like:
Error: (TestCoverageLevel.java:604) Comprehensive & no exception for path => //ldml/localeDisplayNames/territories/territory[@type="202"]
27 changes: 27 additions & 0 deletions docs/site/TEMP-TEXT-FILES/update-time-zone-data-for-zoneparser.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Update Time Zone Data for ZoneParser
Note: This is usually done as a part of full time zone data update process.
1. Download the latest version of IANA Time Zone Database page: https://www.iana.org/time-zones
There are 3 links available for latest version. Select the complete distribution tzdb-<version>.tar.lz (e.g. tzdb-2021a.tar.lz).
Extract entire contents to a work directory.
Note: The data only distribution contains minimum set of files you really need. However, you cannot use a convenient make target without codes. The complete distribution package contains the codes.
2. Run make target - rearguard_tarballs_version
This target creates "rearguard" version of zoneinfo files under directory: tzdataunknown-rearguard.dir.
Note: If you specify a version (e.g. VERSION=2021) when invoking the target, "unknown" will be replaced with the specified version (e.g. tzdata2021a-rearguard.dir), but it's not important in this instruction.
A standard zoneinfo file may use negative daylight saving time offsets. CLDR code currently can not handle negative daylight saving time offsets. The "rearguard" version is designed for tools without negative daylight saving time support.
3. Copy files generated by previous step to {CLDR_DIR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data
Below the list of files to be include:
africa
antarctica
asia
australasia
backward
etcetera
europe
leapseconds
northamerica
southamerica
zone.tab
Note: leapseconds might be removed from the list later.
4. Edit the file {CLDR_DIR}}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/tzdb-version.txt
This file contains just one line text specifying a version of Time Zone Database, e.g. 2021a.
5. Record the version: See Updating External Metadata
29 changes: 29 additions & 0 deletions docs/site/development/updating-codes/external-version-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Updating External Version Metadata
---

# Updating External Version Metadata

## Updating Metadata

[CLDR\-15005](https://unicode-org.atlassian.net/browse/CLDR-15005) is for updating the process for external metadata versions. The following table is out of date with [common/properties/external\_data\_versions.tsv](https://github.com/unicode-org/cldr/blob/main/common/properties/external_data_versions.tsv)

### TODO: Need to add instructions for updating external metadata

~~The following tells how to get the version info for imported data used in a CLDR release.~~

| Data | File | Version Info | Date |
|---|---|---|---|
| UN literacy data | [un_literacy.csv](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/un_literacy.csv) | Date at top | 2012-08 |
| Worldbank data | [world_bank_data.csv](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/world_bank_data.csv) | Date at bottom | 2020-12-16 |
| Factbook data | [factbook_population.txt](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/factbook_population.txt) | record when downloaded in TBD | |
| ISO 636 (language) data | [iso-639-3-version.tab](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/iso-639-3-version.tab) | Date in YYYYMMDD format | 2021-02-02 |
| ISO subdivision codes | iso subdivision codes | record when downloaded in TBD | |
| ISO subdivision names | iso subdivision names | record when downloaded in TBD | |
| ISO currency data | iso currency data | record when downloaded in TBD | |
| Timezone IDs (tzdb) | timezones (tz) | Release date on [IANA time zone DB](https://www.iana.org/time-zones) | 2021-01-24 (2021a) |
| Top level domains | [tlds-alpha-by-domain.txt](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/tlds-alpha-by-domain.txt) | Date at top | 2021-02-17 |
| Language Groups | TBD | Record when downloaded in TBD | |
| UN / EU Codes | TBD | Record when downloaded in TBD | |

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Loading

0 comments on commit aaabace

Please sign in to comment.