From ffe4224dc36d1fa645f9318fe29de5f495d4eef7 Mon Sep 17 00:00:00 2001 From: Chris Pyle Date: Sun, 1 Sep 2024 18:24:20 -0400 Subject: [PATCH 1/3] CLDR-17566 Initial text and md files --- .../TEMP-TEXT-FILES/update-validity-xml.txt | 16 +++ .../updating-population-gdp-literacy.txt | 88 ++++++++++++ .../updating-script-metadata.txt | 63 +++++++++ .../updating-subdivision-codes.txt | 85 ++++++++++++ .../updating-subdivision-translations.txt | 18 +++ .../TEMP-TEXT-FILES/updating-un-codes.txt | 22 +++ .../updating-codes/update-validity-xml.md | 23 ++++ .../updating-population-gdp-literacy.md | 108 +++++++++++++++ .../updating-script-metadata.md | 85 ++++++++++++ .../updating-subdivision-codes.md | 129 ++++++++++++++++++ .../updating-subdivision-translations.md | 26 ++++ .../updating-codes/updating-un-codes.md | 30 ++++ 12 files changed, 693 insertions(+) create mode 100644 docs/site/TEMP-TEXT-FILES/update-validity-xml.txt create mode 100644 docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt create mode 100644 docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt create mode 100644 docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt create mode 100644 docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt create mode 100644 docs/site/TEMP-TEXT-FILES/updating-un-codes.txt create mode 100644 docs/site/development/updating-codes/update-validity-xml.md create mode 100644 docs/site/development/updating-codes/updating-population-gdp-literacy.md create mode 100644 docs/site/development/updating-codes/updating-script-metadata.md create mode 100644 docs/site/development/updating-codes/updating-subdivision-codes.md create mode 100644 docs/site/development/updating-codes/updating-subdivision-translations.md create mode 100644 docs/site/development/updating-codes/updating-un-codes.md diff --git a/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt b/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt new file mode 100644 index 00000000000..91af54d4edb --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt @@ -0,0 +1,16 @@ +Update Validity XML +Create the archive (Creating the Archive) with at least the last release (if you don't have it already) +Run GenerateValidityXML.java +This updates files in cldr/common/validity/. (If you set -DSHOW_FILES, you'll see this on the console.) +New files should not be generated. If there are any, something has gone wrong, so raise this as an issue on cldr-dev. Note: cldr/common/validity/currency.xml contains a comment line - ) of the form: + +Run the following (you must have all the archived versions loaded, back to cldr-28.0!) +TestValidity -e9 +If they are ok, replace and checkin \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt b/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt new file mode 100644 index 00000000000..c9c0cbc6eb6 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt @@ -0,0 +1,88 @@ +Updating Population, GDP, Literacy +Updated 2021-02-10 by Yoshito +Instructions are based on Chrome browser. +Load the World DataBank +The World DataBank is at (http://databank.worldbank.org/data/views/variableselection/selectvariables.aspx?source=world-development-indicators). Unfortunately, they keep changing the link. If the page has been moved, try to get to it by doing the following. Each of the links are what currently works, but that again may change. +Go to http://worldbank.org +Click "View More Data" in the Data section (http://data.worldbank.org/) +Click "Data Catalog" (http://datacatalog.worldbank.org/) +Search "World Development Indicators" (http://data.worldbank.org/data-catalog/world-development-indicators) +In "Data & Resources" tab, click on the blue "Databank" link. It should open a new Window - https://databank.worldbank.org/reports.aspx?source=world-development-indicators +Once you are there, generate a file by using the following steps. There are 3 collapsible sections, "Country", "Series", and "Time" +Countries +Expand the "Country" section, click the "Countries" tab, and then click the "Select All" button on the left. You do NOT want the aggregates here, just the countries. There were 217 countries on the list when these instructions were written; if substantially more than that, you may have mistakenly included aggregates. +Series +Expand the "Series" section. +Select "Population, total" +Select "GNI, PPP (current international $)" +Time +Select all years starting at 2000 up to the latest available year. The latest as of this writing was "2021". Be careful here, because sometimes it will list a year as being available, but there will be no real data there, which messes up our tooling. +The tooling will automatically handle new years. +Click the "Download Options" link in the upper right. +A small "Download options" box will appear. +Select "CSV" +Instruct your browser to the save the file. +You will receive a ZIP file named "Data_Extract_From_World_Development_Indicators.zip". +Unpack this zip file. It will contain two files. +(From a unix command line, you can unpack it with +"unzip -j -a -a Data_Extract_From_World_Development_Indicators.zip" +to junk subdirectories and force the file to LF line endings.) +The larger file (126kb as of 2021-02-10) contains the actual data we are interested in. The file name should be something like f17e18f5-e161-45a9-b357-cba778a279fd_Data.csv +The smaller file is just a field definitions file that we don't care about. +Verify that the data file is of the form: +Country Name,Country Code,Series Name,Series Code,2000 [YR2000],2001 [YR2001],2004 [YR2004],... +Afghanistan,AFG,"Population, total",SP.POP.TOTL,19701940,20531160,23499850,24399948,25183615,... +Afghanistan,AFG,"GNI, PPP (current international $)",NY.GNP.MKTP.PP.CD,..,..,22134851020.6294,25406550418.3726,27761871367.4836,32316545463.8146,... +Albania,ALB,"Population, total",SP.POP.TOTL,3089027,3060173,3026939,3011487,2992547,2970017,... +... +Rename it to world_bank_data.csv and and save in {cldr}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/ +Diff the old version vs. the current. +If the format changes, you'll have to modify WBLine in AddPopulationData.java to have the right order and contents. +Load UN Literacy Data +Goto http://unstats.un.org/unsd/demographic/products/socind/default.htm +Click on "Education" +Click in "Table 4a - Literacy" +Download data - save as temporary file +Open in Excel, OpenOffice, or Numbers - save as cldr/tools/java/org/unicode/cldr/util/data/external/un_literacy.csv (Windows Comma Separated) +If it has multiple sheets, you want the one that says "Data", and looks like: +Table 4a. Literacy +Last update: December 2012 +Country or area Year Adult (15+) literacy rate Youth (15-24) literacy rate +Total Men Women Total Men Women +Albania 2008 96 97 95 99 99 99 +Diff the old version vs. the current. +If the format changes, you'll have to modify the loadUnLiteracy() method in org/unicode/cldr/tool/AddPopulationData.java +Note that the content does not seem to have changed since 2012, but the page says "Please note this page is currently under revision." +If there is no change to the data (still no change 10 years later), there is no reason to commit a new version of the file. +See also CLDR-15923 +Load CIA Factbook +Note: Pages in original instruction were moved to below. These pages no longer provide text version compatible with files in CLDR. (CLDR-14470) +Population: https://www.cia.gov/the-world-factbook/field/population +Real GDP (purchasing power parity): https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity +All files are saved in cldr/tools/java/org/unicode/cldr/util/data/external/ +Goto: https://www.cia.gov/library/publications/the-world-factbook/index.html +Goto the "References" tab, and click on "Guide to Country Comparisons" +Expand "People and Society" and click on "Population" - +There's a "download" icon in the right side of the header. Right click it, Save Link As... call it +factbook_population.txt +You may need to delete header lines. The first line should begin with "1 China … " or similar. +Back up a page, then Expand "Economy" and click on "GDP (purchasing power parity)" +Right Click on DownloadData, Save Link As... call it +factbook_gdp_ppp.txt +You may need to delete header lines. The first line should begin with "1 China … " or similar. +Literacy - No longer works, so we need to revise program - They are still publishing updates to the data at this page, we just need to write some code to put the data into a form we can use, see CLDR-9756 (comment 4) +https://www.cia.gov/library/publications/the-world-factbook/fields/2103.html maybe https://www.cia.gov/library/publications/the-world-factbook/fields/370.html ? +Right Click on "Download Data", Save Link As... Call it +factbook_literacy.txt +Diff the old version vs. the current. +If the format changes, you'll have to modify the loadFactbookLiteracy()) method in org/unicode/cldr/tool/AddPopulationData.java +Convert the data +If you saw any different country names above, you'll need to edit external/alternate_country_names.txt to add them. +For example, we needed to add Czechia in 2016. +Q: How would I know? +If two-letter non-countries are added, then you'll need to adjust StandardCodes.isCountry. +Q: How would I know? +Run "AddPopulationData -DADD_POP=true" and look for errors. +java -jar -DADD_POP=true -DCLDR_DIR=${HOME}/src/cldr cldr.jar org.unicode.cldr.tool.AddPopulationData +Once everything looks ok, check everything in to git. +Once done, then run the ConvertLanguageData tool as on Update Language Script Info \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt b/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt new file mode 100644 index 00000000000..a1914fd7e19 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt @@ -0,0 +1,63 @@ +Updating Script Metadata +New Unicode scripts +We should work on script metadata early for a Unicode version, so that it is available for tools (such as Mark's "UCA" tools). +Unicode 9/CLDR 29: New scripts in CLDR but not yet in ICU caused trouble. +Unicode 10: Working on a pre-CLDR-31 branch, plan to merge into CLDR trunk after CLDR 31 is done. +Should the script metadata code live in the Unicode Tools, so that we don't need a CLDR branch during early Unicode next-version work? +If the new Unicode version's PropertyValueAliases.txt does not have lines for Block and Script properties yet, then create a preliminary version. Diff the Blocks.txt file and UnicodeData.txt to find new scripts. Get the script codes from http://www.unicode.org/iso15924/codelists.html . Follow existing patterns for block and script names, especially for abbreviations. Do not add abbreviations (which differ from the long forms) unless there is a well-established pattern in the existing data. +Aside from instructions below for all script metadata changes, new script codes need English names (common/main/en.xml) and need to be added to common/supplemental/coverageLevels, under key %script100, so that the new script names will show up in the survey tool. For example, see the changes for new Unicode 8 scripts. +Can we add new scripts in CLDR trunk before or only after adding them to CLDR's copy of ICU4J? We did add new Unicode 9 scripts in CLDR 29 before adding them to ICU4J. The CLDR unit tests do not fail any more for scripts that are newer than the Unicode version in CLDR's copy of ICU. +Sample characters +We need sample characters for the "UCA" tools for generating FractionalUCA.txt. +Look for patterns of what kinds of characters we have picked for other scripts, for example the script's letter "KA". We basically want a character where people say "that looks Greek", and the same shape should not be used in multiple scripts. So for Latin we use "L", not "A". We usually prefer consonants, if applicable, but it is more important that a character look unique across scripts. It does want to be a letter, and if possible should not be a combining mark. It would be nice if the letters were commonly used in the majority language, if there are multiple. Compare with the charts for existing scripts, especially related ones. +Editing the spreadsheet +Google Spreadsheet: Script Metadata +Use and copy cell formulas rather than duplicating contents, if possible. Look for which cells have formulas in existing data, especially for Unicode 1.1 and 7.0 scripts. +For example, +Script names should only be entered on the LikelyLanguage sheet. Other sheets should use a formula to map from the script code. +On the Samples sheet, use a formula to map from the code point to the actual character. This is especially important for avoiding mistakes since almost no one will have font support for the new scripts, which means that most people will see "Tofu" glyphs for the sample characters. +Script Metadata properties file +Go to the spreadsheet Script Metadata +File>Download as>Comma Separated Values +Location/Name = {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/Script_Metadata.csv +Refresh files (eclipse), then compare with previous version for sanity check. If there are no new scripts for target Unicode version of CLDR release you're working on, then skip the rest of steps below. For example, script "Toto" is ignore for CLDR 39 because target Unicode release of CLDR 39 is Unicode 13 and "Toto" will be added in Unicode 14. +Note: VM arguments +Each tool (and test) needs   -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src   (or wherever your repo root is) +It is easiest to set this once in the global Preferences, rather than in the Run Configuration for each tool. +Most of these tools also need   -DSCRIPT_UNICODE_VERSION=14   (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE_VERSION variable. +Run {cldr}/tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestScriptMetadata.java +A common error is if some of the data from the spreadsheet is missing, or has incorrect values. +Run GenerateScriptMetadata, which will produce a modified common/properties/scriptMetadata.txt file. +If this ignores the new scripts: Check the -DSCRIPT_UNICODE_VERSION or the ScriptMetadata.java UNICODE_VERSION. +Add the English script names (from the script metadata spreadsheet) to common/main/en.xml. +Add the French script names from ISO 15924 to common/main/fr.xml, but mark them as draft="provisional". +Add the script codes to common/supplemental/coverageLevels.xml (under key %script100) so that the new script names will show up in the CLDR survey tool. +See #8109#comment:4 r11491 +See changes for Unicode 10: http://unicode.org/cldr/trac/review/9882 +See changes for Unicode 12: CLDR-11478 commit/647ce01 +Maybe add the script codes to TestCoverageLevel.java variable script100. +Starting with cldr/pull/1296 we should not need to list a script here explicitly unless it is Identifier_Type=Recommended. +Remove new script codes from $scriptNonUnicode in common/supplemental/attributeValueValidity.xml if needed +For the following step to work as expected, the CLDR copy of the IANA BCP 47 language subtag registry must be updated (at least with the new script codes). +Copy the latest version of https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry to {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language-subtag-registry +Consider copying only the new script subtags (and making a note near the top of the CLDR file, or lines like "Comments: Unicode 14 script manually added 2021-06-01") to avoid having to update other parts of CLDR. +Run GenerateValidityXML.java like this: +See Update Validity XML +This needs the previous version of CLDR in a sibling folder. +see Creating the Archive for details on running the CheckoutArchive tool +Now run GenerateValidityXML.java +If this crashes with a NullPointerException trying to create a Validity object, check that ToolConstants.LAST_RELEASE_VERSION is set to the actual last release. +Currently, the CHART_VERSION must be a simple integer, no ".1" suffix. +At least script.xml should show the new scripts. The generator overwrites the source data file; use git diff or git difftool to make sure the new scripts have been added. +Run GenerateMaximalLocales, as described on the likelysubtags page, which generates another two files. +Compare the latest git master files with the generated ones:  meld  common/supplemental  ../Generated/cldr/supplemental +Copy likelySubtags.xml and supplementalMetadata.xml to the latest git master if they have changes. +Compare generated files with previous versions for sanity check. +Run the CLDR unit tests. +Project cldr-core: Debug As > Maven test +These tests have sometimes failed: +LikelySubtagsTest +TestInheritance +They may need special adjustments, for example in GenerateMaximalLocales.java adding an extra entry to its MAX_ADDITIONS or LANGUAGE_OVERRIDES. +Check in the updated files. +Problems are typically because a non-standard name is used for a territory name. That can be fixed and the process rerun. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt new file mode 100644 index 00000000000..bfa675a7534 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt @@ -0,0 +1,85 @@ +Updating Subdivision Codes +Main Process +Get the latest version of the iso subdivision xml file from https://www.iso.org/obp/ui/ (you'll need a password) and add it to a cldr-private directory: +Click on the XML button to download a zip, and unzip into folder iso_country_code_ALL_xml +Open iso_country_codes.xml in that folder. Find the generated line, eg +Add that date to the folder name, 2016-12-09_iso_country_code_ALL_xml +Post that folder into /cldr-private/external/iso_country_codes/ if not already there. +Copy the contents of the folder to {cldr-private}/iso_country_codes/iso_country_codes.xml also (overriding current contents. +Make sure that you have defined -DCLDR_PRIVATE_DATA="/cldr-private/" +Diff just to see what's new. +Actually, this step is too painful, because ISO doesn't have a canonical XML format. So elements of a table come in random order... Sometimes +AZ-ORD +AZ-SAD +And sometimes the reverse! +May add diffs generation to GenerateSubdivisions... +Run GenerateSubdivisions; it will create a number of files. The important ones are: +{generated}/subdivision/subdivisions.xml +{generated}/subdivision/subdivisionAliases.txt +{generated}/subdivision/en.xml +Diff {generated}subdivisions.xml and {workspace}/cldr/common/supplemental/subdivisions.xml +If they not different (other than date/version/revision), skip to Step 4. +Copy the generated contents into the cldr file, and save. +Make sure the added IDs make sense. +Verify that we NEVER remove an ID. See #8735. +An ID may be deprecated; in that case it should show up in subdivisionAliases.txt if there is a good substitute. +We may need to add a 4-letter code in case ISO messes up. +In either of these cases, change GenerateSubdivisions.java to do the right thing. +Save the Diffs, since they are useful for updating aliases. See example at end. +Open up {workspace}/cldr/common/supplemental/supplementalMetadata.xml +Search for +Replace the line after that up to the line before with the contents of subdivisionAliases.txt +Do a diff with the last release version. The new file should preserve the old aliases. +Note: there is a tool problem where some lines are duplicated. For now, check and fix them. +If a line is duplicated, when you run the tests they will show as errors. +Make sure the changes make sense. +IN PARTICULAR, make sure that NO former types (in uncommented lines) disappear!That is, restore any such lines before committing.) Put them below the line: + +(Ideally the tool would do that, but we're not quite there.) +Use the names to add more aliases. (See Fixing). Check https://www.iso.org/obp/ui/#iso:code:3166:TW (replacing TW by the country code) to see notes there. +Put en.xml into {workspace}/cldr/common/subdivisions/ +You'll overwrite the one there. The new one reuses all the old names where they exist. +Do a diff with the last release. +Make sure the added names (from ISO) are consistent. +Verify that we NEVER remove an ID. (The deprecated ones move down, but don't disappear). +Run the Update Validity XML steps to produce a new {workspace}/cldr/common/validity/subdivision.xml +Don't bother with the others, but diff and update that one. +A code may move to deprecated, but it should never disappear. If you find that, then revisit #4 (supplementalMetadata) above +Run the tests +You may get some collisions in English. Those need to be fixed. +Google various combinations like [country code ] to find articles like ISO_3166-2:UG, then make a fix. +Often a sub-subdivision has the same name as a subdivision. When that is the case add a qualifier to the lesser know one, like "City" or "District". +Sometimes a name will change in ISO to correct a mistake, which can cause a collision. +Fix the ?? in supplemental data (where possible; see below) +Fixing ?? +If there are not known new subdivisions that the old ones should map to, you'll see commented-out lines in supplementalMetadata like: + +As many of these as possible, see if there is a mapping to one or more new subdivisions. That is, where possible, track down the best code(s) to map all of these to, and uncomment the line, and move BELOW +Note that for the name comment, change + + + + + + +... +New data + + + + + + + +... +Exact matches +From this, we can see that items have been renamed. Easiest to add the type values and contains values to a spreadsheet (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. +Partial Matches +Rearrange the leftovers to see if there is any OLD => NEW1+NEW2... cases or OLD1 = NEW, OLD2=NEW cases. For example, for FR we get Q=>NOR and P=>NOR. Remember that these are "best fit", so there may be small discrepancies. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt b/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt new file mode 100644 index 00000000000..27afa54f9ae --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt @@ -0,0 +1,18 @@ +Updating Subdivision Translations +Make sure that that the subdivisions are updated first as per Updating Subdivision Codes +Make sure you have completed Maven Setup +Run tool WikiSubdivisionLanguages +mvn -DCLDR_DIR=________/cldr -Dexec.mainClass=org.unicode.cldr.tool.GenerateLanguageContainment exec:java -pl cldr-rdf +STEVEN LOOMIS 2022-0829 - this does not make sense here. +Sanity check result, run tests. +NOTES +Should only add values, never change what is there beforehand. +Currently excludes items: +That fail exemplar check (broad test, allows any letters in script). +Many of these are reparable, but need manual work. +Currently renames items that collide within country. +Uses superscript 2, 3 for alternates. More than 3 alternates, it excludes since there is probably a more serious problem. +Needs a couple more locales: zh_Hant, de_CH, fil not working yet. +The Language List is in the query file {workspace}cldr/tools/cldr-rdf/src/main/resources/org/unicode/cldr/rdf/sparql/wikidata-wikisubdivisionLanguages.sparql +Check in +Make sure you also check in {workspace}/cldr/tools/cldr-rdf/external/*.tsv ( intermediate tables, for tracking) \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt new file mode 100644 index 00000000000..2aa1579e95a --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt @@ -0,0 +1,22 @@ +Updating UN Codes +UM M19 +Open https://unstats.un.org/unsd/methodology/m49/overview/# +Hit the Copy button, to copy all the data to the clipboard +Open ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/UnCodes.txt +Hit paste. you should see tab-separated fields +Save +Note: "git diff --word-diff" is helpful for finding that, for example, only a column was added. +EU +Go to  https://europa.eu/european-union/about-eu/countries_en +Note: The instructions below don't work. Manually update +tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/EuCode.txt +(Old instructions:  do the same with https://europa.eu/european-union/about-eu/countries/member-countries_en, into util/data/external/eu_member_states_raw.txt  BROKEN LINK ) +Find the section "The XX member countries of the EU: (may be a link at the bottom or sidebar) +Copy and past into ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/EuCodes.txt +Compare with last revision; if there are differences, update containment. +If there are no real differences, do not bother updating EuCodes.txt +Note: "git diff --word-diff" is helpful for finding that, for example, only whitespace changed. +Record the latest version that's been synced as a meta-data//This is new (Aug 2020)! +Q: Not sure how or where to do this? +Run TestUnContainment +mvn -Dorg.unicode.cldr.unittest.testArgs='-n -q -filter:TestUnContainment'  --file=tools/pom.xml -pl cldr-code test -Dtest=TestShim \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-validity-xml.md b/docs/site/development/updating-codes/update-validity-xml.md new file mode 100644 index 00000000000..39eecb1e2ff --- /dev/null +++ b/docs/site/development/updating-codes/update-validity-xml.md @@ -0,0 +1,23 @@ +--- +title: Update Validity XML +--- + +# Update Validity XML + +1. Create the archive ([Creating the Archive](https://cldr.unicode.org/development/creating-the-archive)) with at least the last release (if you don't have it already) +2. Run GenerateValidityXML.java +3. This updates files in cldr/common/validity/. (If you set \-DSHOW\_FILES, you'll see this on the console.) + 1. New files should not be generated. If there are any, something has gone wrong, so raise this as an issue on cldr\-dev. **Note:** cldr/common/validity/currency.xml contains a comment line \- *\) of the form: + - \ + 5. Run the following (you must have all the archived versions loaded, back to cldr\-28\.0!) + 1. TestValidity \-e9 + 6. If they are ok, replace and checkin + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-population-gdp-literacy.md b/docs/site/development/updating-codes/updating-population-gdp-literacy.md new file mode 100644 index 00000000000..fe5c0d674f3 --- /dev/null +++ b/docs/site/development/updating-codes/updating-population-gdp-literacy.md @@ -0,0 +1,108 @@ +--- +title: Updating Population, GDP, Literacy +--- + +# Updating Population, GDP, Literacy + +**Updated 2021\-02\-10 by Yoshito** + +Instructions are based on Chrome browser. + +## Load the World DataBank + +**The World DataBank is at (http://databank.worldbank.org/data/views/variableselection/selectvariables.aspx?source=world-development-indicators). Unfortunately, they keep changing the link. If the page has been moved, try to get to it by doing the following. Each of the links are what currently works, but that again may change.** + +1. Go to http://worldbank.org +2. Click "View More Data" in the Data section (http://data.worldbank.org/) +3. Click "Data Catalog" (http://datacatalog.worldbank.org/) +4. Search "World Development Indicators" (http://data.worldbank.org/data-catalog/world-development-indicators) +5. In "Data \& Resources" tab, click on the blue "Databank" link. It should open a new Window \- https://databank.worldbank.org/reports.aspx?source\=world\-development\-indicators + +Once you are there, generate a file by using the following steps. There are 3 collapsible sections, "Country", "Series", and "Time" + +- Countries + - Expand the "Country" section, click the "Countries" tab, and then click the "Select All" button on the left. You do NOT want the aggregates here, just the countries. There were 217 countries on the list when these instructions were written; if substantially more than that, you may have mistakenly included aggregates. +- Series + - Expand the "Series" section. + - Select "Population, total" + - Select "GNI, PPP (current international $)" +- Time + - Select all years starting at 2000 up to the latest available year. The latest as of this writing was "2021". Be careful here, because sometimes it will list a year as being available, but there will be no real data there, which messes up our tooling. + - The tooling will automatically handle new years. +- Click the "Download Options" link in the upper right. + - A small "Download options" box will appear. + - Select "CSV" + - Instruct your browser to the save the file. +- You will receive a ZIP file named "**Data\_Extract\_From\_World\_Development\_Indicators.zip**". + - Unpack this zip file. It will contain two files. + - (From a unix command line, you can unpack it with + - "unzip \-j \-a \-a **Data\_Extract\_From\_World\_Development\_Indicators.zip"** + - to junk subdirectories and force the file to LF line endings.) + - The larger file (126kb as of 2021\-02\-10\) contains the actual data we are interested in. The file name should be something like f17e18f5\-e161\-45a9\-b357\-cba778a279fd\_Data.csv + - The smaller file is just a field definitions file that we don't care about. +- Verify that the data file is of the form: + - Country Name,Country Code,Series Name,Series Code,2000 \[YR2000],2001 \[YR2001],2004 \[YR2004],... + - Afghanistan,AFG,"Population, total",SP.POP.TOTL,19701940,20531160,23499850,24399948,25183615,... + - Afghanistan,AFG,"GNI, PPP (current international $)",NY.GNP.MKTP.PP.CD,..,..,22134851020\.6294,25406550418\.3726,27761871367\.4836,32316545463\.8146,... + - Albania,ALB,"Population, total",SP.POP.TOTL,3089027,3060173,3026939,3011487,2992547,2970017,... + - ... +- Rename it to **world\_bank\_data.csv** and and save in {**cldr}/tools/cldr\-code/src/main/resources/org/****unicode****/cldr/util/data/external/** +- Diff the old version vs. the current. +- If the format changes, you'll have to modify WBLine in AddPopulationData.java to have the right order and contents. + +## Load UN Literacy Data + +1. Goto http://unstats.un.org/unsd/demographic/products/socind/default.htm +2. Click on "Education" +3. Click in "Table 4a \- Literacy" +4. Download data \- save as temporary file +5. Open in Excel, OpenOffice, or Numbers \- save as cldr/tools/java/org/unicode/cldr/util/data/external/un\_literacy.csv (Windows Comma Separated) + 1. If it has multiple sheets, you want the one that says "Data", and looks like: +6. Table 4a. Literacy +7. Last update: December 2012 +8. Country or area Year Adult (15\+) literacy rate Youth (15\-24\) literacy rate +9. Total Men Women Total Men Women +10. Albania 2008 96 97 95 99 99 99 +11. Diff the old version vs. the current. +12. If the format changes, you'll have to modify the loadUnLiteracy() method in **org/unicode/cldr/tool/AddPopulationData.java** +13. Note that the content does not seem to have changed since 2012, but the page says "*Please note this page is currently under revision*." + 1. If there is no change to the data (still no change 10 years later), there is no reason to commit a new version of the file. + 2. See also [CLDR\-15923](https://unicode-org.atlassian.net/browse/CLDR-15923) + +## Load CIA Factbook + +**Note:** Pages in original instruction were moved to below. These pages no longer provide text version compatible with files in CLDR. ([CLDR\-14470](https://unicode-org.atlassian.net/browse/CLDR-14470)) + +- Population: https://www.cia.gov/the-world-factbook/field/population +- Real GDP (purchasing power parity): https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity +1. All files are saved in **cldr/tools/java/org/unicode/cldr/util/data/external/** +2. Goto: https://www.cia.gov/library/publications/the-world-factbook/index.html +3. Goto the "References" tab, and click on "Guide to Country Comparisons" +4. Expand "People and Society" and click on "Population" \- + 1. There's a "download" icon in the right side of the header. Right click it, Save Link As... call it + 2. **factbook\_population.txt** + 3. **You may need to delete header lines. The first line should begin with "1 China … " or similar.** +5. Back up a page, then Expand "Economy" and click on "GDP (purchasing power parity)" + 1. Right Click on DownloadData, Save Link As... call it + 2. **factbook\_gdp\_ppp.txt** + 3. **You may need to delete header lines. The first line should begin with "1 China … " or similar.** +6. Literacy \- **No longer works, so we need to revise program \- They are still publishing updates to the data at this page, we just need to write some code to put the data into a form we can use, see** [**CLDR\-9756 (comment 4\)**](https://unicode-org.atlassian.net/browse/CLDR-9756?focusedCommentId=118608) + 1. ~~https://www.cia.gov/library/publications/the-world-factbook/fields/2103.html~~ maybe https://www.cia.gov/library/publications/the-world-factbook/fields/370.html ? + 2. ~~Right Click on "Download Data", Save Link As... Call it~~ + 3. ~~**factbook\_literacy.txt**~~ +7. Diff the old version vs. the current. +8. If the format changes, you'll have to modify the loadFactbookLiteracy()) method in **org/unicode/cldr/tool/AddPopulationData.java** + +## Convert the data + +1. If you saw any different country names above, you'll need to edit external/alternate\_country\_names.txt to add them. + 1. For example, we needed to add Czechia in 2016\. +2. Q: How would I know? + 1. If two\-letter non\-countries are added, then you'll need to adjust StandardCodes.isCountry. +3. Q: How would I know? + 1. Run "AddPopulationData *\-DADD\_POP*\=**true"** and look for errors. +4. **java \-jar \-DADD\_POP\=true \-DCLDR\_DIR\=${HOME}/src/cldr cldr.jar org.unicode.cldr.tool.AddPopulationData** +5. Once everything looks ok, check everything in to git. +6. Once done, then run the ConvertLanguageData tool as on [Update Language Script Info](https://cldr.unicode.org/development/updating-codes/update-language-script-info) + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-script-metadata.md b/docs/site/development/updating-codes/updating-script-metadata.md new file mode 100644 index 00000000000..7ab8a0342e0 --- /dev/null +++ b/docs/site/development/updating-codes/updating-script-metadata.md @@ -0,0 +1,85 @@ +--- +title: Updating Script Metadata +--- + +# Updating Script Metadata + +### New Unicode scripts + +We should work on script metadata early for a Unicode version, so that it is available for tools (such as Mark's "UCA" tools). + +- Unicode 9/CLDR 29: New scripts in CLDR but not yet in ICU caused trouble. +- Unicode 10: Working on a pre\-CLDR\-31 branch, plan to merge into CLDR trunk after CLDR 31 is done. +- Should the script metadata code live in the Unicode Tools, so that we don't need a CLDR branch during early Unicode next\-version work? + +If the new Unicode version's PropertyValueAliases.txt does not have lines for Block and Script properties yet, then create a preliminary version. Diff the Blocks.txt file and UnicodeData.txt to find new scripts. Get the script codes from . Follow existing patterns for block and script names, especially for abbreviations. Do not add abbreviations (which differ from the long forms) unless there is a well\-established pattern in the existing data. + +Aside from instructions below for all script metadata changes, new script codes need English names (common/main/en.xml) and need to be added to common/supplemental/coverageLevels, under key %script100, so that the new script names will show up in the survey tool. For example, see the [changes for new Unicode 8 scripts](https://unicode-org.atlassian.net/browse/CLDR-8109). + +Can we add new scripts in CLDR *trunk* before or only after adding them to CLDR's copy of ICU4J? We did add new Unicode 9 scripts in CLDR 29 before adding them to ICU4J. The CLDR unit tests do not fail any more for scripts that are newer than the Unicode version in CLDR's copy of ICU. + +### Sample characters + +We need sample characters for the "UCA" tools for generating FractionalUCA.txt. + +Look for patterns of what kinds of characters we have picked for other scripts, for example the script's letter "KA". We basically want a character where people say "that looks Greek", and the same shape should not be used in multiple scripts. So for Latin we use "L", not "A". We usually prefer consonants, if applicable, but it is more important that a character look unique across scripts. It does want to be a *letter*, and if possible should not be a combining mark. It would be nice if the letters were commonly used in the majority language, if there are multiple. Compare with the [charts for existing scripts](http://www.unicode.org/charts/), especially related ones. + +### Editing the spreadsheet + +Google Spreadsheet: [Script Metadata](https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit#gid=0) + +Use and copy cell formulas rather than duplicating contents, if possible. Look for which cells have formulas in existing data, especially for Unicode 1\.1 and 7\.0 scripts. + +For example, + +- Script names should only be entered on the LikelyLanguage sheet. Other sheets should use a formula to map from the script code. +- On the Samples sheet, use a formula to map from the code point to the actual character. This is especially important for avoiding mistakes since almost no one will have font support for the new scripts, which means that most people will see "Tofu" glyphs for the sample characters. + +### Script Metadata properties file +1. Go to the spreadsheet [Script Metadata](https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit#gid=0) + 1. File\>Download as\>Comma Separated Values + 2. Location/Name \= {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/Script\_Metadata.csv + 3. Refresh files (eclipse), then compare with previous version for sanity check. If there are no new scripts for target Unicode version of CLDR release you're working on, then skip the rest of steps below. For example, script "Toto" is ignore for CLDR 39 because target Unicode release of CLDR 39 is Unicode 13 and "Toto" will be added in Unicode 14\. +2. **Note: VM arguments** + 1. Each tool (and test) needs   \-DCLDR\_DIR\=/usr/local/google/home/mscherer/cldr/uni/src   (or wherever your repo root is) + 2. It is easiest to set this once in the global Preferences, rather than in the Run Configuration for each tool. + 3. Most of these tools also need   \-DSCRIPT\_UNICODE\_VERSION\=14   (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE\_VERSION variable. + 4. Run {cldr}/tools/cldr\-code/src/test/java/org/unicode/cldr/unittest/TestScriptMetadata.java + 5. A common error is if some of the data from the spreadsheet is missing, or has incorrect values. +3. Run GenerateScriptMetadata, which will produce a modified [common/properties/scriptMetadata.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt) file. + 1. If this ignores the new scripts: Check the \-DSCRIPT\_UNICODE\_VERSION or the ScriptMetadata.java UNICODE\_VERSION. + 2. Add the English script names (from the script metadata spreadsheet) to common/main/en.xml. + 3. Add the French script names from [ISO 15924](https://www.unicode.org/iso15924/iso15924-codes.html) to common/main/fr.xml, but mark them as draft\="provisional". + 4. Add the script codes to common/supplemental/coverageLevels.xml (under key %script100\) so that the new script names will show up in the CLDR survey tool. + 1. See [\#8109\#comment:4](https://unicode-org.atlassian.net/browse/CLDR-8109#comment:4) [r11491](https://github.com/unicode-org/cldr/commit/1d6f2a4db84cc449983c7a01e5a2679dc1827598) + 2. See changes for Unicode 10: + 3. See changes for Unicode 12: [CLDR\-11478](https://unicode-org.atlassian.net/browse/CLDR-11478) [commit/647ce01](https://github.com/unicode-org/cldr/commit/be3000629ca3af2ae77de6304480abefe647ce01) + 5. Maybe add the script codes to TestCoverageLevel.java variable script100\. + 1. Starting with [cldr/pull/1296](https://github.com/unicode-org/cldr/pull/1296) we should not need to list a script here explicitly unless it is Identifier\_Type\=Recommended. + 6. Remove new script codes from $scriptNonUnicode in common/supplemental/attributeValueValidity.xml if needed + 7. For the following step to work as expected, the CLDR copy of the IANA BCP 47 language subtag registry must be updated (at least with the new script codes). + 1. Copy the latest version of https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry to {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/language\-subtag\-registry + 2. Consider copying only the new script subtags (and making a note near the top of the CLDR file, or lines like "Comments: Unicode 14 script manually added 2021\-06\-01") to avoid having to update other parts of CLDR. + 8. Run GenerateValidityXML.java like this: + 1. See [Update Validity XML](https://cldr.unicode.org/development/updating-codes/update-validity-xml) + 2. This needs the previous version of CLDR in a sibling folder. + 1. see [Creating the Archive](https://cldr.unicode.org/development/creating-the-archive) for details on running the CheckoutArchive tool + 3. Now run GenerateValidityXML.java + 4. If this crashes with a NullPointerException trying to create a Validity object, check that ToolConstants.LAST\_RELEASE\_VERSION is set to the actual last release. + 1. Currently, the CHART\_VERSION must be a simple integer, no ".1" suffix. + 9. At least script.xml should show the new scripts. The generator overwrites the source data file; use ```git diff``` or ```git difftool``` to make sure the new scripts have been added. + 10. Run GenerateMaximalLocales, [as described on the likelysubtags page](https://cldr.unicode.org/development/updating-codes/likelysubtags-and-default-content), which generates another two files. + 11. Compare the latest git master files with the generated ones:  meld  common/supplemental  ../Generated/cldr/supplemental + 1. Copy likelySubtags.xml and supplementalMetadata.xml to the latest git master if they have changes. + 12. Compare generated files with previous versions for sanity check. + 13. Run the CLDR unit tests. + 1. Project cldr\-core: Debug As \> Maven test + 14. These tests have sometimes failed: + 1. LikelySubtagsTest + 2. TestInheritance + 3. They may need special adjustments, for example in GenerateMaximalLocales.java adding an extra entry to its MAX\_ADDITIONS or LANGUAGE\_OVERRIDES. +4. Check in the updated files. + +Problems are typically because a non\-standard name is used for a territory name. That can be fixed and the process rerun. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-subdivision-codes.md b/docs/site/development/updating-codes/updating-subdivision-codes.md new file mode 100644 index 00000000000..b344b388c91 --- /dev/null +++ b/docs/site/development/updating-codes/updating-subdivision-codes.md @@ -0,0 +1,129 @@ +--- +title: Updating Subdivision Codes +--- + +# Updating Subdivision Codes + +## Main Process + +1. Get the latest version of the iso subdivision xml file from https://www.iso.org/obp/ui/ (you'll need a password) and add it to a cldr\-private directory: + 1. Click on the XML button to download a zip, and unzip into folder **iso\_country\_code\_ALL\_xml** + 2. Open **iso\_country\_codes.xml** in that folder. Find the generated line, eg \ + 3. Add that date to the folder name, **2016\-12\-09\_iso\_country\_code\_ALL\_xml** + 4. Post that folder into [/cldr\-private/external/iso\_country\_codes](https://goto.google.com/isocountrycodes)/ if not already there. + 5. Copy the contents of the folder to {cldr\-private}/iso\_country\_codes/iso\_country\_codes.xml also (overriding current contents. + 6. Make sure that you have defined \-DCLDR\_PRIVATE\_DATA\="\/cldr\-private/" + 7. ~~Diff just to see what's new.~~ + 1. Actually, this step is too painful, because ISO doesn't have a canonical XML format. So elements of a table come in random order... Sometimes + 1. \AZ\-ORD\ + 2. \AZ\-SAD\ + 2. And sometimes the reverse! + 3. May add diffs generation to GenerateSubdivisions... + 8. Run GenerateSubdivisions; it will create a number of files. The important ones are: + 9. {generated}/subdivision/subdivisions.xml + 10. {generated}/subdivision/subdivisionAliases.txt + 11. {generated}/subdivision/en.xml + 12. Diff {generated}**subdivisions.xml** and {workspace}/cldr/common/supplemental/**subdivisions.xml** + 1. If they not different (other than date/version/revision), skip to Step 4\. + 2. Copy the generated contents into the cldr file, and save. + 3. Make sure the added IDs make sense. + 4. Verify that we NEVER remove an ID. See [\#8735](http://unicode.org/cldr/trac/ticket/8735). + 1. An ID may be deprecated; in that case it should show up in **subdivisionAliases.txt** *if there is a good substitute.* + 2. We may need to add a 4\-letter code in case ISO messes up. + 3. In either of these cases, change GenerateSubdivisions.java to do the right thing. + 5. Save the Diffs, since they are useful for updating aliases. See example at end. + 13. Open up {workspace}/cldr/common/supplemental/**supplementalMetadata.xml** + 1. Search for \ + 2. Replace the line after that up to the line before \ with the contents of **subdivisionAliases.txt** + 3. Do a diff with the last release version. The new file should preserve the old aliases. + 1. *Note: there is a tool problem where some lines are duplicated. For now, check and fix them.* + 2. *If a line is duplicated, when you run the tests they will show as errors.* + 3. Make sure the changes make sense. + 4. ***IN PARTICULAR, make sure that NO former types (in*** ***uncommented*** ***lines) disappear!That is, restore any such lines before committing.) Put them below the line:*** + - \ + 5. ***(Ideally the tool would do that, but we're not quite there.)*** + 14. Use the names to add more aliases. (See Fixing). Check https://www.iso.org/obp/ui/#iso:code:3166:TW (replacing TW by the country code) to see notes there. +2. Put **en.xml** into {workspace}/cldr/common/subdivisions/ + 1. You'll overwrite the one there. The new one reuses all the old names where they exist. + 2. Do a diff with the last release. + 1. Make sure the added names (from ISO) are consistent. + 2. Verify that we NEVER remove an ID. (The deprecated ones move down, but don't disappear). +3. Run the [Update Validity XML](https://cldr.unicode.org/development/updating-codes/update-validity-xml) steps to produce a new {workspace}/cldr/common/validity/subdivision.xml + 1. Don't bother with the others, but diff and update that one. + 2. A code may move to deprecated, but it should never disappear. If you find that, then revisit \#4 (supplementalMetadata) above +4. Run the tests + 1. You may get some collisions in English. Those need to be fixed. + 2. Google various combinations like \[country code \ \] to find articles like[ISO\_3166\-2:UG](https://en.wikipedia.org/wiki/ISO_3166-2:UG), then make a fix. + 3. Often a sub\-subdivision has the same name as a subdivision. When that is the case add a qualifier to the lesser know one, like "City" or "District". + 4. Sometimes a name will change in ISO to correct a mistake, which can cause a collision. +5. Fix the ?? in supplemental data (where possible; see below) + +## Fixing ?? + +1. If there are not known new subdivisions that the old ones should map to, you'll see commented\-out lines in **supplementalMetadata** like: + - \ \ ?? \-\-\> +2. As many of these as possible, see if there is a mapping to one or more new subdivisions. That is, where possible, track down the best code(s) to map all of these to, and uncomment the line, and move BELOW \ + - Note that for the name comment, change \ + +**\** + +\ + +**\** + +\ + +\ + +\ + +... + +### New data + +\ + +\ + +\ + +**\** + +\ + +**\** + +\ + +... + +### Exact matches + +From this, we can see that items have been renamed. Easiest to add the type values and contains values to a[spreadsheet](https://docs.google.com/spreadsheets/d/1i3YAhD9ADP6d4j6p4s3lY0psNdlOuknBr4ZrX1mihCw/edit) (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. + +| Source | | old | new | contents | Mechanical | +|---|---|---|---|---|---| +| \ | FR | "H" | | "2A 2B" | \ | +| \ | FR | | "COR" | "2A 2B" | | + +### Partial Matches + +Rearrange the leftovers to see if there is any OLD \=\> NEW1\+NEW2\... cases or OLD1 \= NEW, OLD2\=NEW cases. For example, for FR we get Q\=\>NOR and P\=\>NOR. Remember that these are "best fit", so there may be small discrepancies. + +| Source | | old | new | contents | Mechanical | | +|---|---|---|---|---|---|---| +| Source | | old | new | contents | Mechanical | Fixed ?? cases | +| \ | FR | "Q" | | "27 76" | \ | \ | +| \ | FR | "P" | | "14 50 61" | \ | \ | +| \ | FR | | "NOR" | "14 27 50 61 76" | | | + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-subdivision-translations.md b/docs/site/development/updating-codes/updating-subdivision-translations.md new file mode 100644 index 00000000000..d15d9718236 --- /dev/null +++ b/docs/site/development/updating-codes/updating-subdivision-translations.md @@ -0,0 +1,26 @@ +--- +title: Updating Subdivision Translations +--- + +# Updating Subdivision Translations + +1. Make sure that that the subdivisions are updated first as per [Updating Subdivision Codes](https://cldr.unicode.org/development/updating-codes/updating-subdivision-codes) +2. Make sure you have completed [Maven Setup](https://cldr.unicode.org/development/maven) +3. Run tool WikiSubdivisionLanguages +4. *~~mvn \-DCLDR\_DIR\=~~**~~**\_\_\_\_\_\_\_\_/cldr**~~* *~~\-Dexec.mainClass\=org.unicode.cldr.tool.GenerateLanguageContainment exec:java \-pl cldr\-rdf~~* + 1. STEVEN LOOMIS 2022\-0829 \- this does not make sense here. +5. Sanity check result, run tests. + +### NOTES +1. Should only add values, never change what is there beforehand. + 1. Currently excludes items: + 1. That fail exemplar check (broad test, allows any letters in script). + 2. Many of these are reparable, but need manual work. + 2. Currently renames items that collide *within country*. + 1. Uses superscript 2, 3 for alternates. More than 3 alternates, it excludes since there is probably a more serious problem. + 3. Needs a couple more locales: zh\_Hant, de\_CH, fil not working yet. + 4. The Language List is in the query file **{workspace}cldr/tools/cldr\-rdf/src/main/resources/org/unicode/cldr/rdf/sparql/wikidata\-wikisubdivisionLanguages.sparql** +2. Check in + 1. Make sure you also check in **{workspace}/cldr/tools/cldr\-rdf/external/\*.tsv** ( intermediate tables, for tracking) + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-un-codes.md b/docs/site/development/updating-codes/updating-un-codes.md new file mode 100644 index 00000000000..112a29c398c --- /dev/null +++ b/docs/site/development/updating-codes/updating-un-codes.md @@ -0,0 +1,30 @@ +--- +title: Updating UN Codes +--- + +# Updating UN Codes + +1. UM M19 + 1. Open https://unstats.un.org/unsd/methodology/m49/overview/ + 2. Hit the Copy button, to copy all the data to the clipboard + 3. Open ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/UnCodes.txt + 4. Hit paste. you should see tab\-separated fields + 5. Save +2. Note: "git diff \-\-word\-diff" is helpful for finding that, for example, only a column was added. + +### EU +1. Go to  [https://europa.eu/european\-union/about\-eu/countries\_en](https://european-union.europa.eu/principles-countries-history/eu-countries_en) +2. **Note: The instructions below don't work. Manually update tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/external/EuCode.txt** +3. ~~(Old instructions:  do the same with https://europa.eu/european\-union/about\-eu/countries/member\-countries\_en, into util/data/external/eu\_member\_states\_raw.txt  BROKEN LINK )~~ +4. ~~Find the section "The XX member countries of the EU: (may be a link at the bottom or sidebar)~~ +5. ~~Copy and past into ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/EuCodes.txt~~ +6. ~~Compare with last revision; if there are differences, update containment.~~  + 1. ~~If there are no real differences, do not bother updating EuCodes.txt~~ + 2. ~~Note: "git diff \-\-word\-diff" is helpful for finding that, for example, only whitespace changed.~~ + 3. ~~Record the latest version that's been synced as a meta\-data//This is new (Aug 2020\)!~~ + 4. ~~Q: Not sure how or where to do this?~~ + +### Run TestUnContainment +1. ```mvn \-Dorg.unicode.cldr.unittest.testArgs\='\-n \-q \-filter:TestUnContainment'  \-\-file\=tools/pom.xml \-pl cldr\-code test \-Dtest\=TestShim``` + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file From ce2e6943cb9d95a52b55e1e49ebc7b1b07c456dc Mon Sep 17 00:00:00 2001 From: Chris Pyle Date: Sun, 1 Sep 2024 18:35:52 -0400 Subject: [PATCH 2/3] CLDR-17566 text diffs and minor changes --- .../updating-population-gdp-literacy.txt | 18 +++++++++--------- .../updating-script-metadata.txt | 6 +++--- .../updating-subdivision-codes.txt | 10 +++++++++- .../site/TEMP-TEXT-FILES/updating-un-codes.txt | 11 +++++------ .../updating-subdivision-codes.md | 4 ++-- .../updating-subdivision-translations.md | 2 +- .../updating-codes/updating-un-codes.md | 2 +- 7 files changed, 30 insertions(+), 23 deletions(-) diff --git a/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt b/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt index c9c0cbc6eb6..a3f8ab424c9 100644 --- a/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt +++ b/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt @@ -8,22 +8,22 @@ Click "View More Data" in the Data section (http://data.worldbank.org/) Click "Data Catalog" (http://datacatalog.worldbank.org/) Search "World Development Indicators" (http://data.worldbank.org/data-catalog/world-development-indicators) In "Data & Resources" tab, click on the blue "Databank" link. It should open a new Window - https://databank.worldbank.org/reports.aspx?source=world-development-indicators -Once you are there, generate a file by using the following steps. There are 3 collapsible sections, "Country", "Series", and "Time" +Once you are there, generate a file by using the following steps. There are 3 collapsible sections, "Country", "Series", and "Time" Countries -Expand the "Country" section, click the "Countries" tab, and then click the "Select All" button on the left. You do NOT want the aggregates here, just the countries. There were 217 countries on the list when these instructions were written; if substantially more than that, you may have mistakenly included aggregates. +Expand the "Country" section, click the "Countries" tab, and then click the "Select All" button on the left. You do NOT want the aggregates here, just the countries. There were 217 countries on the list when these instructions were written; if substantially more than that, you may have mistakenly included aggregates. Series Expand the "Series" section. Select "Population, total" Select "GNI, PPP (current international $)" Time -Select all years starting at 2000 up to the latest available year. The latest as of this writing was "2021". Be careful here, because sometimes it will list a year as being available, but there will be no real data there, which messes up our tooling. +Select all years starting at 2000 up to the latest available year. The latest as of this writing was "2021". Be careful here, because sometimes it will list a year as being available, but there will be no real data there, which messes up our tooling. The tooling will automatically handle new years. Click the "Download Options" link in the upper right. A small "Download options" box will appear. Select "CSV" Instruct your browser to the save the file. You will receive a ZIP file named "Data_Extract_From_World_Development_Indicators.zip". -Unpack this zip file. It will contain two files. +Unpack this zip file. It will contain two files. (From a unix command line, you can unpack it with "unzip -j -a -a Data_Extract_From_World_Development_Indicators.zip" to junk subdirectories and force the file to LF line endings.) @@ -48,7 +48,7 @@ If it has multiple sheets, you want the one that says "Data", and looks like: Table 4a. Literacy Last update: December 2012 Country or area Year Adult (15+) literacy rate Youth (15-24) literacy rate -Total Men Women Total Men Women +Total Men Women Total Men Women Albania 2008 96 97 95 99 99 99 Diff the old version vs. the current. If the format changes, you'll have to modify the loadUnLiteracy() method in org/unicode/cldr/tool/AddPopulationData.java @@ -65,12 +65,12 @@ Goto the "References" tab, and click on "Guide to Country Comparisons" Expand "People and Society" and click on "Population" - There's a "download" icon in the right side of the header. Right click it, Save Link As... call it factbook_population.txt -You may need to delete header lines. The first line should begin with "1 China … " or similar. +You may need to delete header lines. The first line should begin with "1 China … " or similar. Back up a page, then Expand "Economy" and click on "GDP (purchasing power parity)" Right Click on DownloadData, Save Link As... call it factbook_gdp_ppp.txt -You may need to delete header lines. The first line should begin with "1 China … " or similar. -Literacy - No longer works, so we need to revise program - They are still publishing updates to the data at this page, we just need to write some code to put the data into a form we can use, see CLDR-9756 (comment 4) +You may need to delete header lines. The first line should begin with "1 China … " or similar. +Literacy - No longer works, so we need to revise program - They are still publishing updates to the data at this page, we just need to write some code to put the data into a form we can use, see CLDR-9756 (comment 4) https://www.cia.gov/library/publications/the-world-factbook/fields/2103.html maybe https://www.cia.gov/library/publications/the-world-factbook/fields/370.html ? Right Click on "Download Data", Save Link As... Call it factbook_literacy.txt @@ -83,6 +83,6 @@ Q: How would I know? If two-letter non-countries are added, then you'll need to adjust StandardCodes.isCountry. Q: How would I know? Run "AddPopulationData -DADD_POP=true" and look for errors. -java -jar -DADD_POP=true -DCLDR_DIR=${HOME}/src/cldr cldr.jar org.unicode.cldr.tool.AddPopulationData +java -jar -DADD_POP=true -DCLDR_DIR=${HOME}/src/cldr cldr.jar org.unicode.cldr.tool.AddPopulationData Once everything looks ok, check everything in to git. Once done, then run the ConvertLanguageData tool as on Update Language Script Info \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt b/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt index a1914fd7e19..1ad3689107e 100644 --- a/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt +++ b/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt @@ -22,9 +22,9 @@ File>Download as>Comma Separated Values Location/Name = {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/Script_Metadata.csv Refresh files (eclipse), then compare with previous version for sanity check. If there are no new scripts for target Unicode version of CLDR release you're working on, then skip the rest of steps below. For example, script "Toto" is ignore for CLDR 39 because target Unicode release of CLDR 39 is Unicode 13 and "Toto" will be added in Unicode 14. Note: VM arguments -Each tool (and test) needs   -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src   (or wherever your repo root is) +Each tool (and test) needs -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src (or wherever your repo root is) It is easiest to set this once in the global Preferences, rather than in the Run Configuration for each tool. -Most of these tools also need   -DSCRIPT_UNICODE_VERSION=14   (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE_VERSION variable. +Most of these tools also need -DSCRIPT_UNICODE_VERSION=14 (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE_VERSION variable. Run {cldr}/tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestScriptMetadata.java A common error is if some of the data from the spreadsheet is missing, or has incorrect values. Run GenerateScriptMetadata, which will produce a modified common/properties/scriptMetadata.txt file. @@ -50,7 +50,7 @@ If this crashes with a NullPointerException trying to create a Validity object, Currently, the CHART_VERSION must be a simple integer, no ".1" suffix. At least script.xml should show the new scripts. The generator overwrites the source data file; use git diff or git difftool to make sure the new scripts have been added. Run GenerateMaximalLocales, as described on the likelysubtags page, which generates another two files. -Compare the latest git master files with the generated ones:  meld  common/supplemental  ../Generated/cldr/supplemental +Compare the latest git master files with the generated ones: meld common/supplemental ../Generated/cldr/supplemental Copy likelySubtags.xml and supplementalMetadata.xml to the latest git master if they have changes. Compare generated files with previous versions for sanity check. Run the CLDR unit tests. diff --git a/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt index bfa675a7534..cc66f3be64a 100644 --- a/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt +++ b/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt @@ -81,5 +81,13 @@ New data ... Exact matches From this, we can see that items have been renamed. Easiest to add the type values and contains values to a spreadsheet (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. +Source old new contents Mechanical + FR "H" "2A 2B" + FR "COR" "2A 2B" Partial Matches -Rearrange the leftovers to see if there is any OLD => NEW1+NEW2... cases or OLD1 = NEW, OLD2=NEW cases. For example, for FR we get Q=>NOR and P=>NOR. Remember that these are "best fit", so there may be small discrepancies. \ No newline at end of file +Rearrange the leftovers to see if there is any OLD => NEW1+NEW2... cases or OLD1 = NEW, OLD2=NEW cases. For example, for FR we get Q=>NOR and P=>NOR. Remember that these are "best fit", so there may be small discrepancies. +Source old new contents Mechanical +Source old new contents Mechanical Fixed ?? cases + FR "Q" "27 76" + FR "P" "14 50 61" + FR "NOR" "14 27 50 61 76" \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt index 2aa1579e95a..d16a161f28f 100644 --- a/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt +++ b/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt @@ -1,16 +1,15 @@ Updating UN Codes UM M19 -Open https://unstats.un.org/unsd/methodology/m49/overview/# +Open https://unstats.un.org/unsd/methodology/m49/overview/ Hit the Copy button, to copy all the data to the clipboard Open ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/UnCodes.txt Hit paste. you should see tab-separated fields Save Note: "git diff --word-diff" is helpful for finding that, for example, only a column was added. EU -Go to  https://europa.eu/european-union/about-eu/countries_en -Note: The instructions below don't work. Manually update -tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/EuCode.txt -(Old instructions:  do the same with https://europa.eu/european-union/about-eu/countries/member-countries_en, into util/data/external/eu_member_states_raw.txt  BROKEN LINK ) +Go to https://europa.eu/european-union/about-eu/countries_en +Note: The instructions below don't work. Manually update tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/EuCode.txt +(Old instructions: do the same with https://europa.eu/european-union/about-eu/countries/member-countries_en, into util/data/external/eu_member_states_raw.txt BROKEN LINK ) Find the section "The XX member countries of the EU: (may be a link at the bottom or sidebar) Copy and past into ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/EuCodes.txt Compare with last revision; if there are differences, update containment. @@ -19,4 +18,4 @@ Note: "git diff --word-diff" is helpful for finding that, for example, only whit Record the latest version that's been synced as a meta-data//This is new (Aug 2020)! Q: Not sure how or where to do this? Run TestUnContainment -mvn -Dorg.unicode.cldr.unittest.testArgs='-n -q -filter:TestUnContainment'  --file=tools/pom.xml -pl cldr-code test -Dtest=TestShim \ No newline at end of file +mvn -Dorg.unicode.cldr.unittest.testArgs='-n -q -filter:TestUnContainment' --file=tools/pom.xml -pl cldr-code test -Dtest=TestShim \ No newline at end of file diff --git a/docs/site/development/updating-codes/updating-subdivision-codes.md b/docs/site/development/updating-codes/updating-subdivision-codes.md index b344b388c91..0b8c1455939 100644 --- a/docs/site/development/updating-codes/updating-subdivision-codes.md +++ b/docs/site/development/updating-codes/updating-subdivision-codes.md @@ -53,7 +53,7 @@ title: Updating Subdivision Codes 2. A code may move to deprecated, but it should never disappear. If you find that, then revisit \#4 (supplementalMetadata) above 4. Run the tests 1. You may get some collisions in English. Those need to be fixed. - 2. Google various combinations like \[country code \ \] to find articles like[ISO\_3166\-2:UG](https://en.wikipedia.org/wiki/ISO_3166-2:UG), then make a fix. + 2. Google various combinations like \[country code \ \] to find articles like [ISO\_3166\-2:UG](https://en.wikipedia.org/wiki/ISO_3166-2:UG), then make a fix. 3. Often a sub\-subdivision has the same name as a subdivision. When that is the case add a qualifier to the lesser know one, like "City" or "District". 4. Sometimes a name will change in ISO to correct a mistake, which can cause a collision. 5. Fix the ?? in supplemental data (where possible; see below) @@ -108,7 +108,7 @@ title: Updating Subdivision Codes ### Exact matches -From this, we can see that items have been renamed. Easiest to add the type values and contains values to a[spreadsheet](https://docs.google.com/spreadsheets/d/1i3YAhD9ADP6d4j6p4s3lY0psNdlOuknBr4ZrX1mihCw/edit) (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. +From this, we can see that items have been renamed. Easiest to add the type values and contains values to a [spreadsheet](https://docs.google.com/spreadsheets/d/1i3YAhD9ADP6d4j6p4s3lY0psNdlOuknBr4ZrX1mihCw/edit) (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. | Source | | old | new | contents | Mechanical | |---|---|---|---|---|---| diff --git a/docs/site/development/updating-codes/updating-subdivision-translations.md b/docs/site/development/updating-codes/updating-subdivision-translations.md index d15d9718236..96cc8fdf813 100644 --- a/docs/site/development/updating-codes/updating-subdivision-translations.md +++ b/docs/site/development/updating-codes/updating-subdivision-translations.md @@ -7,7 +7,7 @@ title: Updating Subdivision Translations 1. Make sure that that the subdivisions are updated first as per [Updating Subdivision Codes](https://cldr.unicode.org/development/updating-codes/updating-subdivision-codes) 2. Make sure you have completed [Maven Setup](https://cldr.unicode.org/development/maven) 3. Run tool WikiSubdivisionLanguages -4. *~~mvn \-DCLDR\_DIR\=~~**~~**\_\_\_\_\_\_\_\_/cldr**~~* *~~\-Dexec.mainClass\=org.unicode.cldr.tool.GenerateLanguageContainment exec:java \-pl cldr\-rdf~~* +4. ~~mvn \-DCLDR\_DIR\=**\_\_\_\_\_\_\_\_/cldr**\-Dexec.mainClass\=org.unicode.cldr.tool.GenerateLanguageContainment exec:java \-pl cldr\-rdf~~ 1. STEVEN LOOMIS 2022\-0829 \- this does not make sense here. 5. Sanity check result, run tests. diff --git a/docs/site/development/updating-codes/updating-un-codes.md b/docs/site/development/updating-codes/updating-un-codes.md index 112a29c398c..a2aec71a012 100644 --- a/docs/site/development/updating-codes/updating-un-codes.md +++ b/docs/site/development/updating-codes/updating-un-codes.md @@ -25,6 +25,6 @@ title: Updating UN Codes 4. ~~Q: Not sure how or where to do this?~~ ### Run TestUnContainment -1. ```mvn \-Dorg.unicode.cldr.unittest.testArgs\='\-n \-q \-filter:TestUnContainment'  \-\-file\=tools/pom.xml \-pl cldr\-code test \-Dtest\=TestShim``` +1. ```mvn -Dorg.unicode.cldr.unittest.testArgs='-n -q -filter:TestUnContainment' --file=tools/pom.xml -pl cldr-code test -Dtest=TestShim``` ![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file From 69ab72fca3f19dc86087c5355b37251bdd5a6b7d Mon Sep 17 00:00:00 2001 From: Chris Pyle Date: Sun, 1 Sep 2024 18:36:22 -0400 Subject: [PATCH 3/3] CLDR-17566 deleting text files --- .../TEMP-TEXT-FILES/update-validity-xml.txt | 16 ---- .../updating-population-gdp-literacy.txt | 88 ------------------ .../updating-script-metadata.txt | 63 ------------- .../updating-subdivision-codes.txt | 93 ------------------- .../updating-subdivision-translations.txt | 18 ---- .../TEMP-TEXT-FILES/updating-un-codes.txt | 21 ----- 6 files changed, 299 deletions(-) delete mode 100644 docs/site/TEMP-TEXT-FILES/update-validity-xml.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/updating-un-codes.txt diff --git a/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt b/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt deleted file mode 100644 index 91af54d4edb..00000000000 --- a/docs/site/TEMP-TEXT-FILES/update-validity-xml.txt +++ /dev/null @@ -1,16 +0,0 @@ -Update Validity XML -Create the archive (Creating the Archive) with at least the last release (if you don't have it already) -Run GenerateValidityXML.java -This updates files in cldr/common/validity/. (If you set -DSHOW_FILES, you'll see this on the console.) -New files should not be generated. If there are any, something has gone wrong, so raise this as an issue on cldr-dev. Note: cldr/common/validity/currency.xml contains a comment line - ) of the form: - -Run the following (you must have all the archived versions loaded, back to cldr-28.0!) -TestValidity -e9 -If they are ok, replace and checkin \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt b/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt deleted file mode 100644 index a3f8ab424c9..00000000000 --- a/docs/site/TEMP-TEXT-FILES/updating-population-gdp-literacy.txt +++ /dev/null @@ -1,88 +0,0 @@ -Updating Population, GDP, Literacy -Updated 2021-02-10 by Yoshito -Instructions are based on Chrome browser. -Load the World DataBank -The World DataBank is at (http://databank.worldbank.org/data/views/variableselection/selectvariables.aspx?source=world-development-indicators). Unfortunately, they keep changing the link. If the page has been moved, try to get to it by doing the following. Each of the links are what currently works, but that again may change. -Go to http://worldbank.org -Click "View More Data" in the Data section (http://data.worldbank.org/) -Click "Data Catalog" (http://datacatalog.worldbank.org/) -Search "World Development Indicators" (http://data.worldbank.org/data-catalog/world-development-indicators) -In "Data & Resources" tab, click on the blue "Databank" link. It should open a new Window - https://databank.worldbank.org/reports.aspx?source=world-development-indicators -Once you are there, generate a file by using the following steps. There are 3 collapsible sections, "Country", "Series", and "Time" -Countries -Expand the "Country" section, click the "Countries" tab, and then click the "Select All" button on the left. You do NOT want the aggregates here, just the countries. There were 217 countries on the list when these instructions were written; if substantially more than that, you may have mistakenly included aggregates. -Series -Expand the "Series" section. -Select "Population, total" -Select "GNI, PPP (current international $)" -Time -Select all years starting at 2000 up to the latest available year. The latest as of this writing was "2021". Be careful here, because sometimes it will list a year as being available, but there will be no real data there, which messes up our tooling. -The tooling will automatically handle new years. -Click the "Download Options" link in the upper right. -A small "Download options" box will appear. -Select "CSV" -Instruct your browser to the save the file. -You will receive a ZIP file named "Data_Extract_From_World_Development_Indicators.zip". -Unpack this zip file. It will contain two files. -(From a unix command line, you can unpack it with -"unzip -j -a -a Data_Extract_From_World_Development_Indicators.zip" -to junk subdirectories and force the file to LF line endings.) -The larger file (126kb as of 2021-02-10) contains the actual data we are interested in. The file name should be something like f17e18f5-e161-45a9-b357-cba778a279fd_Data.csv -The smaller file is just a field definitions file that we don't care about. -Verify that the data file is of the form: -Country Name,Country Code,Series Name,Series Code,2000 [YR2000],2001 [YR2001],2004 [YR2004],... -Afghanistan,AFG,"Population, total",SP.POP.TOTL,19701940,20531160,23499850,24399948,25183615,... -Afghanistan,AFG,"GNI, PPP (current international $)",NY.GNP.MKTP.PP.CD,..,..,22134851020.6294,25406550418.3726,27761871367.4836,32316545463.8146,... -Albania,ALB,"Population, total",SP.POP.TOTL,3089027,3060173,3026939,3011487,2992547,2970017,... -... -Rename it to world_bank_data.csv and and save in {cldr}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/ -Diff the old version vs. the current. -If the format changes, you'll have to modify WBLine in AddPopulationData.java to have the right order and contents. -Load UN Literacy Data -Goto http://unstats.un.org/unsd/demographic/products/socind/default.htm -Click on "Education" -Click in "Table 4a - Literacy" -Download data - save as temporary file -Open in Excel, OpenOffice, or Numbers - save as cldr/tools/java/org/unicode/cldr/util/data/external/un_literacy.csv (Windows Comma Separated) -If it has multiple sheets, you want the one that says "Data", and looks like: -Table 4a. Literacy -Last update: December 2012 -Country or area Year Adult (15+) literacy rate Youth (15-24) literacy rate -Total Men Women Total Men Women -Albania 2008 96 97 95 99 99 99 -Diff the old version vs. the current. -If the format changes, you'll have to modify the loadUnLiteracy() method in org/unicode/cldr/tool/AddPopulationData.java -Note that the content does not seem to have changed since 2012, but the page says "Please note this page is currently under revision." -If there is no change to the data (still no change 10 years later), there is no reason to commit a new version of the file. -See also CLDR-15923 -Load CIA Factbook -Note: Pages in original instruction were moved to below. These pages no longer provide text version compatible with files in CLDR. (CLDR-14470) -Population: https://www.cia.gov/the-world-factbook/field/population -Real GDP (purchasing power parity): https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity -All files are saved in cldr/tools/java/org/unicode/cldr/util/data/external/ -Goto: https://www.cia.gov/library/publications/the-world-factbook/index.html -Goto the "References" tab, and click on "Guide to Country Comparisons" -Expand "People and Society" and click on "Population" - -There's a "download" icon in the right side of the header. Right click it, Save Link As... call it -factbook_population.txt -You may need to delete header lines. The first line should begin with "1 China … " or similar. -Back up a page, then Expand "Economy" and click on "GDP (purchasing power parity)" -Right Click on DownloadData, Save Link As... call it -factbook_gdp_ppp.txt -You may need to delete header lines. The first line should begin with "1 China … " or similar. -Literacy - No longer works, so we need to revise program - They are still publishing updates to the data at this page, we just need to write some code to put the data into a form we can use, see CLDR-9756 (comment 4) -https://www.cia.gov/library/publications/the-world-factbook/fields/2103.html maybe https://www.cia.gov/library/publications/the-world-factbook/fields/370.html ? -Right Click on "Download Data", Save Link As... Call it -factbook_literacy.txt -Diff the old version vs. the current. -If the format changes, you'll have to modify the loadFactbookLiteracy()) method in org/unicode/cldr/tool/AddPopulationData.java -Convert the data -If you saw any different country names above, you'll need to edit external/alternate_country_names.txt to add them. -For example, we needed to add Czechia in 2016. -Q: How would I know? -If two-letter non-countries are added, then you'll need to adjust StandardCodes.isCountry. -Q: How would I know? -Run "AddPopulationData -DADD_POP=true" and look for errors. -java -jar -DADD_POP=true -DCLDR_DIR=${HOME}/src/cldr cldr.jar org.unicode.cldr.tool.AddPopulationData -Once everything looks ok, check everything in to git. -Once done, then run the ConvertLanguageData tool as on Update Language Script Info \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt b/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt deleted file mode 100644 index 1ad3689107e..00000000000 --- a/docs/site/TEMP-TEXT-FILES/updating-script-metadata.txt +++ /dev/null @@ -1,63 +0,0 @@ -Updating Script Metadata -New Unicode scripts -We should work on script metadata early for a Unicode version, so that it is available for tools (such as Mark's "UCA" tools). -Unicode 9/CLDR 29: New scripts in CLDR but not yet in ICU caused trouble. -Unicode 10: Working on a pre-CLDR-31 branch, plan to merge into CLDR trunk after CLDR 31 is done. -Should the script metadata code live in the Unicode Tools, so that we don't need a CLDR branch during early Unicode next-version work? -If the new Unicode version's PropertyValueAliases.txt does not have lines for Block and Script properties yet, then create a preliminary version. Diff the Blocks.txt file and UnicodeData.txt to find new scripts. Get the script codes from http://www.unicode.org/iso15924/codelists.html . Follow existing patterns for block and script names, especially for abbreviations. Do not add abbreviations (which differ from the long forms) unless there is a well-established pattern in the existing data. -Aside from instructions below for all script metadata changes, new script codes need English names (common/main/en.xml) and need to be added to common/supplemental/coverageLevels, under key %script100, so that the new script names will show up in the survey tool. For example, see the changes for new Unicode 8 scripts. -Can we add new scripts in CLDR trunk before or only after adding them to CLDR's copy of ICU4J? We did add new Unicode 9 scripts in CLDR 29 before adding them to ICU4J. The CLDR unit tests do not fail any more for scripts that are newer than the Unicode version in CLDR's copy of ICU. -Sample characters -We need sample characters for the "UCA" tools for generating FractionalUCA.txt. -Look for patterns of what kinds of characters we have picked for other scripts, for example the script's letter "KA". We basically want a character where people say "that looks Greek", and the same shape should not be used in multiple scripts. So for Latin we use "L", not "A". We usually prefer consonants, if applicable, but it is more important that a character look unique across scripts. It does want to be a letter, and if possible should not be a combining mark. It would be nice if the letters were commonly used in the majority language, if there are multiple. Compare with the charts for existing scripts, especially related ones. -Editing the spreadsheet -Google Spreadsheet: Script Metadata -Use and copy cell formulas rather than duplicating contents, if possible. Look for which cells have formulas in existing data, especially for Unicode 1.1 and 7.0 scripts. -For example, -Script names should only be entered on the LikelyLanguage sheet. Other sheets should use a formula to map from the script code. -On the Samples sheet, use a formula to map from the code point to the actual character. This is especially important for avoiding mistakes since almost no one will have font support for the new scripts, which means that most people will see "Tofu" glyphs for the sample characters. -Script Metadata properties file -Go to the spreadsheet Script Metadata -File>Download as>Comma Separated Values -Location/Name = {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/Script_Metadata.csv -Refresh files (eclipse), then compare with previous version for sanity check. If there are no new scripts for target Unicode version of CLDR release you're working on, then skip the rest of steps below. For example, script "Toto" is ignore for CLDR 39 because target Unicode release of CLDR 39 is Unicode 13 and "Toto" will be added in Unicode 14. -Note: VM arguments -Each tool (and test) needs -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src (or wherever your repo root is) -It is easiest to set this once in the global Preferences, rather than in the Run Configuration for each tool. -Most of these tools also need -DSCRIPT_UNICODE_VERSION=14 (set to the upcoming Unicode version), but it is easier to edit the ScriptMetadata.java line that sets the UNICODE_VERSION variable. -Run {cldr}/tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestScriptMetadata.java -A common error is if some of the data from the spreadsheet is missing, or has incorrect values. -Run GenerateScriptMetadata, which will produce a modified common/properties/scriptMetadata.txt file. -If this ignores the new scripts: Check the -DSCRIPT_UNICODE_VERSION or the ScriptMetadata.java UNICODE_VERSION. -Add the English script names (from the script metadata spreadsheet) to common/main/en.xml. -Add the French script names from ISO 15924 to common/main/fr.xml, but mark them as draft="provisional". -Add the script codes to common/supplemental/coverageLevels.xml (under key %script100) so that the new script names will show up in the CLDR survey tool. -See #8109#comment:4 r11491 -See changes for Unicode 10: http://unicode.org/cldr/trac/review/9882 -See changes for Unicode 12: CLDR-11478 commit/647ce01 -Maybe add the script codes to TestCoverageLevel.java variable script100. -Starting with cldr/pull/1296 we should not need to list a script here explicitly unless it is Identifier_Type=Recommended. -Remove new script codes from $scriptNonUnicode in common/supplemental/attributeValueValidity.xml if needed -For the following step to work as expected, the CLDR copy of the IANA BCP 47 language subtag registry must be updated (at least with the new script codes). -Copy the latest version of https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry to {CLDR}/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language-subtag-registry -Consider copying only the new script subtags (and making a note near the top of the CLDR file, or lines like "Comments: Unicode 14 script manually added 2021-06-01") to avoid having to update other parts of CLDR. -Run GenerateValidityXML.java like this: -See Update Validity XML -This needs the previous version of CLDR in a sibling folder. -see Creating the Archive for details on running the CheckoutArchive tool -Now run GenerateValidityXML.java -If this crashes with a NullPointerException trying to create a Validity object, check that ToolConstants.LAST_RELEASE_VERSION is set to the actual last release. -Currently, the CHART_VERSION must be a simple integer, no ".1" suffix. -At least script.xml should show the new scripts. The generator overwrites the source data file; use git diff or git difftool to make sure the new scripts have been added. -Run GenerateMaximalLocales, as described on the likelysubtags page, which generates another two files. -Compare the latest git master files with the generated ones: meld common/supplemental ../Generated/cldr/supplemental -Copy likelySubtags.xml and supplementalMetadata.xml to the latest git master if they have changes. -Compare generated files with previous versions for sanity check. -Run the CLDR unit tests. -Project cldr-core: Debug As > Maven test -These tests have sometimes failed: -LikelySubtagsTest -TestInheritance -They may need special adjustments, for example in GenerateMaximalLocales.java adding an extra entry to its MAX_ADDITIONS or LANGUAGE_OVERRIDES. -Check in the updated files. -Problems are typically because a non-standard name is used for a territory name. That can be fixed and the process rerun. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt deleted file mode 100644 index cc66f3be64a..00000000000 --- a/docs/site/TEMP-TEXT-FILES/updating-subdivision-codes.txt +++ /dev/null @@ -1,93 +0,0 @@ -Updating Subdivision Codes -Main Process -Get the latest version of the iso subdivision xml file from https://www.iso.org/obp/ui/ (you'll need a password) and add it to a cldr-private directory: -Click on the XML button to download a zip, and unzip into folder iso_country_code_ALL_xml -Open iso_country_codes.xml in that folder. Find the generated line, eg -Add that date to the folder name, 2016-12-09_iso_country_code_ALL_xml -Post that folder into /cldr-private/external/iso_country_codes/ if not already there. -Copy the contents of the folder to {cldr-private}/iso_country_codes/iso_country_codes.xml also (overriding current contents. -Make sure that you have defined -DCLDR_PRIVATE_DATA="/cldr-private/" -Diff just to see what's new. -Actually, this step is too painful, because ISO doesn't have a canonical XML format. So elements of a table come in random order... Sometimes -AZ-ORD -AZ-SAD -And sometimes the reverse! -May add diffs generation to GenerateSubdivisions... -Run GenerateSubdivisions; it will create a number of files. The important ones are: -{generated}/subdivision/subdivisions.xml -{generated}/subdivision/subdivisionAliases.txt -{generated}/subdivision/en.xml -Diff {generated}subdivisions.xml and {workspace}/cldr/common/supplemental/subdivisions.xml -If they not different (other than date/version/revision), skip to Step 4. -Copy the generated contents into the cldr file, and save. -Make sure the added IDs make sense. -Verify that we NEVER remove an ID. See #8735. -An ID may be deprecated; in that case it should show up in subdivisionAliases.txt if there is a good substitute. -We may need to add a 4-letter code in case ISO messes up. -In either of these cases, change GenerateSubdivisions.java to do the right thing. -Save the Diffs, since they are useful for updating aliases. See example at end. -Open up {workspace}/cldr/common/supplemental/supplementalMetadata.xml -Search for -Replace the line after that up to the line before with the contents of subdivisionAliases.txt -Do a diff with the last release version. The new file should preserve the old aliases. -Note: there is a tool problem where some lines are duplicated. For now, check and fix them. -If a line is duplicated, when you run the tests they will show as errors. -Make sure the changes make sense. -IN PARTICULAR, make sure that NO former types (in uncommented lines) disappear!That is, restore any such lines before committing.) Put them below the line: - -(Ideally the tool would do that, but we're not quite there.) -Use the names to add more aliases. (See Fixing). Check https://www.iso.org/obp/ui/#iso:code:3166:TW (replacing TW by the country code) to see notes there. -Put en.xml into {workspace}/cldr/common/subdivisions/ -You'll overwrite the one there. The new one reuses all the old names where they exist. -Do a diff with the last release. -Make sure the added names (from ISO) are consistent. -Verify that we NEVER remove an ID. (The deprecated ones move down, but don't disappear). -Run the Update Validity XML steps to produce a new {workspace}/cldr/common/validity/subdivision.xml -Don't bother with the others, but diff and update that one. -A code may move to deprecated, but it should never disappear. If you find that, then revisit #4 (supplementalMetadata) above -Run the tests -You may get some collisions in English. Those need to be fixed. -Google various combinations like [country code ] to find articles like ISO_3166-2:UG, then make a fix. -Often a sub-subdivision has the same name as a subdivision. When that is the case add a qualifier to the lesser know one, like "City" or "District". -Sometimes a name will change in ISO to correct a mistake, which can cause a collision. -Fix the ?? in supplemental data (where possible; see below) -Fixing ?? -If there are not known new subdivisions that the old ones should map to, you'll see commented-out lines in supplementalMetadata like: - -As many of these as possible, see if there is a mapping to one or more new subdivisions. That is, where possible, track down the best code(s) to map all of these to, and uncomment the line, and move BELOW -Note that for the name comment, change - - - - - - -... -New data - - - - - - - -... -Exact matches -From this, we can see that items have been renamed. Easiest to add the type values and contains values to a spreadsheet (use regex to extract), marking with old/new. Then sort, and pick out the ones that match. -Source old new contents Mechanical - FR "H" "2A 2B" - FR "COR" "2A 2B" -Partial Matches -Rearrange the leftovers to see if there is any OLD => NEW1+NEW2... cases or OLD1 = NEW, OLD2=NEW cases. For example, for FR we get Q=>NOR and P=>NOR. Remember that these are "best fit", so there may be small discrepancies. -Source old new contents Mechanical -Source old new contents Mechanical Fixed ?? cases - FR "Q" "27 76" - FR "P" "14 50 61" - FR "NOR" "14 27 50 61 76" \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt b/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt deleted file mode 100644 index 27afa54f9ae..00000000000 --- a/docs/site/TEMP-TEXT-FILES/updating-subdivision-translations.txt +++ /dev/null @@ -1,18 +0,0 @@ -Updating Subdivision Translations -Make sure that that the subdivisions are updated first as per Updating Subdivision Codes -Make sure you have completed Maven Setup -Run tool WikiSubdivisionLanguages -mvn -DCLDR_DIR=________/cldr -Dexec.mainClass=org.unicode.cldr.tool.GenerateLanguageContainment exec:java -pl cldr-rdf -STEVEN LOOMIS 2022-0829 - this does not make sense here. -Sanity check result, run tests. -NOTES -Should only add values, never change what is there beforehand. -Currently excludes items: -That fail exemplar check (broad test, allows any letters in script). -Many of these are reparable, but need manual work. -Currently renames items that collide within country. -Uses superscript 2, 3 for alternates. More than 3 alternates, it excludes since there is probably a more serious problem. -Needs a couple more locales: zh_Hant, de_CH, fil not working yet. -The Language List is in the query file {workspace}cldr/tools/cldr-rdf/src/main/resources/org/unicode/cldr/rdf/sparql/wikidata-wikisubdivisionLanguages.sparql -Check in -Make sure you also check in {workspace}/cldr/tools/cldr-rdf/external/*.tsv ( intermediate tables, for tracking) \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt b/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt deleted file mode 100644 index d16a161f28f..00000000000 --- a/docs/site/TEMP-TEXT-FILES/updating-un-codes.txt +++ /dev/null @@ -1,21 +0,0 @@ -Updating UN Codes -UM M19 -Open https://unstats.un.org/unsd/methodology/m49/overview/ -Hit the Copy button, to copy all the data to the clipboard -Open ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/UnCodes.txt -Hit paste. you should see tab-separated fields -Save -Note: "git diff --word-diff" is helpful for finding that, for example, only a column was added. -EU -Go to https://europa.eu/european-union/about-eu/countries_en -Note: The instructions below don't work. Manually update tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/EuCode.txt -(Old instructions: do the same with https://europa.eu/european-union/about-eu/countries/member-countries_en, into util/data/external/eu_member_states_raw.txt BROKEN LINK ) -Find the section "The XX member countries of the EU: (may be a link at the bottom or sidebar) -Copy and past into ...workspace/cldr/tools/java/org/unicode/cldr/util/data/external/EuCodes.txt -Compare with last revision; if there are differences, update containment. -If there are no real differences, do not bother updating EuCodes.txt -Note: "git diff --word-diff" is helpful for finding that, for example, only whitespace changed. -Record the latest version that's been synced as a meta-data//This is new (Aug 2020)! -Q: Not sure how or where to do this? -Run TestUnContainment -mvn -Dorg.unicode.cldr.unittest.testArgs='-n -q -filter:TestUnContainment' --file=tools/pom.xml -pl cldr-code test -Dtest=TestShim \ No newline at end of file