Skip to content

Commit

Permalink
CLDR-18108 Give all languages a primary script: trivial cases
Browse files Browse the repository at this point in the history
This change adds "primary" scripts to many languages in language_script.tsv.

This won't change likely subtags, rather this just future-proofs our data by recognizing a singular primary script, avoiding issues where ambiguities served customers the wrong script.

I also added scripts for languages in country_language_population.tsv that were missing.
  • Loading branch information
conradarcturus committed Nov 20, 2024
1 parent c0001a2 commit 3cb6b86
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 63 deletions.
14 changes: 14 additions & 0 deletions common/supplemental/supplementalData.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1315,7 +1315,9 @@ XXX Code for transations where no currency is involved
<language type="ann" scripts="Latn"/>
<language type="anp" scripts="Deva"/>
<language type="aoz" scripts="Latn"/>
<language type="apc" scripts="Arab"/>
<language type="apc" territories="IL JO LB PS SY TR" alt="secondary"/>
<language type="apd" scripts="Arab"/>
<language type="apd" territories="SD" alt="secondary"/>
<language type="ar" scripts="Arab" territories="AE BH DJ DZ EG EH ER IL IQ JO KM KW LB LY MA MR OM PS QA SA SD SO SY TD TN YE"/>
<language type="ar" scripts="Syrc" territories="IR SS" alt="secondary"/>
Expand Down Expand Up @@ -1397,6 +1399,7 @@ XXX Code for transations where no currency is involved
<language type="bjj" territories="IN" alt="secondary"/>
<language type="bjn" scripts="Latn"/>
<language type="bjn" territories="ID" alt="secondary"/>
<language type="bjt" scripts="Latn"/>
<language type="bjt" territories="SN" alt="secondary"/>
<language type="bkm" scripts="Latn"/>
<language type="bku" scripts="Latn"/>
Expand All @@ -1422,6 +1425,7 @@ XXX Code for transations where no currency is involved
<language type="brx" scripts="Deva"/>
<language type="brx" territories="IN" alt="secondary"/>
<language type="bs" scripts="Cyrl Latn" territories="BA"/>
<language type="bsc" scripts="Latn"/>
<language type="bsc" territories="SN" alt="secondary"/>
<language type="bss" scripts="Latn"/>
<language type="bto" scripts="Latn"/>
Expand All @@ -1445,6 +1449,7 @@ XXX Code for transations where no currency is involved
<language type="cay" scripts="Latn"/>
<language type="cch" scripts="Latn"/>
<language type="ccp" scripts="Beng Cakm"/>
<language type="ccr" scripts="Latn" alt="secondary"/>
<language type="ce" scripts="Cyrl"/>
<language type="ce" territories="RU" alt="secondary"/>
<language type="ceb" scripts="Latn"/>
Expand Down Expand Up @@ -1712,6 +1717,7 @@ XXX Code for transations where no currency is involved
<language type="ilo" territories="PH" alt="secondary"/>
<language type="inh" scripts="Cyrl"/>
<language type="inh" scripts="Arab Latn" territories="RU" alt="secondary"/>
<language type="io" scripts="Latn" alt="secondary"/>
<language type="is" scripts="Latn" territories="IS"/>
<language type="it" scripts="Latn" territories="CH IT SM VA"/>
<language type="it" territories="DE FR HR MT US" alt="secondary"/>
Expand All @@ -1721,6 +1727,7 @@ XXX Code for transations where no currency is involved
<language type="ja" scripts="Jpan" territories="JP"/>
<language type="jam" scripts="Latn"/>
<language type="jam" territories="JM" alt="secondary"/>
<language type="jbo" scripts="Latn" alt="secondary"/>
<language type="jgo" scripts="Latn"/>
<language type="jmc" scripts="Latn"/>
<language type="jml" scripts="Deva"/>
Expand Down Expand Up @@ -1749,6 +1756,7 @@ XXX Code for transations where no currency is involved
<language type="kdt" scripts="Thai"/>
<language type="kea" scripts="Latn"/>
<language type="kea" territories="CV" alt="secondary"/>
<language type="ken" scripts="Latn"/>
<language type="kfo" scripts="Latn"/>
<language type="kfr" scripts="Deva"/>
<language type="kfr" territories="IN" alt="secondary"/>
Expand Down Expand Up @@ -1786,6 +1794,7 @@ XXX Code for transations where no currency is involved
<language type="kmb" territories="AO" alt="secondary"/>
<language type="kn" scripts="Knda"/>
<language type="kn" territories="IN" alt="secondary"/>
<language type="knf" scripts="Latn"/>
<language type="knf" territories="SN" alt="secondary"/>
<language type="knn" scripts="Deva"/>
<language type="knn" territories="IN" alt="secondary"/>
Expand Down Expand Up @@ -1845,6 +1854,7 @@ XXX Code for transations where no currency is involved
<language type="lbe" territories="RU" alt="secondary"/>
<language type="lbw" scripts="Latn"/>
<language type="lcp" scripts="Thai"/>
<language type="len" scripts="Latn" alt="secondary"/>
<language type="lep" scripts="Lepc"/>
<language type="lez" scripts="Cyrl"/>
<language type="lez" scripts="Aghb" territories="RU" alt="secondary"/>
Expand Down Expand Up @@ -1925,6 +1935,7 @@ XXX Code for transations where no currency is involved
<language type="mfa" territories="TH" alt="secondary"/>
<language type="mfe" scripts="Latn"/>
<language type="mfe" territories="MU" alt="secondary"/>
<language type="mfv" scripts="Latn"/>
<language type="mfv" territories="SN" alt="secondary"/>
<language type="mg" scripts="Latn" territories="MG"/>
<language type="mgh" scripts="Latn"/>
Expand Down Expand Up @@ -2087,6 +2098,7 @@ XXX Code for transations where no currency is involved
<language type="pnt" scripts="Cyrl Latn" alt="secondary"/>
<language type="pon" scripts="Latn"/>
<language type="pon" territories="FM" alt="secondary"/>
<language type="ppl" scripts="Latn"/>
<language type="pqm" scripts="Latn"/>
<language type="prd" scripts="Arab"/>
<language type="prg" scripts="Latn" alt="secondary"/>
Expand Down Expand Up @@ -2206,6 +2218,7 @@ XXX Code for transations where no currency is involved
<language type="sms" scripts="Latn"/>
<language type="sms" territories="FI" alt="secondary"/>
<language type="sn" scripts="Latn" territories="ZW"/>
<language type="snf" scripts="Latn"/>
<language type="snf" territories="SN" alt="secondary"/>
<language type="snk" scripts="Latn"/>
<language type="snk" territories="ML" alt="secondary"/>
Expand Down Expand Up @@ -2294,6 +2307,7 @@ XXX Code for transations where no currency is involved
<language type="tmh" territories="NE" alt="secondary"/>
<language type="tn" scripts="Latn" territories="BW"/>
<language type="tn" territories="ZA" alt="secondary"/>
<language type="tnr" scripts="Latn"/>
<language type="tnr" territories="SN" alt="secondary"/>
<language type="to" scripts="Latn" territories="TO"/>
<language type="tog" scripts="Latn"/>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,16 @@ title: Language Script Description

# Language Script Description

The language\_script spreadsheet should list all of the language / script combinations that are in common modern use. The countries are not important, since their function has been overtaken by the country\_language\_population spreadsheet.
The [`language\_script.tsv`](https://github.com/unicode-org/cldr/blob/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language_script.tsv) data file should list all of the language / script combinations that are in common use. Usage by country is indicated in the [`country\_language\_population.tsv`](https://github.com/unicode-org/cldr/blob/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv) spreadsheet.

1. If the language and script are both modern, and the script is a major way to write the language in some country, then we should see that line marked as **primary**.
2. Otherwise it should be marked **secondary**.

Every language that is in official use in any country according to country\_language\_population  should have at least one primary script in the language\_script spreadsheet.
1. Every language needs at least 1 script considered the **primary** script.
1. This data is used to determine [the most Likely language and region](likelysubtags-and-default-content) so there needs to be at least 1 primary value.
2. [Changed in v47] Include a primary script for historical languages (eg. Ancient Greek, Coptic). The primary script should reflect where the majority of the written corpus originates from.
2. Languages written by significant populations with different scritps in different countries can have multiple **primary** scripts. The [likely subtags](https://www.unicode.org/cldr/charts/latest/supplemental/likely_subtags.html) patterns will use population counts to disambiguate the default script for each locale.
3. Other scripts used for a language should be marked **secondary**.

If a language has multiple primary scripts, then it should not appear without the script tag in the country\_language\_population.tsv. For example, we should not see "az", but rather "az\_Cyrl", "az\_Latn", and so on. For each country where the language is used, we should see figures on the script\-specific values. The values may overlap, that is, we may see az\_Cyrl at 60% and az\_Latn at 55%. However, the combination with the predominantly used script **must** have a larger figure than the others.

This is also reflected in CLDR main: languages with multiple scripts will have that reflected in their structure (eg sr\-Cyrl\-RS), with aliases for the language\-region combinations.

Files in https://github.com/unicode-org/cldr/tree/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data

1. country\_language\_population.tsv
2. language\_script.tsv

In order to re-generate the XML data use ConvertLanguageData as written about in [the article about updating the language scripts](.../update-language-script-info.md).
Loading

0 comments on commit 3cb6b86

Please sign in to comment.