CLDR-17566 md and initial txt files

unicode-org · Sep 2, 2024 · 5c22b1b · 5c22b1b
1 parent eb4b003
commit 5c22b1b
Show file tree

Hide file tree

Showing 10 changed files with 1,379 additions and 0 deletions.
diff --git a/docs/site/TEMP-TEXT-FILES/core-data-for-new-locales.txt b/docs/site/TEMP-TEXT-FILES/core-data-for-new-locales.txt
@@ -0,0 +1,20 @@
+Core Data for New Locales
+This document describes the minimal data needed for a new locale. There are two kinds of data that are relevant for new locales:
+Core Data - This is data that the CLDR committee needs from the proposer before a new locale is added. The proposer is expected to also get a Survey Tool account, and contribute towards the Basic Data.
+Basic Data - The Core data is just the first step. It is only created under the expectation that people will engage in suppling data, at a Basic Coverage Level. If the locale does not meet the Basic Coverage Level in the next Survey Tool cycle, the committee may remove the locale.
+Core Data
+Collect and submit the following data, using the Core Data Submission Form. Note to translators: If you are having difficulties or questions about the following data, please contact us: file a new bug, or post a follow-up to comment to your existing bug.
+The correct language code according to Picking the Right Language Identifier.
+The four exemplar sets: main, auxiliary, numbers, punctuation.
+These must reflect the Unicode model. For more information, see tr35-general.html#Character_Elements.
+Verified country data ( i.e. the population of speakers in the regions (countries) in which the language is commonly used)
+There must be at least one country, but should include enough others that they cover approximately 75% or more of the users of the language.
+"Users of the language" includes as either a 1st or 2nd language. The main focus is on written language.
+Default content script and region (normally the region is the country with largest population using that language, and the customary script used for that language in that country).
+[supplemental/supplementalMetadata.xml]
+See: http://cldr.unicode.org/translation/translation-guide-general/default-content
+The correct time cycle used with the language in the default content region
+In common/supplemental/supplementalData.xml, this is the "timeData" element
+The value should be h (1-12), H (0-23), k (1-24), or K (0-11); as defined in https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
+You must commit to supplying the data required for the new locale to reach Basic level during the next open CLDR submission when requesting a new locale to be added.
+For more information on the other coverage levels refer to Coverage Levels
diff --git a/docs/site/TEMP-TEXT-FILES/coverage-levels.txt b/docs/site/TEMP-TEXT-FILES/coverage-levels.txt
@@ -0,0 +1,76 @@
+Coverage Levels
+There are four main coverage levels as defined in the UTS #35: Unicode Locale Data Markup Language (LDML) Part 6: Supplemental: 8 Coverage Levels. They are described more fully below.
+Usage
+You can use the file common/properties/coverageLevels.txt (added in v41) for a given release to filter the locales that they support. For example, see coverageLevels.txt. (This and other links to data files are to the development versions; see the specific version for the release you are working with.) For a detailed chart of the coverage levels, see the locale_coverage.html file for the respective release.
+The file format is semicolon delimited, with 3 fields per line.
+Locale ID ; Coverage Level ; Name
+Each locale ID also covers all the locales that inherit from it. So to get locales at a desired coverage level or above, the following process is used.
+Always include the root locale file, root.xml
+Include all of the locale files listed in coverageLevels.txt at that level or above.
+Recursively include all other files that inherit from the files in #2.
+Warning: Inheritance is not simple truncation; the parentLocale information in supplementalData.xml needs to be applied also. See Parent_Locales.
+For example, if you include fr.xml in #2, you would also include fr_CA.xml; if you include no.xml in #2 you would also include nn.xml.
+Filtering
+To filter "at that level or above", you use the fact that basic ⊂ moderate ⊂ modern, so
+to filter for basic and above, filter for basic|moderate|modern
+to filter for moderate and above, filter for moderate|modern
+Migration
+As of v43, the files in /seed/ have been moved to /common/. Older versions of CLDR separated some locale files into a 'seed' directory. Some implementations used for filtering, but the criteria for moving from seed to common were not rigorous. To maintain compatibility with the set of locales used from previous versions, an implementation may use the above process for Basic and above, but then also add locales that were previously included. For more information, see CLDR 43 Release Note.
+Usage
+Filtering
+Migration
+Core Data
+Basic Data
+Moderate Data
+Modern Data
+References
+Core Data
+The data needed for a new locale to be added. See Core Data for New Locales for details on Core Data and how to submit for new locales.
+It is expected that during the next Survey Tool cycle after a new locale is added, the data for the Basic Coverage Level will be supplied.
+Basic Data
+Suitable for locale selection and minimal support, eg. choice of language on mobile phone
+This includes very minimal data for support of the language: basic dates, times, autonyms:
+Delimiter Data —Quotation start/end, including alternates
+Numbering system — default numbering system + native numbering system (if default = Latin and native ≠ Latin)
+Locale Pattern Info — Locale pattern and separator, and code pattern
+Language Names — in the native language for the native language and for English
+Script Name(s) — Scripts customarily used to write the language
+Country Name(s) — For countries where commonly used (see "Core XML Data")
+Measurement System — metric vs UK vs US
+Full Month and Day of Week names
+AM/PM period names
+Date and Time formats
+Date/Time interval patterns — fallback
+Timezone baseline formats — region, gmt, gmt-zero, hour, fallback
+Number symbols — decimal and grouping separators; plus, minus, percent sign (for Latin number system, plus native if different)
+Number patterns — decimal, currency, percent, scientific
+Moderate Data
+Suitable for “document content” internationalization, eg. content in a spreadsheet
+Before submitting data above the Basic Level, the following must be in place:
+Plural and Ordinal rules
+As in [supplemental/plurals.xml] and [supplemental/ordinals.xml]
+Must also include minimal pairs
+For more information, see cldr-spec/plural-rules.
+Casing information (only where the language uses a cased scripts according to ScriptMetadata.txt)
+This will go into common/casing
+Collation rules [non-Survey Tool]
+This can be supplied as a list of characters, or as rule file.
+The list is a space-delimited list of the characters used by the language (in the given script). The list may include multiple-character strings, where those are treated specially. For example, if "ch" is sorted after "h" one might see "a b c d .. g h ch i j ..."
+More sophisticated users can do a better job, supplying a file of rules as in cldr-spec/collation-guidelines.
+The result will be a file like: common/collation/ar.xml or common/collation/da.xml.
+The data for the Moderate Level includes subsets of the Modern data, both in depth and breadth.
+Modern Data
+Suitable for full UI internationalization
+Before submitting data at the Moderate Level, the following must be in place:
+Grammatical Features
+The grammatical cases and other information, as in supplemental/grammaticalFeatures.xml
+Must include minimal pair values.
+Romanization table (non-Latin scripts only)
+This can be supplied as a spreadsheet or as a rule file.
+If a spreadsheet, for each letter (or sequence) in the exemplars, what is the corresponding Latin letter (or sequence).
+More sophisticated users can do a better job, supplying a file of rules like transforms/Arabic-Latin-BGN.xml.
+The data for the Modern Level includes:
+### TBD
+References
+For the coverage in the latest released version of CLDR, see Locale Coverage Chart.
+To see the development version of the rules used to determine coverage, see coverageLevels.xml. For a list of the locales at a given level, see coverageLevels.txt.
diff --git a/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt b/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt
@@ -0,0 +1,53 @@
+Picking the Right Language Identifier
+Within programs and structured data, languages are indicated with stable identifiers of the form en, fr-CA, or zh-Hant. The standard Unicode language identifiers follow IETF BCP 47, with some small differences defined in UTS #35: Locale Data Markup Language (LDML). Locale identifiers use the same format, with certain possible extensions.
+Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name "Lahnda". There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry. Moreover, a language identifier uses not only the 'base' language code, like 'en' for English or 'ku' for Kurdish, but also certain modifiers such as en-CA for Canadian English, or ku-Latn for Kurdish written in Latin script. Each of these modifiers are called subtags (or sometimes codes), and are separated by "-" or "_". The language identifier itself is also called a language tag, and sometimes a language code.
+Here is an example of the steps to take to find the right language identifier to use. Let's say you to find the identifier for a language called "Ganda" which you know is spoken in Uganda. You'll first pick the base language subtag as described below, then add any necessary script/territory subtags, and then verify. If you can't find the name after following these steps or have other questions, ask on the Unicode CLDR Mailing List.
+If you are looking at a prospective language code, like "swh", the process is similar; follow the steps below, starting with the verification.
+Choosing the Base Language Code
+Go to iso639-3 to find the language. Typically you'll look under Name starting with G for Ganda.
+There may be multiple entries for the item you want, so you'll need to look at all of them. For example, on the page for names starting with “P”, there are three records: “Panjabi”, “Mirpur Panjabi” and “Western Panjabi” (it is the last of these that corresponds to Lahnda). You can also try a search, but be careful.
+You'll find an entry like:
+While you may think that you are done, you have to verify that the three-letter code is correct.
+Click on the "more..." in this case and you'll find id=lug. You can also use the URL http://www.sil.org/iso639-3/documentation.asp?id=XXX, where you replace XXX by the three-letter code.
+Click on "See corresponding entry in Ethnologue." and you get to code=lug
+Verify that is indeed the language:
+Look at the information on the ethnologue page
+Check Wikipedia and other web sources
+AND IMPORTANTLY: Review Caution! below
+Once you have the right three-letter code, you are still not done. Unicode (and BCP 47) uses the 2 letter ISO code if it exists. Unicode also uses the "macro language" where suitable. So
+Use the two-letter code if there is one. In the example above, the highlighted "lg" from the first table.
+Verify that the code is in http://www.iana.org/assignments/language-subtag-registry
+If the code occurs in http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalMetadata.xml in the type attribute of a languageAlias element, then use the replacement instead.
+For example, because "swh" occurs in <languageAlias type="swh" replacement="sw"/>, "sw" must be used instead of "swh".
+Choosing Script/Territory Subtags
+If you need a particular variant of a language, then you'll add additional subtags, typically script or territory. Consult Sample Subtags for the most common choices. Again, review Caution! below.
+Verifying Your Choice
+Verify your choice by using the online language identifier demo.
+You need to fix the identifier and try again in any if the demo shows any of the following:
+the language identifer is illegal, or
+one of the subtags is invalid, or
+there are any replacement values.**
+Documenting Your Choice
+If you are requesting a new locale / language in CLDR, please include the links to the particular pages above so that we can process your request more quickly, as we have to double check before any addition. The links will be of the form:
+http://www.sil.org/iso639-3/documentation.asp?id=xxx
+http://www.ethnologue.com/show_language.asp?code=xxx
+http://en.wikipedia.org/wiki/Western_Punjabi
+and so on
+Caution!
+Canonical Form
+Unicode language and locale IDs are based on BCP 47, but differ in a few ways. The canonical form is produced by using the canonicalization based on BCP47 (thus changing iw → he, and zh-yue → yue), plus a few other steps:
+Replacing the most prominent encompassed subtag by the macrolanguage (cmn → zh)
+Canonicalizing overlong 3 letter codes (eng-840 → en-US)
+Minimizing according to the likely subtag data (ru-Cyrl → ru, en-US → en).
+BCP 47 also provides for "variant subtags", such as zh-Latn-pinyin. When there are multiple variant subtags, the canonical format for Unicode language identifiers puts them in alphabetical order.
+Note that the CLDR likely subtag data is used to minimize scripts and regions, not the IANA Suppress-Script. The latter had a much more constrained design goal, and is more limited.
+In some cases, systems (or companies) may have different conventions than the Preferred-Values in BCP 47 -- such as those in the Replacement column in the the online language identifier demo. For example, for backwards compatibility, "iw" is used with Java instead of "he" (Hebrew). When picking the right subtags, be aware of these compatibility issues. If a target system uses a different canonical form for locale IDs than CLDR, the CLDR data needs to be processed by remapping its IDs to the target system's.
+For compatibility, it is strongly recommended that all implementations accept both the preferred values and their alternates: for example, both "iw" and "he". Although BCP 47 itself only allows "-" as a separator; for compatibility, Unicode language identifiers allows both "-" and "_". Implementations should also accept both.
+Macrolanguages
+ISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For  compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests.
+Unicode language identifiers do not allow the "extlang" form defined in BCP 47. For example, use "yue" instead of "zh-yue" for Cantonese.
+Ethnologue
+When searching, such as site:ethnologue.com ganda, be sure to completely disregard matches in Ethnologue 14 -- these are out of date, and do not have the right codes!
+The Ethnologue is a great source of information, but it must be approached with a certain degree of caution. Many of the population figures are far out of date, or not well substantiated. The Ethnologue also focus on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and on fluent speakers (not necessarily native speakers). So, for example, it would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language subtag for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.
+Wikipedia
+Wikipedia is also a great source of information, but it must be approached with a certain degree of caution as well. Be sure to follow up on references, not just look at articles.