diff --git a/docs/site/TEMP-TEXT-FILES/core-data-for-new-locales.txt b/docs/site/TEMP-TEXT-FILES/core-data-for-new-locales.txt new file mode 100644 index 00000000000..5a373515029 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/core-data-for-new-locales.txt @@ -0,0 +1,20 @@ +Core Data for New Locales +This document describes the minimal data needed for a new locale. There are two kinds of data that are relevant for new locales: +Core Data - This is data that the CLDR committee needs from the proposer before a new locale is added. The proposer is expected to also get a Survey Tool account, and contribute towards the Basic Data. +Basic Data - The Core data is just the first step. It is only created under the expectation that people will engage in suppling data, at a Basic Coverage Level. If the locale does not meet the Basic Coverage Level in the next Survey Tool cycle, the committee may remove the locale. +Core Data +Collect and submit the following data, using the Core Data Submission Form. Note to translators: If you are having difficulties or questions about the following data, please contact us: file a new bug, or post a follow-up to comment to your existing bug. +The correct language code according to Picking the Right Language Identifier. +The four exemplar sets: main, auxiliary, numbers, punctuation. +These must reflect the Unicode model. For more information, see tr35-general.html#Character_Elements. +Verified country data ( i.e. the population of speakers in the regions (countries) in which the language is commonly used) +There must be at least one country, but should include enough others that they cover approximately 75% or more of the users of the language. +"Users of the language" includes as either a 1st or 2nd language. The main focus is on written language. +Default content script and region (normally the region is the country with largest population using that language, and the customary script used for that language in that country). +[supplemental/supplementalMetadata.xml] +See: http://cldr.unicode.org/translation/translation-guide-general/default-content +The correct time cycle used with the language in the default content region +In common/supplemental/supplementalData.xml, this is the "timeData" element +The value should be h (1-12), H (0-23), k (1-24), or K (0-11); as defined in https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table +You must commit to supplying the data required for the new locale to reach Basic level during the next open CLDR submission when requesting a new locale to be added. +For more information on the other coverage levels refer to Coverage Levels \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/coverage-levels.txt b/docs/site/TEMP-TEXT-FILES/coverage-levels.txt new file mode 100644 index 00000000000..ccaf9331d75 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/coverage-levels.txt @@ -0,0 +1,76 @@ +Coverage Levels +There are four main coverage levels as defined in the UTS #35: Unicode Locale Data Markup Language (LDML) Part 6: Supplemental: 8 Coverage Levels. They are described more fully below. +Usage +You can use the file common/properties/coverageLevels.txt (added in v41) for a given release to filter the locales that they support. For example, see coverageLevels.txt. (This and other links to data files are to the development versions; see the specific version for the release you are working with.) For a detailed chart of the coverage levels, see the locale_coverage.html file for the respective release. +The file format is semicolon delimited, with 3 fields per line. +Locale ID ; Coverage Level ; Name +Each locale ID also covers all the locales that inherit from it. So to get locales at a desired coverage level or above, the following process is used. +Always include the root locale file, root.xml +Include all of the locale files listed in coverageLevels.txt at that level or above. +Recursively include all other files that inherit from the files in #2. +Warning: Inheritance is not simple truncation; the parentLocale information in supplementalData.xml needs to be applied also. See Parent_Locales. +For example, if you include fr.xml in #2, you would also include fr_CA.xml; if you include no.xml in #2 you would also include nn.xml. +Filtering +To filter "at that level or above", you use the fact that basic ⊂ moderate ⊂ modern, so +to filter for basic and above, filter for basic|moderate|modern +to filter for moderate and above, filter for moderate|modern +Migration +As of v43, the files in /seed/ have been moved to /common/. Older versions of CLDR separated some locale files into a 'seed' directory. Some implementations used for filtering, but the criteria for moving from seed to common were not rigorous. To maintain compatibility with the set of locales used from previous versions, an implementation may use the above process for Basic and above, but then also add locales that were previously included. For more information, see CLDR 43 Release Note. +Usage +Filtering +Migration +Core Data +Basic Data +Moderate Data +Modern Data +References +Core Data +The data needed for a new locale to be added. See Core Data for New Locales for details on Core Data and how to submit for new locales. +It is expected that during the next Survey Tool cycle after a new locale is added, the data for the Basic Coverage Level will be supplied. +Basic Data +Suitable for locale selection and minimal support, eg. choice of language on mobile phone +This includes very minimal data for support of the language: basic dates, times, autonyms: +Delimiter Data —Quotation start/end, including alternates +Numbering system — default numbering system + native numbering system (if default = Latin and native ≠ Latin) +Locale Pattern Info — Locale pattern and separator, and code pattern +Language Names — in the native language for the native language and for English +Script Name(s) — Scripts customarily used to write the language +Country Name(s) — For countries where commonly used (see "Core XML Data") +Measurement System — metric vs UK vs US +Full Month and Day of Week names +AM/PM period names +Date and Time formats +Date/Time interval patterns — fallback +Timezone baseline formats — region, gmt, gmt-zero, hour, fallback +Number symbols — decimal and grouping separators; plus, minus, percent sign (for Latin number system, plus native if different) +Number patterns — decimal, currency, percent, scientific +Moderate Data +Suitable for “document content” internationalization, eg. content in a spreadsheet +Before submitting data above the Basic Level, the following must be in place: +Plural and Ordinal rules +As in [supplemental/plurals.xml] and [supplemental/ordinals.xml] +Must also include minimal pairs +For more information, see cldr-spec/plural-rules. +Casing information (only where the language uses a cased scripts according to ScriptMetadata.txt) +This will go into common/casing +Collation rules [non-Survey Tool] +This can be supplied as a list of characters, or as rule file. +The list is a space-delimited list of the characters used by the language (in the given script). The list may include multiple-character strings, where those are treated specially. For example, if "ch" is sorted after "h" one might see "a b c d .. g h ch i j ..." +More sophisticated users can do a better job, supplying a file of rules as in cldr-spec/collation-guidelines. +The result will be a file like: common/collation/ar.xml or common/collation/da.xml. +The data for the Moderate Level includes subsets of the Modern data, both in depth and breadth. +Modern Data +Suitable for full UI internationalization +Before submitting data at the Moderate Level, the following must be in place: +Grammatical Features +The grammatical cases and other information, as in supplemental/grammaticalFeatures.xml +Must include minimal pair values. +Romanization table (non-Latin scripts only) +This can be supplied as a spreadsheet or as a rule file. +If a spreadsheet, for each letter (or sequence) in the exemplars, what is the corresponding Latin letter (or sequence). +More sophisticated users can do a better job, supplying a file of rules like transforms/Arabic-Latin-BGN.xml. +The data for the Modern Level includes: +### TBD +References +For the coverage in the latest released version of CLDR, see Locale Coverage Chart. +To see the development version of the rules used to determine coverage, see coverageLevels.xml. For a list of the locales at a given level, see coverageLevels.txt. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt b/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt new file mode 100644 index 00000000000..051e074a159 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt @@ -0,0 +1,53 @@ +Picking the Right Language Identifier +Within programs and structured data, languages are indicated with stable identifiers of the form en, fr-CA, or zh-Hant. The standard Unicode language identifiers follow IETF BCP 47, with some small differences defined in UTS #35: Locale Data Markup Language (LDML). Locale identifiers use the same format, with certain possible extensions. +Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name "Lahnda". There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry. Moreover, a language identifier uses not only the 'base' language code, like 'en' for English or 'ku' for Kurdish, but also certain modifiers such as en-CA for Canadian English, or ku-Latn for Kurdish written in Latin script. Each of these modifiers are called subtags (or sometimes codes), and are separated by "-" or "_". The language identifier itself is also called a language tag, and sometimes a language code. +Here is an example of the steps to take to find the right language identifier to use. Let's say you to find the identifier for a language called "Ganda" which you know is spoken in Uganda. You'll first pick the base language subtag as described below, then add any necessary script/territory subtags, and then verify. If you can't find the name after following these steps or have other questions, ask on the Unicode CLDR Mailing List. +If you are looking at a prospective language code, like "swh", the process is similar; follow the steps below, starting with the verification. +Choosing the Base Language Code +Go to iso639-3 to find the language. Typically you'll look under Name starting with G for Ganda. +There may be multiple entries for the item you want, so you'll need to look at all of them. For example, on the page for names starting with “P”, there are three records: “Panjabi”, “Mirpur Panjabi” and “Western Panjabi” (it is the last of these that corresponds to Lahnda). You can also try a search, but be careful. +You'll find an entry like: +While you may think that you are done, you have to verify that the three-letter code is correct. +Click on the "more..." in this case and you'll find id=lug. You can also use the URL http://www.sil.org/iso639-3/documentation.asp?id=XXX, where you replace XXX by the three-letter code. +Click on "See corresponding entry in Ethnologue." and you get to code=lug +Verify that is indeed the language: +Look at the information on the ethnologue page +Check Wikipedia and other web sources +AND IMPORTANTLY: Review Caution! below +Once you have the right three-letter code, you are still not done. Unicode (and BCP 47) uses the 2 letter ISO code if it exists. Unicode also uses the "macro language" where suitable. So +Use the two-letter code if there is one. In the example above, the highlighted "lg" from the first table. +Verify that the code is in http://www.iana.org/assignments/language-subtag-registry +If the code occurs in http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalMetadata.xml in the type attribute of a languageAlias element, then use the replacement instead. +For example, because "swh" occurs in , "sw" must be used instead of "swh". +Choosing Script/Territory Subtags +If you need a particular variant of a language, then you'll add additional subtags, typically script or territory. Consult Sample Subtags for the most common choices. Again, review Caution! below. +Verifying Your Choice +Verify your choice by using the online language identifier demo. +You need to fix the identifier and try again in any if the demo shows any of the following: +the language identifer is illegal, or +one of the subtags is invalid, or +there are any replacement values.** +Documenting Your Choice +If you are requesting a new locale / language in CLDR, please include the links to the particular pages above so that we can process your request more quickly, as we have to double check before any addition. The links will be of the form: +http://www.sil.org/iso639-3/documentation.asp?id=xxx +http://www.ethnologue.com/show_language.asp?code=xxx +http://en.wikipedia.org/wiki/Western_Punjabi +and so on +Caution! +Canonical Form +Unicode language and locale IDs are based on BCP 47, but differ in a few ways. The canonical form is produced by using the canonicalization based on BCP47 (thus changing iw → he, and zh-yue → yue), plus a few other steps: +Replacing the most prominent encompassed subtag by the macrolanguage (cmn → zh) +Canonicalizing overlong 3 letter codes (eng-840 → en-US) +Minimizing according to the likely subtag data (ru-Cyrl → ru, en-US → en). +BCP 47 also provides for "variant subtags", such as zh-Latn-pinyin. When there are multiple variant subtags, the canonical format for Unicode language identifiers puts them in alphabetical order. +Note that the CLDR likely subtag data is used to minimize scripts and regions, not the IANA Suppress-Script. The latter had a much more constrained design goal, and is more limited. +In some cases, systems (or companies) may have different conventions than the Preferred-Values in BCP 47 -- such as those in the Replacement column in the the online language identifier demo. For example, for backwards compatibility, "iw" is used with Java instead of "he" (Hebrew). When picking the right subtags, be aware of these compatibility issues. If a target system uses a different canonical form for locale IDs than CLDR, the CLDR data needs to be processed by remapping its IDs to the target system's. +For compatibility, it is strongly recommended that all implementations accept both the preferred values and their alternates: for example, both "iw" and "he". Although BCP 47 itself only allows "-" as a separator; for compatibility, Unicode language identifiers allows both "-" and "_". Implementations should also accept both. +Macrolanguages +ISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests. +Unicode language identifiers do not allow the "extlang" form defined in BCP 47. For example, use "yue" instead of "zh-yue" for Cantonese. +Ethnologue +When searching, such as site:ethnologue.com ganda, be sure to completely disregard matches in Ethnologue 14 -- these are out of date, and do not have the right codes! +The Ethnologue is a great source of information, but it must be approached with a certain degree of caution. Many of the population figures are far out of date, or not well substantiated. The Ethnologue also focus on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and on fluent speakers (not necessarily native speakers). So, for example, it would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language subtag for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation. +Wikipedia +Wikipedia is also a great source of information, but it must be approached with a certain degree of caution as well. Be sure to follow up on references, not just look at articles. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/plural-rules.txt b/docs/site/TEMP-TEXT-FILES/plural-rules.txt new file mode 100644 index 00000000000..b7794a909f3 --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/plural-rules.txt @@ -0,0 +1,172 @@ +Plural Rules +Languages vary in how they handle plurals of nouns or unit expressions ("hour" vs "hours", and so on). Some languages have two forms, like English; some languages have only a single form; and some languages have multiple forms. CLDR uses short, mnemonic tags for these plural categories: +zero +one (singular) +two (dual) +few (paucal) +many (also used for fractions if they have a separate class) +other (required—general plural form—also used if the language only has a single form) +See Language Plural Rules for the categories for each language in CLDR. +These categories are used to provide localized units, with a more natural ways of expressing phrases that vary in plural form, such as "1 hour" vs "2 hours". While they cannot express all the intricacies of natural languages, they allow for more natural phrasing than constructions like "1 hour(s)". +Reporting Defects +When you find errors or omissions in this data, please report the information with a bug report. Please give examples of how the forms may differ. You don't have to give the exact rules, but it is extremely helpful! Here's an example: +Sample Bug Report +The draft Ukrainian (uk) plural rules are: +one: 1, 21, 31, 41, 51, 61... +few: 2-4, 22-24, 32-34... +other: 0, 5-20, 25-30, 35-40...; 1.31, 2.31, 5.31... +Although rules for integer values are correct, there needs to be four categories, +with an extra one for fractions. For example: +1 день +2 дні +5 днів +1.31 дня +2.31 дня +5.31 дня +Determining Plural Categories +The CLDR plural categories do not necessarily match the traditional grammatical categories. Instead, the categories are determined by changes required in a phrase or sentence if a numeric placeholder changes value. +Minimal pairs +The categories are verified by looking a minimal pairs: where a change in numeric value (expressed in digits) forces a change in the other words. For example, the following is a minimal pair for English, establishing a difference in category between "1" and "2". +Warning for Vetters +The Category (Code) values indicate a certain range of numbers that differ between languages. To see the meaning of each Code value for your language see Language Plural Rules chart. +The minimal pairs in the Survey Tool are not direct translations of English. They may be translations of English, such as in German, but must be different if those words or terms do not show the right plural differences for your language. For example, if we look at Belarusian, they are quite different, corresponding to “{0} books in {0} days”, while Welsh has the equivalent of “{0} dog, {0} cat”. Be sure to read the following examples carefully and pay attention to error messages. +For example, English has no separate plural form for "sheep". It would be wrong for the two phrases to be: +one: {0} sheep +other: {0} sheep +You have to pick a different phrase if that is the case in your language. Do not change the sentence in other ways, such as an "unforced change". For example, don't have the 'one' phrase be "{0} sheep" and the 'other' be "{0} deer". +The {0} will always have just a number composed of pure digits in it, such as 0, 1, 2, 3, … 11, 12, … 21, 22, .… 99, 100, …. For example, “1 dog, 1 cat” or “21 dog, 21 cat”. If there are multiple instances of {0}, they will always have the same number. The sentences must be parallel, with exactly the same construction except for what is forced by a change in digits. That is, for a language that has "one" and "other" categories: +take the phrase for "other" +change the {0} to "1" +make only the other changes to the phrase that are grammatically necessary because of that change +change the "1" back to "{0}" +you should then have the phrase for "one" +Gender is irrelevant. Do not contort your phrasing so that it could cover some (unspecified) item of a different gender. (Eg, don't have “Prenez la {0}re à droite; Prenez le {0}er à droite.”) The exception to that is where two nouns of different genders to cover all plural categories, such as Russian “из {0} книг за {0} дня”. +Non-inflecting Nouns—Verbs +Some languages, like Bengali, do not change the form of the following noun when the numeric value changes. Even where nouns are invariant, other parts of a sentence might change. That is sufficient to establish a minimal pair. For example, even if all nouns in English were invariant (like 'fish' or 'sheep'), the verb changes are sufficient to establish a minimal pair: +Non-inflecting Nouns—Pronouns +In other cases, even the verb doesn't change, but referents (such as pronouns) change. So a minimal pair in such a language might look something like: +Multiple Nouns +In many cases, a single noun doesn't exhibit all the numeric forms. For example, in Welsh the following is a minimal pair that separates 1 and 2: +Category +one +two +Resolved String +1 ci +2 gi +But the form of this word is the same for 1 and 4. We need a separate word to get a minimal pair that separates 1 and 4: +Category +one +two +Resolved String +1 gath +1 cath +These combine into a single Minimal Pair Template that can be used to separate all 6 forms in Welsh. +Russian is similar, needing two different nouns: +The minimal pairs are those that are required for correct grammar. So because 0 and 1 don't have to form a minimal pair (it is ok—even though often not optimal—to say "0 people") , 0 doesn't establish a separate category. However, implementations are encouraged to provide the ability to have special plural messages for 0 in particular, so that more natural language can be used: +None of your friends are online. +rather than +You have 0 friends online. +Fractions +In some languages, fractions require a separate category. For example, Russian 'other' in the example above. In some languages, they all in a single category with some integers, and in some languages they are in multiple categories. In any case, they also need to be examined to make sure that there are sufficial minimal pairs. +Rules +The next step is to determine the rules: which numbers go into which categories. +Integers +Test a variety of integers. Look for cases where the 'teens' (11-19) behave differently. Many languages only care about the last 2 digits only, or the last digit only. +Fractions +Fractions are often a bit tricky to determine: languages have very different behavior for them. In some languages the fraction is ignored (when selecting the category), in some languages the final digits of the fraction are important, in some languages a number changes category just if there are visible trailing zeros. Make sure to try out a range of fractions to make sure how the numbers behave: values like 1 vs 1.0 may behave differently, as may numbers like 1.1 vs 1.2 vs 1.21, and so on. +Choosing Plural Category Names +In some sense, the names for the categories are somewhat arbitrary. Yet for consistency across languages, the following guidelines should be used when selecting the plural category names. +If no forms change, then stop (there are no plural rules — everything gets 'other') +'one': Use the category 'one' for the form used with 1. +'other': Use the category 'other' for the form used with the most integers. +'two': Use the category 'two' for the form used with 2, if it is limited to numbers whose integer values end with '2'. +If everything else has the same form, stop (everything else gets 'other') +'zero': Use the category 'zero' for the form used with 0, if it is limited to numbers whose integer values end with '0'. +If everything else has the same form, stop (everything else gets 'other') +'few': Use the category 'few' for the form used with the least remaining number (such as '4') +If everything else has the same form, stop (everything else gets 'other') +'many': Use the category 'many' for the form used with the least remaining number (such as '10') +If everything else has the same form, stop (everything else gets 'other') +If there needs to be a category for items only have fractional values, use 'many' +If there are more categories needed for the language, describe what those categories need to cover in the bug report. +See Language Plural Rules for examples of rules, such as for Czech, and for comparisons of values. Note that in the integer comparison chart, most languages have 'x' (other—gray) for most integers. There are some exceptions (Russian and Arabic, for example), where the categories of 'many' and 'other' should have been swapped when they were defined, but are too late now to change. +Important Notes +These categories are only mnemonics -- the names don't necessarily imply the exact contents of the category. For example, for both English and French the number 1 has the category one (singular). In English, every other number has a plural form, and is given the category other. French is similar, except that the number 0 also has the category one and not other or zero, because the form of units qualified by 0 is also singular. +This is worth emphasizing: A common mistake is to think that "one" is only for only the number 1. Instead, "one" is a category for any number that behaves like 1. So in some languages, for example, one → numbers that end in "1" (like 1, 21, 151) but that don't end in 11 (like "11, 111, 10311). +Note that these categories may be different from the forms used for pronouns or other parts of speech. In particular, they are solely concerned with changes that would need to be made if different numbers, expressed with decimal digits, are used with a sentence. If there is a dual form in the language, but it isn't used with decimal numbers, it should not be reflected in the categories. That is, the key feature to look for is: +If you were to substitute a different number for "1" in a sentence or phrase, would the rest of the text be required to change? For example, in a caption for a video: +"Duration: 1 hour" → "Duration: 3.2 hours" +Plural Rule Syntax +See LDML Language Plural Rules. +Plural Message Migration +The plural categories are used not only within CLDR, but also for localizing messages for different products. When the plural rules change (such as in CLDR 24), the following issues should be considered. Fractional support in plurals is new in CLDR 24. Because the fractions didn't work before, the changes in categories from 23 to 24 should not cause an issue for implementations. The other changes can be categorized as Splitting or Merging categories. +There are some more complicated cases, but the following outlines the main issues to watch for, using examples. For illustration, assume a language uses "" for singular, "u" for dual, and "s" for other.​ ​ +OLD Rules & OLD Messages marks the situation before the change, +NEW Rules & OLD Messages marks the situation after the change (but before any fixes to messages), and +NEW Rules & NEW Messages shows the changes to the messages +Merging +The language really doesn't need 3 cases, because the dual is always identical to one of the other forms. +OLD Rules & OLD Messages +one: book +two: books +other: books +1  ➞ book, 2 ➞ books, 3 ➞ ​ books​ +NEW Rules & OLD or NEW Messages +one: book +other: books +1  ➞ book, 2 ➞ books, 3  ➞​ books​ +This is fairly harmless; merging two of the categories shouldn't affect anyone because the messages for the merged category should not have material differences. The old messages for 'two' are ignored in processing. They could be deleted if desired. +This was done in CLDR 24 for Russian, for example. +Splitting Other +In this case, the 'other' needs to be fixed by moving some numbers to a 'two' category. The way plurals are defined in CLDR, when a message (eg for 'two') is missing, it always falls back to 'other'. So the translation is no worse than before. There are two subcases. +Specific Other Message +In this case, the other message is appropriate for the other case, and not for the new 'two' case. +OLD Rules & OLD Messages +one: book +other: books +1  ➞ book, 2 ➞ books, 3  ➞​ books​ +NEW Rules & OLD Messages +one: book +two: books +other: books +1  ➞ book, 2 ➞ books, 3  ➞​ books​ +The quality is no different than previously. The message can be improved by adding the correct message for 'two', so that the result is: +NEW Rules & NEW Messages +one: book +two: booku +other: books +1  ➞ book, 2 ➞ booku, 3  ➞​ books​ +However, if the translated message is not missing, but has some special text like "UNUSED MESSAGE", then it will need to be fixed; otherwise the special text will show up to users! +Generic Other Message +In this case, the other message was written to be generic by trying to handle (with parentheses or some other textual device) both the plural and dual categories. +OLD Rules & OLD Messages +one: book +other: book(u/s) +1  ➞ book, 2 ➞ book(u/s), 3  ➞​ book(u/s) +NEW Rules & OLD Messages +one: book +two: book(u/s) +other: book(u/s) +1  ➞ book, 2 ➞ book(u/s), 3  ➞​ book(u/s) +The message can be improved by adding a message for 'two', and fixing the message for 'other' to not have the (u/s) workaround: +NEW Rules & NEW Messages +one: book +two: booku +other: books +1  ➞ book, 2 ➞ booku, 3  ➞​ books +Splitting Non-Other +In this case, the 'one' category needs to be fixed by moving some numbers to a 'two' category. +OLD Rules & OLD Messages +one: book/u +other: books +1  ➞ book/u, 2 ➞ book/u, 3  ➞​ books​ +NEW Rules & OLD Messages +one: book/u +other: books +1  ➞ book/u, 2 ➞ books, 3  ➞​ books​ +This is the one case where there is a regression in quality. In order to fix the problem, the message for 'two' needs to be fixed. If the messages for 'one' was written to be generic, then it needs to be fixed as well. +NEW Rules & NEW Messages +one: book +two: booku +other: books +1  ➞ book, 2 ➞ booku, 3  ➞​ books​ \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt b/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt new file mode 100644 index 00000000000..6cd5122f65d --- /dev/null +++ b/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt @@ -0,0 +1,136 @@ +Unicode Transliteration Guidelines +Introduction +This document describes guidelines for the creation and use of CLDR transliterations. Please file any feedback on this document or those charts at Locale Bugs. +Transliteration is the general process of converting characters from one script to another, where the result is roughly phonetic for languages in the target script. For example, "Phobos" and "Deimos" are transliterations of Greek mythological "Φόβος" and "Δεῖμος" into Latin letters, used to name the moons of Mars. +Transliteration is not translation. Rather, transliteration is the conversion of letters from one script to another without translating the underlying words. The following shows a sample of transliteration systems: +Display. Some of the characters in this document may not be visible in your browser, and with some fonts the diacritics will not be correctly placed on the base letters. See Display Problems. +While an English speaker may not recognize that the Japanese word kyanpasu is equivalent to the English word campus, the word kyanpasu is still far easier to recognize and interpret than if the letters were left in the original script. There are several situations where this transliteration is especially useful, such as the following. See the sidebar for examples. +When a user views names that are entered in a world-wide database, it is extremely helpful to view and refer to the names in the user's native script. +When the user performs searching and indexing tasks, transliteration can retrieve information in a different script. +When a service engineer is sent a program dump that is filled with characters from foreign scripts, it is much easier to diagnose the problem when the text is transliterated and the service engineer can recognize the characters. +The term transliteration is sometimes given a narrow meaning, implying that the transformation is reversible (sometimes called lossless). In CLDR this is not the case; the term transliteration is interpreted broadly to mean both reversible and non-reversible transforms of text. (Note that even if theoretically a transliteration system is supposed to be reversible, in source standards it is often not specified in sufficient detail in the edge cases to actually be reversible.) A non-reversible transliteration is often called a transcription, or called a lossy or ambiguous transcription. +Note that reversibility is generally only in one direction, so a transliteration from a native script to Latin may be reversible, but not the other way around. For example, Hangul is reversible, in that any Hangul to Latin to Hangul should provide the same Hangul as the input. Thus we have the following: +갗 → gach → 갗 +However, for completeness, many Latin characters have fallbacks. This means that more than one Latin character may map to the same Hangul. Thus from Latin we don't have reversibility, because two different Latin source strings round-trip back to the same Latin string. +gach → 갗 → gach +gac → 갗 → gach +Transliteration can also be used to convert unfamiliar letters within the same script, such as converting Icelandic THORN (þ) to th. These are not typically reversible. +There is an online demo using released CLDR data at ICU Transform Demo. +Variants +There are many systems for transliteration between languages: the same text can be transliterated in many different ways. For example, for the Greek example above, the transliteration is classical, while the UNGEGN alternate has different correspondences, such as φ → f instead of φ → ph. +CLDR provides for generic mappings from script to script (such as Cyrillic-Latin), and also language-specific variants (Russian-French, or Serbian-German). There can also be semi-generic mappings, such as Russian-Latin or Cyrillic-French. These can be referred to, respectively, as script transliterations, language-specific transliterations, or script-language transliterations. Transliterations from other scripts to Latin are also called Romanizations. +Even within particular languages, there can be variant systems according to different authorities, or even varying across time (if the authority for a system changes its recommendation). The canonical identifier that CLDR uses for these has the form: +source-target/variant +The source (and target) can be a language or script, either using the English name or a locale code. The variant should specify the authority for the system, and if necessary for disambiguation, the year. For example, the identifier for the Russian to Latin transliteration according to the UNGEGN system would be: +ru-und_Latn/UNGEGN, or +Russian-Latin/UNGEGN +If there were multiple versions of these over time, the variant would be, say, UNGEGN2006. +The assumption is that implementations will allow the use of fallbacks, if the exact transliteration specified is unavailable. For example, the following would be the fallback chain for the identifier Russian-English/UNGEGN. This is similar to the Lookup Fallback Pattern used in BCP 47 Tags for Identifying Languages, except that it uses a "stepladder approach" to progressively handle the fallback among source, target, and variant, with priorities being the target, source, and variant, in that order. +Russian-English/UNGEGN +Russian-English +Cyrillic-English/UNGEGN +Cyrillic-English +Russian-Latin/UNGEGN +Russian-Latin +Cyrillic-Latin/UNGEGN +Cyrillic-Latin +Guidelines +There are a number of generally desirable guidelines for script transliterations. These guidelines are rarely satisfied simultaneously, so constructing a reasonable transliteration is always a process of balancing different requirements. These requirements are most important for people who are building transliterations, but are also useful as background information for users. +The following lists the general guidelines for Unicode CLDR transliterations: +standard: follow established systems (standards, authorities, or de facto practice) where possible, deviating sometimes where necessary for reversibility. In CLDR, the systems are generally described in the comments in the XML data files found in the in the transforms folder online. For example, the system for Arabic transliteration in CLDR are found in the comments in Arabic-Latin.xml; there is a reference to the UNGEGN Arabic Tables. Similarly for Hebrew, which also follows the Hebrew UNGEGN Tables. +complete: every well-formed sequence of characters in the source script should transliterate to a sequence of characters from the target script, and vice versa. +predictable: the letters themselves (without any knowledge of the languages written in that script) should be sufficient for the transliteration, based on a relatively small number of rules. This allows the transliteration to be performed mechanically. +pronounceable: the resulting characters have reasonable pronunciations in the target script. Transliteration is not as useful if the process simply maps the characters without any regard to their pronunciation. Simply mapping by alphabetic order ("αβγδεζηθ..." to "abcdefgh...") could yield strings that might be complete and unambiguous, but the pronunciation would be completely unexpected. +reversible: it is possible to recover the text in the source script from the transliteration in the target script. That is, someone that knows the transliteration rules would be able to recover the precise spelling of the original source text. For example, it is possible to go from Elláda back to the original Ελλάδα, while if the transliteration were Ellada (with no accent), it would not be possible. +Some of these principles may not be achievable simultaneously; in particular, adherence to a standard system and reversibility. Often small changes in existing systems can be made to accommodate reversibility. However, where a particular system specifies a fundamentally non-reversible transliterations, those transliterations as represented in CLDR may not be reversible. +Ambiguity +In transliteration, multiple characters may produce ambiguities (non-reversible mappings) unless the rules are carefully designed. For example, the Greek character PSI (ψ) maps to ps, but ps could also result from the sequence PI, SIGMA (πσ) since PI (π) maps to p and SIGMA (σ) maps to s. +The Japanese transliteration standards provide a good mechanism for handling these kinds of ambiguities. Using the Japanese transliteration standards, whenever an ambiguous sequence in the target script does not result from a single letter, the transform uses an apostrophe to disambiguate it. For example, it uses that procedure to distinguish between man'ichi and manichi. Using this procedure, the Greek character PI SIGMA (πσ) maps to p's. This method is recommended for all script transliteration methods, although sometimes the character may vary: for example, "-" is used in Korean. +Note: We've had a recent proposal to consistently use the hyphenation dot for this code, thus we'd have πσ → p‧s. +A second problem is that some characters in a target script are not normally found outside of certain contexts. For example, the small Japanese "ya" character, as in "kya" (キャ), is not normally found in isolation. To handle such characters, the Unicode transliterations currently use different conventions. +Tilde: "ャ" in isolation is represented as "~ya" +Diacritics: Greek "ς" in isolation is represented as s̱ +Note: The CLDR committee is considering converging on a common representation for this. The advantage of a common representation is that it allows for easy filtering. +For the default script transforms, the goal is to have unambiguous mappings, with variants for any common use mappings that are ambiguous (non-reversible). In some cases, however, case may not be preserved. For example, +The following shows Greek text that is mapped to fully reversible Latin: +Greek-Latin +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. +tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. +If the user wants a version without certain accents, then CLDR's chaining rules can be used to remove the accents. For example, the following transliterates to Latin but removes the macron accents on the long vowels. +Greek-Latin; nfd; [\u0304] remove; nfc +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. +tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron. +The above chaining rules, separated by semi-colons, perform the following commands in order: +The following transliterates to Latin but removes all accents. Note that the only change is to expand the filter for the remove command. +Greek-Latin; nfd; [:nonspacing marks:] remove; nfc +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. +ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. +Pronunciation +Standard transliteration methods often do not follow the pronunciation rules of any particular language in the target script. For example, the Japanese Hepburn system uses a "j" that has the English phonetic value (as opposed to French, German, or Spanish), but uses vowels that do not have the standard English sounds. A transliteration method might also require some special knowledge to have the correct pronunciation. For example, in the Japanese kunrei-siki system, "ti" is pronounced as English "chee". +This is similar to situations where there are different languages within the same script. For example, knowing that the word Gewalt comes from German allows a knowledgeable reader to pronounce the "w" as a "v".  When encountering a foreign word like jawa, there is little assurance how it is to be pronounced even when it is not a transliteration (it is just from /span>another Latin-script language). The j could be pronounced (for an English speaker) as in jump, or Junker, or jour; and so on. Transcriptions are only roughly phonetic, and only so when the specific pronunciation rules are understood. +The pronunciation of the characters in the original script may also be influenced by context, which may be particularly misleading in transliteration. For, in the Bengali নিঃশব, transliterated as niḥśaba, the visarga ḥ is not pronounced itself (whereas elsewhere it may be) but lengthens the ś sound, and the final inherent a is pronounced (whereas it commonly is not), and the two inherent a's are pronounced as ɔ and ô, respectively. +In some cases, transliteration may be heavily influenced by tradition. For example, the modern Greek letter beta (β) sounds like a "v", but a transliteration may use a b (as in biology). In that case, the user would need to know that a "b" in the transliterated word corresponded to beta (β) and is to be pronounced as a v in modern Greek. +Letters may also be transliterated differently according to their context to make the pronunciation more predictable. For example, since the Greek sequence GAMMA GAMMA (γγ) is pronounced as ng, the first GAMMA can be transcribed as an "n" in that context. Similarly, the transliteration can give other guidance to the pronunciation in the source language, for example, using "n" or "m" for the same Japanese character (ん) depending on context, even though there is no distinction in the source script. +In general, predictability means that when transliterating Latin script to other scripts using reversible transliterations, English text will not produce phonetic results. This is because the pronunciation of English cannot be predicted easily from the letters in a word: e.g. grove, move, and love all end with "ove", but are pronounced very differently. +Cautions +Reversibility may require modifications of traditional transcription methods. For example, there are two standard methods for transliterating Japanese katakana and hiragana into Latin letters. The kunrei-siki method is unambiguous. The Hepburn method can be more easily pronounced by foreigners but is ambiguous. In the Hepburn method, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". A slightly amended version of Hepburn, that uses "dji" for DI and "dzu" for DU, is unambiguous. +When a sequence of two letters map to one, case mappings (uppercase and lowercase) must be handled carefully to ensure reversibility. For cased scripts, the two letters may need to have different cases, depending on the next letter. For example, the Greek letter PHI (Φ) maps to PH in Latin, but Φο maps to Pho, and not to PHo. +Some scripts have characters that take on different shapes depending on their context. Usually, this is done at the display level (such as with Arabic) and does not require special transliteration support. However, in a few cases this is represented with different character codes, such as in Greek and Hebrew. For example, a Greek SIGMA is written in a final form (ς) at the end of words, and a non-final form (σ) in other locations. This also requires the transform to map different characters based on the context. +Another thing to look out for when dealing with cased scripts is that some of the characters in the target script may not be able to represent case distinctions, such as some of the IPA characters in the Latin script. +It is useful for the reverse mapping to be complete so that arbitrary strings in the target script can be reasonably mapped back to the source script. Complete reverse mapping makes it much easier to do mechanical quality checks and so on. For example, even though the letter "q" might not be necessary in a transliteration of Greek, it can be mapped to a KAPPA (κ). Such reverse mappings will not, in general, be unambiguous. +Available Transliterations +Currently Unicode CLDR offers Romanizations for certain scripts, plus transliterations between the Indic scripts (excluding Urdu). Additional script transliterations will be added in the future. +Except where otherwise noted, all of these systems are designed to be reversible. For bicameral scripts (those with uppercase and lowercase), however, case may not be completely preserved. +The transliterations are also designed to be complete for any sequence of the Latin letters a-z. A fallback is used for a letter that is not covered by the transliteration, and default letters may be inserted as required. For example, in the Hangul transliteration, rink → 린크 → linkeu. That is, "r" is mapped to the closest other letter, and a default vowel is inserted at the end (since "nk" cannot end a syllable). +Preliminary charts are available for the available transliterations. Be sure to read the known issues described there. +Korean +There are many Romanizations of Korean. The default transliteration in Unicode CLDR follows the Korean Ministry of Culture & Tourism Transliteration regulations (see also English summary). There is an optional clause 8 variant for reversibility: +"제 8 항 학술 연구 논문 등 특수 분야에서 한글 복원을 전제로 표기할 경우에는 한글 표기를 대상으로 적는다. 이때 글자 대응은 제2장을 따르되 'ㄱ, ㄷ, ㅂ, ㄹ'은 'g, d, b, l'로만 적는다. 음가 없는 'ㅇ'은 붙임표(-)로 표기하되 어두에서는 생략하는 것을 원칙으로 한다. 기타 분절의 필요가 있을 때에도 붙임표(-)를 쓴다." +translation: "Clause 8: When it is required to recover the original Hangul representation faithfully as in scholarly articles, ' ㄱ, ㄷ, ㅂ, ㄹ' must be always romanized as 'g, d, b, l' while the mapping for the rest of the letters remains the same as specified in clause 2. The placeholder 'ㅇ' at the beginning of a syllable should be represented with '-', but should be omitted at the beginning of a word. In addition, '-' should be used in other cases where a syllable boundary needs to be explicitly marked (be disambiguated." +There are a number of cases where this Romanization may be ambiguous, because sometimes multiple Latin letters map to a single entity (jamo) in Hangul. This happens with vowels and consonants, the latter being slightly more complicated because there are both initial and final consonants: +CLDR uses the following rules for disambiguation of the possible boundaries between letters, in order. The first rule comes from Clause 8. +Don't break so as to require an implicit vowel or null consonant (if possible) +Don't break within Initial-Only or Initial-Or-Final sequences (if possible) +Favor longest match first. +If there is a single consonant between vowels, then Rule #1 will group it with the following vowel if there is one (this is the same as the first part of Clause 8). If there is a sequence of four consonants between vowels, then there is only one possible break (with well-formed text). So the only ambiguities lie with two or three consonants between vowels, where there are possible multi-character consonants involved. Even there, in most cases the resolution is simple, because there isn't a possible multi-character consonant in the case of two, or two possible multi-character consonants in the case of 3. For example, in the following cases, the left side is unambiguous: +angda = ang-da → 앙다 +apda = ap-da → 앞다 +There are a relatively small number of possible ambiguities, listed below using "a" as a sample vowel. +For vowel sequences, the situation is simpler. Only Rule #3 applies, so aeo = ae-o → 애오. +Japanese +The default transliteration for Japanese uses the a slight variant of the Hepburn system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended slightly for reversibility by using "dji" for DI and "dzu" for DU. +Greek +The default transliteration uses a standard transcription for Greek which is aimed at preserving etymology. The ISO 843 variant includes following differences: +* before γ, κ, ξ, χ +Cyrillic +Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative plans to add extended Cyrillic characters in the future, plus variants for GOST and other national standards. +Indic +Transliteration of Indic scripts follows the ISO 15919 Transliteration of Devanagari and related Indic scripts into Latin characters. Internally, all Indic scripts are transliterated by converting first to an internal form, called Inter-Indic, then from Inter-Indic to the target script. Inter-Indic thus provides a pivot between the different scripts, and contains a superset of correspondences for all of them. +ISO 15919 differs from ISCII 91 in application of diacritics for certain characters. These differences are shown in the following example (illustrated with Devanagari, although the same principles apply to the other Indic scripts): +Transliteration rules from Indic to Latin are reversible with the exception of the ZWJ and ZWNJ used to request explicit rendering effects. For example: +Transliteration between Indic scripts are roundtrip where there are corresponding letters. Otherwise, there may be fallbacks. +There are two particular instances where transliterations may produce unexpected results: (1) where the final vowel is suppressed in speech, and (2) with the transliteration of 'c'. +For example: +Others +Unicode CLDR provides other transliterations based on the U.S. Board on Geographic Names (BGN) transliterations. These are currently unidirectional — to Latin only. The goal is to make them bidirectional in future versions of CLDR. +Other transliterations are generally based on the UNGEGN: Working Group on Romanization Systems transliterations. These systems are in wider actual implementation than most ISO standardized transliterations, and are published freely available on the web (http://www.eki.ee/wgrs/) and thus easily accessible to all. The UNGEGN also has good documentation. For example, the UNGEGN Arabic Tables not only presents the UN system, but compares it with the BGN/PCGN 1956 system, the I.G.N. System 1973, ISO 233:1984, the royal Jordanian Geographic Centre System, and the Survey of Egypt System. +Submitting Transliterations +If you are interested in providing transliterations for one or more scripts, file an initial bug report at Locale Bugs. The initial bug should contain the scripts and or languages involved, and the system being followed (with a link to a full description of the proposed transliteration system), and a brief example. The proposed data can also be in that bug, or be added in a Reply to that bug. You can also file a bug in Locale Bugs if you find a problem in an existing transliteration. +For submission to CLDR, the data needs to supplied in the correct XML format or in the ICU format, and should follow an accepted standard (like UNGEGN, BGN, or others). +The format for rules is specified in Transform_Rules. It is best if the results are tested using the ICU Transform Demo first, since if the data doesn't validate it would not be accepted into CLDR. +As mentioned above, even if a transliteration is only used in certain countries or contexts CLDR can provide for them with different variant tags. +For comparison, you can see what is currently in CLDR in the transforms folder online. For example, see Hebrew-Latin.xml. +Script transliterators should cover every character in the exemplar sets for the CLDR locales using that script. +Romanizations (Script-Latin) should cover all the ASCII letters (some of these can be fallback mappings, such as the 'x' below). +If the rules are very simple, they can be supplied in a spreadsheet, with two columns, such as +More Information +For more information, see: +BGN: U.S. Board on Geographic Names +UNGEGN: UNITED NATIONS GROUP OF EXPERTS ON GEOGRAPHICAL NAMES: Working Group on Romanization Systems +Transliteration of Non-Roman Alphabets and Scripts (Thomas T. Pedersen) +Standards for Archival Description: Romanization +ISO-15915 (Hindi) +ISO-15915 (Gujarati) +ISO-15915 (Kannada) +ISCII-91 +UTS #35: Locale Data Markup Language (LDML) \ No newline at end of file diff --git a/docs/site/index/cldr-spec/core-data-for-new-locales.md b/docs/site/index/cldr-spec/core-data-for-new-locales.md new file mode 100644 index 00000000000..b2d1c813989 --- /dev/null +++ b/docs/site/index/cldr-spec/core-data-for-new-locales.md @@ -0,0 +1,33 @@ +--- +title: Core Data for New Locales +--- + +# Core Data for New Locales + +This document describes the minimal data needed for a new locale. There are two kinds of data that are relevant for new locales: + +1. **Core Data** \- This is data that the CLDR committee needs from the proposer ***before*** a new locale is added. The proposer is expected to also get a Survey Tool account, and contribute towards the Basic Data. +2. **Basic Data** \- The Core data is just the first step. It is only created under the expectation that people will engage in suppling data, at a [Basic Coverage Level](https://cldr.unicode.org/index/cldr-spec/coverage-levels#h.yi1eiryx7yl4). **If the locale does not meet the [Basic Coverage Level](https://cldr.unicode.org/index/cldr-spec/coverage-levels#h.yi1eiryx7yl4) in the next Survey Tool cycle, the committee may remove the locale.** + +## Core Data + +Collect and submit the following data, using the [Core Data Submission Form](https://docs.google.com/forms/d/e/1FAIpQLSfSyz0VUSXD93IJQQdjzUCnbQwC2nwz6eiLjTaFjASQZzpoSg/viewform). *Note to translators: If you are having difficulties or questions about the following data, please contact us: [file a new bug](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket), or post a follow\-up to comment to your existing bug.* + +1. The correct language code according to [Picking the Right Language Identifier](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code). +2. The four exemplar sets: main, auxiliary, numbers, punctuation.  + - These must reflect the Unicode model. For more information, see [tr35\-general.html\#Character\_Elements](http://www.unicode.org/reports/tr35/tr35-general.html#Character_Elements). +3. Verified country data ( i.e. the population of speakers in the regions (countries) in which the language is commonly used)  + - There must be at least one country, but should include enough others that they cover approximately 75% or more of the users of the language. + - "Users of the language" includes as either a 1st or 2nd language. The main focus is on written language. +4. Default content script and region (normally the region is the country with largest population using that language, and the customary script used for that language in that country).  + - **\[[supplemental/supplementalMetadata.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/supplementalMetadata.xml#LC1654:~:text=%3CdefaultContent)]** + - *See*: [http://cldr.unicode.org/translation/translation\-guide\-general/default\-content](https://cldr.unicode.org/translation/translation-guide-general/default-content) +5. The correct time cycle used with the language in the default content region + - In common/supplemental/supplementalData.xml, this is the "timeData" element + - The value should be h (1\-12\), H (0\-23\), k (1\-24\), or K (0\-11\); as defined in [https://www.unicode.org/reports/tr35/tr35\-dates.html\#Date\_Field\_Symbol\_Table](https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table) + +***You must commit to supplying [the data required for the new locale to reach Basic level](https://cldr.unicode.org/index/cldr-spec/core-data-for-new-locales#h.yaraq3qjxnns) during the next open CLDR submission when requesting a new locale to be added.*** + +For more information on the other coverage levels refer to [Coverage Levels](https://cldr.unicode.org/index/cldr-spec/coverage-levels)  + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/index/cldr-spec/coverage-levels.md b/docs/site/index/cldr-spec/coverage-levels.md new file mode 100644 index 00000000000..6ec633c07f5 --- /dev/null +++ b/docs/site/index/cldr-spec/coverage-levels.md @@ -0,0 +1,107 @@ +--- +title: Coverage Levels +--- + +# Coverage Levels + +There are four main coverage levels as defined in the [UTS \#35: Unicode Locale Data Markup Language (LDML) Part 6: Supplemental: 8 Coverage Levels](https://www.unicode.org/reports/tr35/tr35-info.html#Coverage_Levels). They are described more fully below. + +## Usage + +You can use the file **common/properties/coverageLevels.txt** (added in v41\) for a given release to filter the locales that they support. For example, see [coverageLevels.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/coverageLevels.txt). (This and other links to data files are to the development versions; see the specific version for the release you are working with.) For a detailed chart of the coverage levels, see the [locale\_coverage.html](https://unicode-org.github.io/cldr-staging/charts/43/supplemental/locale_coverage.html) file for the respective release. + +The file format is semicolon delimited, with 3 fields per line. + + +```Locale ID ; Coverage Level ; Name``` + +Each locale ID also covers all the locales that inherit from it. So to get locales at a desired coverage level or above, the following process is used. + +1. Always include the root locale file, **root.xml** +2. Include all of the locale files listed in **coverageLevels.txt** at that level or above. +3. Recursively include all other files that inherit from the files in \#2\. + - **Warning**: Inheritance is not simple truncation; the **parentLocale** information in [supplementalData.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/supplementalData.xml) needs to be applied also. See [Parent\_Locales](https://www.unicode.org/reports/tr35/tr35.html#Parent_Locales). + - For example, if you include fr.xml in \#2, you would also include fr\_CA.xml; if you include no.xml in \#2 you would also include nn.xml. + +### Filtering + +To filter "at that level or above", you use the fact that basic ⊂ moderate ⊂ modern, so  + +1. to filter for basic and above, filter for basic\|moderate\|modern +2. to filter for moderate and above, filter for moderate\|modern + +### Migration + +As of v43, the files in **/seed/** have been moved to **/common/**. Older versions of CLDR separated some locale files into a 'seed' directory. Some implementations used for filtering, but the criteria for moving from seed to common were not rigorous. To maintain compatibility with the set of locales used from previous versions, an implementation may use the above process for Basic and above, but then also add locales that were previously included. For more information, see [CLDR 43 Release Note](https://cldr.unicode.org/index/downloads/cldr-43).  + +## Core Data + +**The data needed for a new locale to be added. See [Core Data for New Locales](https://cldr.unicode.org/index/cldr-spec/core-data-for-new-locales) for details on Core Data and how to submit for new locales.** + +**It is expected that during the next Survey Tool cycle after a new locale is added, the data for the Basic Coverage Level will be supplied.** + +## Basic Data + +**Suitable for locale selection and minimal support, eg. choice of language on mobile phone** + +This includes very minimal data for support of the language: basic dates, times, autonyms: + +1. Delimiter Data —Quotation start/end, including alternates +2. Numbering system — default numbering system \+ native numbering system (if default \= Latin and native ≠ Latin) +3. Locale Pattern Info — Locale pattern and separator, and code pattern +4. Language Names — in the native language for the native language and for English +5. Script Name(s) — Scripts customarily used to write the language +6. Country Name(s) — For countries where commonly used (see "Core XML Data") +7. Measurement System — metric vs UK vs US +8. Full Month and Day of Week names +9. AM/PM period names +10. Date and Time formats +11. Date/Time interval patterns — fallback +12. Timezone baseline formats — region, gmt, gmt\-zero, hour, fallback +13. Number symbols — decimal and grouping separators; plus, minus, percent sign (for Latin number system, plus native if different) +14. Number patterns — decimal, currency, percent, scientific + +## Moderate Data + +**Suitable for “document content” internationalization, eg. content in a spreadsheet** + +Before submitting data above the Basic Level, the following must be in place: + +1. Plural and Ordinal rules + - As in \[supplemental/plurals.xml] and \[supplemental/ordinals.xml] + - Must also include minimal pairs + - For more information, see [cldr\-spec/plural\-rules](https://cldr.unicode.org/index/cldr-spec/plural-rules). +2. Casing information (only where the language uses a cased scripts according to [ScriptMetadata.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/scriptMetadata.txt)) + - This will go into [common/casing](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/casing/) +3. Collation rules \[non\-Survey Tool] + - This can be supplied as a list of characters, or as rule file. + - The list is a space\-delimited list of the characters used by the language (in the given script). The list may include multiple\-character strings, where those are treated specially. For example, if "ch" is sorted after "h" one might see "a b c d .. g h ch i j ..." + - More sophisticated users can do a better job, supplying a file of rules as in [cldr\-spec/collation\-guidelines](https://cldr.unicode.org/index/cldr-spec/collation-guidelines). +4. The result will be a file like: [common/collation/ar.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/ar.xml) or [common/collation/da.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/da.xml). +The data for the Moderate Level includes subsets of the Modern data, both in depth and breadth. + +## Modern Data + +**Suitable for full UI internationalization** + +Before submitting data at the Moderate Level, the following must be in place: + +1. Grammatical Features + 1. The grammatical cases and other information, as in [supplemental/grammaticalFeatures.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/grammaticalFeatures.xml) + 2. Must include minimal pair values. +2. Romanization table (non\-Latin scripts only) + 1. This can be supplied as a spreadsheet or as a rule file. + 2. If a spreadsheet, for each letter (or sequence) in the exemplars, what is the corresponding Latin letter (or sequence). + 3. More sophisticated users can do a better job, supplying a file of rules like [transforms/Arabic\-Latin\-BGN.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/transforms/Arabic-Latin-BGN.xml). + +The data for the Modern Level includes: + +**\#\#\# TBD** + +## References + +For the coverage in the latest released version of CLDR, see [Locale Coverage Chart](https://unicode-org.github.io/cldr-staging/charts/latest/supplemental/locale_coverage.html). + +To see the development version of the rules used to determine coverage, see [coverageLevels.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/coverageLevels.xml). For a list of the locales at a given level, see [coverageLevels.txt](https://github.com/unicode-org/cldr/blob/main/common/properties/coverageLevels.txt).  + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/index/cldr-spec/picking-the-right-language-code.md b/docs/site/index/cldr-spec/picking-the-right-language-code.md new file mode 100644 index 00000000000..3bb83669538 --- /dev/null +++ b/docs/site/index/cldr-spec/picking-the-right-language-code.md @@ -0,0 +1,93 @@ +--- +title: Picking the Right Language Identifier +--- + +# Picking the Right Language Identifier + +Within programs and structured data, languages are indicated with stable identifiers of the form [en](http://unicode.org/cldr/utility/languageid.jsp?a=en), [fr\-CA](http://unicode.org/cldr/utility/languageid.jsp?a=fr-CA), or [zh\-Hant](http://unicode.org/cldr/utility/languageid.jsp?a=zh-Hant&l=en). The standard Unicode language identifiers follow IETF BCP 47, with some small differences defined in [UTS \#35: Locale Data Markup Language (LDML)](http://www.unicode.org/reports/tr35/). Locale identifiers use the same format, with certain possible extensions. + +Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code '[lah](http://unicode.org/cldr/utility/languageid.jsp?a=lah)', and formal name "Lahnda". There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry. Moreover, a language identifier uses not only the 'base' language code, like '[en](http://unicode.org/cldr/utility/languageid.jsp?a=en)' for English or '[ku](http://unicode.org/cldr/utility/languageid.jsp?a=ku)' for Kurdish, but also certain modifiers such as [en\-CA](http://unicode.org/cldr/utility/languageid.jsp?a=en-CA) for *Canadian English*, or [ku\-Latn](http://ku-Latn) for *Kurdish written in Latin script*. Each of these modifiers are called *subtags* (or sometimes *codes*), and are separated by "\-" or "\_". The language identifier itself is also called a *language tag*, and sometimes a *language code*. + +Here is an example of the steps to take to find the right language identifier to use. Let's say you to find the identifier for a language called "Ganda" which you know is spoken in Uganda. You'll first pick the base language subtag as described below, then add any necessary script/territory subtags, and then verify. If you can't find the name after following these steps or have other questions, ask on the [Unicode CLDR Mailing List](http://www.unicode.org/consortium/distlist.html#cldr_list). + +If you are looking at a prospective language code, like "swh", the process is similar; follow the steps below, starting with the verification. + +## Choosing the Base Language Code + +1. Go to [iso639\-3](http://www-01.sil.org/iso639-3/codes.asp) to find the language. Typically you'll look under **Name** starting with **G** for Ganda. +2. There may be multiple entries for the item you want, so you'll need to look at all of them. For example, on the page for names starting with “P”, there are three records: “Panjabi”, “Mirpur Panjabi” and “Western Panjabi” (it is the last of these that corresponds to Lahnda). You can also try a search, but be [careful](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code). +3. You'll find an entry like: + + lug  lug  **lg**  Ganda  Individual  Living  more ... + +While you may think that you are done, you have to verify that the three\-letter code is correct. + +1. Click on the "more..." in this case and you'll find [id\=lug](http://www.sil.org/iso639-3/documentation.asp?id=lug). You can also use the URL http://www.sil.org/iso639\-3/documentation.asp?id\=XXX, where you replace XXX by the three\-letter code. +2. Click on "See corresponding entry in [Ethnologue](http://www.ethnologue.com/show_language.asp?code=lug)." and you get to [code\=lug](http://www.ethnologue.com/show_language.asp?code=lug) +3. Verify that is indeed the language: + 1. Look at the information on the ethnologue page + 2. Check Wikipedia and other web sources +4. ***AND IMPORTANTLY: Review [Caution!](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code) below*** + +Once you have the right three\-letter code, you are still not done. Unicode (and BCP 47\) uses the 2 letter ISO code if it exists. Unicode also uses the "macro language" where suitable. *So* + +1. Use the two\-letter code if there is one. In the example above, the highlighted "lg" from the first table. +2. Verify that the code is in http://www.iana.org/assignments/language-subtag-registry +3. If the code occurs in http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalMetadata.xml in the type attribute of a languageAlias element, then use the replacement instead. + - For example, because "swh" occurs in \, "sw" must be used instead of "swh". + +## Choosing Script/Territory Subtags + +If you need a particular variant of a language, then you'll add additional subtags, typically script or territory. Consult [Sample Subtags](http://unicode.org/cldr/utility/sample_subtags.html) for the most common choices. ***Again, review*** [***Caution!***](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code) ***below.*** + +## Verifying Your Choice + +1. Verify your choice by using the [online language identifier](http://unicode.org/cldr/utility/languageid.jsp) demo. +2. You need to fix the identifier and try again in *any* if the demo shows any of the following: + 1. the language identifer is illegal, or + 2. one of the subtags is invalid, or + 3. there are any replacement values. [\*\*](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code) + +## Documenting Your Choice + +If you are requesting a new locale / language in CLDR, please include the links to the particular pages above so that we can process your request more quickly, as we have to double check before any addition. The links will be of the form: + +- http://www.sil.org/iso639-3/documentation.asp?id=xxx +- http://www.ethnologue.com/show_language.asp?code=xxx +- http://en.wikipedia.org/wiki/Western_Punjabi +- and so on + +## Caution! + +### Canonical Form + +Unicode language and locale IDs are based on BCP 47, but differ in a few ways. The canonical form is produced by using the canonicalization based on BCP47 (thus changing iw → he, and zh\-yue → yue), plus a few other steps: + +1. Replacing the most prominent encompassed subtag by the macrolanguage (cmn → zh) +2. Canonicalizing overlong 3 letter codes (eng\-840 → en\-US) +3. Minimizing according to the likely subtag data (ru\-Cyrl → ru, en\-US → en). +4. BCP 47 also provides for "variant subtags", such as [zh\-Latn\-pinyin](http://unicode.org/cldr/utility/languageid.jsp?a=zh-Latn-pinyin). When there are multiple variant subtags, the canonical format for Unicode language identifiers puts them in alphabetical order. + +Note that the CLDR likely subtag data is used to minimize scripts and regions, *not* the IANA Suppress\-Script. The latter had a much more constrained design goal, and is more limited. + +In some cases, systems (or companies) may have different conventions than the Preferred\-Values in BCP 47 \-\- such as those in the Replacement column in the the [online language identifier](http://unicode.org/cldr/utility/languageid.jsp) demo. For example, for backwards compatibility, "iw" is used with Java instead of "he" (Hebrew). When picking the right subtags, be aware of these compatibility issues. *If a target system uses a different canonical form for locale IDs than CLDR, the CLDR data needs to be processed by remapping its IDs to the target system's.* + +For compatibility, it is strongly recommended that all implementations accept both the preferred values and their alternates: for example, both "iw" and "he". Although BCP 47 itself only allows "\-" as a separator; for compatibility, Unicode language identifiers allows both "\-" and "\_". Implementations should also accept both. + +### Macrolanguages + +ISO (and hence BCP 47\) has the notion of an individual language (like en \= English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use [ku\-Latn\-TR](http://ku-latn/). See also: [ISO 636 Deprecation Requests](https://cldr.unicode.org/development/development-process/design-proposals/iso-636-deprecation-requests-draft). + +Unicode language identifiers do not allow the "extlang" form defined in BCP 47\. For example, use "yue" instead of "zh\-yue" for Cantonese. + +### Ethnologue + +*When searching, such as* [*site:ethnologue.com ganda*](http://www.google.com/search?q=site%3Aethnologue.com+ganda)*, be sure to completely disregard matches in* [*Ethnologue 14*](http://www.ethnologue.com/14/) *\-\- these are out of date, and do not have the right codes!* + +The Ethnologue is a great source of information, but it must be approached with a certain degree of caution. Many of the population figures are far out of date, or not well substantiated. The Ethnologue also focus on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and on fluent speakers (not necessarily native speakers). So, for example, it would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language subtag for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation. + +### Wikipedia + +Wikipedia is also a great source of information, but it must be approached with a certain degree of caution as well. Be sure to follow up on references, not just look at articles. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/index/cldr-spec/plural-rules.md b/docs/site/index/cldr-spec/plural-rules.md new file mode 100644 index 00000000000..6e1b96d7eaf --- /dev/null +++ b/docs/site/index/cldr-spec/plural-rules.md @@ -0,0 +1,335 @@ +--- +title: Plural Rules +--- + +# Plural Rules + +Languages vary in how they handle plurals of nouns or unit expressions ("hour" vs "hours", and so on). Some languages have two forms, like English; some languages have only a single form; and some languages have multiple forms. CLDR uses short, mnemonic tags for these plural categories: + +- zero +- one (singular) +- two (dual) +- few (paucal) +- many (also used for fractions if they have a separate class) +- other (required—general plural form—also used if the language only has a single form) + +*See [Language Plural Rules](https://www.unicode.org/cldr/charts/45/supplemental/language_plural_rules.html) for the categories for each language in CLDR.* + +These categories are used to provide localized units, with a more natural ways of expressing phrases that vary in plural form, such as "1 hour" vs "2 hours". While they cannot express all the intricacies of natural languages, they allow for more natural phrasing than constructions like "1 hour(s)". + +## Reporting Defects + +When you find errors or omissions in this data, please report the information with a [bug report](https://cldr.unicode.org/index/bug-reports#TOC-Filing-a-Ticket). Please give examples of how the forms may differ. You don't have to give the exact rules, but it is extremely helpful! Here's an example:   + +**Sample Bug Report** + +The draft Ukrainian (uk) plural rules are: + +one: 1, 21, 31, 41, 51, 61\... + +few: 2\-4, 22\-24, 32\-34\... + +other: 0, 5\-20, 25\-30, 35\-40\...; 1\.31, 2\.31, 5\.31\... + +Although rules for integer values are correct, there needs to be four categories, + +with an extra one for fractions. For example: + +1 день
+2 дні
+5 днів
+1\.31 дня
+2\.31 дня
+5\.31 дня + +## Determining Plural Categories + +The CLDR plural categories do not necessarily match the traditional grammatical categories. Instead, the categories are determined by changes required in a phrase or sentence if a numeric placeholder changes value.  + +### Minimal pairs + +The categories are verified by looking a minimal pairs: where a change in numeric value (expressed in digits) forces a change in the other words. For example, the following is a minimal pair for English, establishing a difference in category between "1" and "2". + +| Category | Resolved String | Minimal Pair Template | +|---|---|---| +| one | 1 day | {NUMBER} day | +| other | 2 day s | {NUMBER} day s | + +Warning for Vetters + +The Category (Code) values indicate a certain range of numbers that differ between languages. To see the meaning of each Code value for your language see [Language Plural Rules](https://www.unicode.org/cldr/charts/45/supplemental/language_plural_rules.html) chart. + +*The minimal pairs in the Survey Tool are not direct translations of English*. They *may* be translations of English, such as in [German](https://st.unicode.org/cldr-apps/v#/de/MinimalPairs/), but must be different if those words or terms do not show the right plural differences for your language. For example, if we look at [Belarusian](https://st.unicode.org/cldr-apps/v#/be/MinimalPairs/), they are quite different, corresponding to “{0} books in {0} days”, while [Welsh](https://st.unicode.org/cldr-apps/v#/cy/MinimalPairs/43b7793f1f673abe) has the equivalent of “{0} dog, {0} cat”. *Be sure to read the following examples carefully and pay attention to error messages.* + +For example, English has no separate plural form for "sheep". It would be wrong for the two phrases to be:  + +- one: {0} sheep +- other: {0} sheep + +You have to pick a different phrase if that is the case in your language. Do not change the sentence in other ways, such as an "unforced change". For example, don't have the 'one' phrase be "{0} sheep" and the 'other' be "{0} deer". + +The {0} will always have just a number composed of pure digits in it, such as 0, 1, 2, 3, … 11, 12, … 21, 22, .… 99, 100, …. For example, “1 dog, 1 cat” or “21 dog, 21 cat”. If there are multiple instances of {0}, they will always have the same number. The sentences must be parallel, with exactly the same construction except for what is forced by a change in digits. That is, for a language that has "one" and "other" categories:  + +- take the phrase for "other" +- change the {0} to "1" +- make only the other changes to the phrase that are grammatically necessary because of that change +- change the "1" back to "{0}" +- you should then have the phrase for "one" + +Gender is irrelevant. Do not contort your phrasing so that it could cover some (unspecified) item of a different gender. (Eg, don't have “Prenez la {0}re à droite; Prenez le {0}er à droite.”) The exception to that is where two nouns of different genders to cover all plural categories, such as Russian “из {0} книг за {0} дня”. + +Non\-inflecting Nouns—Verbs + +Some languages, like Bengali, do not change the form of the following noun when the numeric value changes. Even where nouns are invariant, other parts of a sentence might change. That is sufficient to establish a minimal pair. For example, even if all nouns in English were invariant (like 'fish' or 'sheep'), the verb changes are sufficient to establish a minimal pair: + +| Category | Resolved String | Minimal Pair Template | +|---|---|---| +| one | 1 fish is swimming | {NUMBER} fish is swimming | +| other | 2 fish **are** swimming | {NUMBER} fish **are** swimming | + +Non\-inflecting Nouns—Pronouns + +In other cases, even the verb doesn't change, but *referents* (such as pronouns) change. So a minimal pair in such a language might look something like: + +| Category | Resolved String | Minimal Pair Template | +|---|---|---| +| one | You have 1 fish in your cart; do you want to buy **it**? | You have {NUMBER} fish in your cart; do you want to buy **it**? | +| other | You have 2 fish in your cart; do you want to buy **them**? | You have {NUMBER} fish in your cart; do you want to buy **them**? | + +Multiple Nouns + +In many cases, a single noun doesn't exhibit all the numeric forms. For example, in Welsh the following is a minimal pair that separates 1 and 2: + +| **Category** | **Resolved String** | +|---|---| +| one | 1 ci | +| two | 2 **g**i | + +But the form of this word is the same for 1 and 4\. We need a separate word to get a minimal pair that separates 1 and 4: + +| **Category** | **Resolved String** | +|---|---| +| one | 1 gath | +| two | 1 cath | + +These combine into a single Minimal Pair Template that can be used to separate all 6 forms in Welsh. + +| Category | Resolved String | Minimal Pair Template | +|---|---|---| +| zero | 0 cŵn, 0 cathod | {NUMBER} cŵn, {NUMBER} cathod | +| one | 1 ci, 1 gath | {NUMBER} ci, {NUMBER} gath | +| two | 2 gi, 2 gath | {NUMBER} gi, {NUMBER} gath | +| few | 3 chi, 3 cath | {NUMBER} chi, {NUMBER} cath | +| many | 6 chi, 6 chath | {NUMBER} chi, {NUMBER} chath | +| other | 4 ci, 4 cath | {NUMBER} ci, {NUMBER} cath | + +Russian is similar, needing two different nouns: + +| Category | Resolved String | Minimal Pair Template | +|---|---|---| +| one | из 1 книги за 1 день | из {NUMBER} книги за {NUMBER} день | +| few | из 2 книг за 2 дня | из {NUMBER} книг за {NUMBER} дня | +| many | из 5 книг за 5 дней | из {NUMBER} книг за {NUMBER} дней | +| other | из 1,5 книги за 1,5 дня | из {NUMBER} книги за {NUMBER} дня | + +The minimal pairs are those that are required for correct grammar. So because 0 and 1 don't have to form a minimal pair (it is ok—even though often not optimal—to say "0 people") , 0 doesn't establish a separate category. However, implementations are encouraged to provide the ability to have special plural messages for 0 in particular, so that more natural language can be used: + +- None of your friends are online. +- *rather than* +- You have 0 friends online. + +Fractions + +In some languages, fractions require a separate category. For example, Russian 'other' in the example above. In some languages, they all in a single category with some integers, and in some languages they are in multiple categories. In any case, they also need to be examined to make sure that there are sufficial minimal pairs. + +### Rules + +The next step is to determine the rules: which numbers go into which categories. + +Integers + +Test a variety of integers. Look for cases where the 'teens' (11\-19\) behave differently. Many languages only care about the last 2 digits only, or the last digit only. + +Fractions + +Fractions are often a bit tricky to determine: languages have very different behavior for them. In some languages the fraction is ignored (when selecting the category), in some languages the final digits of the fraction are important, in some languages a number changes category just if there are visible trailing zeros. Make sure to try out a range of fractions to make sure how the numbers behave: values like 1 vs 1\.0 may behave differently, as may numbers like 1\.1 vs 1\.2 vs 1\.21, and so on. + +### Choosing Plural Category Names + +In some sense, the names for the categories are somewhat arbitrary. Yet for consistency across languages, the following guidelines should be used when selecting the plural category names. + +1. If no forms change, then stop (there are no plural rules — everything gets '**other**') +2. '**one**': Use the category '**one**' for the form used with 1\. +3. '**other**': Use the category '**other**' for the form used with the most integers. +4. '**two**': Use the category '**two**' for the form used with 2, *if it is limited to numbers whose integer values end with '2'.* + - If everything else has the same form, stop (everything else gets '**other**') +5. '**zero**': Use the category '**zero**' for the form used with 0, *if it is limited to numbers whose integer values end with '0'.* + - If everything else has the same form, stop (everything else gets '**other**') +6. '**few**': Use the category '**few**' for the form used with the least remaining number (such as '4') + - If everything else has the same form, stop (everything else gets '**other**') +7. '**many**': Use the category '**many**' for the form used with the least remaining number (such as '10') + - If everything else has the same form, stop (everything else gets '**other**') + - If there needs to be a category for items only have fractional values, use '**many**' +8. If there are more categories needed for the language, describe what those categories need to cover in the bug report. + +See [*Language Plural Rules*](http://www.unicode.org/cldr/data/charts/supplemental/language_plural_rules.html) for examples of rules, such as for [Czech](https://www.unicode.org/cldr/charts/45/supplemental/language_plural_rules.html#cs), and for [comparisons of values](https://www.unicode.org/cldr/charts/45/supplemental/language_plural_rules.html#cs-comp). Note that in the integer comparison chart, most languages have 'x' (other—gray) for most integers. There are some exceptions (Russian and Arabic, for example), where the categories of 'many' and 'other' should have been swapped when they were defined, but are too late now to change. + +## Important Notes + +*These categories are only mnemonics \-\- the names don't necessarily imply the exact contents of the category.* For example, for both English and French the number 1 has the category one (singular). In English, every other number has a plural form, and is given the category other. French is similar, except that the number 0 also has the category one and not other or zero, because the form of units qualified by 0 is also singular. + +*This is worth emphasizing:* A common mistake is to think that "one" is only for only the number 1\. Instead, "one" is a category for any number that behaves like 1\. So in some languages, for example, one → numbers that end in "1" (like 1, 21, 151\) but that don't end in 11 (like "11, 111, 10311\). + +Note that these categories may be different from the forms used for pronouns or other parts of speech. *In particular, they are solely concerned with changes that would need to be made if different numbers, expressed with decimal digits,* are used with a sentence. If there is a dual form in the language, but it isn't used with decimal numbers, it should not be reflected in the categories. That is, the key feature to look for is:  + +If you were to substitute a different number for "1" in a sentence or phrase, would the rest of the text be required to change? For example, in a caption for a video: + + "Duration: 1 hour" → "Duration: 3\.2 hours" + +## Plural Rule Syntax + +See [LDML Language Plural Rules](http://unicode.org/reports/tr35/tr35-numbers.html#Language_Plural_Rules). + +## Plural Message Migration + +The plural categories are used not only within CLDR, but also for localizing messages for different products. When the plural rules change (such as in [CLDR 24](https://cldr.unicode.org/index/downloads/cldr-24-release-note)), the following issues should be considered. Fractional support in plurals is new in CLDR 24\. Because the fractions didn't work before, the changes in categories from 23 to 24 should not cause an issue for implementations. The other changes can be categorized as Splitting or Merging categories. + +There are some more complicated cases, but the following outlines the main issues to watch for, using examples. For illustration, assume a language uses "" for singular, "u" for dual, and "s" for other.​ ​ + +- **OLD Rules \& OLD Messages** marks the situation before the change, +- **NEW Rules \& OLD Messages** marks the situation after the change (but before any fixes to messages), and +- **NEW Rules \& NEW Messages** shows the changes to the messages + +### Merging + +The language really doesn't need 3 cases, because the dual is always identical to one of the other forms.  + +**OLD Rules \& OLD Messages** + +one: book + +two: books + +other: books + +1  ➞ book, 2 ➞ books, 3 ➞ ​ books​ + +**NEW Rules \& OLD or NEW Messages** + +one: book + +other: books + +1  ➞ book, 2 ➞ books, 3  ➞​ books​ + +This is fairly harmless; merging two of the categories shouldn't affect anyone because the messages for the merged category should not have material differences. The old messages for 'two' are ignored in processing. They could be deleted if desired. + +This was done in CLDR 24 for Russian, for example. + +### Splitting Other + +In this case, the 'other' needs to be fixed by moving some numbers to a 'two' category. The way plurals are defined in CLDR, when a message (eg for 'two') is missing, it always falls back to 'other'. So the translation is no worse than before. There are two subcases. + +Specific Other Message + +In this case, the *other* message is appropriate for the other case, and not for the new 'two' case. + +**OLD Rules \& OLD Messages** + +one: book + +other: books + +1  ➞ book, 2 ➞ books, 3  ➞​ books​ + +**NEW Rules \& OLD Messages** + +one: book + +two: **books** + +other: books + +1  ➞ book, 2 ➞ **books**, 3  ➞​ books​ + +The quality is no different than previously. The message can be improved by adding the correct message for 'two', so that the result is: + +**NEW Rules \& NEW Messages** + +one: book + +two: booku + +other: books + +1  ➞ book, 2 ➞ **booku**, 3  ➞​ books​ + +***However, if the translated message is not missing, but has some special text like "UNUSED MESSAGE", then it will need to be fixed; otherwise the special text will show up to users!*** + +Generic Other Message + +In this case, the *other* message was written to be generic by trying to handle (with parentheses or some other textual device) both the plural and dual categories. + +**OLD Rules \& OLD Messages** + +one: book + +other: book(u/s) + +1  ➞ book, 2 ➞ **book(u/s)**, 3  ➞​ **book(u/s)** + +**NEW Rules \& OLD Messages** + +one: book + +two: book(u/s) + +other: book(u/s) + +1  ➞ book, 2 ➞ **book(u/s)**, 3  ➞​ **book(u/s)** + +The message can be improved by adding a message for 'two', and fixing the message for 'other' to not have the (u/s) workaround: + +**NEW Rules \& NEW Messages** + +one: book + +two: booku + +other: books + +1  ➞ book, 2 ➞ booku, 3  ➞​ books + +### Splitting Non\-Other + +In this case, the 'one' category needs to be fixed by moving some numbers to a 'two' category. + +**OLD Rules \& OLD Messages** + +one: book/u + +other: books + +1  ➞ book/u, 2 ➞ book/u, 3  ➞​ books​ + +**NEW Rules \& OLD Messages** + +one: book/u + +other: books + +1  ➞ **book/u**, 2 ➞ **books**, 3  ➞​ books​ + +This is the one case where there is a regression in quality. In order to fix the problem, the message for 'two' needs to be fixed. If the messages for 'one' was written to be generic, then it needs to be fixed as well. + +**NEW Rules \& NEW Messages** + +one: book + +two: booku + +other: books + +1  ➞ **book**, 2 ➞ **booku**, 3  ➞​ books​ + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/index/cldr-spec/transliteration-guidelines.md b/docs/site/index/cldr-spec/transliteration-guidelines.md new file mode 100644 index 00000000000..9cf742bff67 --- /dev/null +++ b/docs/site/index/cldr-spec/transliteration-guidelines.md @@ -0,0 +1,354 @@ +--- +title: Unicode Transliteration Guidelines +--- + +# Unicode Transliteration Guidelines + +## Introduction + +*This document describes guidelines for the creation and use of CLDR transliterations. Please file any feedback on this document or those charts at [Locale Bugs](https://github.com/unicode-org/cldr/blob/main/docs/requesting_changes.md).* + +Transliteration is the general process of converting characters from one script to another, where the result is roughly phonetic for languages in the target script. For example, "Phobos" and "Deimos" are transliterations of Greek mythological "Φόβος" and "Δεῖμος" into Latin letters, used to name the moons of Mars. + +Transliteration is *not* translation. Rather, transliteration is the conversion of letters from one script to another without translating the underlying words. The following shows a sample of transliteration systems: + +Sample Transliteration Systems +| Source | Translation | Transliteration | System | +|:---:|:---:|:---:|:---:| +| Αλφαβητικός | Alphabetic | Alphabētikós | Classic | +| | | Alfavi̱tikós | UNGEGN | +| しんばし | new bridge (district in Tokyo) | shimbashi | Hepburn | +| | | sinbasi | Kunrei | +| яйца Фаберже | Fabergé eggs | yaytsa Faberzhe | BGN/PCGN | +| | | jajca Faberže | Scholarly | +| | | âjca Faberže | ISO | + +***Display**. Some of the characters in this document may not be visible in your browser, and with some fonts the diacritics will not be correctly placed on the base letters. See [Display Problems](http://www.unicode.org/help/display_problems.html).* + +While an English speaker may not recognize that the Japanese word kyanpasu is equivalent to the English word campus, the word kyanpasu is still far easier to recognize and interpret than if the letters were left in the original script. There are several situations where this transliteration is especially useful, such as the following. See the sidebar for examples. + +- When a user views names that are entered in a world\-wide database, it is extremely helpful to view and refer to the names in the user's native script. +- When the user performs searching and indexing tasks, transliteration can retrieve information in a different script. +- When a service engineer is sent a program dump that is filled with characters from foreign scripts, it is much easier to diagnose the problem when the text is transliterated and the service engineer can recognize the characters. + +Sample Transliterations +| Source | Transliteration | +|---|---| +| 김, 국삼 | Gim, Gugsam | +| 김, 명희 | Gim, Myeonghyi | +| 정, 병호 | Jeong, Byeongho | +| ... | ... | +| たけだ, まさゆき | Takeda, Masayuki | +| ますだ, よしひこ | Masuda, Yoshihiko | +| やまもと, のぼる | Yamamoto, Noboru | +| ... | ... | +| Ρούτση, Άννα | Roútsē, Ánna | +| Καλούδης, Χρήστος | Kaloúdēs, Chrḗstos | +| Θεοδωράτου, Ελένη | Theodōrátou, Elénē | + +The term *transliteration* is sometimes given a narrow meaning, implying that the transformation is *reversible* (sometimes called *lossless*). In CLDR this is not the case; the term *transliteration* is interpreted broadly to mean both reversible and non\-reversible transforms of text. (Note that even if theoretically a transliteration system is supposed to be reversible, in source standards it is often not specified in sufficient detail in the edge cases to actually be reversible.) A non\-reversible transliteration is often called a *transcription*, or called a *lossy* or *ambiguous* transcription. + +Note that reversibility is generally only in one direction, so a transliteration from a native script to Latin may be reversible, but not the other way around. For example, Hangul is reversible, in that any Hangul to Latin to Hangul should provide the same Hangul as the input. Thus we have the following: + + 갗 → gach → 갗 + +However, for completeness, many Latin characters have fallbacks. This means that more than one Latin character may map to the same Hangul. Thus from Latin we don't have reversibility, because two different Latin source strings round\-trip back to the same Latin string. + + gach → 갗 → gach + + gac → 갗 → gach + +Transliteration can also be used to convert unfamiliar letters within the same script, such as converting Icelandic THORN (þ) to th. These are not typically reversible. + + *There is an online demo using released CLDR data at [ICU Transform Demo](https://icu4c-demos.unicode.org/icu-bin/translit).* + +## Variants + +There are many systems for transliteration between languages: the same text can be transliterated in many different ways. For example, for the Greek example above, the transliteration is classical, while the [UNGEGN](https://arhiiv.eki.ee/wgrs/) alternate has different correspondences, such as φ → f instead of φ → ph. + +CLDR provides for generic mappings from script to script (such as Cyrillic\-Latin), and also language\-specific variants (Russian\-French, or Serbian\-German). There can also be semi\-generic mappings, such as Russian\-Latin or Cyrillic\-French. These can be referred to, respectively, as script transliterations, language\-specific transliterations, or script\-language transliterations. Transliterations from other scripts to Latin are also called *Romanizations*. + +Even within particular languages, there can be variant systems according to different authorities, or even varying across time (if the authority for a system changes its recommendation). The canonical identifier that CLDR uses for these has the form: + + *source\-target/variant* + +The source (and target) can be a language or script, either using the English name or a locale code. The variant should specify the authority for the system, and if necessary for disambiguation, the year. For example, the identifier for the Russian to Latin transliteration according to the UNGEGN system would be: + +- ru\-und\_Latn/UNGEGN, or +- Russian\-Latin/UNGEGN + +If there were multiple versions of these over time, the variant would be, say, UNGEGN2006\. + +The assumption is that implementations will allow the use of fallbacks, if the exact transliteration specified is unavailable. For example, the following would be the fallback chain for the identifier Russian\-English/UNGEGN. This is similar to the *Lookup Fallback Pattern* used in [BCP 47 Tags for Identifying Languages](https://www.rfc-editor.org/info/bcp47), except that it uses a "stepladder approach" to progressively handle the fallback among source, target, and variant, with priorities being the target, source, and variant, in that order. + +- Russian\-English/UNGEGN +- Russian\-English +- Cyrillic\-English/UNGEGN +- Cyrillic\-English +- Russian\-Latin/UNGEGN +- Russian\-Latin +- Cyrillic\-Latin/UNGEGN +- Cyrillic\-Latin + +## Guidelines + +There are a number of generally desirable guidelines for script transliterations. These guidelines are rarely satisfied simultaneously, so constructing a reasonable transliteration is always a process of balancing different requirements. These requirements are most important for people who are building transliterations, but are also useful as background information for users. + +The following lists the general guidelines for Unicode CLDR transliterations: + +- *standard*: follow established systems (standards, authorities, or de facto practice) where possible, deviating sometimes where necessary for reversibility. In CLDR, the systems are generally described in the comments in the XML data files found in the in the [transforms](https://github.com/unicode-org/cldr/tree/main/common/transforms) folder online. For example, the system for Arabic transliteration in CLDR are found in the comments in [Arabic\-Latin.xml](https://github.com/unicode-org/cldr/blob/main/common/transforms/Arabic-Latin.xml); there is a reference to the [UNGEGN Arabic Tables](https://arhiiv.eki.ee/wgrs/rom1_ar.pdf). Similarly for Hebrew, which also follows the [Hebrew UNGEGN Tables](https://arhiiv.eki.ee/wgrs/rom1_he.pdf). +- *complete*: every well\-formed sequence of characters in the source script should transliterate to a sequence of characters from the target script, and vice versa. +- *predictable*: the letters themselves (without any knowledge of the languages written in that script) should be sufficient for the transliteration, based on a relatively small number of rules. This allows the transliteration to be performed mechanically. +- *pronounceable*: the resulting characters have reasonable pronunciations in the target script. Transliteration is not as useful if the process simply maps the characters without any regard to their pronunciation. Simply mapping by alphabetic order ("αβγδεζηθ..." to "abcdefgh...") could yield strings that might be complete and unambiguous, but the pronunciation would be completely unexpected. +- *reversible*: it is possible to recover the text in the source script from the transliteration in the target script. That is, someone that knows the transliteration rules would be able to recover the precise spelling of the original source text. For example, it is possible to go from *Elláda* back to the original Ελλάδα, while if the transliteration were *Ellada* (with no accent), it would not be possible. + +Some of these principles may not be achievable simultaneously; in particular, adherence to a standard system *and* reversibility. Often small changes in existing systems can be made to accommodate reversibility. However, where a particular system specifies a fundamentally non\-reversible transliterations, those transliterations as represented in CLDR may not be reversible. + +### Ambiguity + +In transliteration, multiple characters may produce ambiguities (non\-reversible mappings) unless the rules are carefully designed. For example, the Greek character PSI (ψ) maps to ps, but ps could also result from the sequence PI, SIGMA (πσ) since PI (π) maps to p and SIGMA (σ) maps to s. + +The Japanese transliteration standards provide a good mechanism for handling these kinds of ambiguities. Using the Japanese transliteration standards, whenever an ambiguous sequence in the target script does not result from a single letter, the transform uses an apostrophe to disambiguate it. For example, it uses that procedure to distinguish between *man'ichi* and *manichi*. Using this procedure, the Greek character PI SIGMA (πσ) maps to p's. This method is recommended for all script transliteration methods, although sometimes the character may vary: for example, "\-" is used in Korean. + +**Note**: We've had a recent proposal to consistently use the hyphenation dot for this code, thus we'd have πσ → p‧s. + +A second problem is that some characters in a target script are not normally found outside of certain contexts. For example, the small Japanese "ya" character, as in "kya" (キャ), is not normally found in isolation. To handle such characters, the Unicode transliterations currently use different conventions. + +- Tilde: "ャ" in isolation is represented as "\~ya" +- Diacritics: Greek "ς" in isolation is represented as s̱ + +**Note**: The CLDR committee is considering converging on a common representation for this. The advantage of a common representation is that it allows for easy filtering. + +For the default script transforms, the goal is to have unambiguous mappings, with variants for any common use mappings that are ambiguous (non\-reversible). In some cases, however, case may not be preserved. For example, + +| Latin | Greek | Latin | +|:---:|:---:|:---:| +| ps PS | ψ Ψ | ps PS | +| psa Psa **PsA** | ψα Ψα **ΨΑ** | psa Psa **PSA** | +| psA PSA **PSa** | ψΑ ΨΑ **Ψα** | psA PSA **Psa** | + +The following shows Greek text that is mapped to fully reversible Latin: + +| **Greek\-Latin** | | +|---|---| +| τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. | + +If the user wants a version without certain accents, then CLDR's chaining rules can be used to remove the accents. For example, the following transliterates to Latin but removes the macron accents on the long vowels. + +| **Greek\-Latin; nfd; \[\\u0304] remove; nfc** | | +|---|---| +| τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron. | + +The above chaining rules, separated by semi\-colons, perform the following commands in order: + +| Rule | Description | +|---|---| +| Greek-Latin | transliterate Greek to Latin | +| nfd | convert to Unicode NFD format (separating accents from base characters) | +| [\u0304] remove | remove accents, but filter the command to only apply to a single character: [U+0304](http://unicode.org/cldr/utility/character.jsp?a=0304) ( ̄ ) COMBINING MACRON | +| nfc | convert to Unicode NFC format (rejoining accents to base characters) | + +The following transliterates to Latin but removes *all* accents. Note that the only change is to expand the filter for the remove command. + +| **Greek\-Latin; nfd; \[:nonspacing marks:] remove; nfc** | | +|---|---| +| τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. | + +### Pronunciation + +Standard transliteration methods often do not follow the pronunciation rules of any particular language in the target script. For example, the Japanese Hepburn system uses a "j" that has the English phonetic value (as opposed to French, German, or Spanish), but uses vowels that do not have the standard English sounds. A transliteration method might also require some special knowledge to have the correct pronunciation. For example, in the Japanese kunrei\-siki system, "ti" is pronounced as English "chee". + +This is similar to situations where there are different languages within the same script. For example, knowing that the word *Gewalt* comes from German allows a knowledgeable reader to pronounce the "w" as a "v".  When encountering a foreign word like *jawa*, there is little assurance how it is to be pronounced even when it is not a transliteration (it is just from /span\>another Latin\-script language). The *j* could be pronounced (for an English speaker) as in *jump*, or *Junker*, or *jour*; and so on. Transcriptions are only roughly phonetic, and only so when the specific pronunciation rules are understood. + +The pronunciation of the characters in the original script may also be influenced by context, which may be particularly misleading in transliteration. For, in the Bengali নিঃশব, transliterated as niḥśaba, the *visarga ḥ* is not pronounced itself (whereas elsewhere it may be) but lengthens the ś sound, and the final inherent *a* is pronounced (whereas it commonly is not), and the two inherent a's are pronounced as ɔ and ô, respectively. + +In some cases, transliteration may be heavily influenced by tradition. For example, the modern Greek letter beta (β) sounds like a "v", but a transliteration may use a b (as in biology). In that case, the user would need to know that a "b" in the transliterated word corresponded to beta (β) and is to be pronounced as a v in modern Greek. + +Letters may also be transliterated differently according to their context to make the pronunciation more predictable. For example, since the Greek sequence GAMMA GAMMA (γγ) is pronounced as *ng*, the first GAMMA can be transcribed as an "n" in that context. Similarly, the transliteration can give other guidance to the pronunciation in the source language, for example, using "n" or "m" for the same Japanese character (ん) depending on context, even though there is no distinction in the source script. + +In general, predictability means that when transliterating Latin script to other scripts using reversible transliterations, English text will not produce phonetic results. This is because the pronunciation of English cannot be predicted easily from the letters in a word: e.g. *grove*, *move*, and *love* all end with "ove", but are pronounced very differently. + +### Cautions + +Reversibility may require modifications of traditional transcription methods. For example, there are two standard methods for transliterating Japanese katakana and hiragana into Latin letters. The *kunrei\-siki* method is unambiguous. The Hepburn method can be more easily pronounced by foreigners but is ambiguous. In the Hepburn method, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". A slightly amended version of Hepburn, that uses "dji" for DI and "dzu" for DU, is unambiguous. + +When a sequence of two letters map to one, case mappings (uppercase and lowercase) must be handled carefully to ensure reversibility. For cased scripts, the two letters may need to have different cases, depending on the next letter. For example, the Greek letter PHI (Φ) maps to PH in Latin, but Φο maps to Pho, and not to PHo. + +Some scripts have characters that take on different shapes depending on their context. Usually, this is done at the display level (such as with Arabic) and does not require special transliteration support. However, in a few cases this is represented with different character codes, such as in Greek and Hebrew. For example, a Greek SIGMA is written in a final form (ς) at the end of words, and a non\-final form (σ) in other locations. This also requires the transform to map different characters based on the context. + +Another thing to look out for when dealing with cased scripts is that some of the characters in the target script may not be able to represent case distinctions, such as some of the IPA characters in the Latin script. + +It is useful for the reverse mapping to be complete so that arbitrary strings in the target script can be reasonably mapped back to the source script. Complete reverse mapping makes it much easier to do mechanical quality checks and so on. For example, even though the letter "q" might not be necessary in a transliteration of Greek, it can be mapped to a KAPPA (κ). Such reverse mappings will not, in general, be unambiguous. + +## Available Transliterations + +Currently Unicode CLDR offers Romanizations for certain scripts, plus transliterations between the Indic scripts (excluding Urdu). Additional script transliterations will be added in the future. + +Except where otherwise noted, all of these systems are designed to be reversible. For bicameral scripts (those with uppercase and lowercase), however, case may not be completely preserved. + +The transliterations are also designed to be complete for any sequence of the Latin letters a\-z. A fallback is used for a letter that is not covered by the transliteration, and default letters may be inserted as required. For example, in the Hangul transliteration, rink → 린크 → linkeu. That is, "r" is mapped to the closest other letter, and a default vowel is inserted at the end (since "nk" cannot end a syllable). + +*Preliminary [charts](http://www.unicode.org/cldr/data/charts/transforms/index.html) are available for the available transliterations. Be sure to read the known issues described there.* + +### Korean + +There are many Romanizations of Korean. The default transliteration in Unicode CLDR follows the [Korean Ministry of Culture \& Tourism Transliteration](http://www.korean.go.kr/06_new/rule/rule06.jsp) regulations (see also [English summary](https://web.archive.org/web/20070916025652/http://www.korea.net/korea/kor_loca.asp?code=A020303)). There is an optional clause 8 variant for reversibility: + +"제 8 항 학술 연구 논문 등 특수 분야에서 한글 복원을 전제로 표기할 경우에는 한글 표기를 대상으로 적는다. 이때 글자 대응은 제2장을 따르되 'ㄱ, ㄷ, ㅂ, ㄹ'은 'g, d, b, l'로만 적는다. 음가 없는 'ㅇ'은 붙임표(\-)로 표기하되 어두에서는 생략하는 것을 원칙으로 한다. 기타 분절의 필요가 있을 때에도 붙임표(\-)를 쓴다." + +*translation*: "Clause 8: When it is required to recover the original Hangul representation faithfully as in scholarly articles, ' ㄱ, ㄷ, ㅂ, ㄹ' must be always romanized as 'g, d, b, l' while the mapping for the rest of the letters remains the same as specified in clause 2\. The placeholder 'ㅇ' at the beginning of a syllable should be represented with '\-', but should be omitted at the beginning of a word. In addition, '\-' should be used in other cases where a syllable boundary needs to be explicitly marked (be disambiguated." + +There are a number of cases where this Romanization may be ambiguous, because sometimes multiple Latin letters map to a single entity (jamo) in Hangul. This happens with vowels and consonants, the latter being slightly more complicated because there are both initial and final consonants: + +| Type | Multi-Character Consonants | +|---|---| +| Initial-Only | tt pp jj | +| Initial-or-Final | kk ch ss | +| Final-Only | gs nj nh lg lm lb ls lt lp lh bs ng | + +CLDR uses the following rules for disambiguation of the possible boundaries between letters, in order. The first rule comes from Clause 8\. + +1. Don't break so as to require an implicit vowel or null consonant (if possible) +2. Don't break within Initial\-Only or Initial\-Or\-Final sequences (if possible) +3. Favor longest match first. + +If there is a single consonant between vowels, then Rule \#1 will group it with the following vowel if there is one (this is the same as the first part of Clause 8\). If there is a sequence of four consonants between vowels, then there is only one possible break (with well\-formed text). So the only ambiguities lie with two or three consonants between vowels, where there are possible multi\-character consonants involved. Even there, in most cases the resolution is simple, because there isn't a possible multi\-character consonant in the case of two, or two possible multi\-character consonants in the case of 3\. For example, in the following cases, the left side is unambiguous: + + angda \= ang\-da → 앙다 + + apda \= ap\-da → 앞다 + +There are a relatively small number of possible ambiguities, listed below using "a" as a sample vowel. + +| No. of Cons. | Latin | CLDR Disambiguation | Hangul | Comments | | +|---|---|---|---|---|---| +| 2 | atta | = a-tta | 아따 | Rule 1, then 2 | | +| | appa | = a-ppa | 아빠 | | | +| | ajja | = a-jja | 아짜 | | | +| | akka | = a-kka | 아까 | Rule 1, then 2 | | +| | assa | = a-ssa | 아싸 | | | +| | acha | = a-cha | 아차 | | | +| | agsa | = ag-sa | 악사 | Rule 1 | | +| | anja | = an-ja | 안자 | | | +| | anha | = an-ha | 안하 | | | +| | alga | = al-ga | 알가 | | | +| | alma | = al-ma | 알마 | | | +| | alba | = al-ba | 알바 | | | +| | alsa | = al-sa | 알사 | | | +| | alta | = al-ta | 알타 | | | +| | alpa | = al-pa | 알파 | | | +| | alha | = al-ha | 알하 | | | +| | absa | = ab-sa | 압사 | | | +| | anga | = an-ga | 안가 | | | +| 3 | agssa | = ag-ssa | 악싸 | Rule 1, then 2 | | +| | anjja | = an-jja | 안짜 | | | +| | alssa | = al-ssa | 알싸 | | | +| | altta | = al-tta | 알따 | | | +| | alppa | = al-ppa | 알빠 | | | +| | abssa | = ab-ssa | 압싸 | | | +| | akkka | = akk-ka | 앆카 | Rule 1, then 2, then 3 | | +| | asssa | = ass-sa | 았사 | | | + +For vowel sequences, the situation is simpler. Only Rule \#3 applies, so aeo \= ae\-o → 애오. + +### Japanese + +The default transliteration for Japanese uses the a slight variant of the Hepburn system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended slightly for reversibility by using "dji" for DI and "dzu" for DU. + +### Greek + +The default transliteration uses a standard transcription for Greek which is aimed at preserving etymology. The ISO 843 variant includes following differences: + +| Greek | Default | ISO 843 | +|---|---|---| +| β | b | v | +| γ* | n | g | +| η | ē | ī | +| ̔ | h | (omitted) | +| ̀ | ̀ | (omitted) | +| ~ | ~ | (omitted) | + +\* before γ, κ, ξ, χ + +### Cyrillic + +Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative plans to add extended Cyrillic characters in the future, plus variants for GOST and other national standards. + +### Indic + +Transliteration of Indic scripts follows the ISO 15919 *Transliteration of Devanagari and related Indic scripts into Latin characters*. Internally, all Indic scripts are transliterated by converting first to an internal form, called Inter\-Indic, then from Inter\-Indic to the target script. Inter\-Indic thus provides a pivot between the different scripts, and contains a superset of correspondences for all of them. + +ISO 15919 differs from ISCII 91 in application of diacritics for certain characters. These differences are shown in the following example (illustrated with Devanagari, although the same principles apply to the other Indic scripts): + +| Devanagari | ISCII 91 | ISO 15919 | +|---|---|---| +| ऋ | ṛ | r̥ | +| ऌ | ḻ | l̥ | +| ॠ | ṝ | r̥̄ | +| ॡ | ḻ̄ | l̥̄ | +| ढ़ | d̂ha | ṛha | +| ड़ | d̂a | ṛa | + +Transliteration rules from Indic to Latin are reversible with the exception of the ZWJ and ZWNJ used to request explicit rendering effects. For example: + +| Devanagari | Romanization | Note | +|---|---|---| +| क्ष | kṣa | normal | +| क्‍ष | kṣa | explicit halant requested | +| क्‌ष | kṣa | half-consonant requested | + +Transliteration between Indic scripts are roundtrip where there are corresponding letters. Otherwise, there may be fallbacks. + +There are two particular instances where transliterations may produce unexpected results: (1\) where the final vowel is suppressed in speech, and (2\) with the transliteration of 'c'. + +For example: + +| Devanagari | Romanization | Notes | +|---|---|---| +| सेन्गुप्त | Sēngupta | | +| सेनगुप्त | Sēnagupta | The final 'a' is not pronounced | +| मोनिक | Monika | | +| मोनिच | Monica | The 'c' is pronounced "ch" | + +### Others + +Unicode CLDR provides other transliterations based on the [U.S. Board on Geographic Names](https://www.usgs.gov/us-board-on-geographic-names) (BGN) transliterations. These are currently unidirectional — to Latin only. The goal is to make them bidirectional in future versions of CLDR. + +Other transliterations are generally based on the [UNGEGN: Working Group on Romanization Systems](https://arhiiv.eki.ee/wgrs/) transliterations. These systems are in wider actual implementation than most ISO standardized transliterations, and are published freely available on the web () and thus easily accessible to all. The UNGEGN also has good documentation. For example, the [UNGEGN Arabic Tables](https://arhiiv.eki.ee/wgrs/rom1_ar.pdf) not only presents the UN system, but compares it with the BGN/PCGN 1956 system, the I.G.N. System 1973, ISO 233:1984, the royal Jordanian Geographic Centre System, and the Survey of Egypt System. + +## Submitting Transliterations + +If you are interested in providing transliterations for one or more scripts, file an initial bug report at [*Locale Bugs*](http://www.unicode.org/cldr/bugs/locale-bugs). The initial bug should contain the scripts and or languages involved, and the system being followed (with a link to a full description of the proposed transliteration system), and a brief example. The proposed data can also be in that bug, or be added in a Reply to that bug. You can also file a bug in [*Locale Bugs*](http://www.unicode.org/cldr/bugs/locale-bugs) if you find a problem in an existing transliteration. + +For submission to CLDR, the data needs to supplied in the correct XML format or in the ICU format, and should follow an accepted standard (like UNGEGN, BGN, or others). + +- The format for rules is specified in [Transform\_Rules](http://www.unicode.org/reports/tr35/#Transform_Rules). It is best if the results are tested using the [ICU Transform Demo](https://icu4c-demos.unicode.org/icu-bin/translit) first, since if the data doesn't validate it would not be accepted into CLDR. +- As mentioned above, even if a transliteration is only used in certain countries or contexts CLDR can provide for them with different variant tags. +- For comparison, you can see what is currently in CLDR in the [transforms]() folder online. For example, see [Hebrew\-Latin.xml](). +- Script transliterators should cover every character in the exemplar sets for the CLDR locales using that script. +- Romanizations (Script\-Latin) should cover all the ASCII letters (some of these can be fallback mappings, such as the 'x' below). +- If the rules are very simple, they can be supplied in a spreadsheet, with two columns, such as + +| Shavian | Relation | Latin | Comments | +|:---:|:---:|:---:|---| +| \𐑐 | ↔ | p | Map all uppercase to lowercase first | +| \𐑚 | ↔ | b | | +| \𐑑 | ↔ | t | | +| \𐑒\𐑕 | ← | x | fallback | +| ... | | | | + +## More Information + +For more information, see: + +- BGN: [U.S. Board on Geographic Names](https://www.usgs.gov/us-board-on-geographic-names) +- UNGEGN: [UNITED NATIONS GROUP OF EXPERTS ON GEOGRAPHICAL NAMES: Working Group on Romanization Systems](https://arhiiv.eki.ee/wgrs/) +- [Transliteration of Non\-Roman Alphabets and Scripts (Thomas T. Pedersen)](http://transliteration.eki.ee/) +- [Standards for Archival Description: Romanization](http://www.archivists.org/catalog/stds99/chapter8.html) +- [ISO\-15915 (Hindi)](http://transliteration.eki.ee/pdf/Hindi-Marathi-Nepali.pdf) +- [ISO\-15915 (Gujarati)](http://transliteration.eki.ee/pdf/Gujarati.pdf) +- [ISO\-15915 (Kannada)](http://transliteration.eki.ee/pdf/Kannada.pdf) +- [ISCII\-91](http://www.cdacindia.com/html/gist/down/iscii_d.asp) +- [UTS \#35: Locale Data Markup Language (LDML)](http://www.unicode.org/reports/tr35/) + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file