From 3cbc86caa7d2491f5ce5b6dbb54582f891367044 Mon Sep 17 00:00:00 2001 From: Chris Pyle Date: Mon, 2 Sep 2024 11:13:04 -0400 Subject: [PATCH] CLDR-17566 txt diffs and minor change --- docs/site/TEMP-TEXT-FILES/coverage-levels.txt | 8 -- .../picking-the-right-language-code.txt | 5 +- docs/site/TEMP-TEXT-FILES/plural-rules.txt | 63 +++++---- .../transliteration-guidelines.txt | 122 +++++++++++++++--- docs/site/index/cldr-spec/coverage-levels.md | 1 + 5 files changed, 148 insertions(+), 51 deletions(-) diff --git a/docs/site/TEMP-TEXT-FILES/coverage-levels.txt b/docs/site/TEMP-TEXT-FILES/coverage-levels.txt index ccaf9331d75..a0710866169 100644 --- a/docs/site/TEMP-TEXT-FILES/coverage-levels.txt +++ b/docs/site/TEMP-TEXT-FILES/coverage-levels.txt @@ -16,14 +16,6 @@ to filter for basic and above, filter for basic|moderate|modern to filter for moderate and above, filter for moderate|modern Migration As of v43, the files in /seed/ have been moved to /common/. Older versions of CLDR separated some locale files into a 'seed' directory. Some implementations used for filtering, but the criteria for moving from seed to common were not rigorous. To maintain compatibility with the set of locales used from previous versions, an implementation may use the above process for Basic and above, but then also add locales that were previously included. For more information, see CLDR 43 Release Note. -Usage -Filtering -Migration -Core Data -Basic Data -Moderate Data -Modern Data -References Core Data The data needed for a new locale to be added. See Core Data for New Locales for details on Core Data and how to submit for new locales. It is expected that during the next Survey Tool cycle after a new locale is added, the data for the Basic Coverage Level will be supplied. diff --git a/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt b/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt index 051e074a159..2535f3bfe61 100644 --- a/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt +++ b/docs/site/TEMP-TEXT-FILES/picking-the-right-language-code.txt @@ -7,6 +7,7 @@ Choosing the Base Language Code Go to iso639-3 to find the language. Typically you'll look under Name starting with G for Ganda. There may be multiple entries for the item you want, so you'll need to look at all of them. For example, on the page for names starting with “P”, there are three records: “Panjabi”, “Mirpur Panjabi” and “Western Panjabi” (it is the last of these that corresponds to Lahnda). You can also try a search, but be careful. You'll find an entry like: + lug  lug  lg  Ganda  Individual  Living  more ... While you may think that you are done, you have to verify that the three-letter code is correct. Click on the "more..." in this case and you'll find id=lug. You can also use the URL http://www.sil.org/iso639-3/documentation.asp?id=XXX, where you replace XXX by the three-letter code. Click on "See corresponding entry in Ethnologue." and you get to code=lug @@ -26,7 +27,7 @@ Verify your choice by using the online language identifier demo. You need to fix the identifier and try again in any if the demo shows any of the following: the language identifer is illegal, or one of the subtags is invalid, or -there are any replacement values.** +there are any replacement values. ** Documenting Your Choice If you are requesting a new locale / language in CLDR, please include the links to the particular pages above so that we can process your request more quickly, as we have to double check before any addition. The links will be of the form: http://www.sil.org/iso639-3/documentation.asp?id=xxx @@ -44,7 +45,7 @@ Note that the CLDR likely subtag data is used to minimize scripts and regions, n In some cases, systems (or companies) may have different conventions than the Preferred-Values in BCP 47 -- such as those in the Replacement column in the the online language identifier demo. For example, for backwards compatibility, "iw" is used with Java instead of "he" (Hebrew). When picking the right subtags, be aware of these compatibility issues. If a target system uses a different canonical form for locale IDs than CLDR, the CLDR data needs to be processed by remapping its IDs to the target system's. For compatibility, it is strongly recommended that all implementations accept both the preferred values and their alternates: for example, both "iw" and "he". Although BCP 47 itself only allows "-" as a separator; for compatibility, Unicode language identifiers allows both "-" and "_". Implementations should also accept both. Macrolanguages -ISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests. +ISO (and hence BCP 47) has the notion of an individual language (like en = English) versus a Collection or Macrolanguage. For compatibility, Unicode language and locale identifiers always use the Macrolanguage to identify the predominant form. Thus the Macrolanguage subtag "zh" (Chinese) is used instead of "cmn" (Mandarin). Similarly, suppose that you are looking for Kurdish written in Latin letters, as in Turkey. It is a mistake to think that because that is in the north, that you should use the subtag 'kmr' for Northern Kurdish. You should instead use ku-Latn-TR. See also: ISO 636 Deprecation Requests. Unicode language identifiers do not allow the "extlang" form defined in BCP 47. For example, use "yue" instead of "zh-yue" for Cantonese. Ethnologue When searching, such as site:ethnologue.com ganda, be sure to completely disregard matches in Ethnologue 14 -- these are out of date, and do not have the right codes! diff --git a/docs/site/TEMP-TEXT-FILES/plural-rules.txt b/docs/site/TEMP-TEXT-FILES/plural-rules.txt index b7794a909f3..af9a884d722 100644 --- a/docs/site/TEMP-TEXT-FILES/plural-rules.txt +++ b/docs/site/TEMP-TEXT-FILES/plural-rules.txt @@ -27,6 +27,9 @@ Determining Plural Categories The CLDR plural categories do not necessarily match the traditional grammatical categories. Instead, the categories are determined by changes required in a phrase or sentence if a numeric placeholder changes value. Minimal pairs The categories are verified by looking a minimal pairs: where a change in numeric value (expressed in digits) forces a change in the other words. For example, the following is a minimal pair for English, establishing a difference in category between "1" and "2". +Category Resolved String Minimal Pair Template +one 1 day {NUMBER} day +other 2 day s {NUMBER} day s Warning for Vetters The Category (Code) values indicate a certain range of numbers that differ between languages. To see the meaning of each Code value for your language see Language Plural Rules chart. The minimal pairs in the Survey Tool are not direct translations of English. They may be translations of English, such as in German, but must be different if those words or terms do not show the right plural differences for your language. For example, if we look at Belarusian, they are quite different, corresponding to “{0} books in {0} days”, while Welsh has the equivalent of “{0} dog, {0} cat”. Be sure to read the following examples carefully and pay attention to error messages. @@ -43,25 +46,37 @@ you should then have the phrase for "one" Gender is irrelevant. Do not contort your phrasing so that it could cover some (unspecified) item of a different gender. (Eg, don't have “Prenez la {0}re à droite; Prenez le {0}er à droite.”) The exception to that is where two nouns of different genders to cover all plural categories, such as Russian “из {0} книг за {0} дня”. Non-inflecting Nouns—Verbs Some languages, like Bengali, do not change the form of the following noun when the numeric value changes. Even where nouns are invariant, other parts of a sentence might change. That is sufficient to establish a minimal pair. For example, even if all nouns in English were invariant (like 'fish' or 'sheep'), the verb changes are sufficient to establish a minimal pair: +Category Resolved String Minimal Pair Template +one 1 fish is swimming {NUMBER} fish is swimming +other 2 fish are swimming {NUMBER} fish are swimming Non-inflecting Nouns—Pronouns In other cases, even the verb doesn't change, but referents (such as pronouns) change. So a minimal pair in such a language might look something like: +Category Resolved String Minimal Pair Template +one You have 1 fish in your cart; do you want to buy it? You have {NUMBER} fish in your cart; do you want to buy it? +other You have 2 fish in your cart; do you want to buy them? You have {NUMBER} fish in your cart; do you want to buy them? Multiple Nouns In many cases, a single noun doesn't exhibit all the numeric forms. For example, in Welsh the following is a minimal pair that separates 1 and 2: -Category -one -two -Resolved String -1 ci -2 gi +Category Resolved String +one 1 ci +two 2 gi But the form of this word is the same for 1 and 4. We need a separate word to get a minimal pair that separates 1 and 4: -Category -one -two -Resolved String -1 gath -1 cath +Category Resolved String +one 1 gath +two 1 cath These combine into a single Minimal Pair Template that can be used to separate all 6 forms in Welsh. +Category Resolved String Minimal Pair Template +zero 0 cŵn, 0 cathod {NUMBER} cŵn, {NUMBER} cathod +one 1 ci, 1 gath {NUMBER} ci, {NUMBER} gath +two 2 gi, 2 gath {NUMBER} gi, {NUMBER} gath +few 3 chi, 3 cath {NUMBER} chi, {NUMBER} cath +many 6 chi, 6 chath {NUMBER} chi, {NUMBER} chath +other 4 ci, 4 cath {NUMBER} ci, {NUMBER} cath Russian is similar, needing two different nouns: +Category Resolved String Minimal Pair Template +one из 1 книги за 1 день из {NUMBER} книги за {NUMBER} день +few из 2 книг за 2 дня из {NUMBER} книг за {NUMBER} дня +many из 5 книг за 5 дней из {NUMBER} книг за {NUMBER} дней +other из 1,5 книги за 1,5 дня из {NUMBER} книги за {NUMBER} дня The minimal pairs are those that are required for correct grammar. So because 0 and 1 don't have to form a minimal pair (it is ok—even though often not optimal—to say "0 people") , 0 doesn't establish a separate category. However, implementations are encouraged to provide the ability to have special plural messages for 0 in particular, so that more natural language can be used: None of your friends are online. rather than @@ -95,7 +110,7 @@ These categories are only mnemonics -- the names don't necessarily imply the exa This is worth emphasizing: A common mistake is to think that "one" is only for only the number 1. Instead, "one" is a category for any number that behaves like 1. So in some languages, for example, one → numbers that end in "1" (like 1, 21, 151) but that don't end in 11 (like "11, 111, 10311). Note that these categories may be different from the forms used for pronouns or other parts of speech. In particular, they are solely concerned with changes that would need to be made if different numbers, expressed with decimal digits, are used with a sentence. If there is a dual form in the language, but it isn't used with decimal numbers, it should not be reflected in the categories. That is, the key feature to look for is: If you were to substitute a different number for "1" in a sentence or phrase, would the rest of the text be required to change? For example, in a caption for a video: -"Duration: 1 hour" → "Duration: 3.2 hours" + "Duration: 1 hour" → "Duration: 3.2 hours" Plural Rule Syntax See LDML Language Plural Rules. Plural Message Migration @@ -110,11 +125,11 @@ OLD Rules & OLD Messages one: book two: books other: books -1  ➞ book, 2 ➞ books, 3 ➞ ​ books​ +1 ➞ book, 2 ➞ books, 3 ➞ ​ books​ NEW Rules & OLD or NEW Messages one: book other: books -1  ➞ book, 2 ➞ books, 3  ➞​ books​ +1 ➞ book, 2 ➞ books, 3 ➞​ books​ This is fairly harmless; merging two of the categories shouldn't affect anyone because the messages for the merged category should not have material differences. The old messages for 'two' are ignored in processing. They could be deleted if desired. This was done in CLDR 24 for Russian, for example. Splitting Other @@ -124,49 +139,49 @@ In this case, the other message is appropriate for the other case, and not for t OLD Rules & OLD Messages one: book other: books -1  ➞ book, 2 ➞ books, 3  ➞​ books​ +1 ➞ book, 2 ➞ books, 3 ➞​ books​ NEW Rules & OLD Messages one: book two: books other: books -1  ➞ book, 2 ➞ books, 3  ➞​ books​ +1 ➞ book, 2 ➞ books, 3 ➞​ books​ The quality is no different than previously. The message can be improved by adding the correct message for 'two', so that the result is: NEW Rules & NEW Messages one: book two: booku other: books -1  ➞ book, 2 ➞ booku, 3  ➞​ books​ +1 ➞ book, 2 ➞ booku, 3 ➞​ books​ However, if the translated message is not missing, but has some special text like "UNUSED MESSAGE", then it will need to be fixed; otherwise the special text will show up to users! Generic Other Message In this case, the other message was written to be generic by trying to handle (with parentheses or some other textual device) both the plural and dual categories. OLD Rules & OLD Messages one: book other: book(u/s) -1  ➞ book, 2 ➞ book(u/s), 3  ➞​ book(u/s) +1 ➞ book, 2 ➞ book(u/s), 3 ➞​ book(u/s) NEW Rules & OLD Messages one: book two: book(u/s) other: book(u/s) -1  ➞ book, 2 ➞ book(u/s), 3  ➞​ book(u/s) +1 ➞ book, 2 ➞ book(u/s), 3 ➞​ book(u/s) The message can be improved by adding a message for 'two', and fixing the message for 'other' to not have the (u/s) workaround: NEW Rules & NEW Messages one: book two: booku other: books -1  ➞ book, 2 ➞ booku, 3  ➞​ books +1 ➞ book, 2 ➞ booku, 3 ➞​ books Splitting Non-Other In this case, the 'one' category needs to be fixed by moving some numbers to a 'two' category. OLD Rules & OLD Messages one: book/u other: books -1  ➞ book/u, 2 ➞ book/u, 3  ➞​ books​ +1 ➞ book/u, 2 ➞ book/u, 3 ➞​ books​ NEW Rules & OLD Messages one: book/u other: books -1  ➞ book/u, 2 ➞ books, 3  ➞​ books​ +1 ➞ book/u, 2 ➞ books, 3 ➞​ books​ This is the one case where there is a regression in quality. In order to fix the problem, the message for 'two' needs to be fixed. If the messages for 'one' was written to be generic, then it needs to be fixed as well. NEW Rules & NEW Messages one: book two: booku other: books -1  ➞ book, 2 ➞ booku, 3  ➞​ books​ \ No newline at end of file +1 ➞ book, 2 ➞ booku, 3 ➞​ books​ \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt b/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt index 6cd5122f65d..8d7b063413f 100644 --- a/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt +++ b/docs/site/TEMP-TEXT-FILES/transliteration-guidelines.txt @@ -3,24 +3,46 @@ Introduction This document describes guidelines for the creation and use of CLDR transliterations. Please file any feedback on this document or those charts at Locale Bugs. Transliteration is the general process of converting characters from one script to another, where the result is roughly phonetic for languages in the target script. For example, "Phobos" and "Deimos" are transliterations of Greek mythological "Φόβος" and "Δεῖμος" into Latin letters, used to name the moons of Mars. Transliteration is not translation. Rather, transliteration is the conversion of letters from one script to another without translating the underlying words. The following shows a sample of transliteration systems: +Sample Transliteration Systems +Source Translation Transliteration System +Αλφαβητικός Alphabetic Alphabētikós Classic +Alfavi̱tikós UNGEGN +しんばし new bridge (district in Tokyo) shimbashi Hepburn +sinbasi Kunrei +яйца Фаберже Fabergé eggs yaytsa Faberzhe BGN/PCGN +jajca Faberže Scholarly +âjca Faberže ISO Display. Some of the characters in this document may not be visible in your browser, and with some fonts the diacritics will not be correctly placed on the base letters. See Display Problems. While an English speaker may not recognize that the Japanese word kyanpasu is equivalent to the English word campus, the word kyanpasu is still far easier to recognize and interpret than if the letters were left in the original script. There are several situations where this transliteration is especially useful, such as the following. See the sidebar for examples. When a user views names that are entered in a world-wide database, it is extremely helpful to view and refer to the names in the user's native script. When the user performs searching and indexing tasks, transliteration can retrieve information in a different script. When a service engineer is sent a program dump that is filled with characters from foreign scripts, it is much easier to diagnose the problem when the text is transliterated and the service engineer can recognize the characters. +Sample Transliterations +Source Transliteration +김, 국삼 Gim, Gugsam +김, 명희 Gim, Myeonghyi +정, 병호 Jeong, Byeongho +... ... +たけだ, まさゆき Takeda, Masayuki +ますだ, よしひこ Masuda, Yoshihiko +やまもと, のぼる Yamamoto, Noboru +... ... +Ρούτση, Άννα Roútsē, Ánna +Καλούδης, Χρήστος Kaloúdēs, Chrḗstos +Θεοδωράτου, Ελένη Theodōrátou, Elénē The term transliteration is sometimes given a narrow meaning, implying that the transformation is reversible (sometimes called lossless). In CLDR this is not the case; the term transliteration is interpreted broadly to mean both reversible and non-reversible transforms of text. (Note that even if theoretically a transliteration system is supposed to be reversible, in source standards it is often not specified in sufficient detail in the edge cases to actually be reversible.) A non-reversible transliteration is often called a transcription, or called a lossy or ambiguous transcription. Note that reversibility is generally only in one direction, so a transliteration from a native script to Latin may be reversible, but not the other way around. For example, Hangul is reversible, in that any Hangul to Latin to Hangul should provide the same Hangul as the input. Thus we have the following: -갗 → gach → 갗 + 갗 → gach → 갗 However, for completeness, many Latin characters have fallbacks. This means that more than one Latin character may map to the same Hangul. Thus from Latin we don't have reversibility, because two different Latin source strings round-trip back to the same Latin string. -gach → 갗 → gach -gac → 갗 → gach + gach → 갗 → gach + gac → 갗 → gach Transliteration can also be used to convert unfamiliar letters within the same script, such as converting Icelandic THORN (þ) to th. These are not typically reversible. -There is an online demo using released CLDR data at ICU Transform Demo. + There is an online demo using released CLDR data at ICU Transform Demo. Variants There are many systems for transliteration between languages: the same text can be transliterated in many different ways. For example, for the Greek example above, the transliteration is classical, while the UNGEGN alternate has different correspondences, such as φ → f instead of φ → ph. CLDR provides for generic mappings from script to script (such as Cyrillic-Latin), and also language-specific variants (Russian-French, or Serbian-German). There can also be semi-generic mappings, such as Russian-Latin or Cyrillic-French. These can be referred to, respectively, as script transliterations, language-specific transliterations, or script-language transliterations. Transliterations from other scripts to Latin are also called Romanizations. Even within particular languages, there can be variant systems according to different authorities, or even varying across time (if the authority for a system changes its recommendation). The canonical identifier that CLDR uses for these has the form: -source-target/variant + source-target/variant The source (and target) can be a language or script, either using the English name or a locale code. The variant should specify the authority for the system, and if necessary for disambiguation, the year. For example, the identifier for the Russian to Latin transliteration according to the UNGEGN system would be: ru-und_Latn/UNGEGN, or Russian-Latin/UNGEGN @@ -52,22 +74,28 @@ Tilde: "ャ" in isolation is represented as "~ya" Diacritics: Greek "ς" in isolation is represented as s̱ Note: The CLDR committee is considering converging on a common representation for this. The advantage of a common representation is that it allows for easy filtering. For the default script transforms, the goal is to have unambiguous mappings, with variants for any common use mappings that are ambiguous (non-reversible). In some cases, however, case may not be preserved. For example, +Latin Greek Latin +ps PS ψ Ψ ps PS +psa Psa PsA ψα Ψα ΨΑ psa Psa PSA +psA PSA PSa ψΑ ΨΑ Ψα psA PSA Psa The following shows Greek text that is mapped to fully reversible Latin: -Greek-Latin -τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. -tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. +Greek-Latin +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. If the user wants a version without certain accents, then CLDR's chaining rules can be used to remove the accents. For example, the following transliterates to Latin but removes the macron accents on the long vowels. -Greek-Latin; nfd; [\u0304] remove; nfc -τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. -tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron. +Greek-Latin; nfd; [\u0304] remove; nfc +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron. The above chaining rules, separated by semi-colons, perform the following commands in order: +Rule Description +Greek-Latin transliterate Greek to Latin +nfd convert to Unicode NFD format (separating accents from base characters) +[\u0304] remove remove accents, but filter the command to only apply to a single character: U+0304 ( ̄ ) COMBINING MACRON +nfc convert to Unicode NFC format (rejoining accents to base characters) The following transliterates to Latin but removes all accents. Note that the only change is to expand the filter for the remove command. -Greek-Latin; nfd; [:nonspacing marks:] remove; nfc -τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. -ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. +Greek-Latin; nfd; [:nonspacing marks:] remove; nfc +τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. Pronunciation Standard transliteration methods often do not follow the pronunciation rules of any particular language in the target script. For example, the Japanese Hepburn system uses a "j" that has the English phonetic value (as opposed to French, German, or Spanish), but uses vowels that do not have the standard English sounds. A transliteration method might also require some special knowledge to have the correct pronunciation. For example, in the Japanese kunrei-siki system, "ti" is pronounced as English "chee". -This is similar to situations where there are different languages within the same script. For example, knowing that the word Gewalt comes from German allows a knowledgeable reader to pronounce the "w" as a "v".  When encountering a foreign word like jawa, there is little assurance how it is to be pronounced even when it is not a transliteration (it is just from /span>another Latin-script language). The j could be pronounced (for an English speaker) as in jump, or Junker, or jour; and so on. Transcriptions are only roughly phonetic, and only so when the specific pronunciation rules are understood. +This is similar to situations where there are different languages within the same script. For example, knowing that the word Gewalt comes from German allows a knowledgeable reader to pronounce the "w" as a "v". When encountering a foreign word like jawa, there is little assurance how it is to be pronounced even when it is not a transliteration (it is just from /span>another Latin-script language). The j could be pronounced (for an English speaker) as in jump, or Junker, or jour; and so on. Transcriptions are only roughly phonetic, and only so when the specific pronunciation rules are understood. The pronunciation of the characters in the original script may also be influenced by context, which may be particularly misleading in transliteration. For, in the Bengali নিঃশব, transliterated as niḥśaba, the visarga ḥ is not pronounced itself (whereas elsewhere it may be) but lengthens the ś sound, and the final inherent a is pronounced (whereas it commonly is not), and the two inherent a's are pronounced as ɔ and ô, respectively. In some cases, transliteration may be heavily influenced by tradition. For example, the modern Greek letter beta (β) sounds like a "v", but a transliteration may use a b (as in biology). In that case, the user would need to know that a "b" in the transliterated word corresponded to beta (β) and is to be pronounced as a v in modern Greek. Letters may also be transliterated differently according to their context to make the pronunciation more predictable. For example, since the Greek sequence GAMMA GAMMA (γγ) is pronounced as ng, the first GAMMA can be transcribed as an "n" in that context. Similarly, the transliteration can give other guidance to the pronunciation in the source language, for example, using "n" or "m" for the same Japanese character (ん) depending on context, even though there is no distinction in the source script. @@ -88,29 +116,83 @@ There are many Romanizations of Korean. The default transliteration in Unicode C "제 8 항 학술 연구 논문 등 특수 분야에서 한글 복원을 전제로 표기할 경우에는 한글 표기를 대상으로 적는다. 이때 글자 대응은 제2장을 따르되 'ㄱ, ㄷ, ㅂ, ㄹ'은 'g, d, b, l'로만 적는다. 음가 없는 'ㅇ'은 붙임표(-)로 표기하되 어두에서는 생략하는 것을 원칙으로 한다. 기타 분절의 필요가 있을 때에도 붙임표(-)를 쓴다." translation: "Clause 8: When it is required to recover the original Hangul representation faithfully as in scholarly articles, ' ㄱ, ㄷ, ㅂ, ㄹ' must be always romanized as 'g, d, b, l' while the mapping for the rest of the letters remains the same as specified in clause 2. The placeholder 'ㅇ' at the beginning of a syllable should be represented with '-', but should be omitted at the beginning of a word. In addition, '-' should be used in other cases where a syllable boundary needs to be explicitly marked (be disambiguated." There are a number of cases where this Romanization may be ambiguous, because sometimes multiple Latin letters map to a single entity (jamo) in Hangul. This happens with vowels and consonants, the latter being slightly more complicated because there are both initial and final consonants: +Type Multi-Character Consonants +Initial-Only tt pp jj +Initial-or-Final kk ch ss +Final-Only gs nj nh lg lm lb ls lt lp lh bs ng CLDR uses the following rules for disambiguation of the possible boundaries between letters, in order. The first rule comes from Clause 8. Don't break so as to require an implicit vowel or null consonant (if possible) Don't break within Initial-Only or Initial-Or-Final sequences (if possible) Favor longest match first. If there is a single consonant between vowels, then Rule #1 will group it with the following vowel if there is one (this is the same as the first part of Clause 8). If there is a sequence of four consonants between vowels, then there is only one possible break (with well-formed text). So the only ambiguities lie with two or three consonants between vowels, where there are possible multi-character consonants involved. Even there, in most cases the resolution is simple, because there isn't a possible multi-character consonant in the case of two, or two possible multi-character consonants in the case of 3. For example, in the following cases, the left side is unambiguous: -angda = ang-da → 앙다 -apda = ap-da → 앞다 + angda = ang-da → 앙다 + apda = ap-da → 앞다 There are a relatively small number of possible ambiguities, listed below using "a" as a sample vowel. +No. of Cons. Latin CLDR Disambiguation Hangul Comments +2 atta = a-tta 아따 Rule 1, then 2 +appa = a-ppa 아빠 +ajja = a-jja 아짜 +akka = a-kka 아까 Rule 1, then 2 +assa = a-ssa 아싸 +acha = a-cha 아차 +agsa = ag-sa 악사 Rule 1 +anja = an-ja 안자 +anha = an-ha 안하 +alga = al-ga 알가 +alma = al-ma 알마 +alba = al-ba 알바 +alsa = al-sa 알사 +alta = al-ta 알타 +alpa = al-pa 알파 +alha = al-ha 알하 +absa = ab-sa 압사 +anga = an-ga 안가 +3 agssa = ag-ssa 악싸 Rule 1, then 2 +anjja = an-jja 안짜 +alssa = al-ssa 알싸 +altta = al-tta 알따 +alppa = al-ppa 알빠 +abssa = ab-ssa 압싸 +akkka = akk-ka 앆카 Rule 1, then 2, then 3 +asssa = ass-sa 았사 For vowel sequences, the situation is simpler. Only Rule #3 applies, so aeo = ae-o → 애오. Japanese The default transliteration for Japanese uses the a slight variant of the Hepburn system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended slightly for reversibility by using "dji" for DI and "dzu" for DU. Greek The default transliteration uses a standard transcription for Greek which is aimed at preserving etymology. The ISO 843 variant includes following differences: +Greek Default ISO 843 +β b v +γ* n g +η ē ī +̔ h (omitted) +̀ ̀ (omitted) +~ ~ (omitted) * before γ, κ, ξ, χ Cyrillic Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative plans to add extended Cyrillic characters in the future, plus variants for GOST and other national standards. Indic Transliteration of Indic scripts follows the ISO 15919 Transliteration of Devanagari and related Indic scripts into Latin characters. Internally, all Indic scripts are transliterated by converting first to an internal form, called Inter-Indic, then from Inter-Indic to the target script. Inter-Indic thus provides a pivot between the different scripts, and contains a superset of correspondences for all of them. ISO 15919 differs from ISCII 91 in application of diacritics for certain characters. These differences are shown in the following example (illustrated with Devanagari, although the same principles apply to the other Indic scripts): +Devanagari ISCII 91 ISO 15919 +ऋ ṛ r̥ +ऌ ḻ l̥ +ॠ ṝ r̥̄ +ॡ ḻ̄ l̥̄ +ढ़ d̂ha ṛha +ड़ d̂a ṛa Transliteration rules from Indic to Latin are reversible with the exception of the ZWJ and ZWNJ used to request explicit rendering effects. For example: +Devanagari Romanization Note +क्ष kṣa normal +क्‍ष kṣa explicit halant requested +क्‌ष kṣa half-consonant requested Transliteration between Indic scripts are roundtrip where there are corresponding letters. Otherwise, there may be fallbacks. There are two particular instances where transliterations may produce unexpected results: (1) where the final vowel is suppressed in speech, and (2) with the transliteration of 'c'. For example: +Devanagari Romanization Notes +सेन्गुप्त Sēngupta +सेनगुप्त Sēnagupta The final 'a' is not pronounced +मोनिक Monika +मोनिच Monica The 'c' is pronounced "ch" Others Unicode CLDR provides other transliterations based on the U.S. Board on Geographic Names (BGN) transliterations. These are currently unidirectional — to Latin only. The goal is to make them bidirectional in future versions of CLDR. Other transliterations are generally based on the UNGEGN: Working Group on Romanization Systems transliterations. These systems are in wider actual implementation than most ISO standardized transliterations, and are published freely available on the web (http://www.eki.ee/wgrs/) and thus easily accessible to all. The UNGEGN also has good documentation. For example, the UNGEGN Arabic Tables not only presents the UN system, but compares it with the BGN/PCGN 1956 system, the I.G.N. System 1973, ISO 233:1984, the royal Jordanian Geographic Centre System, and the Survey of Egypt System. @@ -123,6 +205,12 @@ For comparison, you can see what is currently in CLDR in the transforms folder o Script transliterators should cover every character in the exemplar sets for the CLDR locales using that script. Romanizations (Script-Latin) should cover all the ASCII letters (some of these can be fallback mappings, such as the 'x' below). If the rules are very simple, they can be supplied in a spreadsheet, with two columns, such as +Shavian Relation Latin Comments +𐑐 ↔ p Map all uppercase to lowercase first +𐑚 ↔ b +𐑑 ↔ t +𐑒𐑕 ← x fallback +... More Information For more information, see: BGN: U.S. Board on Geographic Names diff --git a/docs/site/index/cldr-spec/coverage-levels.md b/docs/site/index/cldr-spec/coverage-levels.md index 6ec633c07f5..4b70ea1228f 100644 --- a/docs/site/index/cldr-spec/coverage-levels.md +++ b/docs/site/index/cldr-spec/coverage-levels.md @@ -78,6 +78,7 @@ Before submitting data above the Basic Level, the following must be in place: - The list is a space\-delimited list of the characters used by the language (in the given script). The list may include multiple\-character strings, where those are treated specially. For example, if "ch" is sorted after "h" one might see "a b c d .. g h ch i j ..." - More sophisticated users can do a better job, supplying a file of rules as in [cldr\-spec/collation\-guidelines](https://cldr.unicode.org/index/cldr-spec/collation-guidelines). 4. The result will be a file like: [common/collation/ar.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/ar.xml) or [common/collation/da.xml](https://home.unicode.org/basic-info/projects/#!/repos/cldr/trunk/common/collation/da.xml). + The data for the Moderate Level includes subsets of the Modern data, both in depth and breadth. ## Modern Data