diff --git a/docs/site/development/development-process/design-proposals/alternate-time-formats.md b/docs/site/development/development-process/design-proposals/alternate-time-formats.md new file mode 100644 index 00000000000..cfc288219d1 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/alternate-time-formats.md @@ -0,0 +1,31 @@ +--- +title: Alternate Time Formats +--- + +# Alternate Time Formats + +This design proposal is intended to solve the problem that sometimes the desired time separator for a pattern may vary depending on the numbering system used. Rather than adding an additional type of number symbol just for the time separator, a more generalized solution would be to expand the syntax for numbering system overrides in patterns, in order to allow a simple replacement of a literal in the pattern based on the numbering system. The following description of the numbering system override is from the current TR35: + +\ + +The numbers attribute is used to indicate that numeric quantities in the pattern are to be rendered using a numbering system other than then default numbering system defined for the given locale. The attribute can be in one of two forms. If the alternate numbering system is intended to apply to ALL numeric quantities in the pattern, then simply use the numbering system ID as found in Section C.13 [Numbering Systems](http://www.unicode.org/reports/tr35/#Numbering_Systems). To apply the alternate numbering system only to a single field, the syntax "\=\" can be used one or more times, separated by semicolons. + +Examples: + +\dd/mm/yyyy\ + +\ + +\dd/mm/yyyy\ + +\ + +\dd/mm/yyyy\ + +\ + +**Proposed Extension** + +In addition to the syntax, allow symbol or string replacements of the form "\=\=\" + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/bcp-47-changes-draft.md b/docs/site/development/development-process/design-proposals/bcp-47-changes-draft.md new file mode 100644 index 00000000000..fa7afbd59f7 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/bcp-47-changes-draft.md @@ -0,0 +1,214 @@ +--- +title: BCP 47 Changes (DRAFT) +--- + +# BCP 47 Changes (DRAFT) + +With the new release of the new version of [BCP 47](http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-18.html), there are various changes we need to make in Unicode CLDR and LDML. Already in CLDR 1.7 we have made modifications anticipating the release: see [BCP 47 Tag Conversion](http://unicode.org/reports/tr35/#BCP_47_Tag_Conversion) in the spec (and the orginal [design proposal](https://cldr.unicode.org/development/development-process/design-proposals/bcp47-syntax-mapping)), but more changes need to be made. + +## Formula + +We need to take another look at which languages we show in the survey tool for translation, because the new version is [very large](http://tools.ietf.org/html/draft-ietf-ltru-4645bis), around 7,000 languages. Showing all of those languages in the Survey tool would neither be good for the usability of the tool for most translators, nor for tool performance, so we need some formula for picking which languages to show by default. + +For feedback on this document, please file a Reply under [http://www.unicode.org/cldr/bugs/locale-bugs?findid=1977](http://www.unicode.org/cldr/bugs/locale-bugs?findid=1977). For discussion of issues, please send email to [cldr-users@unicode.org](mailto:cldr-users@unicode.org). + +**Draft Formula** + +A. We show a language code X for translation if any of the following conditions are true: + +1. X is a qualified language\*\*, **and** has at least **100K** speakers, **and** at least one of the following is true: + 1. X is has official status\* in any country + 2. X exceeds a threshold population† of literate users worldwide: **10M** + 3. X exceeds a threshold population† in some country Z: **1M** ***and*** **1/3** of Z's population†. +2. X has ***non-draft*** minimal language coverage‡ in CLDR itself. +3. *Only for translation in locale Y:* X is a qualified language\*\* that already has a translation in CLDR data in Y. +4. X is an exception explicitly approved by the committee, either in root, or in some language Y. + 1. Current examples: Latin, Sanskrit + +If a translator finds that X is needed for translation in language Y, then a bug can be filed. If we find the volume is high, we may need to add is some way for a translator to add a language in the survey tool. + +B. We show a script code S for translation if and only if it is one of the scripts used by one of the languages shown. + +**Notes** + +\*\* qualified language: excluding collection (except for macrolanguages with predominant forms), ancient, historic, and extinct languages: see [Scope](http://www.sil.org/iso639-3/scope.asp) and [Types](http://www.sil.org/iso639-3/types.asp). Some could be added as exceptions as needed. + +‡ minimal coverage - see [Coverage Levels](http://www.unicode.org/reports/tr35/#Coverage_Levels) - at a non-draft level. + +\* *official status* means official, de facto official, official regional, or de facto official regional. + +† *population* means literate 14-day active users (well, theoretically - we can only get an approximation of that), based on [CLDR figures](http://www.unicode.org/cldr/data/charts/supplemental/language_territory_information.html). Our concern is with written language, not spoken, and so we don't focus on variants that don't have much written usage; moreover, the population figures we want to focus on are the literate population. For this reason and others, we don't rely on the Ethnologue figures. See also [Picking the Right Language Code](https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code). + + +**Please review the generated lists in** [**Filtered Scripts and Languages**](https://cldr.unicode.org/development/development-process/design-proposals/bcp-47-changes-draft)**.** A spreadsheet with some details is on. The first column is the language, 2rd is the world population of the language (literate), and the remaining columns are the reasons (data for 1.1, 1.2, 1.3 from the above). + +Known issues: + +- need to add Norwegian [no], resolve tl vs fil, ... +- Tokelau has no speakers (bug filed) + +### Survey Tool Changes + +The above would only require a small tool change: the main change is that the approved list from #1 and #2 would be in CODE\_FALLBACK, and nothing else would. Languages would get #3 cases by virtue of there being a translated tag already in the language, even if Root doesn't have anything (because it is not in CODE\_FALLBACK). Thus if the locale doesn't already contain a translation for, say, Ancient Greek, it would not show up in the survey tool. + +We would add the lists to the supplemental metadata for access by the tools. The Coverage tool and spec also need to be aligned with the above. + +## Other Changes + +We also need to make other changes to the spec in regards to the new version of BCP 47. In particular, those [macrolanguages](http://www.sil.org/iso639-3/macrolanguages.asp) with an encompassed language that is a "predominant form", CLDR treats the predominant form and the macrolanguage as aliases. See [Locale Field Definitions](http://unicode.org/reports/tr35/#Locale_Field_Definitions) in the spec. We need to flesh that table out to include all of the [macrolanguages](http://www.sil.org/iso639-3/macrolanguages.asp) that are in the [Included Languages](https://cldr.unicode.org/development/development-process/design-proposals/bcp-47-changes-draft), such as Azerbaijani. Here is a start at that (but still just draft). The first part of this list is from a draft of BCP 47bis. The last three are codes that are in the current (2006) version of BCP 47. + +Macrolanguage Table + +| Macrolanguage | Encompassed Language | Comments | +|---|---|---| +| Arabic ' ar ' | Standard Arabic ' arb ' | | +| Konkani ' kok ' | Konkani ( individual language) ' knn ' | | +| Malay ' ms ' | Standard Malay ' zsm ' | | +| Swahili ' sw ' | Swahili ( individual language) ' swh ' | | +| Uzbek ' uz ' | Northern Uzbek ' uzn ' | | +| Chinese ' zh ' | Mandarin Chinese ' cmn ' | | +| Norwegian ' no ' | Norwegian Bokmal ' nob ' = nb | To regularize, we may want to switch in CLDR from nb as the 'norm' to no. | +| Serbo-croatian ' sh ' | | *This is a complex situation, and we'll probably leave as is.* | +| Kurdish 'ku' | Northern Kurdish 'kmr'? | We probably want to change the default content locale to ku-Latn | +| Akan ' ak ' | Twi ' tw ' and Fanti ' fat' | This appears to be a mistake in ISO 639. See: ISO 636 Deprecation Requests . | +| Persian fas (fa) | Western Farsi pes and prs Dari | This appears to be a mistake in ISO 639. See: ISO 636 Deprecation Requests . | + +These would also go into the \ element of the supplemental metadata. We may add more such aliases over time, as we find new predominant forms. Note that we still need to offer both aliases for translation in many cases. For example, we want to show both 'no' and 'nb'. + +## Lenient Parsing + +There are many circumstances where we get less than perfect language identifiers coming in. I think we should have some guidelines as to how to do this. Here are the possibilities: + +1. case / hyphen insensitivity +2. map valid non-canonical forms to their canonical equivalents (zh-cmn, cmn => zh) +3. map certain common invalid forms to their canonical equivalents: + 1. UK => GB + 2. eng => en // and other illegal 3-letter 639 codes that correspond to 2-letter codes + 3. 840 => US // other numeric region codes that correspond to 2-letter codes +4. map away extlangs. Formally, en-yue is valid (this slipped by us in doing BCP 47), and canonicalizes in BCP 47 to yue, the same as zh-yue does. In any event, the simplest thing for us to do is if there is a syntactic extlang: + 1. Verify that the base language and extlang are both valid language subtags + 2. Remove the base language + 3. This avoids having to store which languages are also extlangs, and what their prefixes are. + +People have to do #1. We should recommend #2, and make it easy to support #3. + +See demo at [http://unicode.org/cldr/utility/languageid.jsp](http://unicode.org/cldr/utility/languageid.jsp) + +Also, we should consider modifying the canonical form of language identifiers so as to have lowercase variants (with the exception of some set of grandfathered codes). The following are generated by GenerateMaximalLocales, plus 7 hand modifications for the last line. + +## Filtered Scripts and Languages + +The following script/language names would be included (/excluded) from default translation. For the method used to get this list, see [Formula](https://cldr.unicode.org/development/development-process/design-proposals/bcp-47-changes-draft). + +The languages are listed in the format Abkhazian [ab]-OR, where [xx] is the code, and "OR" is the abbreviated "best" status in some territory: **U**nknown, **O**fficial **R**egional, **O**fficial **M**inority, **D**e facto official, **O**fficial. + +### Included Script Names: 41+ + +- Arabic [Arab], Armenian [Armn] +- Bengali [Beng] +- Cyrillic [Cyrl] +- Devanagari [Deva] +- Ethiopic [Ethi] +- Georgian [Geor], Greek [Grek], Gujarati [Gujr], Gurmukhi [Guru] +- Hebrew [Hebr], Han [Hani] + - Simplified Han [Hans], Traditional Han [Hant], Bopomofo [Bopo] +- Japanese [Jpan] + - Hiragana [Hira], Katakana [Kana] +- Kannada [Knda], Khmer [Khmr], Korean [Kore] + - Hangul [Hang] +- Lao [Laoo], Latin [Latn] +- Malayalam [Mlym], Mongolian [Mong], Myanmar [Mymr] +- Oriya [Orya] +- Sinhala [Sinh] +- Tamil [Taml], Telugu [Telu], Thaana [Thaa], Thai [Thai], Tibetan [Tibt] +- Special codes: + - Common [Zyyy], Symbols [Zsym], Unwritten [Zxxx], Unknown or Invalid Script [Zzzz] + - Braille [Brai] +- *Possibly also in the future:* + - Tifinagh [Tfng], Yi [Yiii], + - Unified Canadian Aboriginal Syllabics [Cans], + +### Excluded Script Names: + +- Avestan [Avst] +- Balinese [Bali], Batak [Batk], Blissymbols [Blis], Book Pahlavi [Phlv], Brahmi [Brah], Buginese [Bugi], Buhid [Buhd] +- Carian [Cari], Chakma [Cakm], Cham, Cherokee [Cher], Cirth [Cirt], Coptic [Copt], Cypriot [Cprt] +- Deseret [Dsrt] +- Eastern Syriac [Syrn], Egyptian demotic [Egyd], Egyptian hieratic [Egyh], Egyptian hieroglyphs [Egyp], Estrangelo Syriac [Syre] +- Fraktur Latin [Latf] +- Gaelic Latin [Latg], Georgian Khutsuri [Geok], Glagolitic [Glag], Gothic [Goth] +- Hanunoo [Hano] +- Imperial Aramaic [Armi], Indus [Inds], Inherited [Qaai], Inscriptional Pahlavi [Phli], Inscriptional Parthian [Prti] +- Javanese [Java] +- Kaithi [Kthi], Katakana or Hiragana [Hrkt], Kayah Li [Kali], Kharoshthi [Khar] +- Lanna [Lana], Lepcha [Lepc], Limbu [Limb], Linear A [Lina], Linear B [Linb], Lisu, Lycian [Lyci], Lydian [Lydi] +- Mandaean [Mand], Manichaean [Mani], Mathematical Notation [Zmth], Mayan hieroglyphs [Maya], Meitei Mayek [Mtei], Meroitic [Mero], Moon +- N’Ko [Nkoo], New Tai Lue [Talu], Nkgb +- Ogham [Ogam], Ol Chiki [Olck], Old Church Slavonic Cyrillic [Cyrs], Old Hungarian [Hung], Old Italic [Ital], Old Permic [Perm], Old Persian [Xpeo], Orkhon [Orkh], Osmanya [Osma] +- Pahawh Hmong [Hmng], Phags-pa [Phag], Phoenician [Phnx], Pollard Phonetic [Plrd], Psalter Pahlavi [Phlp] +- Rejang [Rjng], Rongorongo [Roro], Runic [Runr] +- Samaritan [Samr], Sarati [Sara], Saurashtra [Saur], Shavian [Shaw], SignWriting [Sgnw], Sumero-Akkadian Cuneiform [Xsux], Sundanese [Sund], Syloti Nagri [Sylo], Syriac [Syrc] +- Tagalog [Tglg], Tagbanwa [Tagb], Tai Le [Tale], Tai Viet [Tavt], Tengwar [Teng], Tifinagh [Tfng] +- Ugaritic [Ugar] +- Vai [Vaii], Visible Speech [Visp] +- Western Syriac [Syrj] +- Yi [Yiii] +- Inherited [Zinh] + +### Included Languages: 202 + +- Abkhazian [ab]-OR, Adyghe [ady]-OR, Afrikaans [af]-O, Akan [ak]-U, Albanian [sq]-O, Amharic [am]-O, Arabic [ar]-O, Armenian [hy]-O, Assamese [as]-O, Asturian [ast]-OR, Avaric [av]-OR, Awadhi [awa]-U, Aymara [ay]-O, Azerbaijani [az]-O +- Bambara [bm]-U, Bashkir [ba]-OR, Basque [eu]-OR, Belarusian [be]-O, Bengali [bn]-O, Bhojpuri [bho]-U, Bislama [bi]-O, Bosnian [bs]-O, Bulgarian [bg]-O, Burmese [my]-O +- Catalan [ca]-O, Cebuano [ceb]-OR, Chamorro [ch]-O, Chechen [ce]-OR, Chinese [zh]-O, Chuukese [chk]-O, Croatian [hr]-O, Czech [cs]-O +- Danish [da]-O, Divehi [dv]-O, Dutch [nl]-O, Dzongkha [dz]-O +- Efik [efi]-O, English [en]-O, Erzya [myv]-OR, Estonian [et]-O, Ewe [ee]-OR +- Faroese [fo]-O, Fijian [fj]-O, Filipino [fil]-O, Finnish [fi]-O, French [fr]-O +- Ga [gaa]-OR, Gagauz [gag]-OR, Galician [gl]-OR, Georgian [ka]-O, German [de]-O, Gilbertese [gil]-O, Greek [el]-O, Guarani [gn]-O, Gujarati [gu]-O +- Haitian [ht]-O, Hausa [ha]-O, Hawaiian [haw]-OR, Hebrew [he]-O, Hiligaynon [hil]-OR, Hindi [hi]-O, Hiri Motu [ho]-O, Hungarian [hu]-O +- Icelandic [is]-O, Igbo [ig]-O, Iloko [ilo]-OR, Indonesian [id]-O, Ingush [inh]-OR, Inuktitut [iu]-OR, Irish [ga]-O, Italian [it]-O +- Japanese [ja]-O, Javanese [jv]-U +- Kabardian [kbd]-OR, Kalaallisut [kl]-O, Kannada [kn]-O, Karachay-Balkar [krc]-OR, Kashmiri [ks]-O, Kazakh [kk]-O, Khasi [kha]-OR, Khmer [km]-O, Kinyarwanda [rw]-O, Kirghiz [ky]-O, Komi-Permyak [koi]-OR, Komi-Zyrian [kpv]-OR, Konkani [kok]-OR, Korean [ko]-O, Kosraean [kos]-O, Krio [kri]-U, Kumyk [kum]-OR, Kurdish [ku]-OR +- Lahnda [lah]-U, Lak [lbe]-OR, Lao [lo]-O, Latin [la]-DO, Latvian [lv]-O, Lezghian [lez]-OR, Lingala [ln]-O, Lithuanian [lt]-O, Luxembourgish [lb]-O +- Macedonian [mk]-O, Madurese [mad]-U, Maguindanao [mdh]-OR, Maithili [mai]-OR, Malagasy [mg]-O, Malay [ms]-O, Malayalam [ml]-O, Maltese [mt]-O, Maore Comorian [swb]-O, Maori [mi]-O, Marathi [mr]-O, Marshallese [mh]-O, Moksha [mdf]-OR, Mongolian [mn]-O, Mossi [mos]-U +- Nauru [na]-O, Nepali [ne]-O, Niuean [niu]-O, Northeastern Thai [tts]-U, Northern Sami [se]-OR, Northern Sotho [nso]-O, Norwegian Bokmål [nb]-O, Norwegian Nynorsk [nn]-O, Nyanja [ny]-O +- Oriya [or]-O, Oromo [om]-U, Ossetic [os]-OR +- Palauan [pau]-O, Pangasinan [pag]-OR, Papiamento [pap]-DO, Pashto [ps]-O, Persian [fa]-O, Plains Cree [crk]-OR, Pohnpeian [pon]-O, Polish [pl]-O, Portuguese [pt]-O, Punjabi [pa]-O +- Quechua [qu]-O +- Rhaeto-Romance [rm]-O, Romanian [ro]-O, Rundi [rn]-O, Russian [ru]-O +- Samoan [sm]-O, Sango [sg]-O, Sanskrit [sa]-O, Santali [sat]-OR, Scottish Gaelic [gd]-OR, Serbian [sr]-O, Shona [sn]-U, Sindhi [sd]-O, Sinhala [si]-O, Slovak [sk]-O, Slovenian [sl]-O, Somali [so]-O, Southern Sotho [st]-O, Spanish [es]-O, Sundanese [su]-O, Swahili [sw]-O, Swati [ss]-O, Swedish [sv]-O, Swiss German [gsw]-U +- Tagalog [tl]-OR, Tahitian [ty]-O, Tajik [tg]-O, Tamil [ta]-O, Tatar [tt]-OR, Tausug [tsg]-OR, Telugu [te]-O, Tetum [tet]-O, Thai [th]-O, Tibetan [bo]-OR, Tigrinya [ti]-DO, Tok Pisin [tpi]-O, Tokelau [tkl]-O, Tonga [to]-O, Tsonga [ts]-O, Tswana [tn]-O, Turkish [tr]-O, Turkmen [tk]-O, Tuvalu [tvl]-O, Tuvinian [tyv]-OR, Twi [tw]-OR +- Udmurt [udm]-OR, Uighur [ug]-OR, Ukrainian [uk]-O, Ulithian [uli]-O, Unknown or Invalid Language [und]-S, Urdu [ur]-O, Uzbek [uz]-O +- Venda [ve]-O, Vietnamese [vi]-O +- Waray [war]-OR, Welsh [cy]-OR, Western Frisian [fy]-OR, Wolof [wo]-O, Woods Cree [cwd]-OR +- Xhosa [xh]-O +- Yakut [sah]-OR, Yapese [yap]-O, Yoruba [yo]-O +- Zhuang [za]-OR, Zulu [zu]-O + +### Excluded Languages: 299 + +- Achinese [ace]-U, Acoli [ach]-U, Adangme [ada]-U, Afar [aa]-U, Afrihili [afh]-U, Afro-Asiatic Language [afa]-U, Ainu [ain]-U, Akkadian [akk]-U, Aleut [ale]-U, Algonquian Language [alg]-U, Altaic Language [tut]-U, Ancient Egyptian [egy]-U, Ancient Greek [grc]-U, Angika [anp]-U, Apache Language [apa]-U, Aragonese [an]-U, Aramaic [arc]-U, Arapaho [arp]-U, Araucanian [arn]-U, Arawak [arw]-U, Aromanian [rup]-U, Artificial Language [art]-U, Athapascan Language [ath]-U, Atsam [cch]-U, Australian Language [aus]-U, Austronesian Language [map]-U, Avestan [ae]-U +- Balinese [ban]-U, Baltic Language [bat]-U, Baluchi [bal]-U, Bamileke Language [bai]-U, Banda [bad]-U, Bantu [bnt]-U, Basa [bas]-U, Batak [btk]-U, Beja [bej]-U, Bemba [bem]-U, Berber [ber]-U, Bihari [bh]-U, Bikol [bik]-U, Bini [bin]-U, Blin [byn]-U, Blissymbols [zbl]-U, Braj [bra]-U, Breton [br]-U, Buginese [bug]-U, Buriat [bua]-U +- Caddo [cad]-U, Carib [car]-U, Caucasian Language [cau]-U, Celtic Language [cel]-U, Central American Indian Language [cai]-U, Chagatai [chg]-U, Chamic Language [cmc]-U, Cherokee [chr]-U, Cheyenne [chy]-U, Chibcha [chb]-U, Chinook Jargon [chn]-U, Chipewyan [chp]-U, Choctaw [cho]-U, Church Slavic [cu]-U, Chuvash [cv]-U, Classical Newari [nwc]-U, Classical Syriac [syc]-U, Coptic [cop]-U, Cornish [kw]-U, Corsican [co]-U, Cree [cr]-U, Creek [mus]-U, Creole or Pidgin [crp]-U, Crimean Turkish [crh]-U, Cushitic Language [cus]-U +- Dakota [dak]-U, Dargwa [dar]-U, Dayak [day]-U, Delaware [del]-U, Dinka [din]-U, Dogri [doi]-U, Dogrib [dgr]-U, Dravidian Language [dra]-U, Duala [dua]-U, Dyula [dyu]-U +- Eastern Frisian [frs]-U, Ekajuk [eka]-U, Elamite [elx]-U, English-based Creole or Pidgin [cpe]-U, Esperanto [eo]-U, Ewondo [ewo]-U +- Fang [fan]-U, Fanti [fat]-U, Finno-Ugrian Language [fiu]-U, Fon [fon]-U, French-based Creole or Pidgin [cpf]-U, Friulian [fur]-U, Fulah [ff]-U +- Ganda [lg]-U, Gayo [gay]-U, Gbaya [gba]-U, Geez [gez]-U, Germanic Language [gem]-U, Gondi [gon]-U, Gorontalo [gor]-U, Gothic [got]-U, Grebo [grb]-U, Gwichʼin [gwi]-U +- Haida [hai]-U, Herero [hz]-U, Himachali [him]-U, Hittite [hit]-U, Hmong [hmn]-U, Hupa [hup]-U +- Iban [iba]-U, Ido [io]-U, Ijo [ijo]-U, Inari Sami [smn]-U, Indic Language [inc]-U, Indo-European Language [ine]-U, Interlingua [ia]-U, Interlingue [ie]-U, Inupiaq [ik]-U, Iranian Language [ira]-U, Iroquoian Language [iro]-U +- Jju [kaj]-U, Judeo-Arabic [jrb]-U, Judeo-Persian [jpr]-U +- Kabyle [kab]-U, Kachin [kac]-U, Kalmyk [xal]-U, Kamba [kam]-U, Kanuri [kr]-U, Kara-Kalpak [kaa]-U, Karelian [krl]-U, Karen [kar]-U, Kashubian [csb]-U, Kawi [kaw]-U, Khoisan Language [khi]-U, Khotanese [kho]-U, Kikuyu [ki]-U, Kimbundu [kmb]-U, Klingon [tlh]-U, Komi [kv]-U, Kongo [kg]-U, Koro [kfo]-U, Kpelle [kpe]-U, Kru [kro]-U, Kuanyama [kj]-U, Kurukh [kru]-U, Kutenai [kut]-U +- Ladino [lad]-U, Lamba [lam]-U, Limburgish [li]-U, Lojban [jbo]-U, Low German [nds]-U, Lower Sorbian [dsb]-U, Lozi [loz]-U, Luba-Katanga [lu]-U, Luba-Lulua [lua]-U, Luiseno [lui]-U, Lule Sami [smj]-U, Lunda [lun]-U, Luo [luo]-U, Lushai [lus]-U +- Magahi [mag]-U, Makasar [mak]-U, Manchu [mnc]-U, Mandar [mdr]-U, Mandingo [man]-U, Manipuri [mni]-U, Manobo Language [mno]-U, Manx [gv]-U, Mari [chm]-U, Marwari [mwr]-U, Masai [mas]-U, Mayan Language [myn]-U, Mende [men]-U, Micmac [mic]-U, Middle Dutch [dum]-U, Middle English [enm]-U, Middle French [frm]-U, Middle High German [gmh]-U, Middle Irish [mga]-U, Minangkabau [min]-U, Mirandese [mwl]-U, Miscellaneous Language [mis]-S, Mohawk [moh]-U, Mon-Khmer Language [mkh]-U, Mongo [lol]-U, Multiple Languages [mul]-S, Munda Language [mun]-U +- N’Ko [nqo]-U, Nahuatl [nah]-U, Navajo [nv]-U, Ndonga [ng]-U, Neapolitan [nap]-U, Newari [new]-U, Nias [nia]-U, Niger-Kordofanian Language [nic]-U, Nilo-Saharan Language [ssa]-U, No linguistic content [zxx]-S, Nogai [nog]-U, North American Indian Language [nai]-U, North Ndebele [nd]-U, Northern Frisian [frr]-U, Norwegian [no]-U, Nubian Language [nub]-U, Nyamwezi [nym]-U, Nyankole [nyn]-U, Nyasa Tonga [tog]-U, Nyoro [nyo]-U, Nzima [nzi]-U +- Occitan [oc]-U, Ojibwa [oj]-U, Old English [ang]-U, Old French [fro]-U, Old High German [goh]-U, Old Irish [sga]-U, Old Norse [non]-U, Old Persian [peo]-U, Old Provençal [pro]-U, Osage [osa]-U, Otomian Language [oto]-U, Ottoman Turkish [ota]-U +- Pahlavi [pal]-U, Pali [pi]-U, Pampanga [pam]-U, Papuan Language [paa]-U, Philippine Language [phi]-U, Phoenician [phn]-U, Portuguese-based Creole or Pidgin [cpp]-U, Prakrit Language [pra]-U +- Rajasthani [raj]-U, Rapanui [rap]-U, Rarotongan [rar]-U, Romance Language [roa]-U, Romany [rom]-U, Root [root]-U +- Salishan Language [sal]-U, Samaritan Aramaic [sam]-U, Sami Language [smi]-U, Sandawe [sad]-U, Sardinian [sc]-U, Sasak [sas]-U, Scots [sco]-U, Selkup [sel]-U, Semitic Language [sem]-U, Serer [srr]-U, Shan [shn]-U, Sichuan Yi [ii]-U, Sicilian [scn]-U, Sidamo [sid]-U, Sign Language [sgn]-U, Siksika [bla]-U, Sino-Tibetan Language [sit]-U, Siouan Language [sio]-U, Skolt Sami [sms]-U, Slave [den]-U, Slavic Language [sla]-U, Sogdien [sog]-U, Songhai [son]-U, Soninke [snk]-U, Sorbian Language [wen]-U, South American Indian Language [sai]-U, South Ndebele [nr]-U, Southern Altai [alt]-U, Southern Sami [sma]-U, Sranan Tongo [srn]-U, Sukuma [suk]-U, Sumerian [sux]-U, Susu [sus]-U, Syriac [syr]-U +- Tai Language [tai]-U, Tamashek [tmh]-U, Tereno [ter]-U, Tigre [tig]-U, Timne [tem]-U, Tiv [tiv]-U, Tlingit [tli]-U, Tsimshian [tsi]-U, Tumbuka [tum]-U, Tupi Language [tup]-U, Tyap [kcg]-U +- Ugaritic [uga]-U, Umbundu [umb]-U, Upper Sorbian [hsb]-U +- Vai [vai]-U, Volapük [vo]-U, Votic [vot]-U +- Wakashan Language [wak]-U, Walamo [wal]-U, Walloon [wa]-U, Washo [was]-U +- Yao [yao]-U, Yiddish [yi]-U, Yupik Language [ypk]-U +- Zande [znd]-U, Zapotec [zap]-U, Zaza [zza]-U, Zenaga [zen]-U, Zuni [zun]-U + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/bcp47-syntax-mapping.md b/docs/site/development/development-process/design-proposals/bcp47-syntax-mapping.md new file mode 100644 index 00000000000..12c0646299b --- /dev/null +++ b/docs/site/development/development-process/design-proposals/bcp47-syntax-mapping.md @@ -0,0 +1,218 @@ +--- +title: BCP47 Syntax Mapping +--- + +# BCP47 Syntax Mapping + +In the current LDML specification, a Unicode Locale Identifier consists from is composed of a Unicode Language Identifier plus optional locale extensions. Unicode Language Identifier is fully compatible with BCP47 language tag, but the syntax of locale extensions ("@" key "=" type (";" key "=" type)\* ) are not. The LDML is trying to define systematical mapping, but the current definition may truncate (and or remove "-" in some type values) key or type value to 8 characters because of the BCP47 language subtag's syntax restriction. The current definition utilizes BCP47 private use features, but we want to make locale extensions formal (writing a new RFC to reserve a singleton letter for the usage), so we can avoid any conflicts with other private use values and also allow software developers to write a parser for Unicode locale extensions confidently. + +BCP 47 is undergoing a revision which should be done soon: + +- [Current version (4646)](http://tools.ietf.org/html/rfc4646) +- [Latest draft of next version](http://inter-locale.com/ID/draft-ietf-ltru-4646bis-21.html) + +Once we define formal representation of Unicode locale extensions in BCP47 syntax, we actually no longer have any good reasons to use @key1=type1;key2=type2... syntax for Unicode Locale Identifier other than backward compatibility reasons. This document proposes that we retire the proprietary syntax and fully migrate to the new syntax fully supported by BCP47 language tag. + +There are several options for representing keyword key/type pairs in BCP47 syntax. Examples in following proposal assume a letter "u" is reserved for the Unicode locale extensions; however we could go for any of the [possible extensions](http://inter-locale.com/ID/draft-ietf-ltru-4646bis-21.html#syntax): [0-9 a-w y z]. + +The table below shows the locale extension keys/values currently defined by the LDML specification. + +## Key/Type Definitions + +| key | type | Description | +|---|---|---| +| collation | standard | The default ordering for each language. For root it is [ [UCA](http://www.unicode.org/reports/tr35/#UCA) ] order; for each other locale it is the same as UCA ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they only have effect in those locales. | +| | phonebook | For a phonebook-style ordering (used in German). | +| | pinyin | Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) | +| | traditional | For a traditional-style sort (as in Spanish) | +| | stroke | Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) | +| | direct | Hindi variant | +| | posix | A "C"-based locale. (no longer in CLDR data) | +| | big5han | Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese) | +| | gb2312han | Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese) | +| | unihan | Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese) | +| calendar

(*For information on the calendar algorithms associated with the data used with the above types, see [ [Calendars](https://www.unicode.org/reports/tr35/#Calendars) ].*) | gregorian | (default) | +| | islamic

*alias: arabic* | Astronomical Arabic | +| | chinese | Traditional Chinese calendar | +| | islamic-civil

*alias: civil-arabic* | Civil (algorithmic) Arabic calendar | +| | hebrew | Traditional Hebrew Calendar | +| | japanese | Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor) | +| | buddhist

*alias: thai-buddhist* | Thai Buddhist Calendar (same as Gregorian except for the year) | +| | persian | Persian Calendar | +| | coptic | Coptic Calendar | +| | ethiopic | Ethiopic Calendar | +| collation parameters:

  colStrength
  colAlternate
  colBackwards
  colNormalization
  colCaseLevel
  colCaseFirst,
  colHiraganaQuaternary
  colNumeric
  variableTop | *Associated values as defined in: 5.14.1 <[collation](http://www.unicode.org/reports/tr35/#Collation_Element)>* | Semantics as defined in: 5.14.1 <[collation](http://www.unicode.org/reports/tr35/#Collation_Element)> | +| currency

(also known as a Unicode currency code ) | *ISO 4217 code,*

*plus others in common use* | Currency value identified by ISO 4217 code, plus others in common use. Also uses XXX as *Unknown or Invalid Currency* .

See [Appendix K: Valid Attribute Values](http://www.unicode.org/reports/tr35/#Valid_Attribute_Values) and also [ [Data Formats](http://www.unicode.org/reports/tr35/#DataFormats) ] | +| time zone

(also known as a Unicode time zone code ) | *TZID, plus the value:*

*Etc/Unknown* | Identification for time zone according to the TZ Database, plus the value Etc/Unknown .

Unicode LDML supports all of the time zone IDs by mapping all equivalent time zone IDs to a canonical ID for translation. This canonical time zone ID is not the same as the zone.tab time zone ID found in [ [Olson](http://www.unicode.org/reports/tr35/#Olson) ].

For more information, see [Section 5.9.2 Time Zone Names](http://www.unicode.org/reports/tr35/#Timezone_Names) , [Appendix F: Date Format Patterns](http://www.unicode.org/reports/tr35/#Date_Format_Patterns) , and [Appendix J: Time Zone Display Names](http://www.unicode.org/reports/tr35/#Time_Zone_Fallback) . | + +### Collation Parameters + +| Attribute | Options | Basic Example | XML Example | Description | +|---|---|---|---|---| +| strength | primary (1)
secondary (2)
tertiary (3)
quaternary (4)
identical (5) | [strength 1] | strength = " primary " | Sets the default strength for comparison, as described in the UCA. | +| alternate | *non-ignorable shifted* | [alternate non-ignorable] | alternate = " non-ignorable " | Sets alternate handling for variable weights, as described in UCA | +| backwards | on
*off* | [backwards 2] | backwards = " on " | Sets the comparison for the second level to be backwards ("French"), as described in UCA | +| normalization | on
off | [normalization on] | normalization = " off " | If *on* , then the normal UCA algorithm is used. If *off* , then all strings that are in [ [FCD](http://www.unicode.org/reports/tr35/#FCD) ] will sort correctly, but others will not necessarily sort correctly. So should only be set *off* if the the strings to be compared are in FCD. | +| caseLevel | on
off | [caseLevel on] | caseLevel = " off " | If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on . | +| caseFirst | upper
lower
off | [caseFirst off] | caseFirst = " off " | If set to *upper* , causes upper case to sort before lower case. If set to *lower* , lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels. | +| hiraganaQuaternary | on
off | [hiraganaQ on] | hiragana­Quaternary = " on " | Controls special treatment of Hiragana code points on quaternary level. If turned *on* , Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect. | +| numeric | on
off | [numeric on] | numeric = " on " | If set to *on* , any sequence of Decimal Digits (General_Category = Nd in the [ [UCD](http://www.unicode.org/reports/tr35/#UCD) ]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123". | +| variableTop | uXXuYYYY | & \u00XX\uYYYY < [variable top] | variableTop = "uXXuYYYY" | The parameter value is an encoded Unicode string, with code points in hex, leading zeros removed, and 'u' inserted between successive elements.

Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable, and thus affected by the alternate handling. | +| match-boundaries: | none whole-character whole-word | n/a | match-boundaries = "whole-word" | The meaning is according to the descriptions in UTS #10 [Searching](https://unicode.org/reports/tr10/#Searching) . | +| match-style | minimal medial maximal | n/a | match-style = "medial" | The meaning is according to the descriptions in UTS #10 [Searching](https://unicode.org/reports/tr10/#Searching) . | + +## 1. Proposed BCP47 subtag syntax + +This document propose the syntax described by the BNF below. + +locale-extensions = locale-singleton "-" extension \*("-" extension) + +extension = key "-" type + +locale-singleton = "u" + +key = 2alphanum + +type = 3\*8alphanum + +alphanum = (ALPHA / DIGIT) + +Example: + +en-US-u-ca-islamicc-co-phonebk + +this corresponds to the former syntax + +en-US@calendar=islamic-civil;collation=phonebook + +| Current | Proposed | +|---|---| +| collation | co | +| calendar | ca | +| currency | cu | +| numbers | nu | +| time zone | tz | +| colStrength | ks | +| colAlternate | ka | +| colBackwards | kb | +| colNormalization | kk | +| colCaseLevel | kc | +| colCaseFirst | kf | +| colHiraganaQuaternary | kh | +| colNumeric | kn | +| variableTop | kv | + +### 2. Keys + +Key names and only key names are always of length=2, and types (values) are always greater than 2. This proposal defines new canonical key names below. + +The motivation is reduction of string size, and making sure that keys and values don't overlap syntactically. + +### 3. Types + +3.1 Collation + +3.1.1 Collation (co) types + +| Current | Proposed | +|---|---| +| big5han | big5han | +| digits-after | **digitaft** | +| direct | direct | +| gb2312han | **gb2312** | +| phonebook | **phonebk** | +| pinyin | pinyin | +| reformed | reformed | +| standard | standard | +| stroke | stroke | +| traditional | **trad** | + +3.1.2 Collation Strength (ks) types + +| Current | Proposed | +|---|---| +| primary | level1 | +| secondary | level2 | +| tertiary | level3 | +| quarternary | level4 | +| identical | identic | + +3.1.3 Collation Alternate (ka) types + +| Current | Proposed | +|---|---| +| non-ignorable | **noignore** | +| secondary | level2 | +| shifted | shifted | + +3.1.4 Collation Backwards (kb) / Normalization (kk) / Case Level (kc) / Hiragana Quaternary (kh) / Numeric (kn) types + +| Current | Proposed | +|---|---| +| yes | **true** | +| no | **false** | + +3.1.5 Collation Case First (kf) types + +| Current | Proposed | +|---|---| +| upper | upper | +| lower | lower | +| no | **false** | + + +3.1.6 Collation Variable Top (kv) type + +The variable top parameter is specified by a code point in the format *uXXuYYYY*. No changes are required. + +3.2 Calendar (ca) + +| Current | Proposed | +|---|---| +| buddhist | buddhist | +| coptic | coptic | +| ethiopic | ethiopic | +| ethiopic-amete-alem | **ethiopaa** | +| chinese | chinese | +| gregorian | **gregory** | +| hebrew | hebrew | +| indian | indian | +| islamic | islamic | +| islamic-civil | **islamicc** | +| japanese | japanese | +| persian | presian | +| roc | roc | + +3.3 Currency (cu) types + +ISO4217 code (3-letter alpha) is used for currency. No changes required. + +3.4 Number System (nu) types + +The current CVS snapshot implementation uses CSS3 names. This proposal changes all of type names to script code with one exception (arabext). + +| Current (CVS snapshot) | Proposed | +|---|---| +| arabic-indic | arab | +| bengali | beng | +| cambodian | khmr | +| decimal | latn | +| devanagari | deva | +| gujarati | gujr | +| gurmukhi | guru | +| hebrew | hebr | +| kannada | knda | +| lao | laoo | +| malayalam | mlym | +| mongolian | mong | +| myanmar | mymr | +| oriya | orya | +| persian | arabext | +| telugu | telu | +| thai | thai | + +3.5 Time Zone (tz) types + +CLDR uses Olson tzids. These IDs are usually made from \+"/"+\ and relatively long. To satisfy the syntax requirement discussed in this document, we need to map these IDs to relatively short IDs uniquely. The UN LOCODE is designed to assign unique location code and it satisfies most of the requirement. A LOCODE consists from 2 letter ISO country code and 3 letter location code. This proproposal suggest that a 5 letter LOCODE is used as a short time zone ID if examplar city has a exact match in LOCODE repertoire. Some Olson tzids do not have direct mapping in LOCODE. In this case, we assign our own codes to them, but using 3-4/6-8 letter code to distinguish them from LOCODE. For Olson tzid Etc/GMT\*, this proposal suggest "UTC" + ["E" | "W"] + nn (hour offset), for example, UTCE01 means 1 hour east from UTC (Etc/GMT-1). The proposed short ID list is attached in this [document](https://drive.google.com/file/d/1O9B_hO6uD4m7dtb-hU9euBkgP8nQxJ9X/view?usp=sharing). + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/bcp47-validation-and-canonicalization.md b/docs/site/development/development-process/design-proposals/bcp47-validation-and-canonicalization.md new file mode 100644 index 00000000000..62239abcbb1 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/bcp47-validation-and-canonicalization.md @@ -0,0 +1,148 @@ +--- +title: BCP47 Validation and Canonicalization +--- + +# BCP47 Validation and Canonicalization + +The proposal is to add two tables of precomputed values to CLDR for each release, plus a table of language code mappings. + +## Validation Data + +**Language subtag.** These can be 2-letter, 3-letter, or registered (>3 letter). We were looking at validation of 7,000 base language entries and Markus had an idea. Algorithmically map the two-letter codes onto values from 0..675, and the three letter codes onto 676..18251 (just over 14 bits). + +The set of all valid language subtags can be put into a bit-set using 2,282 bytes. That allows for fast validation with a small table. Registered codes would just use an exception table. + +An alternative mapping would be 26\*26\*27, eg + +- "xy" => (x-0x61)\*26\*27 + (y-0x61)\*27 +- "xyz" => (x-0x61)\*26\*27 + (y-0x61)\*26 + (z-0x60) + +However, it is better to have the two letter codes as smaller numbers, for compression, since they occur far more often. + +**Region subtag.** One could do the same for region codes, with two-letter codes from 0..675, and then 3-digit codes from 675 to 1,675 (about 10.7 bits). A bitset that can cover all values is a 210-byte table. + +**Script subtag.** John Cowan suggested that except for Teng/Tfng, the second letter of the script code is redundant, so you can special-case those two, remove the second letter, and use the same algorithm as for ISO 639-2. However, we can't expect that the JAC would follow any particular restrictions, and the set of scripts is still a relatively small number, so this probably isn't worth it. + +Note: The generation of a table is simply a convenience, since it can be computed from the IANA registry, so it may not be worth doing as a part of CLDR, but we can suggest it as an implementation technique. + +## Canonicalization Data + +We also provide data for validation and canonicalization. The basic canonicalization is as per BCP47, with the following additions: + +1. We canonicalize the case, with variants getting uppercase, so en\_foobar => en\_FOOBAR +2. We alphabetize the variants so that irrelevant differences in order don't cause problems, so en-FOOBAR-ABCDE => en\_ABCDE\_FOOBAR + - Note: the uppercasing of variants is for compatibility, since the basis for the CLDR work predated BCP47. + +Data for doing the preferred value mapping is in the supplemental data, extracted from the IANA registry. + +We also provide data for a lenient canonicalization, which involves the following additional steps: + +1. maps the 3-letter language subtags that have 2-letter equivalents into those 2-letter equivalents; so eng-US => en\_US +2. maps the 3-digit region codes that have 2-letter equivalents into those 2-letter equivalents; so eng-840 => en\_US +3. combines identical extensions; en-a-foo-a-fii => en\_a\_foo\_fii + +The data for #2 are in http://unicode.org/cldr/data/common/supplemental/supplementalData.xml, in codeMappings/territoryCodes. However, we need an extra table for doing #1, the language code mappings. Suggest adding: + +\ + + \ + +... + +## Sample Structure + +\ + + \ + +  -2122061011208687, e00d48015863b67, 15fb9fb2095c00, 340400f7818068d,
+  -2b07ebe0bd4e300, 100086b25d7fffc, 43fff001538f3c40, -4044cc58020eaf00,
+  4085570410419a, 18ffffffc04002, 2eea2e908400418, 6260008c6,
+  -33d4000000000000, 10000, 0, 0,
+  0, 80
+ \
+ + \ + +  91019c747263433, 1c68108800045364, 4443028094090c84, -7ffffbe3baa63970,
+  -3ff1e7af28980bf0, 61204489a16d0e6d, 10000003024040, -648bb808222ebe40,
+  1001044202044053, 4100000020000400, -1220200fbffdfc00, -5010004244000101,
+  -78c6890a8c3e0081, -fc408f0000001, -200169dfafa301, -880800009,
+  -8171c0000001, -4187fd4fbb1, -2000000800011, 9fe75970f1b42bd,
+  1490f9feddf20051, -114007e, -2800000008080001, -80280000001,
+  -40180000001, -400000000010003, -c0000200000001, -1000000d0041,
+  -20000080000041, -1200000208000001, -42000002000005, 7fffffffffefefff,
+  -3bfc2522b0640841, 4124082843c19cf, -d00447fffbbb00, 488349bd64542b49,
+  -3f182aaabe898841, -7a20060100c8ec8f, -400043effff79ae, -3878c1e88b08201,
+  8005b0008100ffd, 2000040030000000, -301210082002fde7, -3eee729ffdfffbc,
+  665df000000227bd, -4200010e261ad97, c100860c01149fc, 75689565b65c5500,
+  20003efedb, -3fe966da82589400, 7f7ffff07a540, 460801000,
+  a12510714b, 600000490100000, 4440000100000001, 4000010048010000,
+  5100000042880000, 4f553243564102dd, 800001cdc2bd5, -6e8fbffffffffffc,
+  a798218157d9013, 4000000824000000, 4001020000000, 39abfeb000004,
+  40000000, 4400020000, 5a88110000000020, 6042000000000000,
+  500108000408a, 400631080, 4081003f50300400, 13b33be00000000,
+  1100800, -5000000000000000, 283cedffffffffff, 3c51fcf24dfffc0a,
+  -4daba593a4fdff00, 409403cff84f039, -774ab6e1cfff5fef, -20000069208824fd,
+  -c3fddefefff7ff9, 40444f850ffd4, 7f41be8d6fffdff0, 2397b2000000da23,
+  fe7ffff00084050, 800000008104180, 11c941, 5feb408040040100,
+  -3fbfbeeffffbff36, -7198804000f9fffa, 1100036edb9f051, -77d596afbc000000,
+  -2001daefbfffbe, 40050809053, 10000, -7ebb71a93fd5f000,
+  78047c0208, 844785244050cc0, 1885000000000204, 11d1350ee8cd1001,
+  833eb5906691, 4100000001040052, 74481a71dd649964, 800008001000fb9,
+  10002010045400, -60a140ff7f9ff7fc, 1040c00000698843, 1a2dd20200,
+  -20feffbffffbfee0, 400000000013959, 2486290f00000401, -7fffbb9fff5fefd3,
+  68000c02000000, -370ffffffffff80, 10000080215e, 1000011000000,
+  -21c2801000, -110008000001, -1, -224000004003,
+  -1801, -200002000000001, -231, -401001,
+  -2000100001, -10000000001, -4000000000000001, -7fa6743002322029,
+  -7ff8000000ff800e, 72e6061b3dc3000, -460ffddbbc083001, -c0000da80802f1a,
+  204100000004, -1010248847ff5e, -1000000000001, -80001,
+  7dffffff7fffffff, -100000008000009, -40000421, -80000080002803,
+  -10000400001, 7efbfffdf7ffffff, -1, -800000011,
+  -1000080020003001, 27ef77ffffeffbe9, -2800000020fb7e0, -98611d000800002,
+  -13642a901081, -820a02001000001, -1aa3ff8bbfbae5e0, -2a4a3bc0002225d6,
+  1170004203fffbff, -1db71a9a7df, 400220005450540d, 6010041040810696,
+  5605100000000000, 200106000001040, 64b67f9b19201180, -462b32512a8ff6c0,
+  421500406204837, -3efa4067cac00000, 3113f7df46c98, 900000000,
+  400100100a000, -3045824002201, -5a2004f9e7bebbd5, 7251425410002047,
+  200003fefbff1edf, -482004483e11f390, -20a0300038a02081, 7c00100107ff445e,
+  -479faa0062208009, 30000003dbd77f5e, 3010000400042704, -fff7ffe70,
+  -1, -1, -1, -1,
+  -1, -1, -1, 7bdeffffffffffff,
+  140c1085d13ee57e, 800117fa2, bffffef00, 3b53000000800002,
+  16241100000010e, 5001a3831002010, 81183010, -780efd5ed8010201,
+  20800052d, 2000009020010400, 319c5f61004, 254010,
+  -2040fffffffebe0, -230a120c00000001, 400b7fffffa7bfb, -400000a1250228ef,
+  -800e840e9000201, -4004010081100801, 757bf7efdffdfffd, -20823e08f7e981,
+  -a8e000010400001, -19de7efa010f, 17d467d07159f159, -40000081080e7d6,
+  -90240402202001, -3e00fd67bfe000b, 7fdfffdffef6f1f, -412043c4fae3b,
+  -1000080080000081, -7fd010022808241, -860000001003a5, -f98420400200001,
+  67dc75ddbf8ff531, 201ca06beab91, 4480404023, 4000400000406d44,
+  80050060130000, -3effffbffdc00000, 6f14f040f49c5588, 400a0a51641,
+  -40823fffffff7ffc, 4a0044108581fe, 1000224012300, 1000000000,
+  936fbf1010800, 0, 2d80c, 2c286c00000000a0,
+  18df010000000, 25003f79fff0120, a00000, 90003100040000,
+  8300000100002, 100000, -1000000000000000, 1011ae7ecffffff,
+  5ef18d040092000, 1105404141000010, 4057a3ff6040, 19f7d9755450080b,
+  46757f0435578cc7, 72c0000000000011, 5480360501fdae, 10000001ba388b1,
+  2100300000044240, 157ff95f00000010, 40117a78d630944, 100000041e984a40,
+  -56fcf17ff8bfdf7e, -1000ffffffede, -7cf0000100bac7a9, 45bd55042f54c019,
+  -460e973bb3f7ffff, 1ad243d7fed7d37f, 80248550118440, 242000008000281,
+  -7f9fe01900020010, -1d4eafefee9fff58, 2442980000000013, 3bfbfe100020,
+  -bb61abaf5a2bf00, 3d42051b3668ffdd, -3ffffff7fc2cf27c, 200506e80c110b44,
+  200007dbbf7f002, 5b2801, -2000fffeefdf00, -7ffffff7eedfdfad,
+  20c10000, 310084280230030, -53b3cffffffdbdcc, 67ffffff004c8023,
+  -3f8af7bfefef, 10138104000010ff, 3081676e140121c0, 1000000100,
+  80400a242200000
+ \
+ +\
+ +Here is the data that they replace: + +\ + +\ + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/bidi-handling-of-structured-text.md b/docs/site/development/development-process/design-proposals/bidi-handling-of-structured-text.md new file mode 100644 index 00000000000..e6100a0ebbf --- /dev/null +++ b/docs/site/development/development-process/design-proposals/bidi-handling-of-structured-text.md @@ -0,0 +1,360 @@ +--- +title: BIDI handling of Structured Text +--- + +# BIDI handling of Structured Text + +**Author: Tomer Mahlin - IBM Bidi Competency Center - e-mail comments to John Emmons - emmo@us.ibm.com** + +**0. Terminology** + +***Structured text*** (aka complex expressions) - text having implied or/and inherent structure such as: URL,file path, email, Java code, XML, regular expression, date/time stamp etc. + +This term is abbreviated as STT. + +**1. The problem** + +When structured text includes characters from languages with bidirectional script default display system on the logical platform does not preserve the structure and as a result the text on the screen becomes incomprehensible. The problem is with display only and background text related processing is not affected. However, the way the text is displayed on the screen makes it very hard to work with it (in the best case). + +**2. What solution would we like to have ?** + +In most general terms the solution should allow: + +1. Preparation of string for display which will assure that its structure is preserved, This entails development of flexible set of parsers which will analyze the structure of text and will inject UCC (Unicode Control Characters) when appropriate. The framework should be extendable in order to allow adding new parsers (either custom or built-in) in the future. +2. Taking into consideration user preferences (default rules for STT handling): Arabic vs. Hebrew. Users in different geographies are accustomed to different rules for STT display. For example math formulas are always displayed from left-to-right for Hebrew users, while for Arabic users, they may be displayed from right-to-left. +3. Ability to leverage default rules for STT handling (defined in CLDR) and also customize them. + +**3. What this solution will NOT address (static vs. dynamic context)** + +Static context is context in which text can't be directly edited by the end user (e.g. label, not editable cell of grid, tree branch label etc.). + +Here is an example for file path: + +![image](../../../images/design-proposals/st1.jpg) + +Dynamic context is context in which text can be directly edited (e.g. input field, editable area, editable cell of grid etc.) . + +Here is an example for file path: + +![image](../../../images/design-proposals/st2.jpg) + +Due to the ICU/CLDR nature (stand alone library without any widgetry) the solution will relate to **static** cases of text only. + +Namely, the solution will prepare the text to be displayed in the static context. + +Addressing dynamic context is beyond what ICU can provide. The main reason is heterogeneous environment in which ICU can be deployed and + +different approaches for resolution of problem for STT. For example, even for Java world the difference between SWT and Swing will pose a challenge. For .Net world (in which ICU4C can be used), addressing the problem is beyond the scope of ICU since it will necessitate invocation of .Net level API. + +**4. How the solution will be used (scope, applicability for ICU itself) ?** + +In overwhelming majority of cases the deployers leveraging ICU will call an API similar to the following one: + +**prepareForDisplay**( inputText, STT\_Type, additional\_parameters) + +This API will inject UCC ( Unicode Control Characters ) into inputText according to STT\_Type and additional\_parameters and will return the result ready for display. + +inputText - structured text subject to display + +STT\_Type: file path, URL, email, Java etc. + +additional\_parameters - external parameters which might affect STT processing and which are not directly available to ICU (e.g. GUI direction etc.) + +The result string will be embedded by the caller into graphical element which will render it on the screen. + +ICU users will be able to either accept default STT rules (associated with locale and defined originally in CLDR) or customize them (via convenient API + +allowing setting / getting values for different aspects of such rules). + +ICU users will be able to extend the list of provided out of the box parsers in order to address additional types of STT. + +One additional usage of this functionality is in serialization / formatting dates / time stamps provided by ICU itself. + +Date / time stamp is considered structured text as well. + +To assure backward compatibility ICU should not inject UCC into date / time stamps by default. However, it can provide either flag or additional + +signature of the same function (responsible for date / time stamps generation) which will prepare them for display (by injection of UCC appropriately). + +This will assure proper display of date / time stamps generated by ICU. For example in Eclipse: + +![image](../../../images/design-proposals/st3.jpg) + +**5. Factors affecting the display of STT** + +There are some different types of factors that can effect the proper display of structured text. Some types of STT such as URLs or file paths, have a strong LTR directionality associated with them, while others have different display semantics depending the directionality of the underlying GUI, or upon the content of the text itself. This is further complicated by the fact that the culturally accepted preferences for proper display of STT differ, even between Hebrew and Arabic users. Thus, we need to be able to define these rules within the CLDR that describe these and be able to use that data in order to be able to format STT properly in a given context. + +The following table summarizes the recognized preferences for the proper display of STT in Hebrew and Arabic: + +![image](../../../images/design-proposals/st6.jpg) + +![image](../../../images/design-proposals/st5.jpg) + +![image](../../../images/design-proposals/st4.jpg) + +**6. Proposed additions to CLDR** + +The following schema is proposed for addition into CLDR, most likely as sub-elements under the \ category. This structure defines the sets of rules in CLDR necessary to describe proper display behavior for various types of structured text: + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +7\. Sample data for various types of STT: + + a). STT with strong directionality, such as filepath, would have the following rules in root: + +\ + +\ + +\ + +\LTR\ + +\LEFT\ + +\LEFT\ + +\ + +\ + +\ + +\ + +\LTR\ + +\RIGHT\ + +\LEFT\ + +\ + +\ + +\ + + b). The "message" type has some variations based on the content, but the rules remain constant across locales, as follows: + +root.xml: + +\ + + \ + + \ + + \LTR\ + + \LEFT\ + + \LEFT\ + + \ + + \ + + \RTL\ + + \LEFT\ + + \RIGHT\ + + \ + + \ + + \RTL\ + + \LEFT\ + + \RIGHT\ + + \ + + \ + + \ + + \ + + \LTR\ + + \RIGHT\ + + \LEFT\ + + \ + + \ + + \RTL\ + + \RIGHT\ + + \RIGHT\ + + \ + + \ + + \RTL\ + + \RIGHT\ + + \RIGHT\ + + \ + + \ + +\ + + c). Rules for date and time stamps differ between Hebrew and Arabic, so we would have: + +In he.xml: + +\ + + \ + + \ + + \RTL\ + + \LEFT\ + + \RIGHT\ + + \ + + \ + + \RTL\ + + \LEFT\ + + \RIGHT\ + + \ + + \ + + \LTR\ + + \LEFT\ + + \LEFT\ + + \ + + \ + + \ + + \ + + \RTL\ + + \RIGHT\ + + \RIGHT\ + + \ + + \ + + \RTL\ + + \RIGHT\ + + \RIGHT\ + + \ + + \ + + \LTR\ + + \RIGHT\ + + \LEFT\ + + \ + + \ + +\ + +The following rules in ar.xml ( Arabic ): + +\ + + \ + + \ + + \LTR\ + + \LEFT\ + + \LEFT\ + + \ + + \ + + \ + + \ + + \RTL\ + + \RIGHT\ + + \RIGHT\ + + \ + + \ + +\ + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/images/design-proposals/st1.jpg b/docs/site/images/design-proposals/st1.jpg new file mode 100644 index 00000000000..4c8788ca357 Binary files /dev/null and b/docs/site/images/design-proposals/st1.jpg differ diff --git a/docs/site/images/design-proposals/st2.jpg b/docs/site/images/design-proposals/st2.jpg new file mode 100644 index 00000000000..bd7088ef524 Binary files /dev/null and b/docs/site/images/design-proposals/st2.jpg differ diff --git a/docs/site/images/design-proposals/st3.jpg b/docs/site/images/design-proposals/st3.jpg new file mode 100644 index 00000000000..5e460da4e81 Binary files /dev/null and b/docs/site/images/design-proposals/st3.jpg differ diff --git a/docs/site/images/design-proposals/st4.jpg b/docs/site/images/design-proposals/st4.jpg new file mode 100644 index 00000000000..46fbe84bb9f Binary files /dev/null and b/docs/site/images/design-proposals/st4.jpg differ diff --git a/docs/site/images/design-proposals/st5.jpg b/docs/site/images/design-proposals/st5.jpg new file mode 100644 index 00000000000..cb5505213a5 Binary files /dev/null and b/docs/site/images/design-proposals/st5.jpg differ diff --git a/docs/site/images/design-proposals/st6.jpg b/docs/site/images/design-proposals/st6.jpg new file mode 100644 index 00000000000..79c09c349cc Binary files /dev/null and b/docs/site/images/design-proposals/st6.jpg differ