From 6128e2a71930aaa914674054bee89071e118d569 Mon Sep 17 00:00:00 2001 From: Chris Pyle Date: Sat, 6 Jul 2024 14:22:38 -0400 Subject: [PATCH] CLDR-17566 removing text files --- .../new-bcp47-extension-t-fields.txt | 89 --------------- .../new-time-zone-patterns.txt | 107 ------------------ docs/site/TEMP-TEXT-FILES/path-filtering.txt | 30 ----- .../pattern-character-for-related-year.txt | 17 --- docs/site/TEMP-TEXT-FILES/pinyin-fixes.txt | 17 --- 5 files changed, 260 deletions(-) delete mode 100644 docs/site/TEMP-TEXT-FILES/new-bcp47-extension-t-fields.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/new-time-zone-patterns.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/path-filtering.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/pattern-character-for-related-year.txt delete mode 100644 docs/site/TEMP-TEXT-FILES/pinyin-fixes.txt diff --git a/docs/site/TEMP-TEXT-FILES/new-bcp47-extension-t-fields.txt b/docs/site/TEMP-TEXT-FILES/new-bcp47-extension-t-fields.txt deleted file mode 100644 index cc0defd8d47..00000000000 --- a/docs/site/TEMP-TEXT-FILES/new-bcp47-extension-t-fields.txt +++ /dev/null @@ -1,89 +0,0 @@ -New BCP47 Extension T Fields -Proposed Additions -BCP47 language tags can use Extension T for identifying transformed content, or indicating requests for transformed content, as described in rfc6497. If you have any comments on proposals, please circulate them on the cldr-users mailing list. Instructions for joining are at cldr list. -There are no proposed additions at this time. -Approved Proposals -The following proposals have been approved for the next version of the BCP47 T Extension registry, after being distributed for public review. -Approved May 9, 2012 -The following proposal was distributed for public review on April 26, 2012. The longer descriptions in can be used as a basis for enhancing the XML description or for the LDML spec. -key = i0 (IME) - -  - -  - -  - -  -key = k0 (keyboard) - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  - -  -Approved April 4, 2012 -The following proposal was distributed for public review on March 26, 2012. The CLDR committee concluded that the best way of representing different kinds of identifiers for use in requesting input transforms was to have separate fields, and not subfields of the existing m0. Using different 'namespaces' allows users of the T extensions to ignore types of subfields that are not relevant, and to group related subfields in an organized fashion. On that basis, the following additions to the BCP47 T Extension registry (see bcp47/transform.xml) were approved. The longer descriptions in bullets can be used as a basis for enhancing the XML description or for the LDML spec. - -Used to indicate a keyboard transformation, such as one used by a client-side virtual keyboard. The first subfield in a sequence would typically be a 'platform' or vendor designation. - -Used to indicate an input method transformation, such as one used by a client-side input method. The first subfield in a sequence would typically be a 'platform' or vendor designation. - -Used to indicate content that has been machine translated, or a request for a particular type of machine translation of content. The first subfield in a sequence would typically be a 'platform' or vendor designation. - -Used when the only information known (or requested) is that the text was machine translated. - -Used for implementation-specific transforms. All subfields consistent with rfc6497 (that is, subtags of 3-8 alphanum characters) are valid, and do not require registration. -Note: RFC6497 interprets transforms that result in content broadly, including speech recognition and other instances where the source is not simply text. For the case of keyboards, the source content can be viewed as keystrokes, but may also be text—for the case of virtual web-based keyboards. For example, such a keyboard may translate the text in the following way. Suppose the user types a key that produces a "W" on a qwerty keyboard. A web-based tool using an azerty virtual keyboard can map that text ("W") to the text that would have resulted from typing a key on an azerty keyboard, by transforming "W" to "Z". Such transforms are in fact performed in existing web applications. The standardized extension can be used to communicate, internally or externally, a request for a particular keyboard mapping that is to be used to transform either text or keystrokes, and then use that data to perform the requested actions. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/new-time-zone-patterns.txt b/docs/site/TEMP-TEXT-FILES/new-time-zone-patterns.txt deleted file mode 100644 index 8323eeab345..00000000000 --- a/docs/site/TEMP-TEXT-FILES/new-time-zone-patterns.txt +++ /dev/null @@ -1,107 +0,0 @@ -New Time Zone Patterns -This proposal was reviewed in the CLDR TC meetings on 2013-01-09 and 2013-01-16, and approved by the CLDR TC. -Z 1..3 -0800 Time Zone - RFC 822 GMT format. For more information about timezone formats, see Appendix J: Time Zone Display Names . -4 HPG-8:00 Time Zone - The localized GMT format. For more information about timezone formats, see Appendix J: Time Zone Display Names . -5 -08:00 Time Zone - ISO8601 time zone format. For more information about timezone formats, see Appendix J: Time Zone Display Names . -Summary -This design proposal includes following new pattern letters in the LDML date format pattern definition for time zone formatting. -X and x for ISO 8601 style non localizable UTC offset format -'X' uses UTC designator "Z" when UTC offset is 0 -'x' uses difference between local time and UTC always - i.e. format like "+0000" is used when UTC offset is 0. -O for localized GMT format variations -V for time zone ID (V - short / VV - IANA) and exemplar city (VVV). -Background -JDK and LDML started from the same base, but they have been extended independently. LDML is not necessarily 100% compatible with JDK or vise versa, but using completely different definitions for the same purpose introduce confusion among consumers. Now, Java folks are trying to integrate JSR-310 into Java 8 release and extend some date format pattern definitions to support things missing in the current SimpleDateFormat. -For example, LDML specification uses pattern letter 'Z' for UTC offset based time zone format. In CLDR 22.1, letter 'Z' is defined as below: -However, JDK supports ISO8601 style format using letter 'X'. JDK also introduced letter 'I' for direct use of zone ID. -This proposal is created for sorting out time zone patterns and hopefully filling a gap between LDML specification and future JDK and JSR-310 releases, as well as adding necessary pattern for CLDR requirements. -Non-Localized Local Time Offset Format -The pattern letter 'X' was added in JDK 7 for supporting ISO8601 style time zone format. In JDK 7, the behavior of pattern letter 'X' is described as below: -ISO8601TimeZone: OneLetterISO8601TimeZone TwoLetterISO8601TimeZone ThreeLetterISO8601TimeZone OneLetterISO8601TimeZone: Sign TwoDigitHours Z TwoLetterISO8601TimeZone: Sign TwoDigitHours Minutes Z ThreeLetterISO8601TimeZone: Sign TwoDigitHours : Minutes Z TwoDigitHours: Digit Digit Sign: one of + - Hours: Digit Digit Digit Minutes: Digit Digit Digit: one of 0 1 2 3 4 5 6 7 8 9 -In the current LDML specification, ISO8601 style format is specified by pattern "ZZZZZ" (5 'Z's), but followings are not supported -Offset without separator, such as -0800 -Hour only format, such as -08 -The JSR-310 proposal is trying to extend the definition to support seconds field in offset, probably because it is necessary to handle LMT in the time zone database. So the proposal adds "XXXX" and "XXXXX" for supporting optional second field. -Note: ISO8601 specification does not support seconds field in local time offset. -In LDML, we could define "ZZZZZZ" (6 'Z's), "ZZZZZZZ" (7 'Z's)... to support these requirements, but it would become so messy. Luckily, pattern letter 'X' is not yet used by the LDML specification, I propose to use the letter for supporting these requirements and deprecate "ZZZZZ" (5 'Z's). -The JSR-310 definition (compatible with JDK 7 SimpleDateFormat, with some enhancements for seconds field) might be used also for LDML, but I think there are several issues. -Single 'X' is used for limiting offset to be hour field only. Such usage is practically questionable. There are some active time zones using offsets with non-zero minutes field. So such format is highly discouraged when a zone has non-zero minutes field. ISO8601 specification also says "The minutes time element of the difference may only be omitted if the difference between the time scales is exactly an integral number of hours.". -When non-zero minutes (or seconds) field is truncated and hour field is 0, the output becomes +00/-00/+0000/-0000/+00:00/-00:00. Use of negative sign for offset equivalent to UTC (-00/-0000/-00:00) is illegal in ISO8601. -When to use "Z" or "+00"/"+0000"/"+00:00" is also a design question. JSR-310 seems to extend pattern letter Z to support format without ISO8601 UTC indicator "Z". -In ISO8601 specification, "Z" is specifically defined for UTC of day. "+00"/"+0000"/"+00:00" is a valid format for difference between local time and UTC (Section 4.2.5.1). Semantically, "Z" expresses UTC itself, while "+00"/"+0000"/"+00:00" expresses a local time with UTC offset of zero. LDML specification might handle zone "Etc/UTC" different from winter time of "Europe/London" (the former is formatted as "Z" and the latter is formatted as "+00"/"+0000"/"+00:00"). However, it is not clear how "Etc/GMT", "Etc/GMT+0"... should be handled. (Etc/GMT might be just an alias of Etc/UTC, but Etc/GMT+0 is probably not in this scope...). It would be much easier to understand to control the behavior through different pattern letter. -For this reason, this proposal defines two pattern letter - 'X' and 'x'. 'X' is mostly upward compatible with JDK 7 SimpleDateFormat's 'X' and the current JSR-310 proposal, but slightly modified for resolving some design issues. 'x' only differs that the format always use difference between local time and UTC ("+00"/"+0000"/"+00:00" when UTC offset is 0). -Proposed Definition -X 1 Z --08 --0830 Variable length ISO8601 'difference between local time and UTC - basic format' including hours field and optional minutes field, or "Z" (ISO8601 UTC designator) when the offset is 0. -Note: The seconds field in the offset is truncated. -2 Z --0800 Fixed length ISO8601 'difference between local time and UTC - basic format' including hours field and optional minutes field, or "Z" (ISO8601 UTC designator) when the offset is 0. -Note: The seconds field in the offset is truncated. -3 Z --08:00 Fixed length ISO8601 'difference between local time and UTC - extended format' including hours field and optional minutes field, or "Z" (ISO8601 UTC designator) when the offset is 0. -Note: The seconds field in the offset is truncated. This pattern is equivalent to "ZZZZZ". -4 Z --0800 --083015 Variable length format based on ISO8601 'difference between local time and UTC - basic format' including hours/minutes field and optional seconds field, or "Z" (ISO8601 UTC designator) when the offset is 0. -Note: When seconds field value is no 0, the result format is not a legal ISO8601 local time offset format. -5 Z --08:00 --08:30:15 Variable length format based on ISO8601 'difference between local time and UTC - extended format' including hours/minutes field and optional seconds field, or "Z" (ISO8601 UTC designator) when the offset is 0. -Note: When seconds field value is no 0, the result format is not a legal ISO8601 local time offset format. -x 1 -08 --0830 Variable length ISO8601 'difference between local time and UTC - basic format' including hours field and optional minutes field. -Note: The seconds field in the offset is truncated. -2 -0800 Fixed length ISO8601 'difference between local time and UTC - basic format' including hours field and optional minutes field. -Note: The seconds field in the offset is truncated. This pattern is equivalent to "Z" /"ZZ"/"ZZZ". -3 -08:00 Fixed length ISO8601 'difference between local time and UTC - extended format' including hours field and optional minutes field. -Note: The seconds field in the offset is truncated. -4 -0800 --083015 Variable length format based on ISO8601 'difference between local time and UTC - basic format' including hours/minutes field and optional seconds field. -Note: When seconds field value is no 0, the result format is not a legal ISO8601 local time offset format. -5 -08:00 --08:30:15 Variable length format based on ISO8601 'difference between local time and UTC - extended format' including hours/minutes field and optional seconds field. -Note: When seconds field value is no 0, the result format is not a legal ISO8601 local time offset format. -O 1 GMT -GMT-5 -HPG+8:30 -UTC-8.30.15 Short localized GMT format. -4 GMT -GMT-05:00 -HPG+08:30 -UTC-08.30.15 Long localized GMT format including hours/minutes fields and optional seconds field. -Note: This format is equivalent to the current "ZZZZ", except optional seconds field might be appended. -X, XX, and XXX always produce valid ISO8601 formats, but may lose the information of the second fields. -XXXX and XXXXX produce valid ISO8601 format for practical use cases, but the outputs may include the second fields (when offset is not exact minutes). -When UTC offset is 1 to 59 seconds, X, XX, and XXX interpret as UTC offset of zero and emit "Z". That means, X, XX and XXX never emit "+00" or "+0000". -The table below illustrates the behavior of pattern letter 'X' in JDK 7, JSR-310 proposal and this proposal. -Proposed 'x' only differs when offset (or truncated offset) is 0 - using +00/+0000/+00:00 instead of "Z" -Note: JDK 7 uses +00 and -00. -00 is illegal in ISO8601 specification. -Localized GMT Format Variants -Pattern "ZZZZ" is currently used for localized GMT format. This format is constructed with following elements: -+HH:mm;-HH:mm -GMT{0} -GMT -For example, UTC offset is -3:00, the output is "GMT-03:00" with above data. Unlike non-localized local time offset format, this format uses local decimal digits for hours/minutes field. -This format sometimes tend to be longer than what people expect. For example, Bulgarian locale in CLDR 22 has "Гриинуич{0}" for gmtFormat. CLDR#5382 proposes to add shorter version and we need a pattern for this purpose. -Again, we could use "ZZZZZZ" (6 'Z's) for this purpose, but it's a little bit ugly. We may also want other variants, for example, using numeric/symbol only offset format, such as "(+3)" in future. Therefore, this proposal allocate a new pattern letter 'O' for the purpose. -Note: "OOOO" (instead of "OO") is used for long format to be consistent with other patterns. Pattern "OO" and "OOO" are reserved for future enhancement. -Time Zone ID -JSR-310 proposal includes pattern letter 'I' (capital ai) for time zone ID itself. This is a little bit beyond "locale repository" purpose, but it is the most robust way to preserve date/time information in a text representation. -In CLDR, we're afraid of burning one letter just for this purpose. In the CLDR TC meeting on Jan 16, 2003, we agreed to use "VV" for this purpose. At the same time, we also agreed to redefine already deprecated pattern "V" for CLDR/BCP 47 short time zone ID, and newly define "VVV" for exemplar city (location - localizable). As the result, the series of all "V" patterns, including existing "VVVV" (generic location format), will be a set of formats supporting canonical time zone ID round trip. -Proposed Definition -JDK 7 (SimpleDateFormat) JSR-310(proposed) LDML(proposed) -UTC Offset 00:00:00 -00:00:30 -00:30:00 -00:30:30 -01:00:00 00:00:00 -00:00:30 -00:30:00 -00:30:30 -01:00:00 00:00:00 -00:00:30 -00:30:00 -00:30:30 -01:00:00 -X Z +00 -00 -00 -01 Z ? ? ? -01 Z Z -00:30 -00:30 -01 -XX Z +0000 -0030 -0030 -0100 Z ? -0030 -0030 -0100 Z Z -0030 -0030 -0100 -XXX Z +00:00 -00:30 -00:30 -01:00 Z ? -00:30 -00:30 -01:00 Z Z -00:30 -00:30 -01:00 -XXXX - - - - - Z -000030 -0030 -003030 -0100 Z -000030 -0030 -003030 -0100 -XXXXX - - - - - Z -00:00:30 -00:30 -00:30:30 -01:00 Z -00:00:30 -00:30 -00:30:30 -01:00 -V 1 uslax -utc Short time zone identifier (BCP 47 unicode locale extension, time zone value) -fallback: If there is no mapping to BCP 47 time zone value, format for pattern "xxxx" is used as a fallback, such as "-0500" -2 America/Los_Angeles -Etc/GMT Time zone identifier (IANA Time Zone Database, or user defined ID) -3 Los Angeles -東京 Localized exemplar location (city) name for time zone -If a time zone is not associated with any specific locations (e.g. Etc/GMT+1), localized exemplar city name for time zone "Etc/Unknown" is used. \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/path-filtering.txt b/docs/site/TEMP-TEXT-FILES/path-filtering.txt deleted file mode 100644 index 6de6d5230e5..00000000000 --- a/docs/site/TEMP-TEXT-FILES/path-filtering.txt +++ /dev/null @@ -1,30 +0,0 @@ -Path Filtering -Inside of CoverageLevel, and in the LDML2ICUConverter, and in various other places, we are filtering the XML files based on paths and values. However, these tend to be ad hoc mechanisms, and especially in the case of CoverageLevel, with a lot of hard-coded strings. This is a proposal for making a general, data-driven mechanism for handling this. -The data is a list of pairs, where the first of each pair is a result, and the second is a regex. Logically, the list is traversed until there is a match, and then the result for that pair is returned. -For example, here is what the start of the list for CoverageLevel might look like: -posix ; posix/messages/(yes|no)str -posix ; characters/exemplarCharacters -minimal ; timeZoneNames/(hourFormat|gmtFormat|regionFormat) -minimal ; unitPattern -basic ; measurementSystemName -The results do not need to be grouped together. Thus an inclusion/exclusion list can be formed like: -true ; posix -false ; examplarCharacter.*auxiliary -true ; exemplarCharacters -... -You can also have special purpose pairs, such as the following to remove the alts at the front. -skip ; \[@alt="[^"]*proposed -Specialized Wildcards -There are a couple of extra features of the regex. For the coverage level (and perhaps others), we need some additional matches. -[TBD - add more] -Variable Description -$locale the locale of the XML file in question -$eu the EU languages -$localeScripts the scripts used in this locale, eg (Latn|Cyrl|Arab) -$modernCurrencies currencies that are currently valid tender in some country -$localeRegions countries/regions that have the locale's language as an official language -$localeCurrencies modern currencies for the $localeRegions -$modernMetazones metazones ... -Issue: -I'm thinking that we may want to append the value to the path (eg .../_VALUE="...") to allow for matching on that. -Use XML instead of ; format? \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/pattern-character-for-related-year.txt b/docs/site/TEMP-TEXT-FILES/pattern-character-for-related-year.txt deleted file mode 100644 index 234f4a9414d..00000000000 --- a/docs/site/TEMP-TEXT-FILES/pattern-character-for-related-year.txt +++ /dev/null @@ -1,17 +0,0 @@ -Pattern character for “related year” -Author Peter Edberg -Date 2014-Feb-11 -Status Proposal -Feedback to pedberg (at) apple (dot) com -Bugs # 6938 -In locales that use non-Gregorian calendars, it is often common to display some portion of a Gregorian or Gregorian-like date along with the formatted non-Gregorian date. This can take various forms: -For non-Gregorian calendars that are “Gregorian-like” (only the current-era year is different) such as the Japanese emperor-year calendar, the Gregorian year may be shown along with the calendar’s current year. For example, using the Japanese calendar in the Japanese locale, a common display format would be something like “2012(平成24)年1月15日” (year portion in yellow). -For non-Gregorian luni-solar calendars in which years are named in a 60-year cycle but eras are not used, the formatted date is ambiguous (it is not clear which era it belongs to, so it cannot be parsed reliably). Furthermore, users of the calendar may be unsure of the mapping between the named year and nearby years in the Gregorian or Gregorian-like calendar that they use for other purposes. To address this, there are two common conventions: -Along with (or instead of) the year name, use the Gregorian year in which the lunar-calendar year began. For example, in the Chinese lunar calendar: -the lunar date corresponding to 2013-Jan-15, which is in the 12th month of the lunar year ren-chen that began 2012-Jan-23, could be represented as 2012壬辰年腊月初四 or just 2012年腊月初四 (year portion in yellow). -the lunar date corresponding to 2013-Feb-15, which is in the 1st month of the lunar year gui-si that began 2013-Feb-10, could be represented as 2013癸巳年正月初六 or just 2013年正月初六 (year portion in yellow). -Along with the lunar year date (using the cyclic year name), show a complete year-month-day date in a Gregorian or Gregorian-like calendar. Which Gregorian-like calendar is used depends on the locale; for Japan it might be the Japanese imperial calendar, for Taiwan it might be the Minguo calendar, elsewhere it would typically be the Gregorian calendar (and possibly other calendars as well, in parts of China an Islamic calendar date may be shown too). -To address the format requirements for sections 1 and 2.1 above, I propose using pattern character 'r' to designate the related Gregorian year, which for any solar or luni-solar calendar would always be a fixed offset from the extended year 'u', and would correspond to the Gregorian year in which the calendar’s year begins (for the Gregorian calendar, 'r' would behave just 'u', i.e. the offset would be 0). For calendars completely unlinked from the solar year, such as the various Islamic calendars, 'r' could still correspond to the Gregorian year in which the Islamic year started, but the difference from extended year would not be a fixed offset (such formatting using Gregorian year is not common for the Islamic calendar anyway). -In ICU, each calendar could provide an internal method that would return that offset, then formatting or parsing using the 'r' character should just use that offset in conjunction with the EXTENDED_YEAR calendar field. In CLDR, all that would be required is documenting the new pattern character. -The data and format requirements for section 2.2 above are more complex and not addressed here -See also the earlier discussion of these issues in section F.11 of the proposal “Chinese (and other) calendar support, intercalary months, year cycles.” \ No newline at end of file diff --git a/docs/site/TEMP-TEXT-FILES/pinyin-fixes.txt b/docs/site/TEMP-TEXT-FILES/pinyin-fixes.txt deleted file mode 100644 index 424d37f64a7..00000000000 --- a/docs/site/TEMP-TEXT-FILES/pinyin-fixes.txt +++ /dev/null @@ -1,17 +0,0 @@ -Pinyin Fixes -As a part of the CLDR updates for Unicode 5.2, I've been looking at the pinyin support. This is in two areas: -We transform from Han characters to Pinyin -We sort according to Pinyin -According to the directions from Richard Cook, the best algorithm to get the most frequently used pinyin reading is to use all kHanyuPinlu readings first; then take all kXHC1983; then kHanyuPinyin. Using a program to get this, and compare against the pinyin sorting and transforms, we get discrepancies. For example, for sorting there are about 1500 cases (see attachment). The format is: -?? for items that look out of place (using a heuristic algorithm). Example: -?? 606 * kē (607) 錒 -The 606 is the "distance" from surrounding cases, the 607 is the rank order of the pinyin. -Where there are multiple readings in Unihan, they are given in the format with --: -?? 1 ào (20) 坳 垇 - -- 坳 {ào=[xh, pn, ma], āo=[pn, ma], yǒu=[pn]} - -- 垇 {ào=[xh, ma], āo=[ma]} -lu is kHanyuPinlu -xh is kXHC1983 -pn is kHanyuPinyin -ma is kMandarin -pinyinSortComparison.txt \ No newline at end of file