Skip to content

Commit

Permalink
CLDR-17566 text diffs and minor changes
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 committed Jun 27, 2024
1 parent 02af3d3 commit 7d6c205
Show file tree
Hide file tree
Showing 6 changed files with 68 additions and 24 deletions.
27 changes: 15 additions & 12 deletions docs/site/TEMP-TEXT-FILES/characters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,18 @@ Alphabetic Information
Ellipsis Patterns
Ellipsis patterns are used in a display when the text is too long to be shown. It will be used in environments where there is very little space, so it should be just one character; where that really can't work, it should be as short as possible.
There are three different possible patterns that need to be translated. Typically the same character is used in all three, but three choices are provided just in case different characters would be appropriate in different contexts, for some languages.
English Pattern English Example Meaning
{0}… or { FIRST_PART_OF_TEXT }… The quick brown f... The end of the string is being truncated.
{0}…{1} or { FIRST_PART_OF_TEXT }…{ LAST_PART_OF_TEXT } The quic…azy dog. The middle of the string is being truncated.
…{1} or …{ LAST_PART_OF_TEXT } …ver the lazy dog. The start of the string is being truncated.
English uses the same basic text for all three cases, and just changes the placeholders. An example of where a language might use different characters is where a space should come between the placeholder and the elipsis. In that case, the patterns would be as in the second column below.
English Pattern With Spaces
{0}… {0} …
{0}…{1} {0} … {1}
…{1} … {1}
English uses the elipsis character (Unicode U+2026), which is preferred over three periods in a row. The latter may have a different appearance, as in the following table.
Ellipsis Character
Three dots (periods/full-stops)
...
Ellipsis Character …
Three dots (periods/full-stops) ...
If your language also uses three dots to indicate that some text is being elided, then you should also use the elipsis character unless three separate dots are strongly preferred.
Parse (Parse Lenient)
This list of characters are those that should be treated the same when a program (or system) reads it as input. An example would be when you type a date into a browser URL field.
Expand All @@ -21,14 +27,11 @@ The delimiters are the characters used for quoting text. For example, for Englis
BIDI languages (Arabic, Hebrew,…):
“Start” means the character that starts the quotation, and “end” the one that finishes it. With most languages, the start quotation will appear on the left, while with BIDI languages, it will appear on the right.
Valid Delimiters
Currently the CLDR survey tool checks input delimiters against a predefined set of possibilities. The following delimiters are considered "valid" by the CLDR survey tool.
U+2018 LEFT SINGLE QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK U+201A SINGLE LOW-9 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK U+201E DOUBLE LOW-9 QUOTATION MARK U+300C LEFT CORNER BRACKET U+300D RIGHT CORNER BRACKET U+300E LEFT WHITE CORNER BRACKET U+300F RIGHT WHITE CORNER BRACKET U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK « U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK » U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
Currently the CLDR survey tool checks input delimiters against a predefined set of possibilities. The following delimiters are considered "valid" by the CLDR survey tool.
‘ U+2018 LEFT SINGLE QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK U+201A SINGLE LOW-9 QUOTATION MARK U+201C LEFT DOUBLE QUOTATION MARK U+201D RIGHT DOUBLE QUOTATION MARK U+201E DOUBLE LOW-9 QUOTATION MARK U+300C LEFT CORNER BRACKET U+300D RIGHT CORNER BRACKET U+300E LEFT WHITE CORNER BRACKET U+300F RIGHT WHITE CORNER BRACKET U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK « U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK » U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
If you need to enter a delimiter that is not one of the characters on this list, please file a new ticket by following these instructions.
Yes/No
There are special versions of "Yes" and "No" used in POSIX (Portable Operating System Interface) context or other similar applications. Please supply the full word in your language (in lowercase if applicable), followed by a colon, then a common abbreviation separated by colons.
Name
Yes
No
English Example
yes:y
no:n
Name English Example
Yes yes:y
No no:n
53 changes: 44 additions & 9 deletions docs/site/TEMP-TEXT-FILES/exemplars.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,58 @@ See the table on the left; you can copy an escape from the left column to insert
The ➖, ❰, and ❱ characters are chosen to be unusual, so that it is unlikely that they would be normally among the characters you would want to have in a set such as the punctuation characters used in your language
You can add characters in any order: they'll be displayed in the default order for your locale. Exceptions are very large character sets like Korean Hangul, which use a code point order so that they can make use of the ➖ character.
In CLDR 43 and previous versions, a different format was used, one that require special "escapes" for certain characters and for strings. This caused problems for many people, and was replaced by the simpler format above.
Key to Escapes
Abbr. Code Point Name
❰TAB❱ U+0009 tab
❰LF❱ U+000A line feed
❰CR❱ U+000D carriage return
❰SP❱ U+0020 space
❰NSP❱ U+2009 narrow/thin space
❰NBSP❱ U+00A0 no-break space
❰NNBSP❱ U+202F narrow/thin no-break space
❰WNJ❱ U+200B allow line wrap after, aka ZWSP
❰WJ❱ U+2060 prevent line wrap
❰SHY❱ U+00AD soft hyphen
❰ZWNJ❱ U+200C cursive non-joiner
❰ZWJ❱ U+200D cursive joiner
❰ALM❱ U+061C Arabic letter mark
❰LRM❱ U+200E left-right mark
❰RLM❱ U+200F right-left mark
❰LRO❱ U+202D left-right override
❰RLO❱ U+202E right-left override
❰PDF❱ U+202C end override
❰BOM❱ U+FEFF byte-order mark
❰ANS❱ U+0600 Arabic number sign
❰ASNS❱ U+0601 Arabic sanah sign
❰AFM❱ U+0602 Arabic footnote marker
❰ASFS❱ U+0603 Arabic safha sign
❰SAM❱ U+070F Syriac abbreviation mark
❰KIAQ❱ U+17B4 Khmer inherent aq
❰KIAA❱ U+17B5 Khmer inherent aa
❰RANGE❱ U+2796 range syntax mark
❰ESCS❱ U+2770 escape start
❰ESCE❱ U+2771 escape end
❰…❱ U+… Other; … = hex notation
Examples
In the info panel, a mouse hover over the non-winning values shows a comparison to the Winning value. The ➕ { } indicates that { and } are additions to the Winning value, and ➖ ‐ – … ' ‘ ’ " “ ” § @ * / & # † ′ ″ indicates that ➖, ‐. –. …. and so on are subtractions from the Winning value. That makes it much easier to see what the difference in the outcome would be.
The very last line shows an internal UnicodeSet format. You can normally ignore this. However, if you want more details about the characters you can copy the [...] from that line in the Info Panel and paste that into the Input box on UnicodeSet (and hit Show Set) to see more information about the characters, such as [!(),-.\:;?\[\]\{\}‑].
Table of Contents
Format
Examples
Exemplar Characters
Parse Characters
Handling Warnings in Exemplar characters
Key to Escapes
Examplar Examples
Exemplar Characters
The exemplar character sets contain the commonly used letters for a given modern form of a language. These are used for testing and for determining the appropriate repertoire of letters for various tasks, like choosing charset converters that can handle a given language. The term “letter” is interpreted broadly, and includes characters used to form words, such as 是 or 가. It should not include presentation forms, like U+FE90 ( ‎ﺐ‎ ) ARABIC LETTER BEH FINAL FORM, or isolated Jamo characters (for Hangul).
For charts of the standard (non-CJK) exemplar characters, see a chart of the standard exemplar characters.
For more information, please see Section 5.6 Character Elements in UTS#35: Locale Data Markup Language (LDML).
There are different categories:
Examplar Examples
Category English Example Meaning
standard a b c d e f g h i j k l m n o p q r s t u v w x y z The minimal characters required for your language (other than punctuation).
The test to see whether or not a letter belongs in the main set is based on whether it is acceptable in your language to always use spellings that avoid that character. For example, English characters do not contain the accented letters that are sometimes seen in words like résumé or naïve , because it is acceptable in common practice to spell those words without the accents.
If your language has both upper and lowercase letters, only include the lowercase (and İ for Turkish and similar languages).
punctuation ‐ – — , ; : ! ? . … ‘ ' ’ ′ ″ “ " ” ( ) [ ] / @ & # § † ‡ * The punctuation characters customarily used with your language.
For example, compared to the English list, Arabic might remove ; , ? /, and add ؟ \ ، ؛.
Don't include purely math symbols such as +, =, ±, and so on.
auxiliary á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ Additional letters and punctuation (beyond the minimal set) used in foreign or technical words found in typical magazines, newspapers, &c.
For example, you could see the name Schröder in English in a magazine, so ö is in the set. However, it is very uncommon to see ł , so that isn't in the auxiliary set for English. Publication style guides, such as The Economist Style Guide for English, are useful for this.
If your language has both upper and lowercase letters, only include the lowercase (and İ for Turkish and similar languages).
index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z The “shortcut” letters for quickly jumping to sections of a sorted, indexed list (for an example, see mu.edu).
The choice of letters should be appropriate for your language. Unlike the minimal or additional characters, it should have either uppercase or lowercase, depending on what is typical for your language (typically uppercase).
Parse Characters
These are sets of characters that are treated as equivalent in parsing. In the Code column you'll see a description of the characters with a sample in parentheses. For example, the following indicates that in date/time parsing, when someone types any of the characters in the Winning column, they should be treated as equivalent to ":".
Note that if your language doesn't use any of these characters in date and times, the value doesn't really matter, and you can simply vote for the default value. For example, if a time is represented by "3.20" instead of "3:20", then it doesn't matter which characters are equivalent to ":".
Expand Down
5 changes: 5 additions & 0 deletions docs/site/TEMP-TEXT-FILES/numbering-systems.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ The default numbering system for a locale is the numbering system that is normal
The native numbering system for a locale is the numbering system used for native digits, and is normally in the script for the locale's language. Native numbering systems can only use numeric positional decimal digits, like for Latin numbers (0123456789). If the numbering system in your language uses an algorithm to spell out numbers in the language's script, label it as a traditional numbering system instead. The traditional numbering system does not need to be specified if it is the same as the native numbering system.
The default, native and traditional numbering systems for a locale may be different. For example, in Tamil the default numbering system is latn, the native numbering system is tamldec and the traditional numbering system is taml.
Codes are used to represent numbering systems in the Survey tool. Below are some examples of common codes:
Code Description Digits
arab Arabic-Indic digits ٠١٢٣٤٥٦٧٨٩
fullwide Full width digits 0123456789
hant Traditional Chinese numerals — non-decimal algorithmic
latn Latin digits 0123456789
For further reference, see the complete list of numbering system codes and their corresponding rules.
Minimum digits for grouping
In some languages, the grouping separator is suppressed in certain cases. For example, see china-auf-wachstumskurs.gif, where there is a grouping separator in "12 080" but not in "4720". The minimumGroupingDigits determines what the default for a locale is. In this case the value should be "2" to illustrate that the separator only appears once the number of thousands goes into the double-digits (i.e. 10 thousand or above) and not for single-digit thousands (i.e. anything below 10 thousand).
Expand Down
2 changes: 1 addition & 1 deletion docs/site/translation/core-data/characters.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ The English value is “?”, but another character might be better for your lan

## Delimiters

The delimiters are the characters used for quoting text. For example, for English they are the “curly” right and left forms as in **“this phrase.”** The alternate forms are for embedded quotations, such as ****He yelled **‘Stop!’**, and turned around.”
The delimiters are the characters used for quoting text. For example, for English they are the “curly” right and left forms as in **“this phrase.”** The alternate forms are for embedded quotations, such as He yelled **‘Stop!’**, and turned around.”

*BIDI languages (Arabic, Hebrew,…):*

Expand Down
3 changes: 2 additions & 1 deletion docs/site/translation/core-data/exemplars.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Certain fields have _**sets**_ of characters (and strings) as values, called **U

In the info panel, a mouse hover over the non-winning values shows a comparison to the Winning value. The ➕ { } indicates that { and } are additions to the Winning value, and ➖ ‐ – … ' ‘ ’ " “ ” § @ \* / & # † ′ ″ indicates that ➖, ‐. –. …. and so on are subtractions from the Winning value. That makes it much easier to see what the difference in the outcome would be.

The very last line shows an internal UnicodeSet format. You can normally ignore this. However, if you want more details about the characters you can copy the [...] from that line in the Info Panel and paste that into the Input box on [UnicodeSet](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp) (and hit Show Set) to see more information about the characters, such as [[!(),-.\:;?\[\]\{\}]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B!%28%29,-.%5C:;?%5C%5B%5C%5D%5C%7B%5C%7D%E2%80%91%5D).
The very last line shows an internal UnicodeSet format. You can normally ignore this. However, if you want more details about the characters you can copy the [...] from that line in the Info Panel and paste that into the Input box on [UnicodeSet](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp) (and hit Show Set) to see more information about the characters, such as [[!(),-.\\:;?\\[\\]\\{\\}‑]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B!%28%29,-.%5C:;?%5C%5B%5C%5D%5C%7B%5C%7D%E2%80%91%5D).

![image](../../images/core-data/Screenshot-2024-06-27-at-3.59.26.png)

Expand Down Expand Up @@ -101,6 +101,7 @@ For example:

- Suppose the currency code XAF is translated as "Φράγκο BEAC CFA" in Greek. That raises a warning because the "BEAC CFA" are not in the Greek exemplars.
- Suppose that a currency symbol contains ৲ (BENGALI RUPEE MARK). That also raises a warning, even though it is a symbol and not a letter, because it has a script (Bengali).

Three possible solutions:

1. If the character really is used in the language, add it to the appropriate exemplar set (**standard, auxiliary,…**).
Expand Down
2 changes: 1 addition & 1 deletion docs/site/translation/core-data/numbering-systems.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Codes are used to represent numbering systems in the Survey tool. Below are some
| hant | Traditional Chinese numerals — non-decimal | algorithmic |
| latn | Latin digits | 0123456789 |

For further reference, see the [complete list](http://www.unicode.org/repos/cldr/trunk/common/bcp47/number.xml) of numbering system codes and their corresponding[rules](http://www.unicode.org/repos/cldr/trunk/common/supplemental/numberingSystems.xml).
For further reference, see the [complete list](http://www.unicode.org/repos/cldr/trunk/common/bcp47/number.xml) of numbering system codes and their corresponding [rules](http://www.unicode.org/repos/cldr/trunk/common/supplemental/numberingSystems.xml).

## Minimum digits for grouping

Expand Down

0 comments on commit 7d6c205

Please sign in to comment.