Skip to content

Commit

Permalink
CLDR-17566 converting design proposals p10 (#3855)
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 authored Jul 15, 2024
1 parent 6f048e4 commit adeccc3
Show file tree
Hide file tree
Showing 6 changed files with 445 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Transform keywords
---

# Transform keywords

There is now an internet draft being developed. See http://tools.ietf.org/html/draft-davis-t-langtag-ext

**OLDER DRAFT**

We often need a language/locale identifier to indicate the source of transformed (transliterated or translated) content. For example, for Map data, the names of Italian or Russian cities need to be represented in Katakana for Japanese users.

It is important to be able to indicate not only the resulting content language, but also the source. Transforms such as transliteration may vary depending not only on the basis of the source and target script, but also language. Thus the Russian "Пу́тин" transliterates into "Putin" in English but "Poutine" in French. The identifier may be used to indicate a desired mechanical transliteration in an API, or it may be used to tag data that has been converted (mechanically or by hand) according to a transliteration method.

The specification uses the BCP47 extension 't' and the Unicode extension key "ts", such as in the following examples:

| | |
|---|---|
| ja-Kana-**t-it-u-ts-ungegn-2007** | the content is transliterated from Italian to Katakana (Japanese) according to the corresponding UNGEGN transliteration dated 2007 |
| und-Kana-**t-und-cyrl** | the content source was Cyrillic, translated or transliterated to Katakana, but the mechanism was not known (or not specified) |
| en-**t-fr-u-ts-mech** | the content was mechanically translated from French to English; the mechanism is unspecified |

The extension **t** indicates a source for the transformed content (transliterated or translated). It takes any Unicode language identifier, thus a subset of the registered BCP47 language tags:

- lang (-script)? (-region)? (-variant)\*

For script transliteration, or for specialized transform variants, the language tag 'und' is used. For example, a general romanization will have the language tag 'und-Latn'. The language tag 'und' is also used where the source or target are "Any" currently in CLDR. A new section of the LDML specification describes this tag.

The Unicode extension key **tm** is a keyword specifying the mechanism for the transform, where that is useful or necessary. It takes N subtypes, in order, that represent the transliteration variant. As usual, the subtypes will be listed in the bcp47 subdirectory of CLDR, adding a description field and optional aliases. Initially, registered are common transliteration standards like ISO15915, KMOCT, ... UNGEGN, ... , and specialized variants. See http://www.unicode.org/reports/tr35/#Transforms.

Any final subtype of 4, 6, or 8 digits represents a date in the format yyyy(MM(dd)?)?, such as 2010, or 201009, or 20100930. So, for example, und-Latn-t-und-hebr-tm-ungegn-2007 represents the transliteration as described in http://www.eki.ee/wgrs/rom1\_he.htm. The date should only be used where necessary, and if present only be as specific as necessary. So if the only dated variants for the given mechanism, source, and result are 1977 and 2007, the month and day in 2007 should not be present.

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: Unihan Data
---

# Unihan Data

## Background

In CLDR, we use this data for sorting and romanization of Chinese data. Both of these need to be weighted for proper names, since those are the items most commonly needed (contact names, map locations, etc.).

1. Sorting
1. A major collation for simplified Chinese compares characters first by pinyin, then (if the same pinyin) by total strokes. It thus needs the most common (simplified) pinyin values, and total (simplified) strokes.
2. A major collation for traditional Chinese compares characters by total (traditional) strokes. It needs reliable total (traditional) strokes.
3. For both of these, we use the Unicode radical-stroke (kRSUnicode) as a tie-breaker. The pinyin values need to be the best single-character readings (without context).
2. Romanization
1. We need to have the most common pinyin values. These can have contextual readings (eg more than one character).

## Tool

There is a file called **GenerateUnihanCollators.java** which is currently used to generate the CLDR data, making use of Unihan data plus some special data files. The code is pretty crufty, since it was mostly designed to synthesize data from different sources before kMandarin and kTotalstrokes were expanded in Unihan. It is currently in the unicodetools project since it needs to be run against draft versions of the UCD.

As input, it uses the Unicode properties, plus the following:

- [bihua-chinese-sorting.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/bihua-chinese-sorting.txt)
- [CJK\_Radicals.csv](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/CJK_Radicals.csv)
- [patchPinyin.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchPinyin.txt)
- [patchStroke.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchStroke.txt)
- [patchStrokeT.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchStrokeT.txt)
- [pinyinHeader.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/pinyinHeader.txt)

It creates a number of files in {Generated}/cldr/han/kMandarin.txt

1. Take Han-Latin.txt, and insert into /cldr/common/transforms/Han-Latin.txt, replacing the lines between
- \# START AUTOGENERATED Han-Latin.xml
- \# END AUTOGENERATED Han-Latin.xml
2. Diff to sanity check. Run the Transform tests (or just all of them), then check in.
3. Take the strokeT.\*\\.txt files, and pinyin.\*\\.txt and insert them in the appropriate slots in
1. pinyin.txt → # START AUTOGENERATED PINYIN LONG (sort by pinyin then kTotalStrokes then kRSUnicode)
2. pinyin\_short.txt → \# START AUTOGENERATED PINYIN SHORT (sort by pinyin then kTotalStrokes then kRSUnicode)
3. strokeT.txt → # START AUTOGENERATED STROKE LONG
4. strokeT\_short.txt → \# START AUTOGENERATED STROKE SHORT
4. Diff to sanity check.
5. Run tests, check in.

The tool also generates some files that we should take back to the Unihan people. Either changes should be made in Unihan, or we should drop the items from out patch files. Examples:

1. [kTotalStrokesReplacements.txt](http://unicode.org/repos/cldr-tmp/trunk/dropbox/han/kTotalStrokesReplacements.txt)

It shows the cases where the binhua values are different than Unihan.

2. [imputedStrokes.txt](http://unicode.org/repos/cldr-tmp/trunk/dropbox/han/imputedStrokes.txt)

It shows the cases where a stroke count is synthesized from radical/stroke information. This is only approximate, but better than sorting them all at the bottom. It is only used if there is no Unihan or binhua information.

### Stopgap

As a proxy for the best pinyin, we use an algorithm to pick from the many pinyin choices in Unihan, based on an algorithm that Richard supplied. There is a small patch file based on having native Chinese speakers look over the data. Any patches should be pulled back into Unihan. The algorithm is:

Take the first pinyin from the following. Where there are multiple choices in a field, use the first

1. patchFile
2. kMandarin // moved up in CLDR 30.
3. kHanyuPinlu
4. kXHC1983
5. kHanyuPinyin
6. bihua

Then, if it is still missing, try to map to a character that does have a pinyin. If we find one, stop and use it.

1. Radical => Unifield
2. kTraditionalVariant
3. kSimplifiedVariant
4. NFKD

## OLD

~~**DRAFT!!**~~

~~In 1.9, we converted to using Unihan data for CLDR collation and transliteration. We've run into some problems (pedberg - see for example~~ [#3428](http://unicode.org/cldr/trac/ticket/3428)~~), and this is a draft proposal for how to resolve them.~~

### Longer Term

~~The following are (draft) recommendations for the UTC.~~

1. ~~Define the kMandarin field to contain one or two values. If there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). If the values would be the same, there is only one value. (pedberg - it is already defined that way)~~
2. ~~The preferred value should be the one that is most commonly used, with a focus on proper names (persons or places). For example, if reading X has 30% of the frequency of Y, but X is used with proper names but Y is not, X would be preferred.~~
3. ~~Define the kTotalStrokes field to be what is most appropriate for use with zh-Hant, and add a new field, kTotalSimplifiedStrokes, to be what is most appropriate for use with zh-Hans. pedberg- The kTotalStrokes field is already defined to be the value "for the character as drawn in the Unicode charts" which may not match the value for zh-Hant; we may need to add 2 stroke count fields.~~
4. ~~Get a commitment from the IRG to supply these values for all new characters. Set in place a program to add/fix values for existing characters.~~

~~Once this is in place, remove the now-superfluous patch files in the CLDR collation/transliteration generation.~~

### Short Term (1.9.1)

1. ~~Modify the pinyin to choose the 1.8 CLDR transliteration value first, then fall back to the others.~~
2. ~~Have two transliteration pinyin variants: Names and General. Make the default for pinyin be "Names". (There are only currently 2 differences.) (pedberg - Yes, but there is a ticket to add more, see~~ [~~#3381~~](http://unicode.org/cldr/trac/ticket/3381)~~, which covers some of the problems from #3428 above)~~
3. ~~Use the default pinyin for collation.~~
4. ~~Add two total-strokes patch files for the collation generator, one for simplified and one for traditional.~~
5. ~~In the generator, have two different total-strokes used for simplified vs traditional.~~

~~pedberg comments:~~

1. ~~We need to ensure that the transliteration value is consistent with the pinyin collator.~~
2. ~~The 1.8 transliterator had many errors, I don't think a wholesale fallback to that is a good idea.~~
3. ~~Using the name reading rather than the general reading for standard pinyin collation might produce unexpected results.~~
4. ~~Why not just specify the name reading when that is desired? No need to make it the default if it is the less common reading.~~

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Units: pixels, ems, display resolution
---

# Units: pixels, ems, display resolution

| | |
|---|---|
| Author | Peter Edberg |
| Date | 2019-05-21 |
| Status | Proposal |
| Feedback to | pedberg (at) unicode (dot) org |
| Bugs | [CLDR-9996](https://unicode-org.atlassian.net/browse/CLDR-9996) Add units for pixels , ems, and display resolution<br /> [CLDR-8076](https://unicode-org.atlassian.net/browse/CLDR-8076#icft=CLDR-8076) Proposal: add units for dot density and pixel density |

This is a proposal to add 7 units related to graphics, imaging, typography.

| unit | abbr (en) | proposed category | notes |
|---|---|---|---|
| pixel<br /> megapixel | px<br /> MP | new category? | A pixel is the smallest resolvable element of a bitmap image. In many usages it does not have a fixed size, and is just used for counting (e.g. an image has 360 pixels vertically and 480 horizontally). In some graphic usages it is specifically 1/96 inch. In CSS it is the smallest resolvable element on a display, but specifically means 1/96 inch on a printer. Thus sometimes it is a property associated with an image, and sometimes with a device. |
| dots-per-inch<br /> dots-per-centimeter | dpi<br /> dpcm | new category?<br /> or concentr | A dot is the smallest displayable element on a device, typically used for printers. Measurements using dots per inch or centimeter are used to indicate printer resolution, and are a kind of linear density. |
| pixels-per-inch<br /> pixels-per-centimeter | ppi<br /> ppcm | new category?<br /> or concentr | Measurements using pixels per inch or centimeter are sometimes used to indicate display resolution, and are a kind of linear density. |
| em | em | new category? or length | A typographic em is a unit of length that corresponds to the point size of a font (so it does not have a fixed size). |

We could consider adding a new category for these, say “graphics”; otherwise some of them do not fit reasonably into any existing category.

Per TC meeting 2019-05-22:

- Use singular for the internal key name, e.g. “dot-per-inch”
- Put all of these in a new category “graphics”

Some reference material:

- https://www.w3.org/Style/Examples/007/units.en.html
- https://en.wikipedia.org/wiki/Pixel
- https://en.wikipedia.org/wiki/Display_resolution
- https://en.wikipedia.org/wiki/Dots_per_inch
- https://en.wikipedia.org/wiki/Pixel_density
- https://en.wikipedia.org/wiki/Em_(typography)

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: UTS #35 Splitting
---

# UTS #35 Splitting

Strawman here for discussion.

1. Divide up the spec by functional lines:
- Dates and Times
- Numbers & Currencies
- Collation
- ...
- Misc.
- Other supplemental data
- Supplemental metadata

Important features

- Collaboration
- Many authors
- Cheap tools, accessible to everyone
- Easy to edit.
- Must be able to snapshot.
- Stylesheets (or equivalent mechanisms) are critical.
- ...

2. Options.
1. Use HTML, but break into Part1, Part2, .... Still have to muck with tagging; non-WYSIWYG editing.
2. Eric strongly recommends docbook (see http://wiki.docbook.org/topic/DocBookAuthoringTools).
3. Ask Richard Ishida about how W3C documents work. [Mark]
4. Use Sites for the subdocuments. We've done this in ICU, and it makes it easier to edit, and thus easier to add new material.

The release would consist of taking a snapshot of the site, copying to different number (eg ldmlspec2.1)

4.1. There is a *rough* prototype:
1. http://sites.google.com/site/ldmlspec/home?previewAsViewer=1
2. http://unicode.org/repos/cldr-tmp/trunk/dropbox/mark/LDML.1.pdf
3. http://sites.google.com/site/ldmlspec/home

4.2. Discussion
1. Mark to look at whether we can make a copy for a snapshot of a version. DONE (easy to do)
2. Advantages:
1. any of us can edit easily
3. Disadvantages:
1. Numbering couldn't be within chapter (eg Chapter 2 section 1 would be 1.)
1. Could only approximate the TR format.
2. CSS doesn't yet work.

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Voting
---

# Voting

We have been trying to tune the voting process, see [2260](http://www.unicode.org/cldr/bugs-private/locale-bugs-private?findid=2260). The goal is to balance stability with flexibility.

## Issues

1. Occurrence of "draft" in country locales: http://www.unicode.org/cldr/data/dropbox/misc/draft\_in\_country\_locales.txt
2. And non-country locales: http://www.unicode.org/cldr/data/dropbox/misc/draft\_in\_noncountry\_locales.txt
3. Gratuitious changes in country locales: pt\_PT has lots of such from pt (= pt\_BR)
4. We want people to be able to make fixes where needed (it is frustrating to request a fix, but not have it confirmed because people don't look at it)
5. But we don't want to have "wiki-battles": how do we "protect" against bogus data, while allowing needed changes?

## Suggestions

1. Set a higher bar
1. on changes to "critical" locales (union of principal's main tiers, intersected with those that are fleshed out well)
2. http://www.unicode.org/cldr/data/charts/supplemental/coverage\_goals.html
3. on country locales.
4. Allow multiple separate votes for TC organizations for non-critical locales. For Vetter status, two sets of eyes should be sufficient. Downside is "deck-stacking".
5. Vote on "logical groups" (eg sets of Months) as a whole.
6. Show country locale differences as "alts".
7. Save votes for "active members" across releases. See [2095](http://www.unicode.org/cldr/bugs-private/locale-bugs-private/design?id=2095;_=).
8. not feasible for this release.

## Background

Our current voting process is at http://cldr.unicode.org/index/process#TOC-Draft-Status-of-Optimal-Field-Value

The key points are:

- For each value, each organization gets a vote based on the *maximum* (not cumulative) strength of the votes of its users who voted on that item.
- If there is a dispute (votes for different values) within an organization, then the majority vote for that organization is chosen. If there is a tie, then no vote is counted for the organization.
- Let **O** be the optimal value's vote, **N** be the vote of the next best value, and **G** be the number of organizations that voted for the optimal value.
- Assign the draft status according to the first of the conditions below that applies:

| Resulting Draft Status | Condition |
|---|:---:|
| *approved - critical loc* | **O ≥ 8** and O &gt; N |
| *approved - non-critical loc* | **O ≥ 4** and O &gt; N |
| *contributed* | O ≥ 2 and O &gt; N and G ≥ 2 |
| *provisional* | O ≥ 2 and O ≥ N |
| *unconfirmed* | otherwise |

- If the draft status of the previously released value is better than the new draft status, then no change is made. Otherwise, the optimal value and its draft status are made part of the new release.

In our previous version, *approved* required O ≥ 8.

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Loading

0 comments on commit adeccc3

Please sign in to comment.