Skip to content

Commit

Permalink
CLDR-17566 converting design proposals p8 (#3853)
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 authored Jul 15, 2024
1 parent a8aabd3 commit 8a41de0
Show file tree
Hide file tree
Showing 6 changed files with 374 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: Post Mortem
---

# Post Mortem

Drivers marked in [...]. Drivers are to file bugs, put together plan for how to handle.

**Post Mortem (Phase I - not translators)**

1. Spec done late, little review, little time for public review. We should have a formal release and announcement ahead of time. Splitting up the spec into functional pieces, maintained in Sites, could let us put the introductory info into the pieces at the start. Move the design doc into the spec once approved. Point the survey tool to relevant parts of the spec. Sync up spec for each milestone. **[Peter, Mark]**
2. How to get metazone translated. Not enough coverage. We ask for translations that are not needed in the survey tool; those also contribute to the lack of coverage.
1. Make coverage be data driven.
2. Make metazone coverage exclude items that never need coverage; be dependant on the language (and the territories that use it). [http://cldr.unicode.org/development/design-proposals/coverage-revision](http://cldr.unicode.org/development/design-proposals/coverage-revision)
3. Make coverage visible in ST.
4. Make metazones easier to understand in the ST. Better examples of how they affect zones.
5. CommonlyUsed can't be effectively entered in ST.
6. **[John, (Mark for Coverage)]**
3. Avoid Xmas; don't do release in March. Release "clean up" tools release in Oct, include U6.0; have regular release in next June. The Oct release can have data changes, but it wouldn't use the ST to gather data. Agreement on that as tentative dates. **[DONE]**
4. Flexible data formatting also needs better presentation, and examples. **[Chris, Peter; #**[**2133**](http://unicode.org/cldr/trac/ticket/2133)**]**
5. Bulk import problems. Major problem was timing; need to come in at the latest a few weeks after data submission. Got bad data. Some language tags wrong (both inside the XML and as the name of the files); syntax wrong, choice of tags, and using wrong aliases. Character encoding problems; not UTF-8 or mixed. Need a tool that is more strict than regular tests, that prevents bulk import if there are problems. Need better gatekeeper for both checkin to SVN and import into ST. Need clearer policy on bulk import.
1. [2424](http://unicode.org/cldr/trac/ticket/2424) - JSP to test bulk import
2. [2579](http://unicode.org/cldr/trac/ticket/2579#comment:1) - comments on ConsoleCheckCLDR vs bulk import
3. Example of bulk import; large changes of structure, example: casing changes.
4. No bulk imports after certain date.
5. Add comments on bulk changes; attached to each change, eg proposed-u666-r12 (r12 points to a string with background).
1. srl: consider Wikipedia's [Bot Policy#Good Communication](http://en.wikipedia.org/wiki/Wikipedia:Bot_policy#Good_communication)
6. **Delay until 2.0 (no bulk except brand-new locales). [John to file bug].**
6. Late implementation of new voting rules. Anything that changes what gets marked as "approved" must be done before data submission. ?Tune voting rules for new structure? **[Done]**
7. Quality suffered through using outside contractors. **[Done]**
8. Need tighter control of commitments vs. milestones. Reviews don't get done until it's too late to do anything about it. BRS for each milestone. Define what the milestones mean. Don't move ahead until all criteria for milestone are met, including reviews. Balance reviews each week also. **[Done]**
9. Tests and tools
1. Clean up the unit tests; have regular suite that can be mandated (at least for quick check). **[Umesh]**
2. Better tools in trac/svn, things that haven't been ported from ICU (code review). **[Steven]**
3. Need automated build in CLDR code, one that run tests. **[Yoshito]**
4. Would like to be able to also build/run ICU tests at the same time. **[Punt]** *(or file a bug for Steven to document somewhere how to do from the command-line? not a high priority)*
5. When do we take drops of ICU4J? **[Yoshito to write page for process]**
6. Need to control when we move to different versions of trunk, so we can do it at the same time. **[Under control]**
10. LDML2ICUConverter - need to recode for clarity... Get rid of the DOM-based code (LDMLUtilities). [Non-supplemental needs work.]
1. General cleanup of supplemental data handling, coverage/filtering **[Mark]**
2. Create staging plan **[John]**
3. Don't have overall picture of utilities and how and why they are used. **[Punt]**

**Phase 2**

1. PM phase 2 (from translators)
1. Voting issues
1. new items had too high a threshold, remained provisional
2. changes did not get approved
3. allow new items without 8 votes should help, but other problems because of too few organizations.
2. **[Chrish]**
3. Arabic; difficult to work with certain data, because they don't know what they will look like in different contexts. Primarily tools. (Mark: maybe show examples in both RTL & LTR cells?) **[Peter, Chris; #**[**2133**](http://unicode.org/cldr/trac/ticket/2133)**]**
4. Available formats, Interval formats. Vetters can't see what the effects are. Apple has internal tool; maybe integrate in ST? **[Peter, Chris]**
5. Russian/Catalan. Difficulty in making bulk changes (eg make changes to large number of fields). **[Chris]**
6. All like the quick steps; **(thanks to Steven and John).**
7. Leverage QuickSteps in rest of survey tool; allow Example column in QS, etc. **[Steven, John]**
8. Vetters used to better tools, able to group. (sort/group/filter) **[Steven, John]**

![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: Proposed Collation Additions
---

# Proposed Collation Additions

| | |
|---|---|
| Author | Mark Davis, Markus Scherer, Michael Fairley |
| Date | 2009-06-23 |
| Status | Proposal |
| Bugs | *insert linked bug numbers here* |

## Script Reordering

We would like to add script reordering as a new collation setting. This will allow, for example, sorting Greek before Latin, and digits after all letters, without listing all affected characters in the rules. Since this is a parameter, it can also be changed at runtime without changing any rules.

This will be implemented via a permutation table for primary collation weights. See the original (somewhat outdated) ICU collation design doc for reference:

http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU\_collation\_design.htm#Script\_Order

### Proposed LDML syntax:

Add the '**kr**' key, with an ordered list of script names as its types, in the order they should be sorted. For example, to specify an ordering of Greek, followed by Latin, followed by everything else (Zzzz = unknown), with digits (Zyyy = Common) last, the following would be used: **el-u-kr-grek-latn-zzzz-zyyy**. That would modify the ordering found on [http://unicode.org/charts/collation/](http://unicode.org/charts/collation/) in the following way:

- OLD
- [Null](http://unicode.org/charts/collation/chart_Null.html) [Ignorable](http://unicode.org/charts/collation/chart_Ignorable.html) [Variable](http://unicode.org/charts/collation/chart_Variable.html) [Common](http://unicode.org/charts/collation/chart_Common.html) [Latin](http://unicode.org/charts/collation/chart_Latin.html) [Greek](http://unicode.org/charts/collation/chart_Greek.html) [Coptic](http://unicode.org/charts/collation/chart_Coptic.html) ... [CJK](http://unicode.org/charts/collation/chart_CJK.html) [CJK-Extensions](http://unicode.org/charts/collation/chart_CJK-Extensions.html) [Unsupported](http://unicode.org/charts/collation/chart_Unsupported.html)
- NEW
- [Null](http://unicode.org/charts/collation/chart_Null.html) [Ignorable](http://unicode.org/charts/collation/chart_Ignorable.html) [Variable](http://unicode.org/charts/collation/chart_Variable.html) [Greek](http://unicode.org/charts/collation/chart_Greek.html) [Latin](http://unicode.org/charts/collation/chart_Latin.html) [Coptic](http://unicode.org/charts/collation/chart_Coptic.html) ... [CJK](http://unicode.org/charts/collation/chart_CJK.html) [CJK-Extensions](http://unicode.org/charts/collation/chart_CJK-Extensions.html) [Unsupported](http://unicode.org/charts/collation/chart_Unsupported.html) [Common](http://unicode.org/charts/collation/chart_Common.html)

***Issue:*** *do we still want Unsupported at the very end??*

The 'digitaft' type for the 'co' key is no longer needed, and can be deprecated (with some minor changes to data).

Add an additional attribute, **scriptReorder**, to **\<settings>**. Its value will be the script names separated by spaces, in the order they should be sorted. The script code **Zzzz** stands for "any other script", and the script code **Zyyy** stands for Common.

Example:

\<settings scriptReorder="grek latn zzzz zyyy">

Note: after looking at the data, I'm thinking that we might want to change the above:

- allow codes that are not just script codes; in particular, Sc and Nd.
- note that implicit is always at the end; thus there would be no code to specify it, so that someone can't try to put something after it.
- Add that if the same script is specified twice in the list, the second wins.
- we also need to warn people that depending on the implementation, specifying a script may drag along others. In particular, historic scripts may be grouped together.

See http://site.icu-project.org/design/collation/script-reordering

### Proposed LDML BCP47 subtag syntax changes:

To allow a key to have multiple types (for listing multiple script codes), change:

extension = key "-" type

to

extension = key ("-" type)+

## Collation Import

We want to add the ability for collation to "import" rules from another collator. This provides two useful features:

- Many European languages can import a common collation for the [European Ordering Rules](http://anubis.dkuug.dk/CEN/TC304/EOR/eorhome.html) and then add language-specific rules on top of that.
- For CJK Unihan variant collation orderings, the large common suffix with the Unihan ordering can be shared.

This should reduce the maintenance burden and make total storage of the collation rule strings significantly smaller.

### Proposed LDML syntax:

Add an **\<import>** tag within collation **\<rules>** with two attributes, **source**, to identify the locale to import from (mirroring \<alias>'s source), and **type**, to identify which collator within the locale to include.

Examples:

\<import source="und\_hani">

\<import source="de" type="phonebk">

Add **private** as an additional attribute for \<settings>:

\<settings private="true"> // mirroring \<transform>'s private attribute

This attribute indicates to clients that the collation is intended only for \<import>, and should not be available as a stand-alone collator or listed in available collator APIs.

**Update CLDR 26 (2014)**: A collation type is marked "private" via a type naming convention, rather than an attribute, so that it is easy for an implementation to omit such a type from a list of available types without reading its data. See [CLDR ticket #3949 comment:18](http://unicode.org/cldr/trac/ticket/3949#comment:18).


![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Resolution of CLDR files
---

# Resolution of CLDR files

These are some notes on how CLDR files are resolved. It is a basic description of the process to help people to understand how the code works. Some of the details may be a bit off if the code changes. The code could undoubtedly be improved both for efficiency and maintainability – this is just an attempt to document what happens now.

If significant changes are made, please update this document.

## XMLSource

Behind each CLDR file is an XMLSource, which manages access to the data. It provides for iteration through all of the distinguished paths in the file, and getting the values associated with each such path. While there is more than one item associated with the path (the value, comments, and the full path), we'll focus on the element value.

There are three main implementations:

1. **simple file access** - uncomplicated, just reads an XML file and produces a map from paths to values.
2. **survey tool access** - accesses a database for use in the survey tool
3. **resolving access** - produces a resolving XMLSource, based on one of the first two. That is, it has a main file which is one of the first two, and can also create any needed other one as necessary: a parent locale, or a locale pointed to by an alias.

The resolution process is fairly complicated. The main issues are lookup and iteration. Of them, iteration is somewhat harder. However, due to aliases being restricted to the root locale since CLDR 2.0, the process has been made a lot easier.

## Lookup

Lookup would be easy if it weren't for aliases. Here's how it works.

**Start with the main file.**

1. Look in the file. If found, return the value.
2. If the value is not found, look in the parent recursively.
3. *If not found when root has been reached and checked, see if the path has an alias in root.*
1. Example:
1. looking up
2. //ldml/dates/calendars/calendar[@type="gregorian"]/dayPeriods/dayPeriodContext[@type="stand-alone"]/dayPeriodWidth[@type="narrow"]/dayPeriod[@type="am"]
1. will match
3. //ldml/dates/calendars/calendar[@type="gregorian"]/dayPeriods/dayPeriodContext[@type="stand-alone"]/**alias**[@source="**locale**"][@path="**../dayPeriodContext[@type='format']**"]
4. If so, construct two items:
1. sourceLocale
2. resolvedPath
5. Example from above:
6. //ldml/dates/calendars/calendar[@type="gregorian"]/dayPeriods/dayPeriodContext[@type="**format**"]/dayPeriodWidth[@type="narrow"]/dayPeriod[@type="am"]
7. Recursively lookup the path in the source locale, and return the value.
1. Note that a locale of "locale" means to lookup in the *original* locale (of the main file).
4. Repeat from step 1 using the resolvedPath until a value is reached or no more aliases are found.
5. If not found, there is a special file that is algorithmically constructed called CODE-FALLBACK, so look there.
6. If not found there, return null (fail).

This process can get complicated. If we look in the sr-YU locale for \<ethiopic calendar dayPeriod, stand-alone, narrow, am>

- it looks first in sr-YU
- then got to the parent, sr
- then root. Finds an alias as above redirecting to format
- look back at sr-YU, now for dayPeriod, format, narrow, am
- look in the parent sr
- and so on.
- Eventually we find the value in gregorian calendar, dayPeriod, format, wide, am

Internally, the code caches the location (targetLocale and resolved path) for each path, so that a second lookup is fast. Note that the target locale "locale" needs to bump all the way up to the top each time, so that the appropriate localized resources are found if they are there.

## Iteration

Iteration is more complicated. We have to figure out whether *any possible* path would return a value in lookup. Again, this would be very simple if it weren't for aliases. Here's how it works.

**Start with the main file.**

1. Find the set of all non-aliased paths in the file and each of its parents, and sort it by path.
2. Collect all the aliases in root and obtain a reverse mapping of aliases, i.e. destinationPath to sourcePath. Sort it by destinationPath.
3. Working backwards, use each reverse alias on the path set to get a set of new paths that would use the alias to map to one of the paths in the original set.
4. Add the new set of paths to the original set of paths, and use the new set as input into step 3. Repeat until there are no more new paths found.

A set of all the paths is cached on the first access for iteration, using the above process. For lookup, the incoming path is checked against the cached set of paths, and the lookup process takes place only if the value is not already cached. The iteration and lookup processes are performed separately because they are both optimized for their individual use cases. Iteration would slow down significantly if value storage was performed at the same time.


![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: script-metadata
---

# script-metadata

[http://unicode.org/cldr/trac/ticket/3871](http://unicode.org/cldr/trac/ticket/3871)

Here is a proposed structure for supplemental data

\<script>

&emsp;\<scriptData type="Latn" lines="top-to-bottom" characters="left-to-right" spaces="true" shaping="minimal" usage="recommended" originRegion="150" sample="A">

lines/characters =

- as in the locale orientation element.

spaces =

- true if the script normally uses spaces to separate words

shaping = (in normal usage)

- none if no shaping is normally needed
- minimal if only minor shaping is normally needed, such as accent placement
- major if glyphs normally need to rearrange, or change shape depending on context

usage = value in UAX31

originRegion = continent or subcontinent where script originated. Following the Unicode book / charts (see the spreadsheet).

sample = character with distinctive glyph that can be used to represent the script in icons (eg missing glyph)

The orientation field in a locale would only be used for overriding.


![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)
Loading

0 comments on commit 8a41de0

Please sign in to comment.