Skip to content

Commit

Permalink
CLDR-17230 Fix spec for derived names
Browse files Browse the repository at this point in the history
  • Loading branch information
macchiati committed Dec 6, 2023
1 parent ed7cb16 commit 0acd1b0
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 42 deletions.
16 changes: 9 additions & 7 deletions docs/ldml/tr35-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -2643,22 +2643,24 @@ Many emoji are represented by sequences of characters. When there are no `annota
1. If **sequence** is an **emoji flag sequence**, look up the territory name in CLDR for the corresponding ASCII characters and return as the short name. For example, the regional indicator symbols P+F would map to “Französisch-Polynesien” in German.
2. If **sequence** is an **emoji tag sequence**, look up the subdivision name in CLDR for the corresponding ASCII characters and return as the short name. For example, the TAG characters gbsct would map to “Schottland” in German.
3. If **sequence** is a keycap sequence or 🔟, use the characterLabel for "keycap" as the **prefixName** and set the **suffix** to be the sequence (or "10" in the case of 🔟), then go to step 8.
4. Let **suffix** and **prefixName** be "".
5. If **sequence** contains any emoji modifiers, move them (in order) into **suffix**, removing them from **sequence**.
6. If **sequence** is a "KISS", "HEART", "FAMILY", or "HOLDING HANDS" emoji ZWJ sequence, move the characters in **sequence** to the front of **suffix**, and set the **sequence** to be "💏", "💑", or "👪" respectively, and go to step 7.
4. If the **sequence** ends with the string ZWJ + ➡️, look up the name of that sequence with that string removed. Embed that name into the "facing-right\" and return it.
5. Let **suffix** and **prefixName** be "".
6. If **sequence** contains any emoji modifiers, move them (in order) into **suffix**, removing them from **sequence**.
7. If **sequence** is a "KISS", "HEART", "FAMILY", or "HOLDING HANDS" emoji ZWJ sequence, move the characters in **sequence** to the front of **suffix**, and set the **sequence** to be "💏", "💑", or "👪" respectively, and go to step 7.
1. A KISS sequence contains ZWJ, "💋", and "❤", which are skipped in moving to **suffix**.
2. A HEART sequence contains ZWJ and "❤", which are skipped in moving to **suffix**.
3. A HOLDING HANDS sequence contains ZWJ+🤝+ZWJ, which are skipped in moving to **suffix**.
4. A FAMILY sequence contains only characters from the set {👦, 👧, 👨, 👩, 👴, 👵, 👶}. Nothing is skipped in moving to **suffix**, except ZWJ.
7. If **sequence** ends with ♂ or ♀, and does not have a name, remove the ♂ or ♀ and move the name for "👨" or "👩" respectively to the start of **prefixName**.
8. Transform **sequence** and append to **prefixName**, by successively getting names for the longest subsequences, skipping any singleton ZWJ characters. If there is more than one name, use the listPattern for unit-short, type=2 to link them.
9. Transform **suffix** into **suffixName** in the same manner.
10. If both the **prefixName** and **suffixName** are non-empty, form the name by joining them with the "category-list" characterLabelPattern and return it. Otherwise return whichever of them is non-empty.
8. If **sequence** ends with ♂ or ♀, and does not have a name, remove the ♂ or ♀ and move the name for "👨" or "👩" respectively to the start of **prefixName**.
9. Transform **sequence** and append to **prefixName**, by successively getting names for the longest subsequences, skipping any singleton ZWJ characters. If there is more than one name, use the listPattern for unit-short, type=2 to link them.
10. Transform **suffix** into **suffixName** in the same manner.
11. If both the **prefixName** and **suffixName** are non-empty, form the name by joining them with the "category-list" characterLabelPattern and return it. Otherwise return whichever of them is non-empty.

The synthesized keywords can follow a similar process.

1. For an **emoji flag sequence** or **emoji tag sequence** representing a subdivision, use "flag".
2. For keycap sequences, use "keycap".
3. For sequences with ZWJ + ➡️, use the keywords for the sequence without the ZWJ + ➡️.
3. For other sequences, add the keywords for the subsequences used to get the short names for **prefixName**, and the short names used for **suffixName**.

Some examples for English data (v30) are given in the following table.
Expand Down
38 changes: 3 additions & 35 deletions docs/ldml/tr35.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

# Unicode Locale Data Markup Language (LDML)

|Version|44 |
|Version|44.1 |
|-------|----------|
|Editors|Mark Davis (<a href="mailto:[email protected]">[email protected]</a>) and <a href="tr35.md#Acknowledgments">other CLDR committee members</a>|
|Date|2023-10-25|
|Date|2023-12-05|
|This Version|<a href="https://www.unicode.org/reports/tr35/tr35-70/tr35.html">https://www.unicode.org/reports/tr35/tr35-70/tr35.html</a>|
|Previous Version|<a href="https://www.unicode.org/reports/tr35/tr35-69/tr35.html">https://www.unicode.org/reports/tr35/tr35-69/tr35.html</a>|
|Latest Version|<a href="https://www.unicode.org/reports/tr35/">https://www.unicode.org/reports/tr35/</a>|
Expand Down Expand Up @@ -4029,6 +4029,7 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni
* [General](tr35-general.md#Contents)
* Added new section [Unit Identifier Uniqueness](tr35-general.md#Unit_Identifier_Uniqueness), and added a relevant constraint on base_component in the [Syntax](tr35-general.md#syntax) section.
* Several clarifications were added in [Transform Rules Syntax](tr35-general.md#Transform_Rules_Syntax), and a new section [Transform Syntax Characters](tr35-general.md#transform-syntax-characters) was added with a table of the characters.
* (44.1) Added handling of derived emoji names and keywords for emoji facing-right sequences.

* [Dates](tr35-dates.md#Contents)
* New section [First Day Overrides](tr35-dates.md#first-day-overrides): Described the various locale ID elements that affect determination of the first day of the week (for week of year calculations), and the order in which they should be considered. Also noted in [Key/Type Definitions](#Key_Type_Definitions) which keys can affect determination of first day.
Expand All @@ -4050,39 +4051,6 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni
* Fixed a problem in [Switch the formatting locale if necessary](tr35-personNames.md#switch-the-formatting-locale-if-necessary), where the full formatting locale wasn't being set correctly when the name object has a locale whose script is incompatibility with name script.
* Rewrote the section on [Setting the spaceReplacement](tr35-personNames.md#setting-the-spacereplacement).

**Differences from LDML Version 42**

* Removed numbering from sections, to allow for more flexible reorganization of the specification in the future
* [Person Names](tr35-personNames.md#Contents)
* Brought Person Name Formatting out of tech preview
* Described the changes from the fields _prefix_ and _suffix_ to the fields _title_, _generation_, and _credentials_.
The problem was that ‘prefix’ and ‘suffix’ are positional terms, whereas the contents may need to change position based on the locale.
* Provided much more detailed algorithms for the whole [Formatting Process](tr35-personNames.md#formatting-process),
including additional processing steps such as [Handle missing surname](tr35-personNames.md#handle-missing-surname)
* Documented changes in the [Sample Name](tr35-personNames.md#sample-name) structure (whose primary use is internal to CLDR data collection)
* For more background, the [Person Names Guide](https://docs.google.com/document/d/1mjxIHsb97Og8ub6BKWxOihcHz7zjU4GdFkIxWHGAtes/edit#heading=h.4u6bqbd313a5) may be helpful,
although it is primarily targeted at CLDR data submitters.
* **Locales**
* Fixed formatting errors in [Likely Subtags](#likely-subtags)
* Improved the specification information about the effect of locale keywords
* "fw" keyword for first day of the week in [Week Data](tr35-dates.md#Week_Data)
* "hc" keyword for hour cycle in [Time Data](tr35-dates.md#Time_Data)
* "dx", "lb", "lw", "ss" keywords related to line wrapping in [Segmentations](tr35-general.md#segmentations)
* "cf" keyword in [Currency Formats](tr35-numbers.md#Currency_Formats)
* "ca", "cf", "dx", "fw", "hc", "lb", "lw", "ms", "mu", "rg" keyword updates in [Key And Type Definitions](#Key_And_Type_Definitions_)
* [Parent Locales](#Parent_Locales)
* Documented the new `component` attribute, which provides for different inheritance behavior for different components (such as segmentation or collation)
* [Region-Priority Inheritance](#Region_Priority_Inheritance)
* Documented the differences in inheritance for rgScope data, which inherits primarily by region rather than primarily by language.
* Includes small changes in [`<rgScope>`: Scope of the “rg” Locale Key](tr35-info.md#rgScope), in [Lookup](#lookup),
and in [Bundle vs Item Lookup](#Bundle_vs_Item_Lookup)
* [Calendar Data](tr35-dates.md#calendar-data)
* Documents new optional `code` and `aliases` attributes to eras, which allow string IDs for eras instead of just numbers
* [Data Size Reduction](#Data_Size)
* Added new section with guidance on how to reduce CLDR data size where necessary
* [Telephone Code Data](tr35-info.md#Telephone_Code_Data)
* Added pointer to the recommended open-source library [libphonenumber](https://github.com/google/libphonenumber#what-is-it)

Note that small changes such as typos and link fixes are not listed above.
Modifications in previous versions are listed in those respective versions.
Click on **Previous Version** in the header until you get to the desired version.
Expand Down

0 comments on commit 0acd1b0

Please sign in to comment.