Skip to content

Commit

Permalink
CLDR-17582 Cleanup English annotations
Browse files Browse the repository at this point in the history
See #3751
  • Loading branch information
macchiati authored and Squash Bot committed May 30, 2024
1 parent 3cc9e07 commit e5fc491
Show file tree
Hide file tree
Showing 7 changed files with 1,802 additions and 2,282 deletions.
2,337 changes: 1,165 additions & 1,172 deletions common/annotations/en.xml

Large diffs are not rendered by default.

54 changes: 47 additions & 7 deletions docs/ldml/tr35-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -2620,28 +2620,68 @@ For more information, see version 5.0 or [UTR #51, Unicode Emoji](https://www.un
<!ATTLIST annotation type (tts) #IMPLIED >
```

There are two kinds of annotations: **short names**, and **keywords**.
There are two kinds of annotations: **short names**, and **search keywords**.

With an attribute `type="tts"`, the value is a **short name**, such as one that can be used for text-to-speech. It should be treated as one of the element values for other purposes.
With an attribute `type="tts"`, the value is a **short name**, such as one that can be used for text-to-speech.
It should be treated as one of the element values for other purposes.

When there is no `type` attribute, the value is a set of **keywords**, delimited by |. Spaces around each element are to be trimmed. The **keywords** are words associated with the character(s) that might be used in searching for the character, or in predictive typing on keyboards. The short name itself can be used as a keyword.
When there is no `type` attribute, the value is a set of **keywords**, delimited by |.
Spaces around each element are to be trimmed.
The **keywords** are words associated with the character(s) that might be used in searching for the character,
or in predictive typing on keyboards. The short name itself can be used as a keyword.

Here is an example from German:

```xml
<annotation cp="👎">schlecht | Hand | Daumen | nach unten</annotation>
<annotation cp="👎">schlecht | Hand | Daumen | nach | unten</annotation>
<annotation cp="👎" type="tts">Daumen runter</annotation>
```

The `cp` attribute value has two formats: either a single string, or if contained within \[\] a UnicodeSet. The latter format can contain multiple code points or strings. A code point pr string can occur in multiple annotation element **cp** values, such as the following, which also contains the "thumbs down" character.
These are intended as search keywords, and not for "triggering" (aka suggesting).

- For triggering, the user is typing out a message and concurrently seeing a few emoji
displayed adjacent to the virtual keyboard. Selecting the emoji adds it to the message.
For example, you mention your birthday while writing, and an emoji cake pops up.
That is typically done with an LLM or similar advanced technology.
- For searching, the user is looking for an emoji in a search box,
and typing in in words that narrow down a displayed set of emoji.
For example, you type 'heart', but that has too many hits, so you add 'blue' and get the set of blue hearts.

### Usage Model

The usage model for the search keywords is:

- The user types one or more words in an emoji search field.
- Each word successively narrows a number of emoji in a results box.
- heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
- blue → 🥶 😰 💙 🩵 🫐 👕 👖 📘 🧿 🔵 🟦 🔷 🔹 🏳️‍⚧️
- heart blue → 💙 🩵
- A word with no hits is ignored
- [heart | blue | confabulation] is equivalent to [heart | blue]
- As the user types a word, each character added to the word narrows the results.
- Whenever the list is short enough to scan, the user will mouse-click on the right emoji — so it doesn’t have to be narrowed too far.
- In the following, the user would just click on 🎉 if that works for them.
- celebrate → 🥳 🥂 🎈 🎉 🎊 🪅
- The order of words doesn’t matter.

Multiword search keywords are typically broken up into separate parts,
because that works better with the usage model. So [hand | mouth | omg | open | over] covers the phrase "hand over mouth".

### cp attribute

The `cp` attribute value has two formats: either a single string, or if contained within \[\] a UnicodeSet.
The latter format can contain multiple code points or strings. A code point pr string can occur in multiple annotation element **cp** values, such as the following, which also contains the "thumbs down" character.

```xml
<annotation cp='[☝✊-✍👆-👐👫-👭💁🖐🖕🖖🙅🙆🙋🙌🙏🤘]'>hand</annotation>
```

Both for short names and keywords, values do not have to match between different languages. They should be the most common values that people using _that_ language would associate with those characters. For example, a "black heart" might have the association of "wicked" in English, but not in some other languages.
Both for short names and keywords, values do not have to match between different languages.
They should be the most common values that people using _that_ language would associate with those characters.
For example, a "black heart" might have the association of "wicked" in English, but not in some other languages.

The cp value may contain sequences, but does not contain any Emoji or Text Variant (VS15 & VS16) characters. All such characters should be removed before looking up any short names and keywords.
The cp value may contain sequences, but does not contain any Emoji or Text Variant (VS15 & VS16) characters.
All such characters should be removed before looking up any short names and keywords.

### <a name="SynthesizingNames" href="#SynthesizingNames">Synthesizing Sequence Names</a>

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
package org.unicode.cldr.tool;

import com.google.common.base.Joiner;
import com.google.common.collect.Sets;
import com.ibm.icu.impl.UnicodeMap;
import com.ibm.icu.text.UnicodeSet;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import org.unicode.cldr.util.Annotations;
import org.unicode.cldr.util.CLDRConfig;
import org.unicode.cldr.util.CLDRFile;
import org.unicode.cldr.util.CldrUtility;
import org.unicode.cldr.util.CodePointEscaper;
import org.unicode.cldr.util.Emoji;
import org.unicode.cldr.util.SimpleUnicodeSetFormatter;
import org.unicode.cldr.util.XPathParts;

public class CheckEmojiAnnotations {
private static final Joiner JOIN_BAR = Joiner.on(" | ");

public static void main(String[] args) {
boolean chooseEmoji = true; // false to get the non-emoji

UnicodeSet rgi = Emoji.getAllRgi();
UnicodeSet rgiNoVariant = Emoji.getAllRgiNoES();
CLDRFile root = CLDRConfig.getInstance().getAnnotationsFactory().make("en", false);
UnicodeSet rootEmoji = new UnicodeSet();
for (String path : root) {
XPathParts parts = XPathParts.getFrozenInstance(path);
String cp = parts.getAttributeValue(-1, "cp");
if (cp != null && rgiNoVariant.contains(cp) == chooseEmoji) {
rootEmoji.add(cp);
}
}
rootEmoji.freeze();

UnicodeMap<Annotations> english = Annotations.getData("en");
Map<String, UnicodeSet> keywordToEmoji = new TreeMap<>();
UnicodeSet allUnclean = new UnicodeSet();

for (Annotations entry : english.values()) {
Set<String> keywords = entry.getKeywords();
UnicodeSet emoji = english.getSet(entry);
emoji.retainAll(rootEmoji);
UnicodeSet emojiRestored = new UnicodeSet();
for (String emojiItem : emoji) {
emojiRestored.add(Emoji.restoreVariants(emojiItem));
}
UnicodeSet unclean = new UnicodeSet(emojiRestored).removeAll(rgi);
allUnclean.add(unclean);

emojiRestored = emojiRestored.retainAll(rgi);
if (emojiRestored.isEmpty()) {
continue;
}

for (String keyword : keywords) {
UnicodeSet value = keywordToEmoji.get(keyword);
if (value == null) {
keywordToEmoji.put(keyword, value = new UnicodeSet());
}
value.addAll(emojiRestored);
}
}
CldrUtility.protectCollection(keywordToEmoji);

int count = 0;
System.out.println("### Emoji to Keywords");
TreeSet<String> sortedRootEmoji = new TreeSet<>(Emoji.COLLATOR);
rootEmoji.addAllTo(sortedRootEmoji);
for (String emoji : sortedRootEmoji) {
String restored = Emoji.restoreVariants(emoji);
Set<String> keywords = english.get(emoji).getKeywords();
System.out.println(
++count + "\t" + restored + "\t" + emoji + "\t" + JOIN_BAR.join(keywords));
}

UnicodeSet toEscape =
new UnicodeSet(CodePointEscaper.FORCE_ESCAPE)
.remove(CodePointEscaper.ZWJ.getCodePoint())
.remove(CodePointEscaper.RANGE.getCodePoint())
.freeze();
SimpleUnicodeSetFormatter suf = new SimpleUnicodeSetFormatter(null, toEscape);

allUnclean =
allUnclean
.retainAll(rgiNoVariant)
.removeAll(Emoji.SKIN_MODIFIERS)
.removeAll(Emoji.HAIR_MODIFIERS);
if (!allUnclean.isEmpty()) {
throw new IllegalArgumentException("Missing " + suf.format(allUnclean));
}

System.out.println("### Keywords to Emoji");

count = 0;
for (Entry<String, UnicodeSet> entry : keywordToEmoji.entrySet()) {
System.out.println(
++count + "\t" + entry.getKey() + "\t" + suf.format(entry.getValue()));
}

System.out.println("### Gender Variants");

for (Set<String> entry : Emoji.getGenderGroups()) {
// find common keywords
Set<String> common = null;
Set<String> cleanEntry = new TreeSet<>();
for (String s : entry) {
if (!rootEmoji.contains(Emoji.removeVariants(s))) {
continue;
}
Annotations anno = getAnnotations(english, s);
if (anno == null) {
continue;
}
cleanEntry.add(s);
if (common == null) {
System.out.println();
common = new TreeSet<>();
common.addAll(anno.getKeywords());
} else {
common.retainAll(anno.getKeywords());
}
}
// now show them
if (cleanEntry.size() > 1) {
for (String s : cleanEntry) {
Annotations anno = getAnnotations(english, s);
String removed = Emoji.removeVariants(s);
System.out.println(
s
+ "\t"
+ removed
+ "\t"
+ anno.getShortName()
+ "\t"
+ JOIN_BAR.join(common)
+ "\t"
+ JOIN_BAR.join(Sets.difference(anno.getKeywords(), common)));
}
}
}
}

public static Annotations getAnnotations(UnicodeMap<Annotations> english, String s) {
Annotations anno = english.get(s);
if (anno == null) {
anno = english.get(s.replace(Emoji.EMOJI_VARIANT, ""));
}
return anno;
}
}
Loading

0 comments on commit e5fc491

Please sign in to comment.