Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-16720 json: add transforms #4036

Merged
merged 9 commits into from
Sep 21, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 25 additions & 23 deletions docs/site/downloads/cldr-46.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ It only covers the data, which is available at [release-46-alpha3](https://githu
## Overview

Unicode CLDR provides key building blocks for software supporting the world's languages.
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
(including all mobile phones) for their software internationalization and localization,
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
(including all mobile phones) for their software internationalization and localization,
adapting software to the conventions of different languages.

The most significant changes in this release were:

- Updates to Unicode 16.0 (including major changes to collation),
- Further revisions to the Message Format 2.0 tech preview,
- Substantial additions and modifications of Emoji search keyword data,
- Updates to Unicode 16.0 (including major changes to collation),
- Further revisions to the Message Format 2.0 tech preview,
- Substantial additions and modifications of Emoji search keyword data,
- ‘Upleveling’ the locale coverage.

### Locale Coverage Status
Expand Down Expand Up @@ -119,7 +119,7 @@ For a full listing, see [¤¤BCP47 Delta](https://unicode.org/cldr/charts/46/del
For a full listing, see [Delta Data](https://unicode.org/cldr/charts/46/delta/index.html)

### Emoji Search Keywords
The usage model for emoji search keywords is that
The usage model for emoji search keywords is that
- The user types one or more words in an emoji search field. The order of words doesn't matter; nor does upper- versus lowercase.
- Each word successively narrows a number of emoji in a results box
- heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
Expand All @@ -131,41 +131,41 @@ The usage model for emoji search keywords is that
Thus in the following, the user would just click on 🎉 if that works for them.
- celebrate → 🥳 🥂 🎈 🎉 🎊 🪅

In this release WhatsApp emoji search keyword data has been incorporated.
In this release WhatsApp emoji search keyword data has been incorporated.
In the process of doing that, the maximum number of search keywords per emoji has been increased,
and the keywords have been simplified in most locales by breaking up multi-word keywords.
An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
now being replaced by the simpler 3 single keywords [white | waving | flag].
and the keywords have been simplified in most locales by breaking up multi-word keywords.
An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
now being replaced by the simpler 3 single keywords [white | waving | flag].
The simpler version typically works as well or better in practice.

### Collation Data Changes
There are two significant changes to the CLDR root collation (CLDR default sort order).

#### Realigned With DUCET
The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order.
The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters.

Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET).

These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details.

#### Improved Han Radical-Stroke Order
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt).
It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt).
It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
where only the traditional forms of radicals are now available as index characters.

### JSON Data Changes

1. Separate modern packages were dropped [CLDR-16465]
2. Adding transliteration rules [CLDR-16720] (In progress)
1. Separate modern packages were dropped [CLDR-16465]
2. Transliteration (transform) data is now available in the `cldr-transforms` package. The JSON file contains transform metadata, and the `_rulesFile` key indicates an external (`.txt`) file containing the actual rules. [CLDR-16720][].

### Markdown ###

Expand All @@ -177,7 +177,7 @@ This process should be completed before release.
### File Changes

Most files added in this release were for new locales.
There were the following new test files:
There were the following new test files:

**TBD***

Expand Down Expand Up @@ -207,3 +207,5 @@ Many people have made significant contributions to CLDR and LDML; see the [Ackno
The Unicode [Terms of Use](https://unicode.org/copyright.html) apply to CLDR data; in particular, see [Exhibit 1](https://unicode.org/copyright.html#Exhibit1).

For web pages with different views of CLDR data, see [http://cldr.unicode.org/index/charts](https://cldr.unicode.org/index/charts).

[CLDR-16720]: https://unicode-org.atlassian.net/issues/CLDR-16720
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,15 @@ public static CldrNode createNode(
String fullTrunk = extractAttrs(fullPathSegment, node.nondistinguishingAttributes);
if (!node.name.equals(fullTrunk)) {
throw new ParseException(
"Error in parsing \"" + pathSegment + " \":\"" + fullPathSegment, 0);
"Error in parsing \""
+ pathSegment
+ "\":\""
+ fullPathSegment
+ " - "
+ node.name
+ " != "
+ fullTrunk,
0);
}

for (String key : node.distinguishingAttributes.keySet()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
import org.unicode.cldr.util.CLDRLocale;
import org.unicode.cldr.util.CLDRPaths;
import org.unicode.cldr.util.CLDRTool;
import org.unicode.cldr.util.CLDRTransforms;
import org.unicode.cldr.util.CLDRURLS;
import org.unicode.cldr.util.CalculatedCoverageLevels;
import org.unicode.cldr.util.CldrUtility;
Expand Down Expand Up @@ -88,6 +89,7 @@
private static final String CLDR_PKG_PREFIX = "cldr-";
private static final String FULL_TIER_SUFFIX = "-full";
private static final String MODERN_TIER_SUFFIX = "-modern";
private static final String TRANSFORM_RAW_SUFFIX = ".txt";
private static Logger logger = Logger.getLogger(Ldml2JsonConverter.class.getName());

enum RunType {
Expand All @@ -98,7 +100,8 @@
rbnf(false, true),
annotations,
annotationsDerived,
bcp47(false, false);
bcp47(false, false),
transforms(false, false);

private final boolean isTiered;
private final boolean hasLocales;
Expand Down Expand Up @@ -739,6 +742,8 @@
outFilename = filenameAsLangTag + ".json";
} else if (type == RunType.bcp47) {
outFilename = filename + ".json";
} else if (type == RunType.transforms) {
outFilename = filename + ".json";
} else if (js.section.equals("other")) {
// If you see other-___.json, it means items that were missing from
// JSON_config_*.txt
Expand Down Expand Up @@ -775,11 +780,11 @@
if (type == RunType.main) {
avl.full.add(filenameAsLangTag);
}
} else if (type == RunType.rbnf) {
js.packageName = "rbnf";
tier = "";
} else if (type == RunType.bcp47) {
js.packageName = "bcp47";
} else if (type == RunType.rbnf
|| type == RunType.bcp47
|| type == RunType.transforms) {
// untiered, just use the name
js.packageName = type.name();
tier = "";
}
if (js.packageName != null) {
Expand Down Expand Up @@ -884,6 +889,28 @@
}
}

if (item.getUntransformedPath()
.startsWith("//supplementalData/transforms")) {
// here, write the raw data
final String rawTransformFile = filename + TRANSFORM_RAW_SUFFIX;
try (PrintWriter outf =
FileUtilities.openUTF8Writer(outputDir, rawTransformFile)) {
outf.println(item.getValue());
// note: not logging the write here- it will be logged when the
// .json file is written.
}
// the value is now the raw filename
item.setValue(rawTransformFile);
item.setPath(
item.getPath()
.replaceAll("\\]/tRule.*$", "]/_rulesFile")
.replace("/transforms/", "/"));
item.setFullPath(
item.getFullPath()
Fixed Show fixed Hide fixed
.replaceAll("\\]/tRule.*$", "]/_rulesFile")
.replace("/transforms/", "/"));
}

// some items need to be split to multiple item before processing. None
// of those items need to be sorted.
// Applies to SPLITTABLE_ATTRS attributes.
Expand Down Expand Up @@ -1453,6 +1480,24 @@
outf.close();
}

public void writeTransformMetadata(String outputDir) throws IOException {
final String dirName = outputDir + "/cldr-" + RunType.transforms.name();
final String fileName = RunType.transforms.name() + ".json";
PrintWriter outf = FileUtilities.openUTF8Writer(dirName, fileName);
System.out.println(
PACKAGE_ICON
+ " Creating packaging file => "
+ dirName
+ File.separator
+ fileName);
JsonObject obj = new JsonObject();
obj.add(
RunType.transforms.name(),
gson.toJsonTree(CLDRTransforms.getInstance().getJsonIndex()));
outf.println(gson.toJson(obj));
outf.close();
}

public void writeCoverageLevels(String outputDir) throws IOException {
try (PrintWriter outf =
FileUtilities.openUTF8Writer(outputDir + "/cldr-core", "coverageLevels.json"); ) {
Expand Down Expand Up @@ -2225,6 +2270,8 @@
if (Boolean.parseBoolean(options.get("packagelist").getValue())) {
writePackageList(outputDir);
}
} else if (type == RunType.transforms) {
writeTransformMetadata(outputDir);
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,14 @@ class LdmlConvertRules {
"identity:variant:type",

// in common/bcp47/*.xml
"keyword:key:name");
"keyword:key:name",

// transforms

// transforms
"transforms:transform:source",
"transforms:transform:target",
"transforms:transform:direction");

/**
* The set of element:attribute pair in which the attribute should be treated as value. All the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1128,4 +1128,19 @@ static String parseDoubleColon(String x, Set<String> others) {
}
return "";
}

public class CLDRTransformsJsonIndex {
/** raw list of available IDs */
public String[] available =
getAvailableIds().stream()
.map((String id) -> id.replace(".xml", ""))
.collect(Collectors.toList())
.toArray(new String[0]);
}

/** This gets the metadata (index file) exposed as cldr-json/cldr-transforms/transforms.json */
public CLDRTransformsJsonIndex getJsonIndex() {
final CLDRTransformsJsonIndex index = new CLDRTransformsJsonIndex();
return index;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
section=transforms ; path=//cldr/supplemental/transforms/.* ; package=transforms ; packageDesc=Transform data
dependency=core ; package=transforms
Loading