CLDR-17535 Update LikelySubtags tool #3958

macchiati · 2024-08-13T23:54:32Z

The goal is to integrate the SIL data reading into the tool for generating likely subtags. I'm creating a new tool because it needs some major cleanup in order for it to work.

common/supplemental/supplementalData.xml

fixed a few values for the predominant script

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java

completely new tool for generating the likely subtag data. It takes code from GenerateMaximalSubtags and breaks it apart into separate files, and cleans up a lot of cruft.
It now takes regular console options, and the options provide a lot of control over seeing what is happening at each point in the process.
It incorporates code to directly add the SIL data, after filtering out any data that would be redundant. It does a much better job of filtering the SIL data, allowing for a list of all of the discarded lines, plus the reasons for discarding them.
It prints a list of items that change from old to new, allowing for double-checking the data without being depending on the file structure.
The output file has a better ordering; first come the regular locales, then the und_ elements, then the items added for SIL.
The file listing is cleaned up, so that the comments are on the same line, with an abbreviated format, eg
- <likelySubtag from="sr_ME" to="sr_Latn_ME"/>

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LSRSource.java

A cleaned up version of a class that contains an LSR locale plus origin attribute values (eg 'sil1')

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LangTagsData.java

A cleaned up separate class that reads the SIL data, filling in a set of error values for possible later display.

tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRLocale.java

added a couple of small utilities (there's a separate ticket for more substantive cleanup)

tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java

made it clearer how to get a version that doesn't minimize

tools/cldr-code/src/main/java/org/unicode/cldr/util/LocaleScriptInfo.java

provides a cleaned-up class to get the default script for a locale

tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java

made the script values ordered, so that a predictable result would obtain.

Changes (disregarding changes due to SIL data)

The explicit new values make sense.
The Laoo and hnj values are due to changed data (we hadn't rerun the tool last release); there was a change in the default value for hnj.
The und_Latn values were all redundant. For example, because we have und_419 => es_Latn_419, the und_419 => es_Latn_419 is superfluous (code for using the data falls back).

Source	Name	oldValue	Name	newValue	Name
und_Aghb	(Caucasian Albanian)	udi_Aghb_RU	Udi (Caucasian Albanian, Russia)	xag_Aghb_AZ	Aghwan (Caucasian Albanian, Azerbaijan)
und_Krai	(Kirat Rai)	bap_Krai_NP	Bantawa (Kirat Rai, Nepal)	bap_Krai_IN	Bantawa (Kirat Rai, India)
hif	Fiji Hindi	hif_Latn_FJ	Fiji Hindi (Latin, Fiji)	hif_Deva_FJ	Fiji Hindi (Devanagari, Fiji)
gon	Gondi	gon_Telu_IN	Gondi (Telugu, India)	gon_Deva_IN	Gondi (Devanagari, India)
und_Hmng	(Pahawh Hmong)	hnj_Hmng_US	Hmong Njua (Pahawh Hmong, United States)	hnj_Hmng_LA	Hmong Njua (Pahawh Hmong, Laos)
kaa	Kara-Kalpak	kaa_Cyrl_UZ	Kara-Kalpak (Cyrillic, Uzbekistan)	kaa_Cyrl_TR	Kara-Kalpak (Cyrillic, Türkiye)
und_Cyrl_TR	(Cyrillic, Türkiye)	kbd_Cyrl_TR	Kabardian (Cyrillic, Türkiye)	kaa_Cyrl_TR	Kara-Kalpak (Cyrillic, Türkiye)
hnj_AU	Hmong Njua (Australia)	hnj_Laoo_AU	Hmong Njua (Lao, Australia)	∅	n/a
hnj_CN	Hmong Njua (China)	hnj_Laoo_CN	Hmong Njua (Lao, China)	∅	n/a
hnj_FR	Hmong Njua (France)	hnj_Laoo_FR	Hmong Njua (Lao, France)	∅	n/a
hnj_GF	Hmong Njua (French Guiana)	hnj_Laoo_GF	Hmong Njua (Lao, French Guiana)	∅	n/a
hnj_LA	Hmong Njua (Laos)	hnj_Laoo_LA	Hmong Njua (Lao, Laos)	∅	n/a
hnj_MM	Hmong Njua (Myanmar [Burma])	hnj_Laoo_MM	Hmong Njua (Lao, Myanmar [Burma])	∅	n/a
hnj_SR	Hmong Njua (Suriname)	hnj_Laoo_SR	Hmong Njua (Lao, Suriname)	∅	n/a
hnj_TH	Hmong Njua (Thailand)	hnj_Laoo_TH	Hmong Njua (Lao, Thailand)	∅	n/a
hnj_VN	Hmong Njua (Vietnam)	hnj_Laoo_VN	Hmong Njua (Lao, Vietnam)	∅	n/a
hnj_Laoo	Hmong Njua (Lao)	hnj_Laoo_LA	Hmong Njua (Lao, Laos)	∅	n/a
und_Laoo_AU	(Lao, Australia)	hnj_Laoo_AU	Hmong Njua (Lao, Australia)	∅	n/a
und_Laoo_CN	(Lao, China)	hnj_Laoo_CN	Hmong Njua (Lao, China)	∅	n/a
und_Laoo_FR	(Lao, France)	hnj_Laoo_FR	Hmong Njua (Lao, France)	∅	n/a
und_Laoo_GF	(Lao, French Guiana)	hnj_Laoo_GF	Hmong Njua (Lao, French Guiana)	∅	n/a
und_Laoo_MM	(Lao, Myanmar [Burma])	hnj_Laoo_MM	Hmong Njua (Lao, Myanmar [Burma])	∅	n/a
und_Laoo_SR	(Lao, Suriname)	hnj_Laoo_SR	Hmong Njua (Lao, Suriname)	∅	n/a
und_Laoo_TH	(Lao, Thailand)	hnj_Laoo_TH	Hmong Njua (Lao, Thailand)	∅	n/a
und_Laoo_US	(Lao, United States)	hnj_Laoo_US	Hmong Njua (Lao, United States)	∅	n/a
und_Laoo_VN	(Lao, Vietnam)	hnj_Laoo_VN	Hmong Njua (Lao, Vietnam)	∅	n/a
und_Latn_419	(Latin, Latin America)	es_Latn_419	Latin American Spanish (Latin)	∅	n/a
und_Latn_MU	(Latin, Mauritius)	mfe_Latn_MU	Morisyen (Latin, Mauritius)	∅	n/a
und_Latn_SL	(Latin, Sierra Leone)	kri_Latn_SL	Krio (Latin, Sierra Leone)	∅	n/a
und_Latn_TK	(Latin, Tokelau)	tkl_Latn_TK	Tokelau (Latin, Tokelau)	∅	n/a
und_Latn_ZM	(Latin, Zambia)	bem_Latn_ZM	Bemba (Latin, Zambia)	∅	n/a

Here are samples of the log when reading the SIL file:

Log Type	Sample	Comment
ill_formed_tags	`aae_x_sub84 ➡ aae_Latn_IT_x_sub84`	remove tags that aren't LSR
exception	`apc ➡ apc_Arab_SY`	gets an exception (in this case, an overwrite)
skipping_scope	`abj ➡ abj_Zyyy_IN`	language, script or region has wrong scope, eg {script=special}
tag_is_full	`az_Cyrl_AZ ➡ az_Cyrl_AZ`	tag is completely redundant
language_of_tag_missing	`abq_Cyrl ➡ abq_Cyrl_RU`	we can't have LS or LR without also having a line for L
redundant_mapping	`aa_Arab ➡ aa_Arab_ET`	if we have aa_ET ➡ aa_Arab_ET, then aa_Arab is redundant
canonicalizing	`ajp ➡ ajp_Arab_JO`	applies the regular CLDR canonicalization

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

srl295

good refactoring and such.

srl295 · 2024-08-16T15:05:46Z

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java

+                    "und_" + script,
+                    "und_" + script + "_" + region,


fine, but probably ideally we would use a builder model here for tag production

srl295 · 2024-08-16T15:13:26Z

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LangTagsData.java

+        Map<String, LSRSource> result = new TreeMap<>();
+        LanguageTagCanonicalizer langCanoner = new LanguageTagCanonicalizer(null);
+        try {
+            Files.lines(path)


we shouldn't be parsing a JSON file using regex… (but this is preexisting code too). Use JsonStreamParser (SAX) instead, or perhaps gson.fromJson() (DOM).

Agreed. I didn't realize we had a JSON parser already in CLDR

srl295 · 2024-08-16T15:14:12Z

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LangTagsData.java

+    private Multimap<String, String> readWikidata() {
+        Multimap<String, String> result = TreeMultimap.create();
+        Path path =
+                Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv")


perhaps rename the file too at some point?

Suggested change

Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv")

Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wikidata_lang_region.tsv")

good idea. Actually, we should be using a tool to get that data from wikidata; right now it was just a one-time query.

srl295 · 2024-08-16T15:16:06Z

it looks like this doesn't actually include the results of a new run of generate likely?

CLDR-17535 Update LikelySubtags tool

1851b77

github-actions bot assigned macchiati Aug 13, 2024

macchiati added 2 commits August 14, 2024 19:56

CLDR-17535 Adding sil data, console options, etc.

ea0f7ab

CLDR-17535 General cleanup

6010495

macchiati marked this pull request as ready for review August 16, 2024 01:35

macchiati requested review from srl295, DavidLRowe and pedberg-icu August 16, 2024 02:52

srl295 approved these changes Aug 16, 2024

View reviewed changes

macchiati merged commit 0a99e76 into unicode-org:main Aug 16, 2024
12 checks passed

macchiati deleted the CLDR-17535-Update-LikelySubtags-tool branch August 16, 2024 15:58

macchiati mentioned this pull request Aug 16, 2024

CLDR-17535 Update likely subtags data #3966

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-17535 Update LikelySubtags tool #3958

CLDR-17535 Update LikelySubtags tool #3958

macchiati commented Aug 13, 2024 •

edited

Loading

srl295 left a comment

srl295 Aug 16, 2024

srl295 Aug 16, 2024

macchiati Aug 16, 2024

srl295 Aug 16, 2024

macchiati Aug 16, 2024

srl295 commented Aug 16, 2024

	Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv")
	Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wikidata_lang_region.tsv")

CLDR-17535 Update LikelySubtags tool #3958

CLDR-17535 Update LikelySubtags tool #3958

Conversation

macchiati commented Aug 13, 2024 • edited Loading

srl295 left a comment

Choose a reason for hiding this comment

srl295 Aug 16, 2024

Choose a reason for hiding this comment

srl295 Aug 16, 2024

Choose a reason for hiding this comment

macchiati Aug 16, 2024

Choose a reason for hiding this comment

srl295 Aug 16, 2024

Choose a reason for hiding this comment

macchiati Aug 16, 2024

Choose a reason for hiding this comment

srl295 commented Aug 16, 2024

macchiati commented Aug 13, 2024 •

edited

Loading