Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17535 Update LikelySubtags tool #3958

Merged

Conversation

macchiati
Copy link
Member

@macchiati macchiati commented Aug 13, 2024

CLDR-17535

The goal is to integrate the SIL data reading into the tool for generating likely subtags. I'm creating a new tool because it needs some major cleanup in order for it to work.

common/supplemental/supplementalData.xml

  • fixed a few values for the predominant script

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java

  • completely new tool for generating the likely subtag data. It takes code from GenerateMaximalSubtags and breaks it apart into separate files, and cleans up a lot of cruft.
  • It now takes regular console options, and the options provide a lot of control over seeing what is happening at each point in the process.
  • It incorporates code to directly add the SIL data, after filtering out any data that would be redundant. It does a much better job of filtering the SIL data, allowing for a list of all of the discarded lines, plus the reasons for discarding them.
  • It prints a list of items that change from old to new, allowing for double-checking the data without being depending on the file structure.
  • The output file has a better ordering; first come the regular locales, then the und_ elements, then the items added for SIL.
  • The file listing is cleaned up, so that the comments are on the same line, with an abbreviated format, eg
    • <likelySubtag from="sr_ME" to="sr_Latn_ME"/> <!--Serbian‧?‧Montenegro ➡ Serbian‧Latin‧Montenegro-->

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LSRSource.java

  • A cleaned up version of a class that contains an LSR locale plus origin attribute values (eg 'sil1')

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LangTagsData.java

  • A cleaned up separate class that reads the SIL data, filling in a set of error values for possible later display.

tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRLocale.java

  • added a couple of small utilities (there's a separate ticket for more substantive cleanup)

tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java

  • made it clearer how to get a version that doesn't minimize

tools/cldr-code/src/main/java/org/unicode/cldr/util/LocaleScriptInfo.java

  • provides a cleaned-up class to get the default script for a locale

tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java

  • made the script values ordered, so that a predictable result would obtain.

Changes (disregarding changes due to SIL data)

  1. The explicit new values make sense.
  2. The Laoo and hnj values are due to changed data (we hadn't rerun the tool last release); there was a change in the default value for hnj.
  3. The und_Latn values were all redundant. For example, because we have und_419 => es_Latn_419, the und_419 => es_Latn_419 is superfluous (code for using the data falls back).
Source Name oldValue Name newValue Name
und_Aghb (Caucasian Albanian) udi_Aghb_RU Udi (Caucasian Albanian, Russia) xag_Aghb_AZ Aghwan (Caucasian Albanian, Azerbaijan)
und_Krai (Kirat Rai) bap_Krai_NP Bantawa (Kirat Rai, Nepal) bap_Krai_IN Bantawa (Kirat Rai, India)
hif Fiji Hindi hif_Latn_FJ Fiji Hindi (Latin, Fiji) hif_Deva_FJ Fiji Hindi (Devanagari, Fiji)
gon Gondi gon_Telu_IN Gondi (Telugu, India) gon_Deva_IN Gondi (Devanagari, India)
und_Hmng (Pahawh Hmong) hnj_Hmng_US Hmong Njua (Pahawh Hmong, United States) hnj_Hmng_LA Hmong Njua (Pahawh Hmong, Laos)
kaa Kara-Kalpak kaa_Cyrl_UZ Kara-Kalpak (Cyrillic, Uzbekistan) kaa_Cyrl_TR Kara-Kalpak (Cyrillic, Türkiye)
und_Cyrl_TR (Cyrillic, Türkiye) kbd_Cyrl_TR Kabardian (Cyrillic, Türkiye) kaa_Cyrl_TR Kara-Kalpak (Cyrillic, Türkiye)
hnj_AU Hmong Njua (Australia) hnj_Laoo_AU Hmong Njua (Lao, Australia) n/a
hnj_CN Hmong Njua (China) hnj_Laoo_CN Hmong Njua (Lao, China) n/a
hnj_FR Hmong Njua (France) hnj_Laoo_FR Hmong Njua (Lao, France) n/a
hnj_GF Hmong Njua (French Guiana) hnj_Laoo_GF Hmong Njua (Lao, French Guiana) n/a
hnj_LA Hmong Njua (Laos) hnj_Laoo_LA Hmong Njua (Lao, Laos) n/a
hnj_MM Hmong Njua (Myanmar [Burma]) hnj_Laoo_MM Hmong Njua (Lao, Myanmar [Burma]) n/a
hnj_SR Hmong Njua (Suriname) hnj_Laoo_SR Hmong Njua (Lao, Suriname) n/a
hnj_TH Hmong Njua (Thailand) hnj_Laoo_TH Hmong Njua (Lao, Thailand) n/a
hnj_VN Hmong Njua (Vietnam) hnj_Laoo_VN Hmong Njua (Lao, Vietnam) n/a
hnj_Laoo Hmong Njua (Lao) hnj_Laoo_LA Hmong Njua (Lao, Laos) n/a
und_Laoo_AU (Lao, Australia) hnj_Laoo_AU Hmong Njua (Lao, Australia) n/a
und_Laoo_CN (Lao, China) hnj_Laoo_CN Hmong Njua (Lao, China) n/a
und_Laoo_FR (Lao, France) hnj_Laoo_FR Hmong Njua (Lao, France) n/a
und_Laoo_GF (Lao, French Guiana) hnj_Laoo_GF Hmong Njua (Lao, French Guiana) n/a
und_Laoo_MM (Lao, Myanmar [Burma]) hnj_Laoo_MM Hmong Njua (Lao, Myanmar [Burma]) n/a
und_Laoo_SR (Lao, Suriname) hnj_Laoo_SR Hmong Njua (Lao, Suriname) n/a
und_Laoo_TH (Lao, Thailand) hnj_Laoo_TH Hmong Njua (Lao, Thailand) n/a
und_Laoo_US (Lao, United States) hnj_Laoo_US Hmong Njua (Lao, United States) n/a
und_Laoo_VN (Lao, Vietnam) hnj_Laoo_VN Hmong Njua (Lao, Vietnam) n/a
und_Latn_419 (Latin, Latin America) es_Latn_419 Latin American Spanish (Latin) n/a
und_Latn_MU (Latin, Mauritius) mfe_Latn_MU Morisyen (Latin, Mauritius) n/a
und_Latn_SL (Latin, Sierra Leone) kri_Latn_SL Krio (Latin, Sierra Leone) n/a
und_Latn_TK (Latin, Tokelau) tkl_Latn_TK Tokelau (Latin, Tokelau) n/a
und_Latn_ZM (Latin, Zambia) bem_Latn_ZM Bemba (Latin, Zambia) n/a

Here are samples of the log when reading the SIL file:

Log Type Sample Comment
ill_formed_tags aae_x_sub84 ➡ aae_Latn_IT_x_sub84 remove tags that aren't LSR
exception apc ➡ apc_Arab_SY gets an exception (in this case, an overwrite)
skipping_scope abj ➡ abj_Zyyy_IN language, script or region has wrong scope, eg {script=special}
tag_is_full az_Cyrl_AZ ➡ az_Cyrl_AZ tag is completely redundant
language_of_tag_missing abq_Cyrl ➡ abq_Cyrl_RU we can't have LS or LR without also having a line for L
redundant_mapping aa_Arab ➡ aa_Arab_ET if we have aa_ET ➡ aa_Arab_ET, then aa_Arab is redundant
canonicalizing ajp ➡ ajp_Arab_JO applies the regular CLDR canonicalization
  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@macchiati macchiati marked this pull request as ready for review August 16, 2024 01:35
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good refactoring and such.

Comment on lines +903 to +904
"und_" + script,
"und_" + script + "_" + region,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine, but probably ideally we would use a builder model here for tag production

Map<String, LSRSource> result = new TreeMap<>();
LanguageTagCanonicalizer langCanoner = new LanguageTagCanonicalizer(null);
try {
Files.lines(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be parsing a JSON file using regex… (but this is preexisting code too). Use JsonStreamParser (SAX) instead, or perhaps gson.fromJson() (DOM).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I didn't realize we had a JSON parser already in CLDR

private Multimap<String, String> readWikidata() {
Multimap<String, String> result = TreeMultimap.create();
Path path =
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps rename the file too at some point?

Suggested change
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv")
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wikidata_lang_region.tsv")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. Actually, we should be using a tool to get that data from wikidata; right now it was just a one-time query.

@srl295
Copy link
Member

srl295 commented Aug 16, 2024

it looks like this doesn't actually include the results of a new run of generate likely?

@macchiati macchiati merged commit 0a99e76 into unicode-org:main Aug 16, 2024
12 checks passed
@macchiati macchiati deleted the CLDR-17535-Update-LikelySubtags-tool branch August 16, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants