-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-17535 Update LikelySubtags tool #3958
CLDR-17535 Update LikelySubtags tool #3958
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good refactoring and such.
"und_" + script, | ||
"und_" + script + "_" + region, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine, but probably ideally we would use a builder model here for tag production
Map<String, LSRSource> result = new TreeMap<>(); | ||
LanguageTagCanonicalizer langCanoner = new LanguageTagCanonicalizer(null); | ||
try { | ||
Files.lines(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't be parsing a JSON file using regex… (but this is preexisting code too). Use JsonStreamParser
(SAX) instead, or perhaps gson.fromJson()
(DOM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I didn't realize we had a JSON parser already in CLDR
private Multimap<String, String> readWikidata() { | ||
Multimap<String, String> result = TreeMultimap.create(); | ||
Path path = | ||
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps rename the file too at some point?
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wididata_lang_region.tsv") | |
Paths.get(CLDRPaths.BIRTH_DATA_DIR, "/../external/wikidata_lang_region.tsv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. Actually, we should be using a tool to get that data from wikidata; right now it was just a one-time query.
it looks like this doesn't actually include the results of a new run of generate likely? |
CLDR-17535
The goal is to integrate the SIL data reading into the tool for generating likely subtags. I'm creating a new tool because it needs some major cleanup in order for it to work.
common/supplemental/supplementalData.xml
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java
<likelySubtag from="sr_ME" to="sr_Latn_ME"/> <!--Serbian‧?‧Montenegro ➡ Serbian‧Latin‧Montenegro-->
tools/cldr-code/src/main/java/org/unicode/cldr/tool/LSRSource.java
tools/cldr-code/src/main/java/org/unicode/cldr/tool/LangTagsData.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRLocale.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/LocaleScriptInfo.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java
Changes (disregarding changes due to SIL data)
Here are samples of the log when reading the SIL file:
aae_x_sub84 ➡ aae_Latn_IT_x_sub84
apc ➡ apc_Arab_SY
abj ➡ abj_Zyyy_IN
az_Cyrl_AZ ➡ az_Cyrl_AZ
abq_Cyrl ➡ abq_Cyrl_RU
aa_Arab ➡ aa_Arab_ET
ajp ➡ ajp_Arab_JO
ALLOW_MANY_COMMITS=true