Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17535 Update likely subtags data #3966

Merged

Conversation

macchiati
Copy link
Member

@macchiati macchiati commented Aug 16, 2024

CLDR-17535

Update the likely subtags with the new tool.

Still to do (but probably after alpha)

  • delete GenerateMaximalLocales and GenerateLikelyAdditions
  • convert to using JSON parser
  • read the wikidata each time (and fix the file name)
  • update the dev instructions
  • check the other tooling to make sure that the updates to languageData script and territory data make sure that the most frequent scripts and territories are at the front of the attribute values.
  • check the unittest for the likely data to make sure that it checks for redundancies.

The format has been cleaned up, and the diff will be hard to read, so look at the delta in #3958

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@macchiati macchiati marked this pull request as ready for review August 16, 2024 17:44
@macchiati macchiati marked this pull request as draft August 16, 2024 17:59
@macchiati
Copy link
Member Author

Set to draft until I resolve the errors

srl295
srl295 previously approved these changes Aug 16, 2024
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far.

@macchiati
Copy link
Member Author

macchiati commented Aug 19, 2024

The tests pass locally now, so crossing my fingers that they do on the server.

@macchiati macchiati marked this pull request as ready for review August 19, 2024 04:49
@macchiati
Copy link
Member Author

Ping: would like to merge this this morning if possible.

@macchiati macchiati requested a review from DavidLRowe August 19, 2024 15:52
@macchiati
Copy link
Member Author

Updated the log of differences and silData (now using newest sil data)

https://docs.google.com/spreadsheets/d/1ObVSxPv2H1p_2NozyN3_DRDDlN8nj-fhidobocBoRj8/edit?gid=1749071284#gid=1749071284

* Scripts that either have no known languages as yet (Cpmn) or are used for any language
* (Brai).
*/
public static final Set<String> SCRIPTS_WITH_NO_LANGUAGES = Set.of("Brai", "Cpmn");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"any language" would be good script metadata (and supplemental data). Zyyy, Zxxx in the same boat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be: Zyyy, Zxxx are disallowed in likely BTW. Otherwise ok?

@macchiati macchiati requested a review from srl295 August 19, 2024 16:07
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far, a couplemore files to review

LstrType.region,
new LocaleValidator.AllowedMatch("001|419"),
LstrType.language,
new LocaleValidator.AllowedMatch("und|in|iw|ji|jw|mo|tl"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this information that's already in supplementalMetadata? ex <languageAlias type="tl" replacement="fil" reason="legacy"/>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A superset is. We only want the common ones here.

new LocaleValidator.AllowedValid(
null,
LstrType.region,
new LocaleValidator.AllowedMatch("001|419"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this needs to be here - we already have en-150 for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 150, but it not externally relevant (eg in likely)

@macchiati
Copy link
Member Author

FYI
Base data: 3881
Minimized: 1296

@macchiati macchiati merged commit 1a914d1 into unicode-org:main Aug 19, 2024
12 checks passed
@macchiati macchiati deleted the CLDR-17535-Update-LikelySubtags-data branch August 19, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants