-
-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22707 Unicode 16 beta jun04 #3028
Conversation
Done. Locally, only intltest rbbi fails now. |
In this branch, or in a separate PR? (As discussed, I will want to do that with several commits, both to separate the proposals and because I want to keep a record of the steps of the LB25 derivation.) |
This pull request here is set up to allow multiple commits, and when it's done I will rebase-and-merge them, not squash them. I assume that it would be easiest for you to add commits here directly for segmentation. |
Sounds reasonable, I will add commits into this one then. |
2f69ca9
to
c466f45
Compare
This comment was marked as resolved.
This comment was marked as resolved.
0e71e57
to
d118c70
Compare
This comment was marked as resolved.
This comment was marked as resolved.
fbac93c
to
b68325e
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Oh, this is fun: This is the set
The set previously contained exactly these three characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BU15.1%3Alb%3DIS%7D+%26+%5B%5Cp%7BU15.1%3Aea%3DF%7D%5Cp%7BU15.1%3Aea%3DW%7D%5Cp%7BU15.1%3Aea%3DH%7D%5D%5D&g=&i=. That item in the PAG report reads:
So I will now have to remove those extremely obscure lines from the rules, a welcome change from my usual routine of adding extremely obscure lines. |
Hi @eggrobin, thanks for making progress here! |
Yes; I have brought in all the work that was already done, but as expected I need to appease the new monkeys. (And some clang warnings, etc.)
Mostly, no: things have already been consolidated (compare eggrobin/icu@unicode-org:icu:main...uax14-integration). What remains is split by UTC decision, and, e.g., the work on UTC-179-C35 is in turn split into the steps documented in the background section of item 5.15 of the report, plus the post UTC correction; I want to retain these steps in the history of line.txt and friends. I expect that I will coalesce whatever additional work remains to be done into one or two commits though. |
Yes, I somehow got distracted from ICU4[CJ] matters last week and dropped this ball. I intend to get back to this on Monday, please poke me with a sharp stick if I don’t. |
Exciting Development: While testing the new monkeys, I came across a string which exposes a bug in my rules for LB19a. This seems completely tractable in ICU, and should not require a change on the UAX14 side, so this is not an all-hands-on-deck emergency. But it is still uncomfortably exciting. The string in question is In ICU, LB19a was implemented in a slightly strange way: LB19 was unchanged, and the complement of LB19a is given break rules (this is to avoid having to add a profusion of rules for overlapping context spanning more than two code points).
But in this case, the lb=CM-as-AL does not follow a break, because LB14 applied. The solution should be to copy the existing rules that end with
This test case is sufficiently treacherous that it should be added both to rbbitst.txt and to the UCD’s own LineBreakTest.txt. |
4f87a48
to
9782d0d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
da1ebfc
to
70089cd
Compare
This comment was marked as outdated.
This comment was marked as outdated.
@markusicu Status report: 70089cd is green (except for clang warnings which I am fixing in the next commit), so if this is blocking too many things you could run with it. Also note that so far this PR does not upgrade any of the tailored copies of the line breaking algorithm (which should receive the same changes as the default). I don’t want to do that before I get the changes to the default right. |
Great, thanks! 🎉
Given the US holiday and your and Elango's travel schedules, I suggest that we keep this PR open for now. If you have more time to work on it, you can make progress right here. It would be nice if it was still "green" next week. At that point I (and maybe Andy) could look it over for plausibility and code changes, and merge. And then I might try to rebase Elango's InCB PR -- or I might just wait for his return. Separately I could start fixing ICU UTS46 code for 16 once this PR is in. |
Added Andy as a reviewer for the segmentation changes. (incomplete, see comments above and separate email) |
…ine_(loose|normal)_cj
87933b2
to
793a0db
Compare
This comment was marked as outdated.
This comment was marked as outdated.
398a85a
to
cf1cbbb
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Ten days ago I had written:
I have now done the Tailorings step. |
As discussed, that would probably not work once we regenerate ICU4J data. |
@markusicu, the ICU4J section of https://unicode-org.github.io/icu/userguide/dev/rules_update.html points me to icu4c/source/data/icu4j-readme.txt; following those instructions, I am able to generate |
@eggrobin I just pushed a commit with the regenerated ICU4J binary .brk files. |
615b13e
to
4d77ea8
Compare
This comment was marked as outdated.
This comment was marked as outdated.
4d77ea8
to
4ad3566
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Hi @eggrobin
|
Not those at least; see #3028 (comment). In general I think the commits should be sensible here, I have consistently been squashing minor tweaks into the relevant commits (hence the many force-pushes). |
(Approving the changes in https://github.com/unicode-org/icu/pull/3028/files/1026f7464ec3966e49f263ead9215802d100ff05.) |
Checklist
ALLOW_MANY_COMMITS=true