-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unicode 15.1 for line segmenter #5218
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let someone else review the algorithm
@@ -19,97 +19,111 @@ const UNKNOWN: u8 = 0; | |||
#[allow(dead_code)] | |||
const AI: u8 = 1; | |||
#[allow(dead_code)] | |||
const AL: u8 = 2; | |||
const AK: u8 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably breaks compatibility with current data structs, so you need to bump the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already bumped to V2 for 2.0. Should I bump to V3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V2 isn't released yet so it's fine as long as long as this only affects V2 structs (looks like dictionary and LSTM are still on V1)
r? @eggrobin for the algorithm |
I addressed the comments I had made above, moving the LB15a-specific logic into the state table (it is not exactly pretty, we should probably invest in some extensions to the state table description language; but it is nowhere near as bad as the ZWJ-CM situation was). |
Maybe we should be using something other than |
Now tested with 2 000 000 random strings. Before ICU 76 (specifically, before unicode-org/icu#3028) it doesn’t make sense to go further than that, since it turned out the PRNG was cycling around that length (see L2/24-162 §5.6). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any issue but cannot confirm that it's correct. Trusting the monkeys
This fix supports Unicode 15.1 for line segmenter. (a part of #3255)