-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unicode-aware regex #4365
Conversation
342a035
to
ef7f2f6
Compare
….CharClass.Char instances
ef7f2f6
to
3d30861
Compare
this.badNegated = new LinkedHashSet<>(); | ||
this.badRange = new LinkedHashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a great line in the documentation for LinkedHashSet
:
This implementation spares its clients from the unspecified, generally chaotic ordering provided by HashSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice; seems like a really well thought through implementation!
Closes #4357
Convert K Unicode-based regex to an equivalent Flex byte-based regex:
r"😊*"
becomesr"(\xF0\x9F\x98\x8A)*"
|
r"[a😊b]"
becomesr"(\xF0\x9F\x98\x8A)|[ab]"
Additionally,
[c1-c2]
havecodepoint(c1) <= codepoint(c2)
r{n,m}
haven <= m
The commit history is incremental, and I'd recommend reviewing commit-by-commit.