Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode-aware regex #4365

Merged
merged 12 commits into from
May 22, 2024
Merged

Support Unicode-aware regex #4365

merged 12 commits into from
May 22, 2024

Conversation

Scott-Guest
Copy link
Contributor

@Scott-Guest Scott-Guest commented May 18, 2024

Closes #4357

Convert K Unicode-based regex to an equivalent Flex byte-based regex:

  • Outside of a character class, just parenthesize to keep the bytes grouped
    • r"😊*" becomes r"(\xF0\x9F\x98\x8A)*"
  • Inside a non-negated character class, factor any single Unicode character out into an explicit |
    • r"[a😊b]" becomes r"(\xF0\x9F\x98\x8A)|[ab]"
  • In all other cases (character ranges and negated character classes), report an error if there are non-ASCII characters

Additionally,

  • Check that character ranges [c1-c2] have codepoint(c1) <= codepoint(c2)
  • Check that numeric ranges r{n,m} have n <= m

The commit history is incremental, and I'd recommend reviewing commit-by-commit.

@Scott-Guest Scott-Guest self-assigned this May 18, 2024
@Scott-Guest Scott-Guest requested review from Baltoli and gtrepta and removed request for Baltoli May 18, 2024 23:48
@Scott-Guest Scott-Guest marked this pull request as ready for review May 18, 2024 23:48
@Scott-Guest Scott-Guest force-pushed the regex-unicode branch 2 times, most recently from 342a035 to ef7f2f6 Compare May 20, 2024 01:08
Comment on lines +161 to +162
this.badNegated = new LinkedHashSet<>();
this.badRange = new LinkedHashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a great line in the documentation for LinkedHashSet:

This implementation spares its clients from the unspecified, generally chaotic ordering provided by HashSet

Copy link
Contributor

@Baltoli Baltoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice; seems like a really well thought through implementation!

@rv-jenkins rv-jenkins merged commit 94686d5 into develop May 22, 2024
17 checks passed
@rv-jenkins rv-jenkins deleted the regex-unicode branch May 22, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants