Skip to content

Commit

Permalink
Support Unicode-aware regex (#4365)
Browse files Browse the repository at this point in the history
Closes #4357

Convert K Unicode-based regex to an equivalent Flex byte-based regex:
- Outside of a character class, just parenthesize to keep the bytes
grouped
   - `r"😊*"` becomes `r"(\xF0\x9F\x98\x8A)*"`
- Inside a non-negated character class, factor any single Unicode
character out into an explicit `|`
    - `r"[a😊b]"` becomes `r"(\xF0\x9F\x98\x8A)|[ab]"`
- In all other cases (character ranges and negated character classes),
report an error if there are non-ASCII characters

Additionally,
- Check that character ranges `[c1-c2]` have `codepoint(c1) <=
codepoint(c2)`
- Check that numeric ranges `r{n,m}` have `n <= m`

The commit history is incremental, and I'd recommend reviewing
commit-by-commit.

---------

Co-authored-by: Bruce Collie <[email protected]>
  • Loading branch information
Scott-Guest and Baltoli authored May 22, 2024
1 parent e974730 commit 94686d5
Show file tree
Hide file tree
Showing 16 changed files with 540 additions and 160 deletions.
8 changes: 8 additions & 0 deletions k-distribution/tests/regression-new/checks/checkRegexRanges.k
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// Copyright (c) Runtime Verification, Inc. All Rights Reserved.
module CHECKREGEXRANGES
syntax lexical Foo = r"a|[1-#]b"
syntax Bar ::= r"a|[1-#]b"

syntax lexical Baz = r"(a|b|c){100,1}"
syntax Buz ::= r"(a|b|c){100,1}"
endmodule
21 changes: 21 additions & 0 deletions k-distribution/tests/regression-new/checks/checkRegexRanges.k.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[Error] Outer Parser: Invalid character range '1-#'. Start of range U+0031 is greater than end of range U+0023.
Source(checkRegexRanges.k)
Location(3,1,3,33)
3 | syntax lexical Foo = r"a|[1-#]b"
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Invalid character range '1-#'. Start of range U+0031 is greater than end of range U+0023.
Source(checkRegexRanges.k)
Location(4,16,4,27)
4 | syntax Bar ::= r"a|[1-#]b"
. ^~~~~~~~~~~
[Error] Outer Parser: Invalid numeric range '(a|(b|c)){100,1}'. Start of range 100 is greater than end of range 1.
Source(checkRegexRanges.k)
Location(6,1,6,39)
6 | syntax lexical Baz = r"(a|b|c){100,1}"
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Invalid numeric range '(a|(b|c)){100,1}'. Start of range 100 is greater than end of range 1.
Source(checkRegexRanges.k)
Location(7,16,7,33)
7 | syntax Buz ::= r"(a|b|c){100,1}"
. ^~~~~~~~~~~~~~~~~
[Error] Compiler: Had 4 parsing errors.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
// Copyright (c) Runtime Verification, Inc. All Rights Reserved.
module CHECKREGEXUNICODE
syntax lexical Foo = r"[^a😊][😦-ab-πŸ™]"
syntax Bar ::= r"[^a😊][😦-ab-πŸ™]" [token]
endmodule
31 changes: 31 additions & 0 deletions k-distribution/tests/regression-new/checks/checkRegexUnicode.k.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[Error] Outer Parser: Invalid character range '😦-a'. Start of range U+1F626 is greater than end of range U+0061.
Source(checkRegexUnicode.k)
Location(3,1,3,41)
3 | syntax lexical Foo = r"[^a😊][😦-ab-πŸ™]"
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Unsupported non-ASCII characters found in character class range: [😦, πŸ™]
Source(checkRegexUnicode.k)
Location(3,1,3,41)
3 | syntax lexical Foo = r"[^a😊][😦-ab-πŸ™]"
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Unsupported non-ASCII characters found in negated character class: [😊]
Source(checkRegexUnicode.k)
Location(3,1,3,41)
3 | syntax lexical Foo = r"[^a😊][😦-ab-πŸ™]"
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Invalid character range '😦-a'. Start of range U+1F626 is greater than end of range U+0061.
Source(checkRegexUnicode.k)
Location(4,16,4,43)
4 | syntax Bar ::= r"[^a😊][😦-ab-πŸ™]" [token]
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Unsupported non-ASCII characters found in character class range: [😦, πŸ™]
Source(checkRegexUnicode.k)
Location(4,16,4,43)
4 | syntax Bar ::= r"[^a😊][😦-ab-πŸ™]" [token]
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Outer Parser: Unsupported non-ASCII characters found in negated character class: [😊]
Source(checkRegexUnicode.k)
Location(4,16,4,43)
4 | syntax Bar ::= r"[^a😊][😦-ab-πŸ™]" [token]
. ^~~~~~~~~~~~~~~~~~~~~~~~~~~
[Error] Compiler: Had 6 parsing errors.
1 change: 1 addition & 0 deletions k-distribution/tests/regression-new/regex-unicode/1.test
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
πŸ˜ŠπŸ™
3 changes: 3 additions & 0 deletions k-distribution/tests/regression-new/regex-unicode/1.test.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<k>
😊😦😦 ~> .K
</k>
6 changes: 6 additions & 0 deletions k-distribution/tests/regression-new/regex-unicode/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
DEF=test
EXT=test
TESTDIR=.
KOMPILE_BACKEND=llvm

include ../../../include/kframework/ktest.mak
13 changes: 13 additions & 0 deletions k-distribution/tests/regression-new/regex-unicode/test.k
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// Copyright (c) Runtime Verification, Inc. All Rights Reserved.
module TEST-SYNTAX
syntax lexical Emote = r"😊?[πŸ˜¦πŸ™]+"

syntax Emoji ::= r"{Emote}" [token]

endmodule

module TEST
imports TEST-SYNTAX
configuration <k> $PGM:Emoji </k>
rule <k>πŸ˜ŠπŸ™ => 😊😦😦</k>
endmodule

This file was deleted.

Loading

0 comments on commit 94686d5

Please sign in to comment.