forked from ua-parser/uap-python
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add finite-automaton simplifier, for re2 and graal
As I've discovered a while ago, finite automaton engines are not very fond of large bounded repetitions. In re2 and regex, that mostly translates to increased memory consumption (e.g. in their default modes, converting `.*` to `.{0,500}` increases the pattern's size by 115x in re2 and 84x in regex, if a capture is added on top then regex balloons to 219x), there is a performance impact but it's high single digit to low double, in regex at least (didn't test re2). However as it turns out Graal uses a JIT-ed DFA, and it *really* doesn't like these patterns, it spends a lot of time JIT-compiling (this is apparently the source of the extra 300% CPU use I could observe on what are purely single-threaded workloads, the JIT desperately trying to optimise regexes) them with no gain in performance: down-converting the regex back to the sensible increases performances by ~25%, though it doesn't seem to impact memory use... So... do that: `fa_simplifier` is the same idea as ua-parser/uap-rust@29b9195 but from the Python side, and applied to graal and re2 (not regex because it does that internally as linked above). Also switch Graal over to the lazy builtins, it kinda spreads the cost but it seems stupid to compile the regexes only to immediately swap (fa_simplifier) and recompile them... so don't do that, especially as I couldn't be arsed to make the replacement conditional (so every eager regex is recompiled, even though only those which actually got modified by `fa_simplifier` need it...). Fixes ua-parser#228
- Loading branch information
Showing
6 changed files
with
82 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import pytest # type: ignore | ||
|
||
from ua_parser.utils import fa_simplifier | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("from_", "to"), | ||
[ | ||
(r"\d", "[0-9]"), | ||
(r"[\d]", "[0-9]"), | ||
(r"[\d\.]", r"[0-9\.]"), | ||
], | ||
) | ||
def test_classes(from_, to): | ||
assert fa_simplifier(from_) == to |