-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex
-based parser
#166
Comments
masklinn
added a commit
to masklinn/uap-python
that referenced
this issue
Jul 15, 2024
Uses ua-parser/uap-rust#3 Fixes ua-parser#166
Merged
masklinn
added a commit
to masklinn/uap-python
that referenced
this issue
Oct 8, 2024
Uses ua-parser/uap-rust#3 Fixes ua-parser#166
masklinn
added a commit
to masklinn/uap-python
that referenced
this issue
Oct 8, 2024
Uses ua-parser/uap-rust#3 Fixes ua-parser#166
masklinn
added a commit
to masklinn/uap-python
that referenced
this issue
Oct 8, 2024
Uses ua-parser/uap-rust#3 Fixes ua-parser#166
masklinn
added a commit
to masklinn/uap-python
that referenced
this issue
Oct 13, 2024
Uses ua-parser/uap-rust#3 - add an optional dependency on `ua-parser-rs` under the `regex` key - add a regex-based parser misc: - update the classifiers - bump required-python to 3.9 (3.8 is basically EOL) - update CI to better split up the steps - fix up the check for binary pyyaml: requirements_dev was removed in 81da21a, in May 2023, so this hasn't been working for 18 months - fix CLI script to correctly handle optional modules so it can run on pypy and graal, add regex, make tracemalloc optional as pypy doesn't support it (didn't check graal) - update tox: remove cpython 3.8, pypy 3.8 and 3.9 from tox (3.8's last supporting release was 7.3.11 in May 2023, 3.9's was 7.3.16 in April 2024), add graal Fixes ua-parser#166
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
implement filteredre2 on top of regexhttps://github.com/ua-parser/uap-rust/tree/main/regex-filteredimplement ua-parser in rusthttps://github.com/ua-parser/uap-rust/tree/main/ua-parserAfter testing
RegexSet
it's the same asre2::Set
(if not worse) and definitely unsuitable for the job (builds a giant DFA, which is ungodly slow when using a bunch of actual regexes).So we'd need to implement something like
FilteredRE2
, finally took a gander at how that works and turns out it's conceptually pretty simple: internally it's a bigVec<Regex>
, an extractor and mapper of the relevant literal atoms (PrefilterTree), and an gatekeeper (re2::Set
or an aho-corasick or sumthin, basically something which can efficiently check a large amount of literal strings against a needle and returns those which do match), the gatekeeper is initialised with the atoms, and its job is to return a set of matching atoms for the needle.FilteredRE2
then maps those atoms back to their regexes, and only tests those against the needle.Per rust-lang/regex#923 the
regex-syntax
crate can be used to extract the literal nubbins, however selecting the relevant atoms will require studying and translating re2's prefilter (and prefilter tree), as it doesn't look like the asker of the original question went on to do publicly available things which could be reused, and the process seems hairy.This complexity means we're not getting around to that any time soon, especially as I still haven't seen a way to do pure python fallbacks with maturin (cf PyO3/maturin#563, PyO3/maturin#1081, PyO3/maturin#1839, 35227), so this should probably be an independent package e.g.
uap-rs
with a published pyo3 package. A general-purposeregex
binding (à la google-re2) would not make much sense due to the need of implementing the filtering system unless that can get upstreamed (doesn't seem super likely).Opportunity
A specialised version might actually do better than
FilteredRE2
: functionally in there2
parser we have to apply the gatekeeper, then test all the potential regexes, then get the smallest index, then match the selected regex to extract the data.But with a specialised implementation, we can match the potential regexes, in-order, and return the first capture, avoiding a second pass. The selection cost might be a bit higher as it requires sorting the potential matches (rather than just taking the lowest actual match), and I assume even preparing for capture is more expensive than preparing for matching, but we'd have to test the difference on match failure (as on success test + match should logically be costlier than direct match).
Also obviously we might be able to use a more efficient gatekeeper, since aho-corasick exists and is already a dependency of
regex
.Similar idea to #149
Depends on #116Unresolved question
pyo3 or cffi?
pyo3 is almost certainly faster because it can do less call and the data extraction / conversion is done on the rust side, and much safer as the API is expressed in Rust
The major drawback is that pyo3 bindings only really work with cpython, pypy has cpyext but edit: probably not an option (or at least a good one) per PyPy Support via cpyext - What's needed? PyO3/pyo3#129
cffi has a much lower level API as everything has to transit through C, no rich type on either side, and requires providing python code, no cool conversions... but it works and should be well supported on pypy, otoh the cffi bit is the python side, the C side is... C, so usable from any language which can call C which could be helpful
uniffi seems like a new kid on the block, a multi language binding generator, but it's not clear that it supports pypy? no mention anywhere
edit: per the pyo3 docs: "uniffi wheels are compatible with all python versions including pypy."
hpy apparently also a new kid on the block, but seems quite slow going
The text was updated successfully, but these errors were encountered: