Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finite automaton conversion #230

Merged
merged 6 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@ jobs:
artifact: dist/*.tar.gz
- source: wheel
artifact: dist/*.whl
- opts: ""
- python-version: graalpy-24
opts: "--experimental-options --engine.CompileOnly='~tregex re'"
steps:
- name: Checkout working copy
uses: actions/checkout@v4
Expand Down Expand Up @@ -127,6 +130,6 @@ jobs:
name: ${{ matrix.source }}
path: dist/
- name: install package in environment
run: pip install ${{ matrix.artifact || '.' }}
run: python -m pip install ${{ matrix.artifact || '.' }}
- name: run tests
run: pytest -v -Werror -Wignore::ImportWarning --doctest-glob="*.rst" -ra
run: python ${{ matrix.opts }} -m pytest -v -Werror -Wignore::ImportWarning --doctest-glob="*.rst" -ra
15 changes: 9 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,20 @@ Just add ``ua-parser`` to your project's dependencies, or run

to install in the current environment.

Installing `google-re2 <https://pypi.org/project/google-re2/>`_ is
*strongly* recommended as it leads to *significantly* better
performances. This can be done directly via the ``re2`` optional
dependency:
Installing `ua-parser-rs <https://pypi.org/project/ua-parser-rs>`_ or
`google-re2 <https://pypi.org/project/google-re2/>`_ is *strongly*
recommended as they yield *significantly* better performances. This
can be done directly via the ``regex`` and ``re2`` optional
dependencies respectively:

.. code-block:: sh

$ pip install 'ua_parser[regex]'
$ pip install 'ua_parser[re2]'

If ``re2`` is available, ``ua-parser`` will simply use it by default
instead of the pure-python resolver.
If either dependency is already available (e.g. because the software
makes use of re2 for other reasons) ``ua-parser`` will use the
corresponding resolver automatically.

Quick Start
-----------
Expand Down
13 changes: 13 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,19 @@ from user agent strings.

.. warning:: Only available if |re2|_ is installed.

.. class::ua_parser.regex.Resolver(Matchers)

An advanced resolver based on |regex|_ and a bespoke implementation
of regex prefiltering, by the sibling project `ua-rust
<https://github.com/ua-parser/uap-rust`_.

Sufficiently fast that a cache may not be necessary, and may even
be detrimental at smaller cache sizes

.. warning:: Only available if `ua-parser-rs
<https://pypi.org/project/ua-parser-rs/`>_ is
installed.

Eager Matchers
''''''''''''''

Expand Down
97 changes: 97 additions & 0 deletions doc/guides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,103 @@ from here on::
:class:`~ua_parser.caching.Local`, which is also caching-related,
and serves to use thread-local caches rather than a shared cache.

Builtin Resolvers
=================

.. list-table::
:header-rows: 1
:stub-columns: 1

* -
- speed
- portability
- memory use
- safety
* - ``regex``
- great
- good
- bad
- great
* - ``re2``
- good
- bad
- good
- good
* - ``basic``
- terrible
- great
- great
- great

``regex``
---------

The ``regex`` resolver is a bespoke effort as part of the `uap-rust
<https://github.com/ua-parser/uap-rust>`_ sibling project, built on
`rust-regex <https://github.com/rust-lang/regex>`_ and `a bespoke
regex-prefiltering implementation
<https://github.com/ua-parser/uap-rust/tree/main/regex-filtered>`_,
it:

- Is the fastest available resolver, usually edging out ``re2`` by a
significant margin (when that is even available).
- Is fully controlled by the project, and thus can be built for all
interpreters and platforms supported by pyo3 (currently: cpython,
pypy, and graalpy, on linux, macos and linux, intel and arm). It is
also built as a cpython abi3 wheel and should thus suffer from no
compatibility issues with new release.
- Built entirely out of safe rust code, its safety risks are entirely
in ``regex`` and ``pyo3``.
- Its biggest drawback is that it is a lot more memory intensive than
the other resolvers, because ``regex`` tends to trade memory for
speed (~155MB high water mark on a real-world dataset).

If available, it is the default resolver, without a cache.

``re2``
-------

The ``re2`` resolver is built atop the widely used `google-re2
<https://github.com/google/re2>`_ via its built-in Python bindings.
It:

- Is extremely fast, though around 80% slower than ``regex`` on
real-world data.
- Is only compatible with CPython, and uses pure API wheels, so needs
a different release for each cpython version, for each OS, for each
architecture.
- Is built entirely in C++, but by experienced Google developers.
- Is more memory intensive than the pure-python ``basic`` resolver,
but quite slim all things considered (~55MB high water mark on a
real-world dataset).

If available, it is the second-preferred resolver, without a cache.

``basic``
---------

The ``basic`` resolver is a naive linear traversal of all rules, using
the standard library's ``re``. It:

- Is *extremely* slow, about 10x slower than ``re2`` in cpython, and
pypy and graal's regex implementations do *not* like the workload
and behind cpython by a factor of 3~4.
- Has perfect compatibility, with the caveat above, by virtue of being
built entirely out of standard library code.
- Is basically as safe as Python software can be by virtue of being
just Python, with the native code being the standard library's.
- Is the slimmest resolver at about 40MB.

This is caveated by a hard requirement to use caches which makes it
workably faster on real-world datasets (if still nowhere near
*uncached* ``re2`` or ``regex``) but increases its memory requirement
significantly e.g. using "sieve" and a cache size of 20000 on a
real-world dataset, it is about 4x slower than ``re2`` for about the
same memory requirements.

It is the fallback and least preferred resolver, with a medium
(currently 2000 entries) cache by default.

Writing Custom Resolvers
========================

Expand Down
6 changes: 6 additions & 0 deletions doc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,9 @@ if installed, but can also be installed via and alongside ua-parser:
$ pip install 'ua-parser[yaml]'
$ pip install 'ua-parser[regex,yaml]'

``yaml`` simply enables the ability to :func:`load yaml rulesets
<ua_parser.loaders.load_yaml>`.

The other two dependencies enable more efficient resolvers. By
default, ``ua-parser`` will select the fastest resolver it finds out
of the available set. For more, see :ref:`builtin resolvers`.
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ warn_redundant_casts = true

# these can be overridden (maybe?)
strict_equality = true
strict_concatenate = true
# strict_concatenate = true
check_untyped_defs = true
disallow_subclassing_any = true
disallow_untyped_decorators = true
Expand All @@ -110,6 +110,7 @@ module = [
"test_core",
"test_caches",
"test_parsers_basics",
"test_fa_simplifier",
]

#check_untyped_defs = false
Expand Down
36 changes: 22 additions & 14 deletions src/ua_parser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,25 @@
UserAgent,
)
from .loaders import load_builtins, load_lazy_builtins
from .utils import IS_GRAAL

Re2Resolver: Optional[Callable[[Matchers], Resolver]] = None
_ResolverCtor = Callable[[Matchers], Resolver]
Re2Resolver: Optional[_ResolverCtor] = None
if importlib.util.find_spec("re2"):
from .re2 import Resolver as Re2Resolver
RegexResolver: Optional[_ResolverCtor] = None
if importlib.util.find_spec("ua_parser_rs"):
from .regex import Resolver as RegexResolver
BestAvailableResolver: _ResolverCtor = next(
filter(
None,
(
RegexResolver,
Re2Resolver,
lambda m: CachingResolver(BasicResolver(m), Cache(2000)),
),
)
)


VERSION = (1, 0, 0)
Expand All @@ -81,15 +96,7 @@ def from_matchers(cls, m: Matchers, /) -> Parser:
stack.

"""
if Re2Resolver is not None:
return cls(Re2Resolver(m))
else:
return cls(
CachingResolver(
BasicResolver(m),
Cache(200),
)
)
return cls(BestAvailableResolver(m))

def __init__(self, resolver: Resolver) -> None:
self.resolver = resolver
Expand Down Expand Up @@ -132,10 +139,11 @@ def parse_device(self: Resolver, ua: str) -> Optional[Device]:
def __getattr__(name: str) -> Parser:
global parser
if name == "parser":
parser = Parser.from_matchers(
load_builtins() if Re2Resolver is None else load_lazy_builtins()
)
return parser
if RegexResolver or Re2Resolver or IS_GRAAL:
matchers = load_lazy_builtins()
else:
matchers = load_builtins()
return Parser.from_matchers(matchers)
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")


Expand Down
23 changes: 22 additions & 1 deletion src/ua_parser/basic.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
__all__ = ["Resolver"]

import re
from itertools import chain
from operator import methodcaller
from typing import List
from typing import Any, List

from .core import (
Device,
Expand All @@ -12,6 +14,7 @@
PartialResult,
UserAgent,
)
from .utils import IS_GRAAL, fa_simplifier


class Resolver:
Expand All @@ -30,6 +33,24 @@ def __init__(
matchers: Matchers,
) -> None:
self.user_agent_matchers, self.os_matchers, self.device_matchers = matchers
if IS_GRAAL:
matcher: Any
kind = next(
(
"eager" if hasattr(type(m), "regex") else "lazy"
for m in chain.from_iterable(matchers)
),
None,
)
if kind == "eager":
for matcher in chain.from_iterable(matchers):
matcher.pattern = re.compile(
fa_simplifier(matcher.pattern.pattern),
flags=matcher.pattern.flags,
)
elif kind == "lazy":
for matcher in chain.from_iterable(matchers):
matcher.regex = fa_simplifier(matcher.pattern.pattern)

def __call__(self, ua: str, domains: Domain, /) -> PartialResult:
parse = methodcaller("__call__", ua)
Expand Down
9 changes: 5 additions & 4 deletions src/ua_parser/re2.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
PartialResult,
UserAgent,
)
from .utils import fa_simplifier


class DummyFilter:
Expand All @@ -38,15 +39,15 @@ def __init__(
if self.user_agent_matchers:
self.ua = re2.Filter()
for u in self.user_agent_matchers:
self.ua.Add(u.regex)
self.ua.Add(fa_simplifier(u.regex))
self.ua.Compile()
else:
self.ua = DummyFilter()

if self.os_matchers:
self.os = re2.Filter()
for o in self.os_matchers:
self.os.Add(o.regex)
self.os.Add(fa_simplifier(o.regex))
self.os.Compile()
else:
self.os = DummyFilter()
Expand All @@ -58,9 +59,9 @@ def __init__(
# no pattern uses global flags, but since they're not
# supported in JS that seems safe.
if d.flags & re.IGNORECASE:
self.devices.Add("(?i)" + d.regex)
self.devices.Add("(?i)" + fa_simplifier(d.regex))
else:
self.devices.Add(d.regex)
self.devices.Add(fa_simplifier(d.regex))
self.devices.Compile()
else:
self.devices = DummyFilter()
Expand Down
33 changes: 33 additions & 0 deletions src/ua_parser/utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
import platform
import re
from typing import Match, Optional

IS_GRAAL: bool = platform.python_implementation() == "GraalVM"


def get(m: Match[str], idx: int) -> Optional[str]:
return (m[idx] or None) if 0 < idx <= m.re.groups else None
Expand Down Expand Up @@ -28,3 +31,33 @@ def replacer(repl: str, m: Match[str]) -> Optional[str]:
return None

return re.sub(r"\$(\d)", lambda n: get(m, int(n[1])) or "", repl).strip() or None


REPETITION_PATTERN = re.compile(r"\{(0|1)\s*,\s*\d{3,}\}")
CLASS_PATTERN = re.compile(
r"""
\[[^]]*\\(d|w)[^]]*\]
|
\\(d|w)
""",
re.VERBOSE,
)


def class_replacer(m: re.Match[str]) -> str:
d, w = ("0-9", "A-Za-z0-9_") if m[1] else ("[0-9]", "[A-Za-z0-9_]")
return m[0].replace(r"\d", d).replace(r"\w", w)


def fa_simplifier(pattern: str) -> str:
"""uap-core makes significant use of large bounded repetitions, to
mitigate catastrophic backtracking.

However this explodes the number of states (and thus graph size)
for finite automaton engines, which significantly increases their
memory use, and for those which use JITs it can exceed the JIT
threshold and force fallback to a slower engine (seems to be the
case for graal's TRegex).
"""
pattern = REPETITION_PATTERN.sub(lambda m: "*" if m[1] == "0" else "+", pattern)
return CLASS_PATTERN.sub(class_replacer, pattern)
15 changes: 15 additions & 0 deletions tests/test_fa_simplifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import pytest # type: ignore

from ua_parser.utils import fa_simplifier


@pytest.mark.parametrize(
("from_", "to"),
[
(r"\d", "[0-9]"),
(r"[\d]", "[0-9]"),
(r"[\d\.]", r"[0-9\.]"),
],
)
def test_classes(from_, to):
assert fa_simplifier(from_) == to
Loading