Skip to content

Commit

Permalink
Typed API & parsers API
Browse files Browse the repository at this point in the history
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in #163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes #93, fixes #142, closes #116
  • Loading branch information
masklinn committed Nov 3, 2023
1 parent e9483d8 commit 7f90746
Show file tree
Hide file tree
Showing 15 changed files with 1,220 additions and 98 deletions.
158 changes: 83 additions & 75 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,119 +1,127 @@
uap-python
==========

A python implementation of the UA Parser (https://github.com/ua-parser,
formerly https://github.com/tobie/ua-parser)
Official python implementation of the `User Agent String
Parser <https://github.com/ua-parser>`_ project.

Build Status
------------

.. image:: https://github.com/ua-parser/uap-python/actions/workflows/ci.yml/badge.svg
:alt: CI on the master branch


Installing
----------

Install via pip
~~~~~~~~~~~~~~~

Just run:
Just add ``ua-parser`` to your project's dependencies, or run

.. code-block:: sh
$ pip install ua-parser
Manual install
~~~~~~~~~~~~~~

In the top-level directory run:

.. code-block:: sh
$ python setup.py install
Change Log
---------------
Because this repo is mostly a python wrapper for the User Agent String Parser repo (https://github.com/ua-parser/uap-core), the changes made to this repo are best described by the update diffs in that project. Please see the diffs for this submodule (https://github.com/ua-parser/uap-core/releases) for a list of what has changed between versions of this package.
to install in the current environment.

Getting Started
---------------

Retrieve data on a user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Retrieve all data on a user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
>>> from ua_parser import user_agent_parser
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> from ua_parser import parse
>>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
>>> parsed_string = user_agent_parser.Parse(ua_string)
>>> pp.pprint(parsed_string)
{ 'device': {'brand': 'Apple', 'family': 'Mac', 'model': 'Mac'},
'os': { 'family': 'Mac OS X',
'major': '10',
'minor': '9',
'patch': '4',
'patch_minor': None},
'string': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 '
'Safari/537.36',
'user_agent': { 'family': 'Chrome',
'major': '41',
'minor': '0',
'patch': '2272'}}
Extract browser data from user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> parse(ua_string) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
ParseResult(user_agent=UserAgent(family='Chrome',
major='41',
minor='0',
patch='2272',
patch_minor='104'),
os=OS(family='Mac OS X',
major='10',
minor='9',
patch='4',
patch_minor=None),
device=Device(family='Mac',
brand='Apple',
model='Mac'),
string='Mozilla/5.0 (Macintosh; Intel Mac OS...
Any datum not found in the user agent string is set to ``None``::
>>> parse("")
ParseResult(user_agent=None, os=None, device=None, string='')
Extract only browser data from user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
>>> from ua_parser import user_agent_parser
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> from ua_parser import parse_user_agent
>>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
>>> parsed_string = user_agent_parser.ParseUserAgent(ua_string)
>>> pp.pprint(parsed_string)
{'family': 'Chrome', 'major': '41', 'minor': '0', 'patch': '2272'}
>>> parse_user_agent(ua_string)
UserAgent(family='Chrome', major='41', minor='0', patch='2272', patch_minor='104')
..
For specific domains, a match failure just returns ``None``::
⚠️Before 0.15, the convenience parsers (``ParseUserAgent``,
``ParseOs``, and ``ParseDevice``) were not cached, which could
result in degraded performances when parsing large amounts of
identical user-agents (which might occur for real-world datasets).

For these versions (up to 0.10 included), prefer using ``Parse``
and extracting the sub-component you need from the resulting
dictionary.
>>> parse_user_agent("")
Extract OS information from user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
>>> from ua_parser import user_agent_parser
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> from ua_parser import parse_os
>>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
>>> parsed_string = user_agent_parser.ParseOS(ua_string)
>>> pp.pprint(parsed_string)
{ 'family': 'Mac OS X',
'major': '10',
'minor': '9',
'patch': '4',
'patch_minor': None}
Extract Device information from user-agent string
>>> parse_os(ua_string)
OS(family='Mac OS X', major='10', minor='9', patch='4', patch_minor=None)
Extract device information from user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
>>> from ua_parser import user_agent_parser
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)
>>> from ua_parser import parse_device
>>> ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
>>> parsed_string = user_agent_parser.ParseDevice(ua_string)
>>> pp.pprint(parsed_string)
{'brand': 'Apple', 'family': 'Mac', 'model': 'Mac'}
>>> parse_device(ua_string)
Device(family='Mac', brand='Apple', model='Mac')
Parser
~~~~~~
Parsers expose the same functions (``parse``, ``parse_user_agent``,
``parse_os``, and ``parse_device``) as the top-level of the package,
however these are all *utility* methods.
The actual protocol of parsers, and the one method which must be
implemented / overridden is::
def __call__(self, str, Components, /) -> ParseResult:
It's similar to but more flexible than ``parse``:
- The ``str`` is the user agent string.
- The ``Components`` is a hint, through which the caller requests the
domain (component) they are looking for, any combination of
``Components.USER_AGENT``, ``Components.OS``, and
``Components.DEVICE``. ``Domains.ALL`` exists as a convenience alias
for the combination of all three.
The parser *must* return at least the requested information, but if
that's more convenient or no more expensive it *can* return more.
- The ``ParseResult`` is similar to ``CompleteParseResult``, except
all the attributes are ``Optional`` and it has a ``components:
Components`` attribute which specifies whether a component was never
requested (its value for the user agent string is unknown) or it has
been requested but could not be resolved (no match was found for the
user agent).
``ParseResult.complete()`` convert to a ``CompleteParseResult`` if
all the components are set, and raise an exception otherwise. If
some of the components are set to ``None``, they'll be swapped for a
default value.
Calling the parser directly is part of the public API. One of the
advantage is that it does not return default values, as such it allows
more easily differentiating between a non-match (= ``None``) and a
default fallback (``family = "Other"``).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "ua-parser"
description = "Python port of Browserscope's user agent parser"
version = "1.0.0a"
version = "1.0.0a1"
readme = "README.rst"
requires-python = ">=3.8"
dependencies = []
Expand Down
77 changes: 61 additions & 16 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env python
# flake8: noqa
import io
from contextlib import suppress
from os import fspath
from pathlib import Path
Expand Down Expand Up @@ -51,6 +52,13 @@ def run(self) -> None:
f"Unable to find regexes.yaml, should be at {yaml_src!r}"
)

def write_matcher(f, typ: str, fields: List[Optional[object]]):
f.write(f" {typ}(".encode())
while len(fields) > 1 and fields[-1] is None:
fields = fields[:-1]
f.write(", ".join(map(repr, fields)).encode())
f.write(b"),\n")

def write_params(fields):
# strip trailing None values
while len(fields) > 1 and fields[-1] is None:
Expand All @@ -70,10 +78,20 @@ def write_params(fields):
outdir = dist_dir / self.pkg_name
outdir.mkdir(parents=True, exist_ok=True)

dest = outdir / "_regexes.py"
dest = outdir / "_matchers.py"
dest_legacy = outdir / "_regexes.py"

with dest.open("wb") as fp:
with dest.open("wb") as f, dest_legacy.open("wb") as fp:
# fmt: off
f.write(b"""\
########################################################
# NOTICE: this file is autogenerated from regexes.yaml #
########################################################
from .core import Matchers, UserAgentMatcher, OSMatcher, DeviceMatcher
MATCHERS: Matchers = ([
""")
fp.write(b"# -*- coding: utf-8 -*-\n")
fp.write(b"########################################################\n")
fp.write(b"# NOTICE: This file is autogenerated from regexes.yaml #\n")
Expand All @@ -87,31 +105,35 @@ def write_params(fields):
fp.write(b"\n")
fp.write(b"USER_AGENT_PARSERS = [\n")
for device_parser in regexes["user_agent_parsers"]:
fp.write(b" UserAgentParser(\n")
write_params([
write_matcher(f, "UserAgentMatcher", [
device_parser["regex"],
device_parser.get("family_replacement"),
device_parser.get("v1_replacement"),
device_parser.get("v2_replacement"),
])
fp.write(b" ),\n")
fp.write(b"]\n")
fp.write(b"\n")
fp.write(b"DEVICE_PARSERS = [\n")
for device_parser in regexes["device_parsers"]:
fp.write(b" DeviceParser(\n")

fp.write(b" UserAgentParser(\n")
write_params([
device_parser["regex"],
device_parser.get("regex_flag"),
device_parser.get("device_replacement"),
device_parser.get("brand_replacement"),
device_parser.get("model_replacement"),
device_parser.get("family_replacement"),
device_parser.get("v1_replacement"),
device_parser.get("v2_replacement"),
])
fp.write(b" ),\n")
fp.write(b"]\n")
fp.write(b"\n")
f.write(b" ], [\n")
fp.write(b"]\n\n")

fp.write(b"OS_PARSERS = [\n")
for device_parser in regexes["os_parsers"]:
write_matcher(f, "OSMatcher", [
device_parser["regex"],
device_parser.get("os_replacement"),
device_parser.get("os_v1_replacement"),
device_parser.get("os_v2_replacement"),
device_parser.get("os_v3_replacement"),
device_parser.get("os_v4_replacement"),
])

fp.write(b" OSParser(\n")
write_params([
device_parser["regex"],
Expand All @@ -122,6 +144,29 @@ def write_params(fields):
device_parser.get("os_v4_replacement"),
])
fp.write(b" ),\n")
f.write(b" ], [\n")
fp.write(b"]\n\n")

fp.write(b"DEVICE_PARSERS = [\n")
for device_parser in regexes["device_parsers"]:
write_matcher(f, "DeviceMatcher", [
device_parser["regex"],
device_parser.get("regex_flag"),
device_parser.get("device_replacement"),
device_parser.get("brand_replacement"),
device_parser.get("model_replacement"),
])

fp.write(b" DeviceParser(\n")
write_params([
device_parser["regex"],
device_parser.get("regex_flag"),
device_parser.get("device_replacement"),
device_parser.get("brand_replacement"),
device_parser.get("model_replacement"),
])
fp.write(b" ),\n")
f.write(b"])\n")
fp.write(b"]\n")
# fmt: on

Expand Down
Loading

0 comments on commit 7f90746

Please sign in to comment.