Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken test case(s) when implementing patch_minor in user agent parser? #562

Closed
masklinn opened this issue Nov 1, 2023 · 1 comment
Closed

Comments

@masklinn
Copy link
Contributor

masklinn commented Nov 1, 2023

It might be an error in my implementation, but when trying to add patch_minor to the "user agent" parser I'm getting mismatches: test_ua has the following case:

uap-core/tests/test_ua.yaml

Lines 1478 to 1483 in d668d6c

- user_agent_string: 'Mozilla/5.0 (Series40; NokiaC3-01/05.60; Profile/MIDP-2.1 Configuration/CLDC-1.1) Gecko/20100401 S40OviBrowser/2.2.0.0.31'
family: 'Ovi Browser'
major: '2'
minor: '2'
patch: '0'
patch_minor:

as far as I can tell this should match the following rule:

uap-core/regexes.yaml

Lines 850 to 851 in d668d6c

- regex: '(S40OviBrowser)/(\d+)\.(\d+)\.(\d+)\.(\d+)'
family_replacement: 'Ovi Browser'

but the rule has 5 groups, so it is capturing patch_minor: '0' not null is it not?

And the reverse issue,

uap-core/tests/test_ua.yaml

Lines 6754 to 6759 in d668d6c

- user_agent_string: 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 [FBAN/FBIOS;FBAV/194.0.0.38.99;FBBV/127868476;FBDV/iPhone7,2;FBMD/iPhone;FBSN/iOS;FBSV/11.4.1;FBSS/2;FBCR/OrangeBotswana;FBID/phone;FBLC/en_GB;FBOP/5;FBRV/128807018]'
family: 'Facebook'
major: '194'
minor: '0'
patch: '0'
patch_minor: '38'

should match

uap-core/regexes.yaml

Lines 176 to 177 in d668d6c

- regex: '\[FB.{0,300};(FBAV)/(\d+)(?:\.(\d+)|)(?:\.(\d+)|)'
family_replacement: 'Facebook'

but the regex only has 4 groups including the family, so there should be no capture of the patch_minor.

masklinn added a commit to masklinn/uap-python that referenced this issue Nov 3, 2023
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to masklinn/uap-python that referenced this issue Jan 14, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 3, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 3, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 6, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to ua-parser/uap-python that referenced this issue Feb 6, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in #163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes #93, fixes #142, closes #116
@masklinn
Copy link
Contributor Author

resolved by #579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant