Reliably identify a relevant sitename when configuring new sites #69

jmorgannz · 2021-11-28T09:38:56Z

This is a separate feature opened after discussion of the subject in the #68 PR.

Example domain name:

subdomaina.subdomainb.website.com.tw

Discussion to date has yielded the following:

Identify the public suffix of a domain (com.tw)
Identify the sitename as the first domain segment left of the public suffix, including the public suffix (sitename.com.tw)
Ignore 0-n subdomains left of the sitename (~~subdomaina.subdomainb.~~website.com.tw)
In the case where it is desired to identify the sitename as a subdomain, manual override should be accepted

Methods discussed:

Use a limited pre-defined list of public suffixes augmented by user configurable over-rides. (Treat as same site #68)
Source and load a comprehensive list of fixed public suffixes requiring no user configuration or override.

Solution 2 has been leaned toward as a favourite by importing a copy of the Public Suffix List and a system to read / use it.

When suggesting a sitename we try to find the "significant" part of the url. for www.google.com that would be google.com, but just keeping the two last parts (or removing the first one) fail too often. amazon.co.uk is one example. Further, each TLD has it's own policy here, so an algorithmic approch is bound to fail. https://publicsuffix.org/ tries to gather all possible SLD's. it might not be perfect, but better than what we have (hardcoding a couple like (com|edu| co).* The list is rather large, but with some clever(?) tricks we can get it down to an acceptable size: Going a bit crazy here. Browsers don't support gzip/deflate data yet (waiting for the Compression Streams API) and other compression schemes where reasonable libs are available simply don't cut it on the compression rate. in the mean time, png is lossless and deflate compression - exactly what we need :) So this patch pre-process theh PSL list for easy lookup (and removes a lot of reduntant text) and export the result as a json dictionary. this is then converted to png by imagemagick. The browser loads the image, we access the pixel values and end up with our desired json dict. Issue #69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliably identify a relevant sitename when configuring new sites #69

Reliably identify a relevant sitename when configuring new sites #69

jmorgannz commented Nov 28, 2021 •

edited

Loading

Reliably identify a relevant sitename when configuring new sites #69

Reliably identify a relevant sitename when configuring new sites #69

Comments

jmorgannz commented Nov 28, 2021 • edited Loading

jmorgannz commented Nov 28, 2021 •

edited

Loading