remove 'www*.' prefix when parsing allowed_domains #27

BurnzZ · 2024-01-12T07:18:22Z

This fixes this specific use case:

User has entered the parameter: url=https://www.example.com.
This sets allowed_domains=["www.example.com"].
However, the site lists its links as https://example.com/some-page (without the 'www.' prefix)
This causes the OffsiteMiddleware to filter out the links.

Removing any 'www*.' prefixes beforehand allows for a cleaner allowed_domains value.

The reverse should also work where url=https://example.com is used as the arg which sets allowed_domains=["example.com"] and the site uses links like https://www.example.com/some-page. This is because, the allowed_domain would always be a subset of the links in the OffsiteMiddleware (code ref).

Lastly, also moved parsing the allowed_domains from the BaseSpider to the EcommerceSpider since ArticleSpider would have a different way to parse them due to multiple seeds.

codecov-commenter · 2024-01-12T07:20:00Z

Codecov Report

Merging #27 (6364d2e) into main (570edab) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 6364d2e differs from pull request most recent head 7d205c0. Consider uploading reports for the commit 7d205c0 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #27      +/-   ##
==========================================
+ Coverage   98.57%   98.58%   +0.01%     
==========================================
  Files          11       12       +1     
  Lines         492      496       +4     
==========================================
+ Hits          485      489       +4     
  Misses          7        7

Files	Coverage Δ
zyte_spider_templates/spiders/base.py	`100.00% <ø> (ø)`
zyte_spider_templates/spiders/ecommerce.py	`98.88% <100.00%> (+0.02%)`	⬆️
zyte_spider_templates/utils.py	`100.00% <100.00%> (ø)`

Gallaecio

Looks good to me.

Although I wonder if a better approach might be to find the top-level domain part of the URL (with tldextract, which Scrapy uses), and include that plus the subdomain. So, a.b.c.example.com becomes example.com. It’s just a thought, though, and I am not even sure if it would be a change for the better or for the worse.

BurnzZ · 2024-01-15T15:13:49Z

Although I wonder if a better approach might be to find the top-level domain part of the URL (with tldextract, which Scrapy uses), and include that plus the subdomain. So, a.b.c.example.com becomes example.com. It’s just a thought, though, and I am not even sure if it would be a change for the better or for the worse.

This might be the case later on as @PyExplorer polishes the Article Spider which has quite lots of varying cases with the domains.

EDIT: It seems tldextract would still have the www. prefix in the subdomain part. For eCommerce, we'd need the subdomain prefixes like uk and fr so that the crawls would stay within it.

In [1]: tldextract.extract('http://www.uk.example.com/')
Out[1]: ExtractResult(subdomain='www.uk', domain='example', suffix='com', is_private=False)

In [2]: tldextract.extract('http://www.fr.example.com/')
Out[2]: ExtractResult(subdomain='www.fr', domain='example', suffix='com', is_private=False)

zyte_spider_templates/utils.py

tests/test_utils.py

kmike · 2024-01-29T08:36:16Z

zyte_spider_templates/utils.py

+
+
+def get_domain(url: str) -> str:
+    return re.sub(r"www.*?\.", "", parse_url(url).netloc)


Would it make sense to have this function more conservative? It looks like this one can change netlocs like wwworld.example.com, or my.wwworld.com, or awww.com.

What are the real-world cases besides www.example.com which we should support?

Thanks for all these great examples! Updated in 4343c7f.

What are the real-world cases besides www.example.com which we should support?

The most common ones that were observed would be:

https://example.com

https://www.example.com,

https://www2.example.com

zyte_spider_templates/utils.py

Co-authored-by: Adrián Chaves <[email protected]>

BurnzZ requested review from kmike, wRAR and PyExplorer January 12, 2024 07:18

BurnzZ force-pushed the allowed-domains branch 2 times, most recently from cc58fb6 to fc2df5e Compare January 12, 2024 08:25

Gallaecio approved these changes Jan 15, 2024

View reviewed changes

BurnzZ mentioned this pull request Jan 26, 2024

release notes for 0.6.0 #30

Merged

2 tasks

wRAR reviewed Jan 26, 2024

View reviewed changes

zyte_spider_templates/utils.py Outdated Show resolved Hide resolved

BurnzZ added 5 commits January 27, 2024 00:12

remove 'www*.' prefix in allowed_domains

640f898

run linters

6512ba2

move logic of parsing domain to utils

847d5fd

add test_utils.py for get_domain()

27a0044

simplify get_domain() code

a866976

BurnzZ force-pushed the allowed-domains branch from c66f4dd to a866976 Compare January 26, 2024 16:13

add missing imports

fe458fc

BurnzZ force-pushed the allowed-domains branch from d965b95 to fe458fc Compare January 26, 2024 16:18

wRAR approved these changes Jan 26, 2024

View reviewed changes

PyExplorer approved these changes Jan 26, 2024

View reviewed changes

kmike reviewed Jan 29, 2024

View reviewed changes

tests/test_utils.py Show resolved Hide resolved

kmike reviewed Jan 29, 2024

View reviewed changes

add more test cases and handle prefix

4343c7f

kmike approved these changes Jan 29, 2024

View reviewed changes

Gallaecio reviewed Jan 29, 2024

View reviewed changes

zyte_spider_templates/utils.py Outdated Show resolved Hide resolved

Improve regex on domain extraction

7d205c0

Co-authored-by: Adrián Chaves <[email protected]>

BurnzZ merged commit 832a008 into main Jan 30, 2024
9 checks passed

BurnzZ deleted the allowed-domains branch January 30, 2024 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove 'www*.' prefix when parsing allowed_domains #27

remove 'www*.' prefix when parsing allowed_domains #27

BurnzZ commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 •

edited

Loading

Gallaecio left a comment

BurnzZ commented Jan 15, 2024 •

edited

Loading

kmike Jan 29, 2024

BurnzZ Jan 29, 2024



		def get_domain(url: str) -> str:
		return re.sub(r"www.*?\.", "", parse_url(url).netloc)

remove 'www*.' prefix when parsing allowed_domains #27

remove 'www*.' prefix when parsing allowed_domains #27

Conversation

BurnzZ commented Jan 12, 2024

codecov-commenter commented Jan 12, 2024 • edited Loading

Codecov Report

Gallaecio left a comment

Choose a reason for hiding this comment

BurnzZ commented Jan 15, 2024 • edited Loading

kmike Jan 29, 2024

Choose a reason for hiding this comment

BurnzZ Jan 29, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jan 12, 2024 •

edited

Loading

BurnzZ commented Jan 15, 2024 •

edited

Loading