Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove 'www*.' prefix when parsing allowed_domains #27

Merged
merged 8 commits into from
Jan 30, 2024
Merged

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Jan 12, 2024

This fixes this specific use case:

  • User has entered the parameter: url=https://www.example.com.
  • This sets allowed_domains=["www.example.com"].
  • However, the site lists its links as https://example.com/some-page (without the 'www.' prefix)
  • This causes the OffsiteMiddleware to filter out the links.

Removing any 'www*.' prefixes beforehand allows for a cleaner allowed_domains value.

The reverse should also work where url=https://example.com is used as the arg which sets allowed_domains=["example.com"] and the site uses links like https://www.example.com/some-page. This is because, the allowed_domain would always be a subset of the links in the OffsiteMiddleware (code ref).

Lastly, also moved parsing the allowed_domains from the BaseSpider to the EcommerceSpider since ArticleSpider would have a different way to parse them due to multiple seeds.

@BurnzZ BurnzZ requested review from kmike, wRAR and PyExplorer January 12, 2024 07:18
@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2024

Codecov Report

Merging #27 (6364d2e) into main (570edab) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 6364d2e differs from pull request most recent head 7d205c0. Consider uploading reports for the commit 7d205c0 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #27      +/-   ##
==========================================
+ Coverage   98.57%   98.58%   +0.01%     
==========================================
  Files          11       12       +1     
  Lines         492      496       +4     
==========================================
+ Hits          485      489       +4     
  Misses          7        7              
Files Coverage Δ
zyte_spider_templates/spiders/base.py 100.00% <ø> (ø)
zyte_spider_templates/spiders/ecommerce.py 98.88% <100.00%> (+0.02%) ⬆️
zyte_spider_templates/utils.py 100.00% <100.00%> (ø)

@BurnzZ BurnzZ force-pushed the allowed-domains branch 2 times, most recently from cc58fb6 to fc2df5e Compare January 12, 2024 08:25
Copy link
Contributor

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Although I wonder if a better approach might be to find the top-level domain part of the URL (with tldextract, which Scrapy uses), and include that plus the subdomain. So, a.b.c.example.com becomes example.com. It’s just a thought, though, and I am not even sure if it would be a change for the better or for the worse.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 15, 2024

Although I wonder if a better approach might be to find the top-level domain part of the URL (with tldextract, which Scrapy uses), and include that plus the subdomain. So, a.b.c.example.com becomes example.com. It’s just a thought, though, and I am not even sure if it would be a change for the better or for the worse.

This might be the case later on as @PyExplorer polishes the Article Spider which has quite lots of varying cases with the domains.

EDIT: It seems tldextract would still have the www. prefix in the subdomain part. For eCommerce, we'd need the subdomain prefixes like uk and fr so that the crawls would stay within it.

In [1]: tldextract.extract('http://www.uk.example.com/')
Out[1]: ExtractResult(subdomain='www.uk', domain='example', suffix='com', is_private=False)

In [2]: tldextract.extract('http://www.fr.example.com/')
Out[2]: ExtractResult(subdomain='www.fr', domain='example', suffix='com', is_private=False)

@BurnzZ BurnzZ mentioned this pull request Jan 26, 2024
2 tasks


def get_domain(url: str) -> str:
return re.sub(r"www.*?\.", "", parse_url(url).netloc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this function more conservative? It looks like this one can change netlocs like wwworld.example.com, or my.wwworld.com, or awww.com.

What are the real-world cases besides www.example.com which we should support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these great examples! Updated in 4343c7f.

What are the real-world cases besides www.example.com which we should support?

The most common ones that were observed would be:

@BurnzZ BurnzZ merged commit 832a008 into main Jan 30, 2024
9 checks passed
@BurnzZ BurnzZ deleted the allowed-domains branch January 30, 2024 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants