-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove 'www*.' prefix when parsing allowed_domains #27
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #27 +/- ##
==========================================
+ Coverage 98.57% 98.58% +0.01%
==========================================
Files 11 12 +1
Lines 492 496 +4
==========================================
+ Hits 485 489 +4
Misses 7 7
|
cc58fb6
to
fc2df5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Although I wonder if a better approach might be to find the top-level domain part of the URL (with tldextract
, which Scrapy uses), and include that plus the subdomain. So, a.b.c.example.com
becomes example.com
. It’s just a thought, though, and I am not even sure if it would be a change for the better or for the worse.
This might be the case later on as @PyExplorer polishes the Article Spider which has quite lots of varying cases with the domains. EDIT: It seems tldextract would still have the In [1]: tldextract.extract('http://www.uk.example.com/')
Out[1]: ExtractResult(subdomain='www.uk', domain='example', suffix='com', is_private=False)
In [2]: tldextract.extract('http://www.fr.example.com/')
Out[2]: ExtractResult(subdomain='www.fr', domain='example', suffix='com', is_private=False) |
c66f4dd
to
a866976
Compare
d965b95
to
fe458fc
Compare
zyte_spider_templates/utils.py
Outdated
|
||
|
||
def get_domain(url: str) -> str: | ||
return re.sub(r"www.*?\.", "", parse_url(url).netloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have this function more conservative? It looks like this one can change netlocs like wwworld.example.com
, or my.wwworld.com
, or awww.com
.
What are the real-world cases besides www.example.com which we should support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all these great examples! Updated in 4343c7f.
What are the real-world cases besides www.example.com which we should support?
The most common ones that were observed would be:
Co-authored-by: Adrián Chaves <[email protected]>
This fixes this specific use case:
url=https://www.example.com
.allowed_domains=["www.example.com"]
.https://example.com/some-page
(without the 'www.' prefix)OffsiteMiddleware
to filter out the links.Removing any 'www*.' prefixes beforehand allows for a cleaner allowed_domains value.
The reverse should also work where
url=https://example.com
is used as the arg which setsallowed_domains=["example.com"]
and the site uses links likehttps://www.example.com/some-page
. This is because, theallowed_domain
would always be a subset of the links in theOffsiteMiddleware
(code ref).Lastly, also moved parsing the allowed_domains from the BaseSpider to the EcommerceSpider since ArticleSpider would have a different way to parse them due to multiple seeds.