Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.allowed returns false for other sites (and domains) #110

Open
gk1544 opened this issue Mar 23, 2019 · 2 comments
Open

robots.allowed returns false for other sites (and domains) #110

gk1544 opened this issue Mar 23, 2019 · 2 comments

Comments

@gk1544
Copy link

gk1544 commented Mar 23, 2019

Hi,

Let's take a look at the following example from Google:

robots.txt location:
http://example.com/robots.txt

Valid for:
http://example.com/
http://example.com/folder/file
Not valid for:
http://other.example.com/
https://example.com/
http://example.com:8181/

For instance, when asked if any page on http://other.example.com/ is allowed, reppy returns False.

It should either return True or potentially throw an exception, but definitely not False.
Returning False is incorrect because robots.txt is not a whitelist.

Here is an example:

import reppy
robots_content = 'Disallow: /abc'
robots = reppy.Robots.parse('http://example.com/robots.txt', robots_content)

print(robots.allowed('http://example.com/', '*'))
# True (**correct**)
print(robots.allowed('http://other.example.com/', '*'))
# False (**incorrect**)
print(robots.allowed('http://apple.com/', '*'))
# False (**incorrect**)
@dlecocq
Copy link

dlecocq commented Mar 25, 2019

I can certainly understand the argument for not wanting it to return False - it is somewhat misleading. Ultimately this traces down into rep-cpp's agent.cpp at https://github.com/seomoz/rep-cpp/blob/master/src/agent.cpp#L69 .

I have mixed feelings about what the behavior should be. On the one hand, False doesn't really capture the truth of it, but it is the safer alternative - better to incorrectly report False than to risk incorrectly reporting True; instead we need a way to convey "it's not clear whether it's allowed or not based on this robots.txt." On the other hand, an exception doesn't quite feel appropriate to throw an exception because it doesn't feel particularly exceptional. Perhaps a different return type that conveys some more of the nuance would work, but that also seems a little clunky.

Whenever we've used this, we generally are using it through the cache, which takes care of finding the appropriate Robots or Agent based on the domain.

@pensnarik
Copy link

What's workaround for this? Many website contains robots.txt rules only for 2nd level domain this means that links containing "www.domain.com" will be also forbidden by rules while they're not. For example:

DEBUG - URL https://insurancejournal.com/news/west/ is allowed in robots.txt
DEBUG - URL https://www.insurancejournal.com/news/international/2020/10/02/584993.htm is FORBIDDEN by robots.txt, skipping

I'm thinking to remove www. from the URL before checking it but this looks ugly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants