Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kitchenstories scrapper not detected #1261

Open
2 tasks done
hhopke opened this issue Sep 23, 2024 · 3 comments
Open
2 tasks done

Kitchenstories scrapper not detected #1261

hhopke opened this issue Sep 23, 2024 · 3 comments
Labels
bots-protection А form of bot protection is preventing the fetching of the recipe's HTML bug

Comments

@hhopke
Copy link

hhopke commented Sep 23, 2024

Pre-filing checks

  • I have searched for open issues that report the same problem
  • I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

"https://www.kitchenstories.com/de/rezepte/susskartoffel-curry"

The results you expect to see

Scrapped recipe

The results (including any Python error messages) that you are seeing

url = "https://www.kitchenstories.com/de/rezepte/susskartoffel-curry" name = input('What is your name, risotto sampler?\n') html = requests.get(url, headers={"User-Agent": f"Risotto Sampler {name}"}).content scraper = scrape_html(html, org_url=url, wild_mode=False) scraper.host() scraper.title() scraper.total_time() scraper.image() scraper.ingredients() scraper.ingredient_groups() scraper.instructions() scraper.instructions_list() scraper.yields() scraper.to_json() scraper.links() scraper.nutrients() # not always available scraper.canonical_url() # not always available scraper.equipment() # not always available scraper.cooking_method() # not always available scraper.keywords() # not always available scraper.dietary_restrictions() # not always available

Traceback (most recent call last): File "...\scratches\scratch_7.py", line 11, in <module> scraper.title() File "~\recipe_scrapers\plugins\exception_handling.py", line 63, in decorated_method_wrapper return decorated(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ~\recipe_scrapers\plugins\html_tags_stripper.py", line 74, in decorated_method_wrapper decorated_func_result = decorated(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~\recipe_scrapers\plugins\normalize_string.py", line 33, in decorated_method_wrapper return normalize_string(decorated(self, *args, **kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~\recipe_scrapers\plugins\schemaorg_fill.py", line 66, in decorated_method_wrapper raise e File "~\recipe_scrapers\plugins\schemaorg_fill.py", line 57, in decorated_method_wrapper return decorated(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "~\recipe_scrapers\_abstract.py", line 95, in title raise NotImplementedError("This should be implemented.") NotImplementedError: This should be implemented.

@hhopke hhopke added the bug label Sep 23, 2024
@jayaddison
Copy link
Collaborator

Hi @hhopke - thank you for the bugreport! I haven't been able to replicate this problem locally; could you check whether there any of the differences in the code I used below seemed different to yours?

>>> import requests
>>> from recipe_scrapers import scrape_html
>>> url = "https://www.kitchenstories.com/de/rezepte/susskartoffel-curry"
>>> name = input('What is your name, risotto sampler?\n')
What is your name, risotto sampler?
James
>>> html = requests.get(url, headers={"User-Agent": f"Risotto Sampler {name}"}).content
>>> scraper = scrape_html(html, org_url=url, wild_mode=False)
>>> scraper.title()
'Süßkartoffel-Curry'

@hhopke
Copy link
Author

hhopke commented Oct 16, 2024

Hi @jayaddison,
I was on vacation, therefore the late reply. I used the exactly same code. Just tried to copy and paste with yours and get the same output. Interesting though is that I am getting this for multiple sites, like if the page is blocking me.

For instance this page worked: https://fitmencook.com/recipes/mexican-tortilla-soup/

@jayaddison jayaddison added the bots-protection А form of bot protection is preventing the fetching of the recipe's HTML label Oct 17, 2024
@jayaddison
Copy link
Collaborator

@hhopke no problem at all, thanks for responding. I have one idea, although it may be something you've already considered: do you know whether the relevant pages display as expected when opened in a popular web browser? That could provide one item of information, and perhaps a workaround:

  • Info: it may help confirm whether the problem could somehow be related to the script used to retrieve the recipe page (a difference in user-agent).
  • Workaround: if the page does load correctly in a browser, you should be able to save the source HTML of the page from your browser to a file, and then to update the scripting to read that file and scrape from there instead.

Unfortunately there's often not a lot we can do about transient network errors and network/server filtering -- so I can't guarantee a successful result; but if the page does load in other browsers then, in theory at least, we have more options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bots-protection А form of bot protection is preventing the fetching of the recipe's HTML bug
Projects
None yet
Development

No branches or pull requests

2 participants