Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to scrape JavaScript website #891

Open
nihaludin opened this issue Jan 13, 2025 · 9 comments
Open

Unable to scrape JavaScript website #891

nihaludin opened this issue Jan 13, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@nihaludin
Copy link

Hi @ALL,

I’m encountering a similar issue and would appreciate your help. Initially, I received the message, "You need to enable JavaScript to run this app."

However, after enabling requires_js_support: True in my configuration, I am now encountering a timeout error.

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.example.com/) ---
Attempt 1 failed: Page.goto: Timeout 30000ms exceeded.
Call log:

navigating to "https://www.example.com/", waiting until "networkidle"
Here’s my current configuration:

graph_config = { "llm": { "api_key": OPENAI_API_KEY, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, "loader_kwargs": { "requires_js_support": True, "timeout": 120, "retry_limit": 3, "slow_mo": 20000 } }

Copy link

dosubot bot commented Jan 13, 2025

Hey @nihaludin!

I'm a bot here to help you with bugs, questions, and contributions. Let's tackle this issue together!

To address the timeout error you're experiencing, consider the following adjustments:

  1. Increase Timeout: You can try increasing the timeout parameter further if the page requires more time to load. For example, set it to 180 seconds or more.

  2. Check URL: Ensure that the URL you're trying to scrape actually contains dynamic content that requires JavaScript. The example URL you provided is a placeholder and may not have any content to load.

  3. Debugging: Since you have headless set to False, you can visually inspect what the browser is doing during the scraping process. This can help identify if there are any issues with the page loading.

  4. Network Conditions: Ensure that there are no network issues that might be causing delays in loading the page.

Here's an updated configuration suggestion:

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": False,
    "loader_kwargs": {
        "requires_js_support": True,
        "timeout": 180,  # Increase timeout
        "retry_limit": 3,
        "slow_mo": 20000
    }
}

These steps should help mitigate the timeout issue. If the problem persists, consider testing with a different URL that has known dynamic content.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Jan 13, 2025
@nihaludin
Copy link
Author

I increased the timeout parameter to 1000 but still im getting the same error @dosu

Copy link

dosubot bot commented Jan 13, 2025

If increasing the timeout to 1000 seconds didn't resolve the issue, consider the following steps:

  1. Verify JavaScript Content: Double-check that the URL you're trying to scrape actually requires JavaScript to load content. If the page is static or doesn't have dynamic content, enabling JavaScript support might not be necessary.

  2. Inspect Network Activity: Use the browser's developer tools to inspect network activity when loading the page manually. This can help identify if there are specific resources or scripts that are taking too long to load.

  3. Check for Errors: Look for any JavaScript errors in the console when loading the page manually. These errors might indicate issues that could affect the scraping process.

  4. Test with Different URLs: Try scraping a different URL that you know requires JavaScript and has dynamic content. This can help determine if the issue is specific to the current URL.

  5. Review Proxy and Network Settings: Ensure that there are no proxy or network settings that might be interfering with the page load.

  6. Adjust Slow Motion: Consider reducing the slow_mo parameter to see if it affects the loading time. While slow_mo is useful for debugging, it can slow down the overall process.

If these steps don't resolve the issue, it might be helpful to log more detailed information about the scraping attempts to diagnose the problem further.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@nihaludin
Copy link
Author

@VinciGit00 Not able to fetch details after "requires_js_support": True is there any other approach to overcome this issue. Thanks!

@VinciGit00
Copy link
Collaborator

Why you navigated to https://www.example.com/? It does not require javascript

@nihaludin
Copy link
Author

@VinciGit00 Apologies ,it's just an example not the same website I'm actually working with an e-commerce website which has javascript.

@VinciGit00
Copy link
Collaborator

ok send the complete code please

@nihaludin
Copy link
Author

`graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,

"loader_kwargs": {
    # "requires_js_support": True,
    "timeout": 120,
    "retry_limit": 1,
    # "slow_mo": 200000,
}

}`

`smart_scraper = SmartScraperGraph(
prompt="Segment the URLs with their respective category based on product type", # this prompt for extracting categories
source="some ecommerce website" )

result = smart_scraper.run()`

@VinciGit00 please find the code above, Thanks!

@nihaludin
Copy link
Author

I'm trying to scrape this website @VinciGit00
https://www.meijer.com/shopping/c/collections/wellness-on-sale.html?icid=HP:sub:7:011925::0119DealsWellness
only home page contents I'm able to extract.

Any help would be appreciated, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants