Unable to scrape JavaScript website #891

nihaludin · 2025-01-13T10:09:03Z

I’m encountering a similar issue and would appreciate your help. Initially, I received the message, "You need to enable JavaScript to run this app."

However, after enabling requires_js_support: True in my configuration, I am now encountering a timeout error.

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.example.com/) ---
Attempt 1 failed: Page.goto: Timeout 30000ms exceeded.
Call log:

navigating to "https://www.example.com/", waiting until "networkidle"
Here’s my current configuration:

graph_config = { "llm": { "api_key": OPENAI_API_KEY, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, "loader_kwargs": { "requires_js_support": True, "timeout": 120, "retry_limit": 3, "slow_mo": 20000 } }

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-13T10:10:24Z

Hey @nihaludin!

I'm a bot here to help you with bugs, questions, and contributions. Let's tackle this issue together!

To address the timeout error you're experiencing, consider the following adjustments:

Increase Timeout: You can try increasing the timeout parameter further if the page requires more time to load. For example, set it to 180 seconds or more.
Check URL: Ensure that the URL you're trying to scrape actually contains dynamic content that requires JavaScript. The example URL you provided is a placeholder and may not have any content to load.
Debugging: Since you have headless set to False, you can visually inspect what the browser is doing during the scraping process. This can help identify if there are any issues with the page loading.
Network Conditions: Ensure that there are no network issues that might be causing delays in loading the page.

Here's an updated configuration suggestion:

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": False,
    "loader_kwargs": {
        "requires_js_support": True,
        "timeout": 180,  # Increase timeout
        "retry_limit": 3,
        "slow_mo": 20000
    }
}

These steps should help mitigate the timeout issue. If the problem persists, consider testing with a different URL that has known dynamic content.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

nihaludin · 2025-01-13T10:20:42Z

I increased the timeout parameter to 1000 but still im getting the same error @dosu

dosubot · 2025-01-13T10:21:57Z

If increasing the timeout to 1000 seconds didn't resolve the issue, consider the following steps:

Verify JavaScript Content: Double-check that the URL you're trying to scrape actually requires JavaScript to load content. If the page is static or doesn't have dynamic content, enabling JavaScript support might not be necessary.
Inspect Network Activity: Use the browser's developer tools to inspect network activity when loading the page manually. This can help identify if there are specific resources or scripts that are taking too long to load.
Check for Errors: Look for any JavaScript errors in the console when loading the page manually. These errors might indicate issues that could affect the scraping process.
Test with Different URLs: Try scraping a different URL that you know requires JavaScript and has dynamic content. This can help determine if the issue is specific to the current URL.
Review Proxy and Network Settings: Ensure that there are no proxy or network settings that might be interfering with the page load.
Adjust Slow Motion: Consider reducing the slow_mo parameter to see if it affects the loading time. While slow_mo is useful for debugging, it can slow down the overall process.

If these steps don't resolve the issue, it might be helpful to log more detailed information about the scraping attempts to diagnose the problem further.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

nihaludin · 2025-01-13T10:28:04Z

@VinciGit00 Not able to fetch details after "requires_js_support": True is there any other approach to overcome this issue. Thanks!

VinciGit00 · 2025-01-14T13:02:18Z

Why you navigated to https://www.example.com/? It does not require javascript

nihaludin · 2025-01-15T10:18:53Z

@VinciGit00 Apologies ,it's just an example not the same website I'm actually working with an e-commerce website which has javascript.

VinciGit00 · 2025-01-15T10:55:26Z

ok send the complete code please

nihaludin · 2025-01-23T14:07:56Z

`graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,

"loader_kwargs": {
    # "requires_js_support": True,
    "timeout": 120,
    "retry_limit": 1,
    # "slow_mo": 200000,
}

}`

`smart_scraper = SmartScraperGraph(
prompt="Segment the URLs with their respective category based on product type", # this prompt for extracting categories
source="some ecommerce website" )

result = smart_scraper.run()`

@VinciGit00 please find the code above, Thanks!

nihaludin · 2025-01-23T14:32:33Z

I'm trying to scrape this website @VinciGit00
https://www.meijer.com/shopping/c/collections/wellness-on-sale.html?icid=HP:sub:7:011925::0119DealsWellness
only home page contents I'm able to extract.

Any help would be appreciated, Thanks!

dosubot bot added the bug Something isn't working label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scrape JavaScript website #891

Unable to scrape JavaScript website #891

nihaludin commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

nihaludin commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

nihaludin commented Jan 13, 2025

VinciGit00 commented Jan 14, 2025

nihaludin commented Jan 15, 2025

VinciGit00 commented Jan 15, 2025

nihaludin commented Jan 23, 2025

nihaludin commented Jan 23, 2025

Unable to scrape JavaScript website #891

Unable to scrape JavaScript website #891

Comments

nihaludin commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

nihaludin commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

nihaludin commented Jan 13, 2025

VinciGit00 commented Jan 14, 2025

nihaludin commented Jan 15, 2025

VinciGit00 commented Jan 15, 2025

nihaludin commented Jan 23, 2025

nihaludin commented Jan 23, 2025