read_html_live() practical implementation #397

rcepka · 2024-02-06T13:58:40Z

Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out its implementation into the scraping workflow.
So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.

What I currently expect from my web scraping solution is mainly this:

ability to rotate user agents
implementation of proxy to rotate IP`s
ability to re-run the html request in case that it fails from some reason

Below is my simplified code, the way I am doing it now:

scrape_page <- function(link, usr_agent, scraping_repeat, ...) {


    sleep_time <<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
    Sys.sleep(sleep_time)


    # Set initial values
    response <- NULL
    attempts <<- 1


    #
    # Main loop
    #

    while (response_code != 200  &  attempts <= scraping_repeat) {

      # Call this before each "GET"
      proxy_number <<- get_proxy_number(proxies_list = proxies_list, proxy_selection = proxy_selection)
      usr_agent <<- sample(user_agents_list, 1)


      tryCatch({
        response <- GET(
          link,
          user_agent(usr_agent),
          use_proxy(
            url = proxies_list$address[proxy_number],
            port = as.numeric(proxies_list$port[proxy_number]),
            username = proxies_list$username[proxy_number],
            password = proxies_list$pass[proxy_number]
          )
        )

        response_code <<- response$status_code

      },
      # Error handling
      error = function(e){
        logger::log_error("Fun scrape_page:  The page could not be scraped, link: {link}")
      }
      )

    # Repeat scraping if needed
      if(response_code != 200) {
        attempts <<- attempts + 1
        wait_time <- scraping_repeat_wait_time * attempts
        Sys.sleep(wait_time)
        }

    #
    # End of main loop
    #
    }

    return(response)

  }

My questions:

how can I implement read_html_live(), with proxy and user agent features?
how to interact with site using read_html_live() + $click(), $scroll_to() and etc? Sorry, I am newbie here...

Many thanks in advance for any advices, hints or opinions...

The text was updated successfully, but these errors were encountered:

hadley · 2024-02-12T22:04:09Z

Ability to change user agents is tracked in Some way to customise user agent for read_html_live() #388
It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
I'm currently not sure how we'll expose browser errors to R.

hadley closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html_live() practical implementation #397

read_html_live() practical implementation #397

rcepka commented Feb 6, 2024

hadley commented Feb 12, 2024

read_html_live() practical implementation #397

read_html_live() practical implementation #397

Comments

rcepka commented Feb 6, 2024

hadley commented Feb 12, 2024