Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html_live() practical implementation #397

Closed
rcepka opened this issue Feb 6, 2024 · 1 comment
Closed

read_html_live() practical implementation #397

rcepka opened this issue Feb 6, 2024 · 1 comment

Comments

@rcepka
Copy link

rcepka commented Feb 6, 2024

Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out its implementation into the scraping workflow.
So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.

What I currently expect from my web scraping solution is mainly this:

  • ability to rotate user agents
  • implementation of proxy to rotate IP`s
  • ability to re-run the html request in case that it fails from some reason

Below is my simplified code, the way I am doing it now:

scrape_page <- function(link, usr_agent, scraping_repeat, ...) {


    sleep_time <<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
    Sys.sleep(sleep_time)


    # Set initial values
    response <- NULL
    attempts <<- 1


    #
    # Main loop
    #

    while (response_code != 200  &  attempts <= scraping_repeat) {

      # Call this before each "GET"
      proxy_number <<- get_proxy_number(proxies_list = proxies_list, proxy_selection = proxy_selection)
      usr_agent <<- sample(user_agents_list, 1)


      tryCatch({
        response <- GET(
          link,
          user_agent(usr_agent),
          use_proxy(
            url = proxies_list$address[proxy_number],
            port = as.numeric(proxies_list$port[proxy_number]),
            username = proxies_list$username[proxy_number],
            password = proxies_list$pass[proxy_number]
          )
        )

        response_code <<- response$status_code

      },
      # Error handling
      error = function(e){
        logger::log_error("Fun scrape_page:  The page could not be scraped, link: {link}")
      }
      )

    # Repeat scraping if needed
      if(response_code != 200) {
        attempts <<- attempts + 1
        wait_time <- scraping_repeat_wait_time * attempts
        Sys.sleep(wait_time)
        }

    #
    # End of main loop
    #
    }

    return(response)

  }

My questions:

  • how can I implement read_html_live(), with proxy and user agent features?
  • how to interact with site using read_html_live() + $click(), $scroll_to() and etc? Sorry, I am newbie here...

Many thanks in advance for any advices, hints or opinions...

@hadley
Copy link
Member

hadley commented Feb 12, 2024

  1. Ability to change user agents is tracked in Some way to customise user agent for read_html_live() #388
  2. It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
  3. I'm currently not sure how we'll expose browser errors to R.

@hadley hadley closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants