You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I dont exactly understand how this new function works and I am trying to figure out its implementation into the scraping workflow.
So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.
What I currently expect from my web scraping solution is mainly this:
ability to rotate user agents
implementation of proxy to rotate IP`s
ability to re-run the html request in case that it fails from some reason
Below is my simplified code, the way I am doing it now:
scrape_page<-function(link, usr_agent, scraping_repeat, ...) {
sleep_time<<- runif(1, sys_sleep_time_from, sys_sleep_time_to)
Sys.sleep(sleep_time)
# Set initial valuesresponse<-NULLattempts<<-1## Main loop#while (response_code!=200&attempts<=scraping_repeat) {
# Call this before each "GET"proxy_number<<- get_proxy_number(proxies_list=proxies_list, proxy_selection=proxy_selection)
usr_agent<<- sample(user_agents_list, 1)
tryCatch({
response<- GET(
link,
user_agent(usr_agent),
use_proxy(
url=proxies_list$address[proxy_number],
port= as.numeric(proxies_list$port[proxy_number]),
username=proxies_list$username[proxy_number],
password=proxies_list$pass[proxy_number]
)
)
response_code<<-response$status_code
},
# Error handlingerror=function(e){
logger::log_error("Fun scrape_page: The page could not be scraped, link: {link}")
}
)
# Repeat scraping if neededif(response_code!=200) {
attempts<<-attempts+1wait_time<-scraping_repeat_wait_time*attempts
Sys.sleep(wait_time)
}
## End of main loop#
}
return(response)
}
My questions:
how can I implement read_html_live(), with proxy and user agent features?
how to interact with site using read_html_live() + $click(), $scroll_to() and etc? Sorry, I am newbie here...
Many thanks in advance for any advices, hints or opinions...
The text was updated successfully, but these errors were encountered:
It looks like using a proxy requires setting some command line flags. That's going to require quite a lot of plumbing, so is unlikely to be something I tackle until a few people have requested it.
I'm currently not sure how we'll expose browser errors to R.
Hello,
thank you for this excellent package and for the newest addition - read_html_live(). This was a very needed feature for scraping javascript based websites. I don
t exactly understand how this new function works and I am trying to figure out it
s implementation into the scraping workflow.So if it isn`t outside of scope of your regular users support, I would appreciate advices on this topics from you.
What I currently expect from my web scraping solution is mainly this:
Below is my simplified code, the way I am doing it now:
My questions:
Many thanks in advance for any advices, hints or opinions...
The text was updated successfully, but these errors were encountered: