Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html_live needs some time after returning its result to allow html_elements to work properly #428

Open
Feat-FeAR opened this issue Oct 24, 2024 · 0 comments

Comments

@Feat-FeAR
Copy link

Feat-FeAR commented Oct 24, 2024

Thank you first of all for the development of this useful package.
Today, I have experienced a strange behavior from the read_html_live() function, whereby if I run my script line by line from R Studio, and slowly, I can then use html_elements() to retrieve the elements from the HTML page correctly, but if I source the script (or even if I run all the lines individually, but quickly!) html_elements() just returns NAs, as if the contents of the variable returned by read_html_live() are not yet available... (even if the variable is already stored in the global environment!)

Here is my minimal reproducible example where I retrieve 'F1000Research' best percentile from Scopus web site. I need scraping because such information is not provided by the API)

This just returns NAs:

journal_url <- "https://www.scopus.com/sourceid/21100258853"
page <- read_html_live(journal_url)
page |> html_elements("td:nth-child(1) div") |> html_text() -> category
best_category <- category[2]
page |> html_elements("td:nth-child(3) div div") |> html_text() -> percent
best_percentile <- percent[3]
cat("Category:", best_category, "\nPercentile:", best_percentile)

However this works (even when sourcing the entire script):

journal_url <- "https://www.scopus.com/sourceid/21100258853"
page <- read_html_live(journal_url)

Sys.sleep(1) # <----- just give him some time

page |> html_elements("td:nth-child(1) div") |> html_text() -> category
best_category <- category[2]
page |> html_elements("td:nth-child(3) div div") |> html_text() -> percent
best_percentile <- percent[3]
cat("Category:", best_category, "\nPercentile:", best_percentile)

¯(°_o)/¯

My sessionInfo:

> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant