Skip to content

Latest commit

 

History

History
50 lines (33 loc) · 2.21 KB

README.md

File metadata and controls

50 lines (33 loc) · 2.21 KB

Web Scraping in R -- IRE2024

The basics of scraping web pages in R using rvest

Requirements for the class

  • R and RStudio installed
  • tidyverse and rvest installed: install.packages(c("tidyverse","rvest"))
  • A browser with development tools (such as Chrome Inpsect)

Basics of HTML structure

Get to know the structure of an HTML element - https://developer.mozilla.org/en-US/docs/Glossary/Element

  • tags ex: <p> opens and </p> closes
  • attributes ex: id="shazam" inside the tag <p id="shazam">
  • text ex: <p>The text between opening and closing tags</p>

A table built into HTML uses a <table> tag. The <th> tag is used for the header row; <tr> for table row, <td> for table data:

Basic usage of functions in rvest

Step 1: read the html from a webpage into the RStudio environment using the read_html() function:

ex. html <- read_html("url")

Step 2: pull a specific element from that html using html_element() or html_elements():

ex. everything_inside_a_table_tag <- html_element("table")
ex. everything_inside_a_p_tag <- html_element("p")

Step 3: pull the text or contents from an html element using html_text2():

ex. everything_inside_a_p_tag |> html_text2()

Websites we'll scrape in this class (we'll see how far we can get)

1 https://www.dllr.state.md.us/employment/warn.shtml

2 https://dlr.sd.gov/workforce_services/businesses/warn_notices.aspx

3 https://www.billboard.com/charts/hot-100/

You'll find the finished scripts in the finished_scripts folder.

Resources for help