forked from cis-ds/course-site
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm012.Rmd
39 lines (29 loc) · 1.46 KB
/
cm012.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
title: "Getting data from the web: scraping"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
# cm012 - November 2, 2016
## Overview
* Identify methods for writing functions to interact with APIs
* Define JSON and XML data structure and how to convert them to data frames
* Define CSS selectors and practice writing selectors to extract information from web pages
* Write a script to extract data on wineries in Ithaca, NY using `rvest` and the Selector Gadget
## Slides and links
* [Notes from class](webdata02.html)
* [Slides](extras/cm012_slides.html)
* [Documentation for `httr`](https://cran.r-project.org/web/packages/httr/)
* `rvest`
* Load the library (`library(rvest)`)
* `demo("tripadvisor")` - scraping a Trip Advisor page
* `demo("united")` - how to scrape a web page which requires a login
* [Scraping IMDB](https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/)
* Python stuff
* [Accessing APIs](webdata_api_py.html)
* [Scraping the web with `BeautifulSoup`](webdata_scrape_py.html)
## To do for Monday
* [Start homework 6](hw06-webdata.html)
* Chapters 1, 2, 3, and 4 in [*Tidy Text Mining with R*](http://tidytextmining.com/)
* [Wojcik, S. P., Hovasapian, A., Graham, J., Motyl, M., & Ditto, P. H. (2015). Conservatives report, but liberals display, greater happiness. *Science*, 347(6227), 1243-1246.](http://science.sciencemag.org.proxy.uchicago.edu/content/347/6227/1243.full)