cm012.Rmd

---
title: "Getting data from the web: scraping"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```

# cm012 - November 2, 2016

## Overview

* Identify methods for writing functions to interact with APIs
* Define JSON and XML data structure and how to convert them to data frames
* Define CSS selectors and practice writing selectors to extract information from web pages
* Write a script to extract data on wineries in Ithaca, NY using `rvest` and the Selector Gadget

## Slides and links

* [Notes from class](webdata02.html)
* [Slides](extras/cm012_slides.html)

* [Documentation for `httr`](https://cran.r-project.org/web/packages/httr/)
* `rvest`
    * Load the library (`library(rvest)`)
    * `demo("tripadvisor")` - scraping a Trip Advisor page
    * `demo("united")` - how to scrape a web page which requires a login
    * [Scraping IMDB](https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/)
* Python stuff
    * [Accessing APIs](webdata_api_py.html)
    * [Scraping the web with `BeautifulSoup`](webdata_scrape_py.html)

## To do for Monday

* [Start homework 6](hw06-webdata.html)
* Chapters 1, 2, 3, and 4 in [*Tidy Text Mining with R*](http://tidytextmining.com/)
* [Wojcik, S. P., Hovasapian, A., Graham, J., Motyl, M., & Ditto, P. H. (2015). Conservatives report, but liberals display, greater happiness. *Science*, 347(6227), 1243-1246.](http://science.sciencemag.org.proxy.uchicago.edu/content/347/6227/1243.full)