forked from info201/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
apis.Rmd
336 lines (237 loc) · 23.4 KB
/
apis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
# Accessing Web APIs {#apis}
R is able to load data from external packages or read it from locally-saved `.csv` files, but it is also able to download data directly from web sites on the internet. This allows scripts to always work with the latest data available, performing analysis on data that may be changing rapidly (such as from social networks or other live events). Web services may make their data easily accessible to computer programs like R scripts by offering an **Application Programming Interface (API)**. A web service's API specifies _where_ and _how_ particular data may be accessed, and many web services follow a particular style known as _Representational State Transfer (REST)_. This chapter will cover how to access and work with data from these _RESTful APIs_.
## What is a Web API?
An **interface** is the point at which two different systems meet and _communicate_: exchanging informations and instructions. An **Application Programming Interface (API)** thus represents a way of communicating with a computer application by writing a computer program (a set of formal instructions understandable by a machine). APIs commonly take the form of **functions** that can be called to give instructions to programs—the set of functions provided by a library like `dplyr` make up the API for that library.
While most APIs provide an interface for utilizing _functionality_, other APIs provide an interface for accessing _data_. One of the most common sources of these data apis are **web services**: websites that offer an interface for accessing their data.
<!-- (Technically the interface is just the function **signature** which says how you _use_ the function: what name it has, what arguments it takes, and what value it returns. The actual library is an _implementation_ of this interface). -->
With web services, the interface (the set of "functions" you can call to access the data) takes the form of **HTTP Requests**—a _request_ for data sent following the _**H**yper**T**ext **T**ransfer **P**rotocol_. This is the same protocol (way of communicating) used by your browser to view a web page! An HTTP Request represents a message that your computer sends to a web server (another computer on the internet which "serves", or provides, information). That server, upon receiving the request, will determine what data to include in the **response** it sends _back_ to the requesting computer. With a web browser, the response data takes the form of HTML files that the browser can _render_ as web pages. With data APIs, the response data will be structured data that you can convert into R structures such as lists or data frames.
In short, loading data from a Web API involves sending an **HTTP Request** to a server for a particular piece of data, and then receiving and parsing the **response** to that request.
## RESTful Requests
There are two parts to a request sent to an API: the name of the **resource** (data) that you wish to access, and a **verb** indicating what you want to do with that resource. In many ways, the _verb_ is the function you want to call on the API, and the _resource_ is an argument to that function.
### URIs
Which **resource** you want to access is specified with a **Uniform Resource Identifier (URI)**. A URI is a generalization of a URL (Uniform Resource Locator)—what you commonly think of as "web addresses". URIs act a lot like the _address_ on a postal letter sent within a large organization such as a university: you indicate the business address as well as the department and the person, and will get a different response (and different data) from Alice in Accounting than from Sally in Sales.
- Note that the URI is the **identifier** (think: variable name) for the resource, while the **resource** is the actual _data_ value that you want to access.
Like postal letter addresses, URIs have a very specific format used to direct the request to the right resource.
![The format (schema) of a URI.](img/apis/uri-schema.png "URI schema")
Not all parts of the format are required—for example, you don't need a `port`, `query`, or `fragment`. Important parts of the format include:
- `scheme` (`protocol`): the "language" that the computer will use to communicate the request to this resource. With web services this is normally `https` (**s**ecure HTTP)
- `domain`: the address of the web server to request information from
- `path`: which resource on that web server you wish to access. This may be the name of a file with an extension if you're trying to access a particular file, but with web services it often just looks like a folder path!
- `query`: extra **parameters** (arguments) about what resource to access.
The `domain` and `path` usually specify the resource. For example, `www.domain.com/users` might be an _identifier_ for a resource which is a list of users. Note that web services can also have "subresources" by adding extra pieces to the path: `www.domain.com/users/mike` might refer to the specific "mike" user in that list.
With an API, the domain and path are often viewed as being broken up into two parts:
- The **Base URI** is the domain and part of the path that is included on _all_ resources. It acts as the "root" for any particular resource. For example, the [GitHub API](https://developer.github.com/v3/) has a base URI of `https://api.github.com/`.
- An **Endpoint**, or which resource on that domain you want to access. Each API will have _many_ different endpoints.
For example, GitHub includes endpoints such as:
- `/users/{user}` to get information about a specific `:user:` `id` (the `{}` indicate a "variable", in that you can put any username in there in place of the string `"{user}"`). Check out this [example](https://api.github.com/users/mkfreeman) in your browser.
- `/orgs/{organization}/repos` to get the repositories that are on an organization page. See an [example](https://api.github.com/orgs/info201a-au17/repos).
Thus you can equivalently talk about accessing a particular **resource** and sending a request to a particular **endpoint**. The **endpoint** is appended to the end of the **Base URI**, so you could access a GitHub user by combining the **Base URI** (`https://api.github.com`) and **endpoint** (`/users/mkfreeman`) into a single string: `https://api.github.com/users/mkfreeman`. That URL will return a data structure of new releases, which you can request from your R program or simply view in your web-browser.
#### Query Parameters
Often in order to access only partial sets of data from a resource (e.g., to only get some users) you also include a set of **query parameters**. These are like extra arguments that are given to the request function. Query parameters are listed after a question mark **`?`** in the URI, and are formed as key-value pairs similar to how you named items in _lists_. The **key** (_parameter name_) is listed first, followed by an equal sign **`=`**, followed by the **value** (_parameter value_); note that you can't include any spaces in URIs! You can include multiple query parameters by putting an ampersand **`&`** between each key-value pair:
```
?firstParam=firstValue&secondParam=secondValue&thirdParam=thirdValue
```
Exactly what parameter names you need to include (and what are legal values to assign to that name) depends on the particular web service. Common examples include having parameters named `q` or `query` for searching, with a value being whatever term you want to search for: in [`https://www.google.com/search?q=informatics`](https://www.google.com/search?q=informatics), the **resource** at the `/search` **endpoint** takes a query parameter `q` with the term you want to search for!
#### Access Tokens and API Keys
Many web services require you to register with them in order to send them requests. This allows them to limit access to the data, as well as to keep track of who is asking for what data (usually so that if someone starts "spamming" the service, they can be blocked).
To facilitate this tracking, many services provide **Access Tokens** (also called **API Keys**). These are unique strings of letters and numbers that identify a particular developer (like a secret password that only works for you). Web services will require you to include your _access token_ as a query parameter in the request; the exact name of the parameter varies, but it often looks like `access_token` or `api_key`. When exploring a web service, keep an eye out for whether they require such tokens.
_Access tokens_ act a lot like passwords; you will want to keep them secret and not share them with others. This means that you **should not include them in your committed files**, so that the passwords don't get pushed to GitHub and shared with the world. The best way to get around this in R is to create a separate script file in your repo (e.g., `api-keys.R`) which includes exactly one line: assigning the key to a variable:
```r
# in `api-keys.R`
api_key <- "123456789abcdefg"
```
You can then include this file_name_ in a **`.gitignore`** file in your repo; that will keep it from even possibly being committed with your code!
In order to access this variable in your "main" script, you can use the `source()` function to load and run your `api-keys.R` script. This will execute the line of code that assigns the `api_key` variable, making it available in your environment for your use:
```r
# in `my-script.R`
# (make sure working directory is set)
source('api-keys.R') # load the script
print(api_key) # key is now available!
```
Anyone else who runs the script will simply need to provide an `api_key` variable to access the API using their key, keeping everyone's account separate!
<p class="alert alert-warning">Watch out for APIs that mention using [OAuth](https://en.wikipedia.org/wiki/OAuth) when explaining API keys. OAuth is a system for performing **authentification**—that is, letting someone log into a website from your application (like what a "Log in with Facebook" button does). OAuth systems require more than one access key, and these keys ___must___ be kept secret and usually require you to run a web server to utilize them correctly (which requires lots of extra setup, see [the full `httr` docs](https://cran.r-project.org/web/packages/httr/httr.pdf) for details). So for this course, we encourage you to avoid anything that needs OAuth</p>
### HTTP Verbs
When you send a request to a particular resource, you need to indicate what you want to _do_ with that resource. When you load web content, you are typically sending a request to retrieve information (logically, this is a `GET` request). However, there are other actions you can perform to modify the data structure on the server. This is done by specifying an **HTTP Verb** in the request. The HTTP protocol supports the following verbs:
- `GET` Return a representation of the current state of the resource
- `POST` Add a new subresource (e.g., insert a record)
- `PUT` Update the resource to have a new state
- `PATCH` Update a portion of the resource's state
- `DELETE` Remove the resource
- `OPTIONS` Return the set of methods that can be performed on the resource
By far the most common verb is `GET`, which is used to "get" (download) data from a web service. Depending on how you connect to your API (i.e., which programming language you are using), you'll specify the verb of interest to indicate what we want to do to a particular resource.
Overall, this structure of treating each datum on the web as a **resource** which we can interact with via **HTTP Requests** is referred to as the **REST Architecture** (REST stands for _REpresentational State Transfer_). This is a standard way of structuring computer applications that allows them to be interacted with in the same way as everyday websites. Thus a web service that enabled data access through named resources and responds to HTTP requests is known as a **RESTful** service, with a _RESTful API_.
## Accessing Web APIs
To access a Web API, you just need to send an HTTP Request to a particular URI. You can easily do this with the browser: simply navigate to a particular address (base URI + endpoint), and that will cause the browser to send a `GET` request and display the resulting data. For example, you can send a request to search GitHub for repositories named `d3` by visiting:
```
https://api.github.com/search/repositories?q=d3&sort=forks
```
This query accesses the `/search/repositories/` endpoint, and also specifies 2 query parameters:
- `q`: The term(s) you are searching for, and
- `sort`: The attribute of each repository that you would like to use to sort the results
(Note that the data you'll get back is structured in JSON format. See [below](#json) for details).
In `R` you can send GET requests using the [`httr`](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html) library. Like `dplyr`, you will need to install and load it to use it:
```r
install.packages("httr") # once per machine
library("httr")
```
This library provides a number of functions that reflect HTTP verbs. For example, the **`GET()`** function will send an HTTP GET Request to the URI specified as an argument:
```r
# Get search results for artist
response <- GET("https://api.github.com/search/repositories?q=d3&sort=forks")
```
While it is possible to include _query parameters_ in the URI string, `httr` also allows you to include them as a _list_, making it easy to set and change variables (instead of needing to do a complex `paste0()` operation to produce the correct string):
```r
# Equivalent to the above, but easier to read and change
query_params <- list(q = "d3", sort = "forks")
response <- GET("https://api.github.com", query = query_params)
```
If you try printing out the `response` variable, you'll see information about the response:
```
Response [https://api.github.com/?q=d3&sort=forks]
Date: 2017-10-22 19:05
Status: 200
Content-Type: application/json; charset=utf-8
Size: 2.16 kB
```
This is called the **response header**. Each **response** has two parts: the **header**, and the **body**. You can think of the response as a envelope: the _header_ contains meta-data like the address and postage date, while the _body_ contains the actual contents of the letter (the data).
Since you're almost always interested in working with the _body_, you will need to extract that data from the response (e.g., open up the envelope and pull out the letter). You can do this with the `content()` method:
```r
# extract content from response, as a text string (not a list!)
body <- content(response, "text")
```
Note the second argument `"text"`; this is needed to keep `httr` from doing it's own processing on the body data, since we'll be using other methods to handle that; keep reading for details!
<p class="alert alert-info">_Pro-tip_: The URI shown when you print out the `response` is a good way to check exactly what URI you were sending the request to: copy that into your browser to make sure it goes where you expected!</p>
## JSON Data {#json}
Most APIs will return data in **JavaScript Object Notation (JSON)** format. Like CSV, this is a format for writing down structured data—but while `.csv` files organize data into rows and columns (like a data frame), JSON allows you to organize elements into **key-value pairs** similar to an R _list_! This allows the data to have much more complex structure, which is useful for web services (but can be challenging for us).
In JSON, lists of key-value pairs (called _objects_) are put inside braces (**`{ }`**), with the key and value separated by a colon (**`:`**) and each pair separated by a comma (**`,`**). Key-value pairs are often written on separate lines for readability, but this isn't required. Note that keys need to be character strings (so in quotes), while values can either be character strings, numbers, booleans (written in lower-case as `true` and `false`), or even other lists! For example:
```json
{
"first_name": "Ada",
"job": "Programmer",
"salary": 78000,
"in_union": true,
"favorites": {
"music": "jazz",
"food": "pizza",
}
}
```
(In JavaScript the period `.` has special meaning, so it is not used in key names; hence the underscores `_`). The above is equivalent to the `R` list:
```r
list(
first_name = "Ada", job = "Programmer", salary = 78000, in_union = TRUE,
favorites = list(music = "jazz", food = "pizza") # nested list in the list!
)
```
Additionally, JSON supports what are called _arrays_ of data. These are like lists without keys (and so are only accessed by index). Key-less arrays are written in square brackets (**`[ ]`**), with values separated by commas. For example:
```json
["Aardvark", "Baboon", "Camel"]
```
which is equvalent to the `R` list:
```r
list("Aardvark", "Baboon", "Camel")
```
(Like _objects_ , array elements may or may not be written on separate lines).
Just as R allows you to have nested lists of lists, and those lists may or may not have keys, JSON can have any form of nested _objects_ and _arrays_. This can be arrays (unkeyed lists) within objects (keyed lists), such as a more complex set of data about Ada:
```json
{
"first_name": "Ada",
"job": "Programmer",
"pets": ["rover", "fluffy", "mittens"],
"favorites": {
"music": "jazz",
"food": "pizza",
"numbers": [12, 42]
}
}
```
JSON can also be structured as _arrays_ of _objects_ (unkeyed lists of keyed lists), such as a list of data about Seahawks games:
```json
[
{ "opponent": "Dolphins", "sea_score": 12, "opp_score": 10 },
{ "opponent": "Rams", "sea_score": 3, "opp_score": 9 },
{ "opponent": "49ers", "sea_score": 37, "opp_score": 18 },
{ "opponent": "Jets", "sea_score": 27, "opp_score": 17 },
{ "opponent": "Falcons", "sea_score": 26, "opp_score": 24 }
]
```
The latter format is incredibly common in web API data: as long as each _object_ in the _array_ has the same set of keys, then you can easily consider this as a data table where each _object_ (keyed list) represents an **observation** (row), and each key represents a **feature** (column) of that observation.
### Parsing JSON
When working with a web API, the usual goal is to take the JSON data contained in the _response_ and convert it into an `R` data structure you can use, such as _list_ or _data frame_. While the `httr` package is able to parse the JSON body of a response into a _list_, it doesn't do a very clean job of it (particularly for complex data structures).
A more effective solution is to use _another_ library called [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf). This library provides helpful methods to convert JSON data into R data, and does a much more effective job of converting content into data frames that you can use.
As always, you will need to install and load this library:
```r
install.packages("jsonlite") # once per machine
library("jsonlite")
```
`jsonlite` provides a function called **`fromJSON()`** that allows you to convert a JSON string into a list—or even a data frame if the columns have the right lengths!
```r
# send request for albums by David Bowie
<<<<<<< HEAD
query.params <- list(q = "d3", sort = "forks")
response <- GET("https://api.github.com/search/repositories", query = query.params)
=======
query_params <- list(q = "d3", sort = "forks")
response <- GET("https://api.github.com", query = query_params)
>>>>>>> 112475b2d49fb2adbcb635aef504ee74c89a9a15
body <- content(response, "text") # extract the body JSON
parsed_data <- fromJSON(body) # convert the JSON string to a list
```
The `parsed_data` will contain a _list_ built out of the JSON. Depending on the complexity of the JSON, this may already be a data frame you can `View()`... but more likely you'll need to _explore_ the list more to locate the "main" data you are interested in. Good strategies for this include:
- You can `print()` the data, but that is often hard to read (it requires a lot of scrolling).
- The `str()` method will produce a more organized printed list, though it can still be hard to read.
- The `names()` method will let you see a list of the what keys the list has, which is good for delving into the data.
As an example continuing the above code:
```r
is.data.frame(parsed_data) # FALSE; not a data frame you can work with
names(parsed_data) # "href" "items" "limit" "next" "offset" "previous" "total"
# looking at the JSON data itself (e.g., in the browser), `items` is the
# key that contains the value we want
items <- parsed_data$items # extract that element from the list
is.data.frame(items) # TRUE; you can work with that!
```
### Flattening Data
Because JSON supports—and in fact encourages—nested lists (lists within lists), parsing a JSON string is likely to produce a data frame whose columns _are themselves data frames_. As an example:
```r
# A somewhat contrived example
people <- data.frame(names = c("Spencer", "Jessica", "Keagan")) # a data frame with one column
favorites <- data.frame( # a data frame with two columns
food = c("Pizza", "Pasta", "salad"),
music = c("Bluegrass", "Indie", "Electronic")
)
# Store second dataframe as column of first
people$favorites <- favorites # the `favorites` column is a data frame!
# This prints nicely...
print(people)
# names favorites.food favorites.music
# 1 Spencer Pizza Bluegrass
# 2 Jessica Pasta Indie
# 3 Keagan salad Electronic
# but doesn't actually work like you expect!
people$favorites.food # NULL
people$favorites$food # [1] Pizza Pasta salad
```
Nested data frames make it hard to work with the data using previously established techniques. Luckily, the `jsonlite` package provides a helpful function for addressing this called **`flatten()`**. This function takes the columns of each _nested_ data frame and converts them into appropriately named columns in the "outer" data frame:
```r
people <- flatten(people)
people$favorites.food # this just got created! Woo!
```
Note that `flatten()` only works on values that are _already data frames_; thus you may need to find the appropriate element inside of the list (that is, the item which is the data frame you want to flatten).
In practice, you will almost always want to flatten the data returned from a web API. Thus your "algorithm" for downloading web data is as follows:
1. Use `GET()` to download the data, specifying the URI (and any query parameters).
2. Use `content()` to extract the data as a JSON string.
3. Use `fromJSON()` to convert the JSON string into a list.
4. Find which element in that list is your data frame of interest. You may need to go "multiple levels" deep.
5. Use `flatten()` to flatten that data frame.
6. ...
7. Profit!
<p class="alert alert-info">_Pro-tip_: JSON data can be quite messy when viewed in your web-browser. Installing a browser extension such as
<a href="https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc" target="_blank">JSONView</a> will format JSON responses in a more readable way, and even enable you to interactively explore the data structure.</p>
## Resources {-}
- [URIs (Wikipedia)](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)
- [HTTP Protocol Tutorial](https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177)
- [Programmable Web](http://www.programmableweb.com/) (list of web APIs; may be out of date)
- [RESTful Architecture](https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm) (original specification; not for beginners)
- [JSON View Extension](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc?hl=en)
- [`httr` documentation](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html)
- [`jsonlite` documentation](https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf)