-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathfromGraphQlToR.Rmd
316 lines (223 loc) · 11.6 KB
/
fromGraphQlToR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "How to get data from Zendro to R"
output:
html_document:
df_print: paged
---
Libraries needed for this tutorial:
```{r, warning=FALSE}
library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
```
This tutorial uses the library `httr` to establish the connection with the GraphiQL API, but there are also other options to interact with GraphQL from R. Please check the R packages: [ghql](https://docs.ropensci.org/ghql/), [gqlr](https://github.com/schloerke/gqlr) and [graphql](https://github.com/ropensci/graphql).
## Introduction to GraphQL API and how to query it
GraphQL is a query language for Application Programming Interfaces (APIs). Queries are written in the GraphQL language, and the result (the data) is given back in [JSON format](https://www.w3schools.com/whatis/whatis_json.asp).
If you are not familiar with GraphQL, we recommend you to start by checking the [Introduction to GraphQL and querying the API](GraphQL_intro.md) of Zendro How to Guides.
Zendro provides a GraphQL API web interface, called Graph**i**QL, which is a Web Browser tool for writing, validating, and testing GraphQL queries.
For example, try copy-pasting and executing the following query at [https://zendro.conabio.gob.mx/api/graphql](https://zendro.conabio.gob.mx/api/graphql), which is the API that we will be using in this and other tutorials.
```{GraphQL eval=FALSE}
{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
```
(The example above only gets the first 10 results, in a section of this tutorial we will explain how to define `pagination` to pull down a given number, or all, of the items in a dataset.)
## Download a small dataset (<1,000 elements):
The function `get_from_graphQL()` defined below queries a GraphQL API and transforms the data from JSON format (which is the output of GraphQL) into a R data frame object you can easily use for further analyses. If you want to now what's going on inside this function, there is an step-by-step detailed description at the end of this document.
To start using `get_from_graphQL()` first run the code below to load the function into your R environment (you can also have it as a different file and use `source()` to run it):
```{r}
get_from_graphQL<-function(query, url){
### This function queries a GraphiQL API and outpus the data into a single data.frame
## Arguments
# query: a graphQL query. It should work if you try it in graphiQL server. Must be a character string.
# url = url of the server to query. Must be a character string.
## Needed libraries:
# library(httr)
# library(jsonlite)
# library(dplyr)
# library(stringr)
### Function
## query the server
result <- POST(url, body = list(query=query), encode=c("json"))
## check server response
satus_code<-result$status_code
if(satus_code!=200){
print(paste0("Oh, oh: status code ", satus_code, ". Check your query and that the server is working"))
}
else{
# get data from query result
jsonResult <- content(result, as = "text")
# check if data downloaded without errors
# graphiQL will send an error if there is a problem with the query and the data was not dowloaded properly, even if the connection status was 200.
### FIX this when != TRUE because result is na
errors<-grepl("errors*{10}", jsonResult)
if(errors==TRUE){
print("Sorry :(, your data downloaded with errors, check your query and API server for details")
}
else{
# transform to json
readableResult <- fromJSON(jsonResult,
flatten = T) # this TRUE is to combine the different lists into a single data frame (because data comming from different models is nested in lists)
# get data
data<-as.data.frame(readableResult$data[1])
# rename colnames to original variable names
x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames
return(data)
}
}
}
```
`get_from_graphQL()` allows you to get data of up to 1,000 elements (results of your query) at a time, which is the maximum number allowed by GraphQL for a single batch. In the next section we explain how to use `pagination` to download larger datasets in batches.
To use the `get_from_graphQL()` function, first you have to define a GraphQL query. If you don't know how to do this, start by checking the [Introduction to GraphQL and querying the API](GraphQL_intro.md) of Zendro How to Guides.
Once you have a GraphQL query working, you'll need to save it to an R object as a character vector:
```{r}
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
"
```
Next we use this query as an argument for `get_from_graphQL()`, along with the url of the API, which is the same of the GraphiQL web interface you explored above:
```{r}
data<-get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql")
```
If all wen't well you will get a data frame with the result of your query:
```{r}
head(data)
```
## Download a dataset with more than >1,000 elements:
GraphQL outputs the resutls of a query in batches of max 1,000 elements. So if the data you want to download is larger than that, then you need to **paginate**, i.e. to get the data in batches. `pagination` is is an argument within GraphQL queries that could be done by:
* *Limit-offset*: indicating the first element to get (`offset`, default 0) and the number of elements to get (`limit`). The `limit` can't be larger than `1000`.
* *Cursor-based*: indicating the unique ID (`cursor`) of the element to get first, and a number of elements to get after.
Zendro uses the limit-offset pagination with the syntaxis:
`pagination:{limit:[integer], offset:[integer]}`
[See GraphQL documentation](https://graphql.org/learn/pagination/) and this [tutorial on GraphQL pagination](https://daily.dev/blog/pagination-in-graphql) for more details.
In the previous examples we downloaded only 10 elements (`pagination:{limit:10})`) from the rivers type, but the dataset is larger. (Remember, data in GraphQL is organised in **types** and **fields** within those types. When thinking about your structured data, you can think of types as the names of tables, and fields as the columns of those tables. In the example above `rivers` is a type and the fields are `river_id`, `name`, `length` among others.)
To know how many elements does a type has we can make a query with the function `count`, if it is available for the type we are interested on. You can check this in the `Docs` at the top right menu of the GraphiQL interface.
For example, `rivers` has the function `countRivers` so with the query `{countRivers}` we can get the total number of rivers.
Similar to how we got data before, you can use this very simple query in the function `get_from_graphQL` to get the number of rivers into R:
```{r}
# query API with count function
no_records<-get_from_graphQL(query="{countRivers}", url="https://zendro.conabio.gob.mx/api/graphql")
# change to vector, we don't need a df
no_records<-no_records[1,1]
no_records
```
In this case we have `r no_records`. Technically we could download all the data in a single batch because it is <1000, but for demostration purposes we will download it in batches of 10.
The following code calculates the number of pages needed to get a given number of records assuming a desired limit (size of each batch). Then it runs `get_from_graphQL()` within a loop for each page until getting the total number of records desired.
```{r}
# Define desired number of records and limit. Number of pages and offset will be estimated based on the number of records to download
no_records<- no_records # this was estimated above with a query to count the total number of records, but you can also manually change it to a custom desired number
my_limit<-10 # max 1000.
no_pages<-ceiling(no_records/my_limit)
## Define offseet.
# You can use the following loop:
# to calculate the offset automatically based on
# on the number of pages needed.
my_offset<-0 # start in 0. Leave like this
for(i in 1:no_pages){ # loop to
my_offset<-c(my_offset, my_limit*i)
}
# Or you can define the offset manually
# uncommenting the following line
# and commenting the loop above:
# my_offset<-c(#manually define your vector)
## create object where to store downloaded data. Leave empty
data<-character()
##
## Loop to download the data from GraphQL using pagination
##
for(i in c(1:length(my_offset))){
# Define pagination
pagination <- paste0("limit:", my_limit, ", offset:", my_offset[i])
# Define query looping through desired pagination:
my_query<- paste0("{
rivers(pagination:{", pagination, "}){
river_id
name
length
}
}
")
# Get data and add it to the already created df
data<-rbind(data, get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql"))
#end of loop
}
```
As a result you will get all the data in a single df:
```{r}
head(data)
summary(data)
```
## `get_from_graphQL()` explained step by step
The following is a step-by-step example explaining with more detail how does the function `get_from_graphQL()` that we used above works.
First, once you have a GraphQL query working, you'll need to save it to an R object as a character vector:
```{r}
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
"
```
Next, define as another character vector the url of the API, which is the same of the GraphiQL web interface you explored above:
```{r}
url<-"https://zendro.conabio.gob.mx/api/graphql"
```
Now we can a query to the API by using a POST request:
```{r}
# query server
result <- POST(url, body = list(query=my_query), encode = c("json"))
```
The result that we are getting is the `http` response. Before checking if we got the data, it is good practice to verify if the connection was successful by checking the status code. A `200` means that all went well. Any other code means problems. See [this](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).
```{r}
# check server response
result$status_code
```
We now need to extract the data in order to be able to manipulate it. If everything went well, the `http` response will contain an attribute data which will itself contain an attribute named as the query, in this case `rivers`.
```{r}
result
```
If the query is not written properly or if there is any other error, the attribute `data` won't exist and instead we will get the attribute `erros` listing the errors found.
If all wen't well we can proceed to extract the content of the results with:
```{r}
# get data from query result
jsonResult <- content(result, as = "text")
```
The result will be in json format, which we can convert into an Robjet (list). In this list the results are within each type used in the query. The argumment `flatten` is used to collapse the list into a single data frame the data from different types.
```{r}
# transform to json
readableResult <- fromJSON(jsonResult,
flatten = T)
```
Extract data:
```{r}
# get data
data<-as.data.frame(readableResult$data[1])
head(data)
```
By default, the name of each type will be added a the beggining of each column name:
```{r}
colnames(data)
```
To keep only the name of the variable as it is in the original data:
```{r}
x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames
```
So finally we have the data in a single nice looking data frame:
```{r}
head(data)
```
Notice that you will get a dataframe like teh one above only for one to one associations, but than in other cases you still will get variables that are a list, which you can process in a separate step.