fromGraphQlToR.Rmd

---
title: "How to get data from Zendro to R"
output:
  html_document:
    df_print: paged
---

Libraries needed for this tutorial:
```{r, warning=FALSE}
library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
```

This tutorial uses the library `httr` to establish the connection with the GraphiQL API, but there are also other options to interact with GraphQL from R. Please check the R packages: [ghql](https://docs.ropensci.org/ghql/), [gqlr](https://github.com/schloerke/gqlr) and [graphql](https://github.com/ropensci/graphql).

## Introduction to GraphQL API and how to query it

GraphQL is a query language for Application Programming Interfaces (APIs). Queries are written in the GraphQL language, and the result (the data) is given back in [JSON format](https://www.w3schools.com/whatis/whatis_json.asp). 

If you are not familiar with GraphQL, we recommend you to start by checking the [Introduction to GraphQL and querying the API](GraphQL_intro.md) of Zendro How to Guides.

Zendro provides a GraphQL API web interface, called Graph**i**QL, which is a Web Browser tool for writing, validating, and testing GraphQL queries.

For example, try copy-pasting and executing the following query at [https://zendro.conabio.gob.mx/api/graphql](https://zendro.conabio.gob.mx/api/graphql), which is the API that we will be using in this and other tutorials.

```{GraphQL eval=FALSE}
{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
}
} 
```

(The example above only gets the first 10 results, in a section of this tutorial we will explain how to define `pagination` to pull down a given number, or all, of the items in a dataset.)


## Download a small dataset (<1,000 elements):

The function `get_from_graphQL()` defined below queries a GraphQL API and transforms the data from JSON format (which is the output of GraphQL) into a R data frame object you can easily use for further analyses. If you want to now what's going on inside this function, there is an step-by-step detailed description at the end of this document.

To start using `get_from_graphQL()` first run the code below to load the function into your R environment (you can also have it as a different file and use `source()` to run it): 

```{r}

get_from_graphQL<-function(query, url){
### This function queries a GraphiQL API and outpus the data into a single data.frame 

## Arguments
# query: a graphQL query. It should work if you try it in graphiQL server. Must be a character string.
# url = url of the server to query. Must be a character string.

## Needed libraries:
# library(httr)
# library(jsonlite)
# library(dplyr)
# library(stringr)

### Function

##  query the server
result <- POST(url, body = list(query=query), encode=c("json"))

## check server response
satus_code<-result$status_code

if(satus_code!=200){
  print(paste0("Oh, oh: status code ", satus_code, ". Check your query and that the server is working"))
}

else{
  
  # get data from query result
  jsonResult <- content(result, as = "text") 
  
  # check if data downloaded without errors
  # graphiQL will send an error if there is a problem with the query and the data was not dowloaded properly, even if the connection status was 200. 
  ### FIX this when != TRUE because result is na
  errors<-grepl("errors*{10}", jsonResult)
  if(errors==TRUE){
    print("Sorry :(, your data downloaded with errors, check your query and API server for details")
  } 
  else{ 
  # transform to json
  readableResult <- fromJSON(jsonResult, 
                           flatten = T) # this TRUE is to combine the different lists into a single data frame (because data comming from different models is nested in lists)
    
  # get data
  data<-as.data.frame(readableResult$data[1]) 
  
  # rename colnames to original variable names
  x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
  colnames(data)<-x # assing new colnames
  return(data)
    }
  }
}

```

`get_from_graphQL()` allows you to get data of up to 1,000 elements (results of your query) at a time, which is the maximum number allowed by GraphQL for a single batch. In the next section we explain how to use `pagination` to download larger datasets in batches.

To use the `get_from_graphQL()` function, first you have to define a GraphQL query. If you don't know how to do this, start by checking the [Introduction to GraphQL and querying the API](GraphQL_intro.md) of Zendro How to Guides.

Once you have a GraphQL query working, you'll need to save it to an R object as a character vector:

```{r}
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
   }
}
"
```

Next we use this query as an argument for `get_from_graphQL()`, along with the url of the API, which is the same of the GraphiQL web interface you explored above:

```{r}
data<-get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql")
```

If all wen't well you will get a data frame with the result of your query:

```{r}
head(data)
```


## Download a dataset with more than >1,000 elements:

GraphQL outputs the resutls of a query in batches of max 1,000 elements. So if the data you want to download is larger than that, then you need to **paginate**, i.e. to get the data in batches. `pagination` is is an argument within GraphQL queries that could be done by:

* *Limit-offset*: indicating the first element to get (`offset`, default 0) and the number of elements to get (`limit`). The `limit` can't be larger than `1000`.

* *Cursor-based*: indicating the unique ID (`cursor`) of the element to get first, and a number of elements to get after.

Zendro uses the limit-offset pagination with the syntaxis:

`pagination:{limit:[integer], offset:[integer]}`

[See GraphQL documentation](https://graphql.org/learn/pagination/) and this [tutorial on GraphQL pagination](https://daily.dev/blog/pagination-in-graphql) for more details.

In the previous examples we downloaded only 10 elements (`pagination:{limit:10})`) from the rivers type, but the dataset is larger. (Remember, data in GraphQL is organised in **types** and **fields** within those types. When thinking about your structured data, you can think of types as the names of tables, and fields as the columns of those tables. In the example above `rivers` is a type and the fields are `river_id`, `name`, `length` among others.)

To know how many elements does a type has we can make a query with the function `count`, if it is available for the type we are interested on. You can check this in the `Docs` at the top right menu of the GraphiQL interface. 

For example, `rivers` has the function `countRivers` so with the query `{countRivers}` we can get the total number of rivers.

Similar to how we got data before, you can use this very simple query in the function `get_from_graphQL` to get the number of rivers into R:

```{r}
# query API with count function
no_records<-get_from_graphQL(query="{countRivers}", url="https://zendro.conabio.gob.mx/api/graphql")

# change to vector, we don't need a df
no_records<-no_records[1,1]
no_records
```

In this case we have `r no_records`. Technically we could download all the data in a single batch because it is <1000, but for demostration purposes we will download it in batches of 10.

The following code calculates the number of pages needed to get a given number of records assuming a desired limit (size of each batch). Then it runs `get_from_graphQL()` within a loop for each page until getting the total number of records desired.


```{r}
# Define desired number of records and limit. Number of pages and offset will be estimated based on the number of records to download
no_records<- no_records # this was estimated above with a query to count the total number of records, but you can also manually change it to a custom desired number
my_limit<-10 # max 1000. 
no_pages<-ceiling(no_records/my_limit)

## Define offseet.
# You can use the following loop:
# to calculate the offset automatically based on 
# on the number of pages needed.
my_offset<-0 # start in 0. Leave like this
for(i in 1:no_pages){ # loop to 
  my_offset<-c(my_offset, my_limit*i)
}

# Or you can define the offset manually 
# uncommenting the following line
# and commenting the loop above:
# my_offset<-c(#manually define your vector) 

## create object where to store downloaded data. Leave empty
data<-character()

##
## Loop to download the data from GraphQL using pagination
## 

for(i in c(1:length(my_offset))){

# Define pagination
pagination <- paste0("limit:", my_limit, ", offset:", my_offset[i])

# Define query looping through desired pagination:
my_query<- paste0("{
  rivers(pagination:{", pagination, "}){
      river_id
      name
      length
   }
   } 
   ")


# Get data and add it to the already created df
data<-rbind(data, get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql"))

#end of loop
}

```

As a result you will get all the data in a single df:

```{r}
head(data)
summary(data)
```

## `get_from_graphQL()` explained step by step

The following is a step-by-step example explaining with more detail how does the function `get_from_graphQL()` that we used above works.

First, once you have a GraphQL query working, you'll need to save it to an R object as a character vector:

```{r}
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
   }
}
"
```

Next, define as another character vector the url of the API, which is the same of the GraphiQL web interface you explored above:

```{r}
url<-"https://zendro.conabio.gob.mx/api/graphql"
```


Now we can a query to the API by using a POST request:

```{r}
# query server
result <- POST(url, body = list(query=my_query), encode = c("json"))
```

The result that we are getting is the `http` response. Before checking if we got the data, it is good practice to verify if the connection was successful by checking the status code. A `200` means that all went well. Any other code means problems. See [this](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

```{r}
# check server response
result$status_code
```

We now need to extract the data in order to be able to manipulate it. If everything went well, the `http` response will contain an attribute data which will itself contain an attribute named as the query, in this case `rivers`.


```{r}
result
```

If the query is not written properly or if there is any other error, the attribute `data` won't exist and instead we will get the attribute `erros` listing the errors found. 

If all wen't well we can proceed to extract the content of the results with:

```{r}
# get data from query result
jsonResult <- content(result, as = "text") 

```

The result will be in json format, which we can convert into an Robjet (list). In this list the results are within each type used in the query. The argumment `flatten` is used to collapse the list into a single data frame the data from different types. 

```{r}
# transform to json
readableResult <- fromJSON(jsonResult, 
                         flatten = T)
```

Extract data:

```{r}
# get data
data<-as.data.frame(readableResult$data[1]) 
head(data)
```

By default, the name of each type will be added a the beggining of each column name:

```{r}
colnames(data)
```

To keep only the name of the variable as it is in the original data: 
```{r}
x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames 
```

So finally we have the data in a single nice looking data frame:
```{r}
head(data) 
```
Notice that you will get a dataframe like teh one above only for one to one associations, but than in other cases you still will get variables that are a list, which you can process in a separate step.