-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinsertEndemData.Rmd
214 lines (161 loc) · 8.23 KB
/
insertEndemData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Inserting endemic species with the colSpList API"
output:
github_document:
number_section: true
---
************************
**Note**:
This document was created from a Rmarkdown document, with the output format "github_document".
In order to use this type of file, please install the packages *knitr* and *rmarkdown* in R.
1. If you want to compile the document as a markdown document for github, while applying all the code contained in the file
+ use ```rmarkdown::render("file.Rmd")```
2. If you want to extract the R code of the document as a R script
+ use ```knitr::purl("file.Rmd")```
***********************
In order to show how to work with the colSpList API and the threatened species, we will use the species list from Ceiba, concerning the endemic bird species of Colombia (publicly available at <http://i2d.humboldt.org.co/ceiba/resource.do?r=biota_v14_n2_09>)
# Preformatting the dataset
The dataset downloaded from Ceiba includes various files, that we need to preformat (here in R), in order to extract the information that we may send to the API and its database:
```{r}
directory = "../../data/dwca-biota_v14_n2_09"
(files = dir(directory))
```
In order to read all the text files in the directory:
```{r}
fileNames <- files[grepl("\\.txt$",files)]
filesToRead <- paste(directory,fileNames,sep = "/")
data <- lapply(filesToRead,read.csv,sep="\t")
names(data) <- sub("\\.txt","",fileNames)
```
The first lines of each read file is :
```{r}
lapply(data,head,5)
```
## Taxonomic information
The "taxon" file contains all the taxonomic information.
```{r}
colnames(data[["taxon"]])
```
The "scientificName" is actually the "canonicalName" here, and does not include the authorship for the species.
```{r}
data$taxon$scientificName[1:10]
```
The taxonomic status of the species is written in spanish:
```{r}
table(data[["taxon"]]['taxonomicStatus'])
```
In case taxon names are synonyms, the column "acceptedNameUsage" gives the accepted name of the taxon, in a weird form of a "canonicalName" (without markers) associated with authorship:
```{r}
table(data[["taxon"]]$acceptedNameUsage,data[["taxon"]]$taxonomicStatus)
```
Our most simple way here to get the rank of the taxon is to check which column contains information:
```{r}
taxRankColumns <- c("kingdom","phylum","class","order","family","genus","specificEpithet")
ranksAssociated <- c("KG","PHY","CL","OR","FAM","GN","SP")
ranks <- apply(data[["taxon"]][,taxRankColumns],1,function(x,r)r[max(which(!is.na(x) & x != ""))], r = ranksAssociated)
table(ranks)
```
It appears that all the taxon have a specific epithet. We need now to check that all scientificName are indeed a genus and a specific epithet.
```{r}
regexGnSp <- "^[A-Z][a-z]+ [a-z]+$"
all(grepl(regexGnSp,data[["taxon"]]$scientificName))
```
So, indeed all names in scientificName correspond to species.
### Taxonomic preformatting
```{r}
syno <- data$taxon$acceptedNameUsage != "" | data$taxon$taxonomicStatus == "Sinónimo"
pf_taxon <- data.frame(id = data$taxon$id,canonicalname = data$taxon$scientificName,rank="SP",synoscientificname = ifelse(data$taxon$acceptedNameUsage == "", NA, data$taxon$acceptedNameUsage), parentcanonicalname = data$taxon$genus, syno = syno)
head(pf_taxon)
```
## Endemic status, references and comments
In the "description" file, we can find both the endemism status of the species and the references to cite, in different rows.
```{r}
by(data$description,data$description$type,head)
```
```{r}
tabStatus <- data.frame(id = data$description$id[data$description$type == "Distribución"], endemstatus = data$description$description[data$description$type == "Distribución"])
tabRef <- data.frame(id = data$description$id[data$description$type == "Literatura"], rawRef = data$description$description[data$description$type == "Literatura"])
listRef <- strsplit(tabRef$rawRef,", ?")
names(listRef) <- tabRef$id
# There are numbers, that we will supress from the list
listRef <- lapply(listRef,function(x)return(x[!grepl("^[0-9]*$",x)]))
listRef <- listRef[sapply(listRef,length)>0]
listRef[1:5]
```
We might use a more complex data schema in the future, but right now we do not have the specific structures for integrating the information which is contained in the "distribution" table.
For now, what we will do is to concatenate the information in a "comments" field, which already exists in the database.
```{r}
head(data$distribution)
commentLocality <- paste("locality:", data$distribution$locality)
tabComments <- data.frame(id = data$distribution$id,
comments = paste0(
ifelse(data$distribution$locality != "",paste("locality:",data$distribution$locality,"| "),""),
ifelse(data$distribution$occurrenceRemarks != "", paste("occurrenceRemarks:",data$distribution$occurrenceRemarks,"| "),"")
))
```
## Final preformatting
The goal here is to associate the taxonomic information, the endemic status, the references, and the comments in lists that may be directly transformed to json in order to send them to the API post method.
```{r}
# First we prepare the global reference and link for the dataset, to add in each taxon
baseRef <- list(
ref_citation = list("Chaparro-Herrera, S., Echeverry-Galvis, M.A., Córdoba-Córdoba, S., Sua-Becerra, A. (2013). Listado actualizado de las aves endémicas y casi-endémicas de Colombia. 308 registros. Versión 5.1. http://doi.org/10.15472/tozuue"),
link = list("http://i2d.humboldt.org.co/ceiba/resource.do?r=biota_v14_n2_09")
)
masterList <- list()
for(i in 1:nrow(pf_taxon))
{
id <- pf_taxon[i,"id"]
masterList[[i]] <- as.list(pf_taxon[i,colnames(pf_taxon) != "id" & !is.na(pf_taxon[i,])])
masterList[[i]] <- append(masterList[[i]],list(endemstatus = tabStatus[tabStatus$id==id,"endemstatus"]))
masterList[[i]] <- append(masterList[[i]],baseRef)
# Then we add the potential references already preformatted for this taxon
if(id %in% names(listRef))
{
masterList[[i]]$ref_citation <- append(masterList[[i]]$ref_citation,listRef[[id]])
masterList[[i]]$link <- append(masterList[[i]]$link,as.list(rep(" ",length(listRef[[id]]))))
}
masterList[[i]]$comments<-tabComments$comments[tabComments$id == id]
}
```
# Basic usage : example of one only species
If we take the first example of the list that we formatted on the previous part of the document, we obtain:
```{r}
masterList[[1]]
```
As you can see, the list is already formatted with the specifications of the API:
in order to send the status to the database, we use a json "dictionary" with the following elements:
* the identification of the species, with either:
+ *gbifkey* : integer corresponding to the taxonKey used in GBIF
+ *scientificname* : string corresponding to the scientificName in the GBIF backbone
+ *canonicalname* : string corresponding to the canonicalName in the GBIF backbone (formally, it is better to use the canonicalNameWithMarker) used by the name parser of the GBIF backbone)
+ *rank*: taxonomic rank of the taxon
+ *syno*: boolean describing whether the name is a synonym to an accepted taxon
+ *parentgbifkey*, *parentcanonicalname* and *parentscientificname*: equivalent of the identification of the taxon, for the parent taxon
+ *synogbifkey*, *synocanonicalname* and *synoscientificname*: equivalent of the identification of the taxon, for the accepted taxon in case the name sent is a synonym
* *endemstatus* : the endemism level
* *ref_citation* : a list of the references on which are based the inclusion of the taxon and its endemism status in the API
* *link* : a list of the url links (corresponding in length and order to the *ref_citation*)
* *comments* : comment on the endemism status of the taxon
## In R
```{r}
require(httr)
require(jsonify)
sendJson <- to_json(masterList[[1]],unbox=T)
baseURL <- 'http://localhost:5000'
baseResource <- "insertEndem"
POST('http://localhost:5000/insertEndem',body=sendJson, content_type("application/json"),verbose())
```
Now we do it for all the list:
```{r}
res = list()
for(i in 1:length(masterList))
{
res[[i]] <- POST(paste(baseURL,baseResource,sep="/"),body=to_json(masterList[[i]],unbox=T), content_type("application/json"))
}
```
# Problems
```{r}
pbs <- !sapply(res,function(x)"cd_tax"%in%names(content(x)))
any(pbs)
```
No problem found!