-
Notifications
You must be signed in to change notification settings - Fork 3
/
mdb-to-csv_07.Rmd
469 lines (379 loc) · 18.4 KB
/
mdb-to-csv_07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
---
title: "Query and site .csv files from an Access database (mdb-to-csv.R)"
output:
html_document:
theme: united
css: SI-md-08.css
fig_caption: yes
highlight: haddock
number_sections: yes
toc: yes
toc_float: false
collapsed: no
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=TRUE)
pdf.options(useDingbats = TRUE)
```
# Introduction #
This script reads a GCD Microsoft Access data base, in this case, `GCDv03_Marlon_et_al_2015.mdb`, downloaded from [https://paleofire.org](https://paleofire.org). These data were the basis of the analyses in Marlon et al. (2016, *Biogeosciences*). This database includes the results of two queries stored as database "views" (as opposed to tables), one for a list of sites and the other for the data. The views are read directly from the .mdb file. The files have the prefix `v3i`, being the i-th (in alphabetical order, i.e. `v3a`, `v3b`, ..., `v3i`) version incorporating successive fix-ups of data problems.
The data query (i.e. the charcoal data for all sites) is parsed into individual "site" .csv files, while also doing various fixups, calculations of sedimentation rates, and converting, for example, charcoal concentrations into charcoal influx. The result is a set of .csv files that can be examined individually, or processed further by transformation, presampling or binning, and the development of composite curves. While aimed at producing the .csv files, the script also looks for and flags various inconsistencies in the data, like age reversals or zero or negative sedimentation rates. In addition to the individual .csv files, one per site, the script also generates a "site-query" .csv file listing the information for each site, and a "data-query" .csv file of all of the charcoal data.
The script also rewrites the site list query into a .csv file for convenient sorting into regional subsets.
# Set up folders and filenames #
Set up folder paths and filenames for the various input and output data. Set the path to the Access database (`.mdb`) file.
```{r setup01, eval=TRUE, echo=TRUE}
dbname <- "GCDv03_Marlon_et_al_2015.mdb"
```
Create paths and folders for a) the query files, the individual .csv files, the sitelist .csv file.
```{r setup02, eval=TRUE, echo=TRUE}
# query label and path to query and query name
datapath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/"
querypath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_query/"
# if the query outpu folder does not exist, create it
dir.create(file.path(querypath), showWarnings=FALSE)
# query file names
querysitename <- "v3i_sites.csv"
querydataname <- "v3i_data.csv"
```
```{r setup03, eval=TRUE, echo=TRUE}
# path to .csv output
csvpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_sites_csv/"
# if output folder does not exist, create it
dir.create(file.path(csvpath), showWarnings=FALSE)
# path to sitelist output
sitelistpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_sitelists/"
# if output folder does not exist, create it
dir.create(file.path(sitelistpath), showWarnings=FALSE)
# sitelist output label
sitelistname <- "v3i_all"
```
Also create a "debug" or log file:
```{r debug, eval=FALSE, echo=TRUE}
# debug/log file
debugpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_debug/"
# if debug folder does not exist, create it
dir.create(file.path(debugpath), showWarnings=FALSE)
debugname <- "mdb-to-csv_debug.txt"
# open the debug/log file
debugfile <- file(paste(debugpath, debugname, sep=""), "w")
```
Other setup:
```{r setup04}
# setup
maxsites <- 2000
maxsamples <- 9000
miss <- -9999.0
```
# Access database queries #
## Database connection setup ##
Note that the connection to the particular Access database (`GCDv03_Marlon_et_al_2015.mdb`) is established externally to R using (on Windows) the Data Sources tool (i.e. Control Panel > Administrative Tools > Data Sources (ODBC)). On the Mac, the *Actual ODBC Driver for Access* at http://www.actualtech.com/ works, but requires some configuration, as well as compilation of the `RODBC` package. See the code file `mdb-to-csv_osx.R`.
## Connect to the database and get some info ##
Load the `RODBC` library and connect to the downloaded database. Get some basic information.
```{r db01}
# load RODBC library and connect to the database
library(RODBC)
gcdv3.db <- odbcConnect(dbname)
odbcGetInfo(gcdv3.db)
```
Check that the two queries are present in the database as views, and list their column names. (This is not really necessary, because he presence of the appropriate data can be verified simply by opeining the database.) The data are not in intrinsic tables in the database, but are defined using two queries, and and are externally available as table-like "views" The names of the queries, `ALL_BART_SITES` and `ALL_BART_DATA` are legacies of the workflow developed for analyzing the first two versions of the database.
```{r db02}
# check for existence of database site and data views
sqlTables(gcdv3.db, tableName="ALL_BART_SITES", tableType="VIEW")
sqlColumns(gcdv3.db, "ALL_BART_SITES")$COLUMN_NAME
sqlTables(gcdv3.db, tableName="ALL_BART_DATA", tableType="VIEW")
sqlColumns(gcdv3.db, "ALL_BART_DATA")$COLUMN_NAME
```
Note that for the particular way the database was constructed, the variables `LATITUDE` and `LONGITUDE` in the site query and `DEPTH` in the data query are not correctly named. Those columns will be renamed after fetching the data.
## Site and data queries
Use the `sqlFetch()` function to get the two tables. For the site query, convert `SITE_NAME` from a factor to a character string. Close the database when done.
```{r db03}
# site query
site_query <- sqlFetch(gcdv3.db, "ALL_BART_SITES")
names(site_query)[3] <- "LATITUDE"; names(site_query)[4] <- "LONGITUDE"
site_query$SITE_NAME <- as.character(site_query$SITE_NAME)
head(site_query)
str(site_query)
```
```{r db04}
#data query
data_query <- sqlFetch(gcdv3.db, "ALL_BART_DATA")
names(data_query)[4] <- "DEPTH"
head(data_query)
str(data_query)
```
```{r db05}
# close the database
odbcClose(gcdv3.db)
```
# Write query .csv files
Write out two "query" .csv files, one for the sites and one for the data. These may be examined, externally edited, or augmented by additional data.
```{r writeSites}
# site .csv file
sitecsvpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_query/"
write.csv(site_query, paste(sitecsvpath, querysitename, sep=""), row.names=FALSE)
```
```{r writeData}
# data .csv file
datacsvpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_query/"
write.csv(data_query, paste(datacsvpath, querydataname, sep=""), row.names=FALSE)
```
Also write a .csv file of site locations and site names for easy editing to make site lists for regional queries:
```{r writeSitelist}
# rewrite sitefile as .csv file for sorting by region and depositional context
sitelist <- data.frame(site_query$ID_SITE, site_query$LATITUDE, site_query$LONGITUDE, site_query$ELEV,
site_query$ID_DEPO_CONTEXT, site_query$SITE_NAME, stringsAsFactors = FALSE)
names(sitelist) <- c("Site_ID", "Lat", "Lon", "Elev", "depo_context", "Site_Name")
head(sitelist)
str(sitelist)
sitelistfile <- paste(sitelistpath, sitelistname, ".csv", sep="")
write.table(sitelist, sitelistfile, row.names=FALSE, sep=",")
```
# Parse the query #
The main part of the script loops over the individual sites, and does various checks and calculations, including
1. calculation of sedimentation rates and deposition times
2. checking for age or depth reversals
3. setting of various indicator variables and flags
4. calculation of alternative quanties (e.g. influx, given concentrations)
5. further checking for anomalies
6. writing out a .csv file for each site
In the example here, only the first site is processed, although the script appears to loop over all sites.
## Loop over sites ##
```{r readRData, eval=TRUE, echo=FALSE, include=FALSE}
load("mdb-to-csv_07_run1.RData")
j <- 1
```
```{r mainLoop, eval=FALSE, echo=TRUE}
for (j in 1:maxsites) {
```
## Extract the data for the j-th site from the query ##
Get the data for the j-th site using "square bracket" extraction:
```{r parse01, eval=FALSE, echo=TRUE}
nsamp <- 0
sitedata <- data_query[data_query$ID_SITE == j, ]
nsamp <- length(sitedata$ID_SITE)
```
The first site, Cygnet L. contains `r nsamp` records:
```{r parse02, eval=TRUE, echo=FALSE, include=TRUE}
head(sitedata)
tail(sitedata)
```
## Process the data for the j-th site ##
Define some local variables, and replace `NA`'s with a missing values code.
```{r parse03, eval=FALSE, echo=TRUE}
if (nsamp > 0) {
jchar <- as.character(j)
nsampchar <- as.character(nsamp)
writeLines(paste("Site",jchar,nsampchar,"samples", sep=" "), con = debugfile, sep = "\n")
# local variables
depth <- sitedata$DEPTH; age <- sitedata$EST_AGE; quant <- sitedata$QUANTITY
depth[is.na(depth)] <- miss
age[is.na(age)] <- miss
quant[is.na(quant)] <- miss
thickness <- rep(miss, nsamp); dep_time <- rep(miss, nsamp); sed_rate <- rep(miss, nsamp)
unit_dep_time <- rep(miss, nsamp)
xst_level <- as.character(sitedata[1,9])
```
### Sedimentation rates and deposition times, missing values ###
The first (top) and last (bottom) samples require some fixing up in order to obtain "thickness" values used to calculate sedimentation rate as sample thickness divided by deposition time. Otherwise, thickness, sedimentation rate and deposition time for each sample are calculated using obvious approaches. Note the inline conversion of depths recorded in the database as metres, to centimetres, the standard devisor in the units of influx.
```{r parse05, eval=FALSE, echo=TRUE}
# sed rate and deposition time
# first (top) sample
if (depth[1] != miss && depth[2] != miss) {
thickness[1] <- (depth[2] - depth[1])*100.0 # meters to cm (depth in m, influx and conc in cm)
dep_time[1] <- age[2] - age[1]
if (dep_time[1] > 0.0) sed_rate[1] <- thickness[1]/dep_time[1]
if (sed_rate[1] != miss) unit_dep_time[1] <- 1.0/sed_rate[1]
}
# samples 2 to nsamp-1
for (i in 2:(nsamp-1)) {
if (depth[1] != miss && depth[2] != miss) {
thickness[i] <- (depth[i+1] - depth[i])*100.0
dep_time[i] <- ((age[i+1] + age[i])/2.0) - ((age[i] + age[i-1])/2.0)
if (dep_time[i] > 0.0) sed_rate[i] <- thickness[i]/dep_time[i]
if (sed_rate[i] != miss) unit_dep_time[i] <- 1.0/sed_rate[i]
}
}
# last (bottom) sample
if (depth[nsamp-1] != miss && depth[nsamp] != miss) {
thickness[nsamp] <- thickness[nsamp-1] # replicate thickness
dep_time[nsamp] <- age[nsamp] - age[nsamp-1]
sed_rate[nsamp] <- sed_rate[nsamp-1] # replicate sed_rate
unit_dep_time[nsamp] <- unit_dep_time[nsamp-1]
}
```
```{r parse06, eval=TRUE, echo=FALSE, include=TRUE}
head(cbind(age, depth, thickness, dep_time, sed_rate, unit_dep_time))
tail(cbind(age, depth, thickness, dep_time, sed_rate, unit_dep_time))
```
It's the case for this site that samples were taken as contiguous 1-cm samples, which explains why thickness values are always equal to 1.0.
Count non-missing values:
```{r parse07, eval=FALSE, echo=TRUE}
# counts of missing values
depth_count <- 0; age_count <- 0; quant_count <- 0; sed_rate_count <- 0; sed_rate_flag <- 1
depth_count <- sum(depth != miss)
age_count <- sum(age != miss)
quant_count <- sum(quant != miss)
sed_rate_count <- sum(sed_rate != miss)
if (sed_rate_count != nsamp) sed_rateflag = 0
```
```{r parse08, eval=TRUE, echo=FALSE, include=TRUE}
print(cbind(depth_count, age_count, quant_count, sed_rate_count))
```
### Check for age or depth reversals
Check for age or depth reversals (sometime natural, but often related to mechanical and/or transcription errors). Because this is analysis of a finished or published data set, no warnings should be generated. In pratice, for "new" versions of a database, several iterations will be requred to clean everything up.
```{r parse09, eval=FALSE, echo=TRUE}
# check for age or depth reversals, and zero or negative sed rates (in nonmissing data)
depth_reversal <- 0; age_reversal <- 0; sed_rate_zeroneg <- 0
for (i in 2:nsamp) {
if (age[i] != miss && age[i-1] != miss && age[i] <= age[i-1]) age_reversal=1
if (depth[i] != miss && depth[i-1] != miss) {
if (depth[i] <= depth[i-1]) depth_reversal=1
}
}
for (i in 2:nsamp) {
if (sed_rate[i] != miss && sed_rate[i] <= 0.0) sed_rate_zeroneg=1
}
```
```{r parse10, eval=TRUE, echo=FALSE, include=TRUE}
print(cbind(depth_reversal, age_reversal, sed_rate_zeroneg))
```
### Set and write out various flags ###
Check for various issues like partial records, age reversals, and so on, and write notes into the debug file.
```{r parse11, eval=FALSE, echo=TRUE}
# set and write out various flags
if (depth_count != 0 && depth_count != nsamp) {
writeLines(paste("**** has a missing depth when some are nonmissing", sep=" "), con = debugfile, sep = "\n")
}
if (age_count != 0 && age_count != nsamp) {
writeLines(paste("**** has a missing age when some are nonmissing", sep=" "), con = debugfile, sep = "\n")
}
if (quant_count != 0 && quant_count != nsamp) {
writeLines(paste("**** has a missing quantity when some are nonmissing", sep=" "), con = debugfile, sep = "\n")
}
if (sed_rate_count != 0 && sed_rate_count != nsamp) {
writeLines(paste("**** has a missing sed rate when some are nonmissing", sep=" "), con = debugfile, sep = "\n")
}
if (depth_reversal != 0) {
writeLines(paste("**** has a depth reversal", sep=" "), con = debugfile, sep = "\n")
}
if (age_reversal != 0) {
writeLines(paste("**** has an age reversal", sep=" "), con = debugfile, sep = "\n")
}
if (sed_rate_zeroneg != 0) {
writeLines(paste("**** has zero or negative sed rates", sep=" "), con = debugfile, sep = "\n")
}
```
### Get alternative quantities ###
The next step is to get alternative quantities for each record, i.e. if the records for a particular site are influx values, get concentrations, and if the records are concentrations, get influx. For charcoal:pollen ratios (`C0P0`), treat the data as though they were concentrations. Where no translations are possible, simply copy the data.
```{r parse13, eval=FALSE, echo=TRUE}
# alternative quantities
conc <- rep(miss, nsamp); influx <- rep(miss, nsamp)
influx_source <- rep("none", nsamp) ; conc_source <- rep("none", nsamp)
# select case based on xst_level
if (xst_level == "INFL") # adopt influx values as they are, calculate concentration
{
influx <- quant
influx_source <- "data"
if (influx != miss && unit_dep_time != miss && sed_rate != 0.0) {
conc <- influx * unit_dep_time
conc_source <- "calculated from influx "
} else {
conc <- quant
conc_source <- "copied from quant "
}
writeLines("INFL", con = debugfile, sep = "\n")
}
else if (xst_level == "CONC") # calculate influx, adopt conc values as they are
{
conc <- quant
conc_source <- "data"
if (conc != miss && sed_rate != miss && sed_rate != 0.0) {
influx <- quant * sed_rate
influx_source <- "calculated from conc "
} else {
influx <- quant
influx_source <- "copied from quant "
}
writeLines("CONC", con = debugfile, sep = "\n")
}
else if (xst_level == "C0P0") # assume quantity is concentration like
{
conc <- quant
conc_source <- "C0P0"
if (sed_rate != miss && sed_rate != 0.0) {
influx <- quant * sed_rate
influx_source <- "calculated from C0P0 (conc) "
} else {
influx <- quant
influx_source <- "copied from quant "
}
writeLines("C0P0", con = debugfile, sep = "\n")
}
else if (xst_level == "SOIL") # just copy
{
conc <- quant
conc_source <- "copied from quant "
influx <- quant
influx_source <- "copied from quant "
writeLines("SOIL", con = debugfile, sep = "\n")
}
else if (xst_level == "OTHE") # just copy
{
conc <- quant
conc_source <- "copied from quant "
influx <- quant
influx_source <- "copied from quant "
writeLines("OTHE", con = debugfile, sep = "\n")
}
else
{
conc <- quant
conc_source <- "copied from quant "
influx <- quant
influx_source <- "copied from quant "
writeLines("Unknown", con = debugfile, sep = "\n")
}
}
```
The results look like the following:
```{r parse14, eval=TRUE, echo=FALSE, include=TRUE}
head(outdata)
```
### Further checks ###
Check for and report instances where influx values are 0.0 everywhere.
```{r parse15, eval=FALSE, echo=TRUE}
# check for influx == 0.0 everywhere
nzero <- 0
nzero <- sum(influx != 0.0)
if (nzero == 0) {
writeLines(paste("**** has no non-zero influx values", sep=" "), con = debugfile, sep = "\n")
}
```
### Write out the .csv file ###
Then, for each site, assemble the appropriate data, and write out a .csv file for the current site:
```{r parse17, eval=FALSE, echo=TRUE}
# .csv out
if (nsamp > 0 && nzero > 0) {
# get siteid string
siteidchar <- as.character(j)
if (j >= 1) siteid <- paste("000", siteidchar, sep="")
if (j >= 10) siteid <- paste("00", siteidchar, sep="")
if (j >= 100) siteid <- paste("0", siteidchar, sep="")
if (j >= 1000) siteid <- paste( siteidchar, sep="")
sitehdr <- paste("site", siteid, sep="")
# assemble output data and write it out
samplenum <- seq(1:nsamp)
outdata <- data.frame(samplenum,sitedata$ID_SAMPLE, depth, age, sed_rate, quant, conc,
influx, xst_level, conc_source, influx_source)
names(outdata) <- c(sitehdr, "id_sample", "depth", "est_age", "sed_rate", "quant", "conc",
"influx", "xst_level", "conc_source", "influx_source" )
csvfile <- paste(csvpath,siteid,"_data.csv", sep="")
write.csv(outdata, csvfile, row.names=FALSE)
}
```
Bottom of loop, close debug file
```{r parse19, eval=FALSE, echo=TRUE, warning=FALSE}
}
close(debugfile)
```