-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathpresample-bin_07.Rmd
191 lines (146 loc) · 8.29 KB
/
presample-bin_07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: "Presampling or binning of transformed data (presample-bin.R)"
output:
html_document:
css: SI-md-08.css
fig_caption: yes
highlight: haddock
number_sections: yes
theme: united
toc: yes
toc_float: false
collapsed: no
---
```{r presample-bin, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=TRUE)
pdf.options(useDingbats = TRUE)
```
# Introduction #
The purpose of this script (`presample-bin.R`) is to "presample" or bin the transformed data. Charcoal data is available at all kinds of "native" resolutions, from samples that represent decades or centuries (or longer) to those that represent annual deposition. Further, some records have been interpolated to pseudo-annual time steps. In developing composite curves, those records with higher resolutions will contribute disproportionately to the curve. There are two general approaches for dealing with this: 1) weighting individual charcoal (influx or concentration) values according to their resolution, with lower-resolution records receiving higher weights, and vice-versa, or 2) reducing the sampling frequency of the records to some common interval (without interpolating or creating pseudo data).
We adopted the latter approach because it was more transparent, and because the weighting approach was highly sensitive to the particular weighting scheme adopted. The binning or "presampling" works by creating a set of target bins, typically at decadal or bidecadal intervals, and then processing each record as follows: If a particular record has multiple samples that fall within a specific bin, they are combined by simple averaging, if a particular record has a single sample that falls within a specific bin, that value is adopted as the value for the bin, but if a particular record has no values that fall within a specific bin, none are created by interpolation. The approach implemented here in R differs slightly from the original Fortran implementation, but not in any appreciable way. There are two main parts to this script: 1) a set-up part that contains path and file names, along with the base-period specifiation (that change from run-to-run), and 2) the calculation part that generally does not change.
The bins are defined by their midpoints, and the following expression places individual samples into an appropriate bin:
```{r binnum, eval=FALSE}
# this definition of bin number seems to match that implicit in presample.f90
binnum <- as.integer(ceiling((sitedata$est_age-targbeg-(targstep/2))/targstep))+1
```
# Set up
The first step is set various path names and base-period parameter values. The `queryname` is used to compose file and pathnames, `datapath` specifies the folder where the input and output data reside, and `sitelistpath` and `sitelistfile` specify a particular list of sites to be processed.
```{r pathnames}
# presample-bin.R
# presamples, or bins the transformed data into evenly spaced bins with no interpolation
# paths for input and output .csv files -- modify as appropriate
queryname <- "v3i"
datapath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/"
sitelistpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i/v3i_sitelists/"
sitelist <- "v3i_nsa_globe"
```
The `basename` is that used in `trans-and-zscore.R` to identify a particular base period that was used to transform the data. The structure of the bins is defined by a starting and ending age, and interval.
```{r setBinstructure}
## set basename and bin structure
basename <- "zt21k"
targbeg <- -60
targend <- 22000
targstep <- 20
```
Set up a debugging (log) file.
```{r debug}
# debug/log file
debugpath <- "/Projects/GPWG/GPWGv3/GCDv3Data/v3i_Rscripts/v3i_debug/"
# if debug folder does not exist, create it
dir.create(file.path(debugpath), showWarnings=FALSE)
debugname <- "presample-bin_debug.txt"
# open the debug/log file
debugfile <- file(paste(debugpath, debugname, sep=""), "w")
```
# Calculation #
## Initial steps ##
First, various folders and filenames are created. The `dir.create()` function is used to create the folders if they do not already exist.
```{r pathnames2}
# no changes below here
# various path and filenames
sitelistfile <- paste(sitelistpath, sitelist, ".csv", sep="")
transcsvpath <- paste(datapath,queryname,"_trans_csv/",sep="")
presampcsvpath <- paste(datapath,queryname,"_presamp_csv/",sep="")
# if output folder does not exist, create it
dir.create(file.path(datapath, paste(queryname,"_presamp_csv/",sep="")), showWarnings=FALSE)
```
Define the bins
```{r binDefinition}
# bin center (target points) definition
targage <- seq(targbeg, targend, by=targstep)
```
Read the list of sites to be processed.
```{r readSitelist}
# read list of sites
ptm <- proc.time()
sites <- read.csv(sitelistfile, stringsAsFactors=FALSE)
nsites <- length(sites[,1])
print(nsites)
```
## Main loop ##
Loop over the sites, doing the following for each:
1. Compose the trans-and-zscore .csv file name
2. Read the input data
3. Count the number of nonmissing (non-NA) and infinite influx values
4. Find bin number of each sample
5. Get average zt values (and average ages) for the data in each bin
6. Get bin numbers of each bin that had an average (or a single) value
7. Write output
Step 3 determines the number of nonmissing (NA) values of zt for each site, and the number of those nonmissing values that are not infinite (Inf). If there are no nonmissing or noninfinite values, the site is skipped. Average ages of the point in the bin are calculated, but not currently used.
```{r main loop, eval=TRUE, warning=FALSE}
# main loop
for (j in seq(1,nsites)) {
# 1. Compose the trans-and-zscore .csv file name
sitenum <- sites[j,1]
sitename <- as.character(sites[j,5])
siteidchar <- as.character(sitenum)
if (sitenum >= 1) siteid <- paste("000", siteidchar, sep="")
if (sitenum >= 10) siteid <- paste("00", siteidchar, sep="")
if (sitenum >= 100) siteid <- paste("0", siteidchar, sep="")
if (sitenum >= 1000) siteid <- paste( siteidchar, sep="")
inputfile <- paste(transcsvpath, siteid, "_trans_influx_",basename,".csv", sep="")
if (file.exists(inputfile)) {
# 2. Read the input data
sitedata <- read.csv(inputfile)
nsamp <- length(sitedata$zt)
nsampchar <- as.character(nsamp)
writeLines(paste("Site",siteidchar,nsampchar,"samples", sep=" "), con = debugfile, sep = "\n")
# 3. Count the number of nonmissing (non-NA) and infinite influx values
nonmiss <- na.omit(sitedata$zt)
numnonmiss <- length(nonmiss)
numinf <- sum(is.infinite(nonmiss))
numnonmiss; numinf
if (length(nonmiss) > 0 & numinf < numnonmiss) {
# add a column of 1's for counting
sitedata$one <- rep(1,length(sitedata[,1]))
# 4. Find bin number of each sample
# this definition of bin number seems to match that implicit in presample.f90
binnum <- as.integer(ceiling((sitedata$est_age-targbeg-(targstep/2))/targstep))+1
# uncommenting the following reveals how each sample is assigned to a bin
#head(cbind(sitedata$est_age,sitedata$zt,binnum,targage[binnum]), nsamp)
# 5. Get average zt values (and average ages) for the data in each bin
binave <- tapply(sitedata$zt, binnum, mean)
binaveage <- tapply(sitedata$est_age, binnum, mean)
bincount <- tapply(sitedata$one, binnum, sum)
# 6. Get bin numbers of each bin that had an average (or a single) value
binsub <- as.numeric(unlist(dimnames(binave)))
# 7. Write output
presampout <- data.frame(targage[binsub],binave,bincount)
presampout <- na.omit(presampout)
colnames(presampout) <- c("age", "zt", "np")
outputfile <- paste(presampcsvpath, siteid, "_presamp_influx_",basename,"_bw",
as.character(targstep),".csv", sep="")
write.table(presampout, outputfile, col.names=TRUE, row.names=FALSE, sep=",")
}
}
}
```
How long did this take?
```{r howLong}
proc.time() - ptm
```
As the loop executes, one block of information for each site will be printed.
```{r printExample, echo=FALSE}
cat(" [1] 1 \n [1] 1 \n [1] \"Cygnet\" \n [1] \"/Projects/GPWG/GPWGv3/data/v3i/v3i_trans_csv/0001_trans_influx_zt-lme.csv\" \n ...")
```