-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathMethods.Rmd
353 lines (300 loc) · 17.6 KB
/
Methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Yojoa Remote Sensing Methods"
author: "B Steele, Matt Ross"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include = F}
library(tidyverse)
library(ggthemes)
library(readr)
is_dir <- "data/in-situ/"
rs_dir <- "data/upstreamRS/"
match_dir <- "data/matchups/"
models_dir <- "data/models/"
```
# Methods
```{r load data, include = F}
stack = read_csv(file.path(rs_dir, 'yojoa_corr_rrs_met_v2023-06-15.csv')) %>%
filter(date < ymd('2023-01-01'))
secchi = read_csv(file.path(is_dir, 'Secchi_completedataset.csv')) %>%
mutate(secci = as.numeric(secchi)) %>%
filter(!is.na(secchi)) %>%
mutate(location = gsub(' ', '', location))
match5 = read_csv(file.path(match_dir, 'fiveDay_LS-Secchi_matchups_n237.csv'))
match71 = read_csv(file.path(match_dir, 'multiDay_LS-Secchi_matchups_n238.csv'))
```
## Landsat Stack
Median Landsat Collection 2 Surface Reflectance (Masek, et al., 2006, Vermote,
et al., 2016) values (Rrs) were obtained for the 18 sampling locations in Lake
Yojoa following the methods described in Topp, et al. (2021). The Landsat stack
was comprised of `r length(unique(stack$system.index))` unique scenes comprised
of data from Landsat 5, 7, 8, and 9 (Table 1). This stack contained `r
nrow(stack)` Rrs summaries (Table 2) at the 18 locations between the dates of
`r min(stack$date)` and `r max(stack$date)`. Minor adaptations were made for
the transition for Collection 1 to Collection 2 Landsat data to account for
differences in scaling factors between collections. Rrs summaries included only
'confident' water pixels as defined by the dynamic surface water extent
algorighm (Jones, 2019). Data were filtered for reasonable values for water
reflectance (-0.01 \< Rrs \< 0.2) for all bands (Blue, Red, Green, Near
Infrared, Shortwave Infrared 1, Shortwave Infrared 2). Inter-mission handoff
coefficients to standardize Rrs values due to slight changes in sensors and
atmospheric correction between missions (Gardner, et al. 2021) were calculated
based on Landsat Collection 2 data acquired from all lakes greater than 25
hectacres within Guatemala, Honduras, and El Salvador (described in *Regional
Handoff Coefficients* below). All Landsat data were acquired using the Google
Earth Engine Python Application Programming Interface (API) in RStudio version
2023.03.0, R version 4.2.3 (R Core Team 2023), and Python version 3.8 (Python
Software Foundation, <https://www.python.org/>).
Landsat data acquisition was completed in the scripts within the 'landsat_c2'
folder. Within this folder there are two subfolders of scripts: `literateCode`
which contains the scripts used to acquire the Yojoa stack and `forHandoffs`
which contains the scripts used to acquire the regional remote sensing stack
for the handoff calculations. Landsat data collation was completed in the
script `programs/1_RS_collate_harmonize.Rmd`. Regional remote sensing data were
collated in the script `programs/3_regionalRS_collate_harmonize.Rmd`. Handoff
calculations were made in the script `programs/4_regional_handoff_calcs.Rmd`
and were applied in the script `programs/5_apply_handoff_coefficients.Rmd`.
```{r getCount, echo = F}
# save function to get row count
getCount <- function(dataset, miss, countType) {
if (miss == "all") {
df <- dataset
} else {
df <- dataset %>%
filter(mission == miss)
}
if (countType == "imageDate") {
length(unique(df$system.index))
} else {
if (countType == "matches") {
nrow(df)
} else {
print("oops")
}
}
}
```
| Landsat Mission | Number of Unique Landsat Scenes | Number of Unique Landsat
Scenes with ±5 day matchups | **Number of Unique Landsat Scenes with ±7/1 day
matchups** |
|------------------|------------------|------------------|------------------| |
Landsat 5 | `r getCount(stack, 'LANDSAT_5', 'imageDate')` | `r getCount(match5,
'LANDSAT_5', 'imageDate')` | `r getCount(match71, 'LANDSAT_5', 'imageDate')` |
| Landsat 7 | `r getCount(stack, 'LANDSAT_7', 'imageDate')` | `r
getCount(match5, 'LANDSAT_7', 'imageDate')` | `r getCount(match71, 'LANDSAT_7',
'imageDate')` | | Landsat 8 | `r getCount(stack, 'LANDSAT_8', 'imageDate')` |
`r getCount(match5, 'LANDSAT_8', 'imageDate')` | `r getCount(match71,
'LANDSAT_8', 'imageDate')` | | Landsat 9 | `r getCount(stack, 'LANDSAT_9',
'imageDate')` | `r getCount(match5, 'LANDSAT_9', 'imageDate')` | `r
getCount(match71, 'LANDSAT_9', 'imageDate')` | | Total | `r getCount(stack,
'all', 'imageDate')` | `r getCount(match5, 'all', 'imageDate')` | `r
getCount(match71, 'all', 'imageDate')` |
: Table 1. Summary of number of unique Landsat scenes in the full Landsat
stack, the ±5 day matchup dataset, and the ±7/1 day dataset.
| Landsat Mission | Number of Valid Location-Landsat Images | Number of Valid
Location-Landsat Images in ±5 day matchups | **Number of** Valid
Location-Landsat Images in **±7/1 day matchups** |
|------------------|------------------|------------------|------------------| |
Landsat 5 | `r getCount(stack, 'LANDSAT_5', 'matches')` | `r getCount(match5,
'LANDSAT_5', 'matches')` | `r getCount(match71, 'LANDSAT_5', 'matches')` | |
Landsat 7 | `r getCount(stack, 'LANDSAT_7', 'matches')` | `r getCount(match5,
'LANDSAT_7', 'matches')` | `r getCount(match71, 'LANDSAT_7', 'matches')` | |
Landsat 8 | `r getCount(stack, 'LANDSAT_8', 'matches')` | `r getCount(match5,
'LANDSAT_8', 'matches')` | `r getCount(match71, 'LANDSAT_8', 'matches')` | |
Landsat 9 | `r getCount(stack, 'LANDSAT_9', 'matches')` | `r getCount(match5,
'LANDSAT_9', 'matches')` | `r getCount(match71, 'LANDSAT_9', 'matches')` | |
Total | `r getCount(stack, 'all', 'matches')` | `r getCount(match5, 'all',
'matches')` | `r getCount(match71, 'all', 'matches')` |
: Table 2. Summary of number of valid Landsat observations for individual
locations in Yojoa in the full Landsat stack, the ±5 day matchup dataset, and
the ±7/1 day dataset.
```{r, echo = F}
getSiteCount <- function(dataset, site) {
if (site == "all") {
df <- dataset
nrow(df)
} else {
df <- dataset %>%
filter(location == site)
nrow(df)
}
}
```
| Site | Number of Secchi Measurements | Number of Secchi-Landsat Matchups in
±5 day dataset | Number of Secchi-Landsat Matchups in ±7/1 day dataset. |
|------------------|------------------|------------------|------------------| |
A | `r getSiteCount(secchi, 'A')` | `r getSiteCount(match5, 'A')` | `r
getSiteCount(match71, 'A')` | | B | `r getSiteCount(secchi, 'B')` | `r
getSiteCount(match5, 'B')` | `r getSiteCount(match71, 'B')` | | C | `r
getSiteCount(secchi, 'C')` | `r getSiteCount(match5, 'C')` | `r
getSiteCount(match71, 'C')` | | D | `r getSiteCount(secchi, 'D')` | `r
getSiteCount(match5, 'D')` | `r getSiteCount(match71, 'D')` | | E | `r
getSiteCount(secchi, 'E')` | `r getSiteCount(match5, 'E')` | `r
getSiteCount(match71, 'E')` | | F | `r getSiteCount(secchi, 'F')` | `r
getSiteCount(match5, 'F')` | `r getSiteCount(match71, 'F')` | | G | `r
getSiteCount(secchi, 'G')` | `r getSiteCount(match5, 'G')` | `r
getSiteCount(match71, 'G')` | | H | `r getSiteCount(secchi, 'H')` | `r
getSiteCount(match5, 'H')` | `r getSiteCount(match71, 'H')` | | I | `r
getSiteCount(secchi, 'I')` | `r getSiteCount(match5, 'I')` | `r
getSiteCount(match71, 'I')` | | J | `r getSiteCount(secchi, 'J')` | `r
getSiteCount(match5, 'J')` | `r getSiteCount(match71, 'J')` | | K | `r
getSiteCount(secchi, 'K')` | `r getSiteCount(match5, 'K')` | `r
getSiteCount(match71, 'K')` | | L | `r getSiteCount(secchi, 'L')` | `r
getSiteCount(match5, 'L')` | `r getSiteCount(match71, 'L')` | | M | `r
getSiteCount(secchi, 'M')` | `r getSiteCount(match5, 'M')` | `r
getSiteCount(match71, 'M')` | | N | `r getSiteCount(secchi, 'N')` | `r
getSiteCount(match5, 'N')` | `r getSiteCount(match71, 'N')` | | O | `r
getSiteCount(secchi, 'O')` | `r getSiteCount(match5, 'O')` | `r
getSiteCount(match71, 'O')` | | P | `r getSiteCount(secchi, 'P')` | `r
getSiteCount(match5, 'P')` | `r getSiteCount(match71, 'P')` | | Q | `r
getSiteCount(secchi, 'Q')` | `r getSiteCount(match5, 'Q')` | `r
getSiteCount(match71, 'Q')` | | R | `r getSiteCount(secchi, 'R')` | `r
getSiteCount(match5, 'R')` | `r getSiteCount(match71, 'R')` | | Total | `r
getSiteCount(secchi, 'all')` | `r getSiteCount(match5, 'all')` | `r
getSiteCount(match71, 'all')` |
: Table 3. Secchi measurement counts per site in the *in situ* record, the
±5-day matchup dataset,and the ±7/1-day matchup dataset.
#### Regional Handoff Coefficients
Median Landsat Collection 2 Surface Reflectance values were obtained for the
Chebyshev center (Yang, 2020) for all lakes greater than 25 hectares within or
touching the boundaries of the countries of Guatemala, Honduras, and El
Salvador from the spatial dataset produced by Messager, et al. (2016).
Acquisition of these data followed the same methods as those for Lake Yojoa
described above.
Centers of the lakes from the spatial dataset were calculated in the GEE IDE
using the method from Yang (2020) with the script
`GEE_scripts/guat_elsal_hon_getCenters.js`.
## Secchi-RS matchups
Secchi data were matched with records from the Landsat remote sensing stack at
a number of time windows: same, one, two, three, and five day windows. In
addition to these standard windows, we employed a variable window defined by
local knowledge. In this method, we allowed for matchups up to ±7 days for all
months where the conditions of the lake are relatively consistent and in
October and November, months that often have sudden clarity changes, only
matches ±1 day were permitted. Windows up to ±7 days have yielded reasonable
results when paired with remote sensing data (Kloiber, et al. 2002). We avoided
discrete over-matching of our data by assuring that each satellite overpass
instance was only paired with the nearest-in-time Secchi measurement and that
each discrete Secchi measurement was only paired with a single, nearest-in-time
valid satellite overpass. Only matchup datasets of ±5 days and ±7/1 produced
reasonable models and are the only datasets summarized in this methods
overview.
Matchups were completed in the script `programs/7_LSC2_secchi_matchups.Rmd`.
## Climate Data
Climate data were aqcuired from the ERA5 dataset (Muñoz, 2019) in the Google
Earth Engine Code Editor (Gorelick, 2017) from a single data point at the
approximate geographical center of Lake Yojoa (14.8768°N, 87.9791°W) for all
available data. ERA5 data provide daily modeled values for an extensive list of
parameters; however, we only used total precipitation, mean air temperature,
total solar radiation, and mean wind speed in our analysis. These data were
aggregated for the previous 3, 5, and 7 days.
Climate data were accessed and summarized in the GEE IDE using the script
`GEE_scripts/era5_download.js`. ERA5 data were collated and aggregated in the
script `programs/2_Process_Summarize_ERA5.Rmd`. Data were paired with the
Landsat stack in the script `programs/6_add_era5_data.Rmd`.
## xgboost Model
We used the R package {xgboost} (Chen, et al. 2023) to develop the best
performing gradient tree boost algorithm for these data. Model features were
median Rrs values for the blue, green, red and near infrared bands, the ratio
of red to green, blue to green, red to blue, green to red, total solar
radiation, maximum air temperature, mean air temperature, minimum air
temperature, total precipitation, and mean wind speed. Various input feature
window combinations were tested, including providing the program with the
previous day's meteorology as well as a summary of the previous 5 or 7 days.
We experimented with two data handling methods with varying stringency. Both
methods used train-test-validation datasets, where the train and test were
provided for model development and the validation were hold out data to test
performance independently of model development. The first method split the data
as 60% of matchups as the training dataset, 20% for testing, and 20% for
validation (`programs/8_xgboost_stringent.Rmd`). The very stringent method
split the Landsat image-dates with the same proportions, so that there was no
image-date crossover between the train, test, or validation dataset
(`programs/9_xgboost_very_stringent.Rmd`).
Because we were particularly interested in performance at the upper end of
Secchi (higher clarity), we also tried an xgboost model that weighted higher
Secchi matchups as more important (`programs/10_xgboost_higher_secchi.Rmd`).
This script uses the same data handling stringency as the 'very stringent\`
method.
To select the best optimal xgboost hyperparameters, we used a grid search
method partitioning the top 20 performing models as measured by lowest RMSE.
From these models, we selected the booster that had the lowest RMSE and a
train-test RMSE that was within 0.15m to avoid selecting an overfit model. If
no models met these conditions the one with the closest train-test RMSE was
selected as the optimal {xgboost} model.
## Stepwise Regression Model
We used the R packages {caret} (Kuhn, 2008) and {leaps} (Lumley, 2020) to
perform backwards stepwise regression. We tested the same combinations of
matchup windows and meteorological data as with the {xgboost} package, except
all input data were normalized to values between 0 and 1 using a min-max
scaling method. Because this method uses cross-wise validation (10 times cv),
no test data were provided to the model. 70% of the data were used in model
development and the remaining 30% were holdout data to examine the results
independently. Data partitioning was completed by total number of matchups not
by image date.
## Time series of estimated Secchi
The time series of Secchi estimates from the four best performing models (one
from each of the above models) have been calculated in a single file
(`data/landsat_estimations/Yojoa_LS-derived_Secchi_estimates.csv`, created in
`programs/12_apply_models.Rmd`). The four column names indicate the model used
to create the Secchi estimates. Keep in mind that input values of the features
(the band and met data that define the model) that are outside of the range of
values used to train the model can create equally out-of-range Secchi
estimations in the stepwise regression model. In general, {xgboost} handles
these outliers better, but a very conservative and cautious approach dictates
that Secchi estimations made with input values that are beyond the range of
those in model development should be treated with skepticism.
------------------------------------------------------------------------
# Acknowledgements
Landsat Collection 2 Level 2 Science Products courtesy of the U.S. Geological
Survey.
------------------------------------------------------------------------
# Citations
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano
I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y, Yuan J (2023). \_xgboost: Extreme
Gradient Boosting\_. R package version 1.7.3.1,
\<[[https://CRAN.R-project.org/package=xgboost\\\\](https://CRAN.R-project.org/package=xgboost){.uri}](%5Bhttps://CRAN.R-project.org/package=xgboost%5D(https://CRAN.R-project.org/package=xgboost)%7B.uri%7D){.uri}\>.
Gardner, J. R., Yang, X., Topp, S. N., Ross, M. R.. V., Altenau, E. H., &
Pavelsky, T. M. (2021). The color of rivers. *Geophysical Research Letters*,
48, e2020GL088946. <https://doi.org/10.1029/2020GL088946>
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R.
(2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone.
*Remote sensing of Environment*, *202*, 18-27.
Jones, J. (2019). Improved automated detection of subpixel-scale
inundation-revised dynamic surface water extent (DSWE) Partial surface water
tests. Remote Sensing, **11**(4), 374. <https://doi.org/10.3390/rs11040374>
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package.
Journal of Statistical Software, 28(5), 1--26.
<https://doi.org/10.18637/jss.v028.i05>
Kloiber, S. M., Brezonik, P. L., Olmanson, L. G., & Bauer, M. E. (2002). A
procedure for regional lake water clarity assessment using Landsat
multispectral data. *Remote sensing of Environment*, *82*(1), 38-47.
Lumley, T. (2020). \_leaps: Regression Subset Selection\_. R package version
3.1, <https://CRAN.R-project.org/package=leaps>
Muñoz Sabater, J., (2019): ERA5-Land daily averaged data from 1981 to present.
Copernicus Climate Change Service (C3S) Climate Data Store (CDS). (2023-03-30).
Messager, M.L., Lehner, B., Grill, G., Nedeva, I., Schmitt, O. (2016).
Estimating the volume and age of water stored in global lakes using a
geo-statistical approach. Nature Communications, 7: 13603.
<https://doi.org/10.1038/ncomms13603>
Masek, J.G., Vermote, E.F., Saleous N.E., Wolfe, R., Hall, F.G., Huemmrich,
K.F., Gao, F., Kutler, J., and Lim, T-K. (2006). A Landsat surface reflectance
dataset for North America, 1990--2000. IEEE Geoscience and Remote Sensing
Letters 3(1):68-72. <http://dx.doi.org/10.1109/LGRS.2005.857030>.
R Core Team (2023). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL
<https://www.R-project.org/>.
Topp, S. N., Pavelsky, T. M., Dugan, H. A., Yang, X., Gardner, J., & Ross, M.
R. V. (2021). Shifting patterns of summer lake color phenology in over 26,000
US lakes. *Water Resources Research*, 57, e2020WR029123.
<https://doi.org/10.1029/2020WR029123>
Vermote, E., Justice, C., Claverie, M., & Franch, B. (2016). Preliminary
analysis of the performance of the Landsat 8/OLI land surface reflectance
product. Remote Sensing of
Environment. <http://dx.doi.org/10.1016/j.rse.2016.04.008>.
Yang, Xiao. (2020). Deepest point calculation for any given polygon using
Google Earth Engine JavaScript API (Version v2). Zenodo.
<https://doi.org/10.5281/zenodo.6341960>
```{r, echo = F}
knitr::wrap_rmd('Methods.Rmd', width = 80, backup = NULL) #note, this will not wrap text that are prefaced by any special characters (like bullets!)
```