-
Notifications
You must be signed in to change notification settings - Fork 5
/
trad02-covariates.qmd
274 lines (190 loc) · 11.9 KB
/
trad02-covariates.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: "Covariates for observations and background"
cache: false
---
Here we do the multi-step task of associating observations with environmental covariates and creating a background point data set used by the model to characterize the environment.
## Background points to characterize the environment
Presence only modeling using [MaxEnt](https://biodiversityinformatics.amnh.org/open_source/maxent/) or the pure R implementation [maxnet](https://github.com/BigelowLab/maxnet) require that a sample representing "background" is provided for the model building stage. Background points are used to characterize the environment in which the presence points are found, the modeling algorithm uses that information to discriminate between suitability and unsuitability of the environment. It is good practice to sample "background" in the same spatial and temporal range as the presence data. That means we need to define a bounding polygon around the presence locations from which we can sample, as well as sampling through time.
## Loading data
We have two data sources to load: point observation data and rasterized environmental predictor data.
### Load the observation data
We'll load in our [OBIS](https://obis.org/taxon/127405) observations as a flat table, and immediately filter the data to any occurring from 2000 to present.
```{r}
source("setup.R")
obs = read_obis(form = "sf") |>
dplyr::filter(date >= as.Date("2000-01-01")) |>
dplyr::glimpse()
```
### Load the environmental predictors
Next we load the environmental predictors, `sst` and `wind` (as `windspeed`, `u_wind` and `v_wind`). For each we first read in the database, then call a convenience reading function that handles the input of the files and assembling into a single `stars` class object.
```{r}
sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
dplyr::arrange(date)
wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
dplyr::arrange(date)
windspeed_db = wind_db |>
dplyr::filter(param == "windspeed")
u_wind_db = wind_db |>
dplyr::filter(param == "u_wind")
v_wind_db = wind_db |>
dplyr::filter(param == "v_wind")
preds = read_predictors(sst_db = sst_db,
windspeed_db = windspeed_db,
u_wind_db = u_wind_db,
v_wind_db = v_wind_db)
preds
```
We'll set these aside for a moment and come back to them after we have established our background points.
## Sampling background data
We need to create a random sample of background in both time and space.
#### How many samples?
Now we can sample - **but how many**? Let's start by selecting approximately **four times** as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn't enough we can come back an increase the number. In addition, we may lose some samples in the subsequent steps making a spatial sample.
### Sampling time
Sampling time requires us to consider that the occurrences are not evenly distributed through time. We can see that using a histogram of observation dates by month.
First, let's add a variable to `obs` that reflects the first day of the month of the observation. We'll use the `current_month()` function from the [oisster](https://github.com/BigelowLab/oisster) package to compute that. We then use that to define the breaks (or bins) of a histogram.
```{r}
obs = obs |>
dplyr::mutate(month_id = oisster::current_month(date))
date_range = range(obs$month_id)
breaks = seq(from = date_range[1], to = date_range[2], by = "month")
H = hist(obs$month_id, breaks = breaks, format = "%Y",
freq = TRUE, main = "Observations",
xlab = "Date")
```
Clearly the observations bunch up during certain times of the year, so they are not randomly distributed in time.
Now we have a choice... sample randomly across the entire time span or weight the sampling to match that the distribution of observations. Context matters. Since observations are not the product of systematic surveys, but instead are presence observations we need to keep in mind we are modeling human behavior: we are modeling observations of people who report observations.
#### Unweighted sampling in time
If the purpose of the sampling isn't to mimic the distribution of observations in time, but instead to characterize the environment then we would make an unweighted sample across the time range.
:::{.callout-note}
Note that we set the random number generator seed. This isn't a requirement, but we use it here so that we get the same random selection each time we render the page. Here's a nice discussion about `set.seed()` [usage](https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function).
:::
```{r}
set.seed(1234)
nback = nrow(obs) * 4
days_sample = sample_time(obs$date, size = nback, by = "month", replace = TRUE)
```
Now we can plot the same histogram, but with the `days_unweighted_sample` data.
```{r}
unweightedH = hist(days_sample, breaks = 'month',
format = "%Y",
freq = TRUE,
main = "Sample",
xlab = "Date")
```
#### Weighted sampling in time{#sec-timeweighting}
Let's take a look at the same process but this time we'll use a weight to sample more when we tend to have more observations. We'll use the original histogram counts
```{r}
set.seed(1234)
days_sample = sample_time(obs$date, size = nback, by = "month", replace = TRUE, weighted = TRUE)
```
Now we can plot the same histogram, but with the `days_unweighted_sample` data.
```{r}
weightedH = hist(days_sample, breaks = 'month',
format = "%Y",
freq = TRUE,
main = "Sample",
xlab = "Date")
```
In this case, we are modeling the event that an observer spots **and reports** a *Mola mola*, so we want to background to characterize the times when those events occur. We'll use the weighted time sample.
### Sampling space
The [sf](https://CRAN.R-project.org/package=sf) package provides a function, `st_sample()`, for sampling points within a polygon. But what polygon? We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it's advantages and disadvantages. We show how to make one of each.
#### The bounding box polygon
This is the easiest of the three polygons to make.
```{r}
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
sf::st_geometry()
box = sf::st_bbox(obs) |>
sf::st_as_sfc()
plot(coast, extent = box, axes = TRUE)
plot(box, lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```
Hmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land). That could limit the utility of the model.
#### The convex hull polygon
Also an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations. The key here is to take the union of the points first which creates a single MULTIPOINT object. If you don't you'll get a convex hull around every point... oops.
```{r}
chull = sf::st_union(obs) |>
sf::st_convex_hull()
plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```
Well, that's an improvement, but we still get large areas vacant of observations and most of Nova Scotia.
#### The buffered polygon
An alternative is to create a buffered polygon around the MULTIPOINT object. We like to think of this as the "shrink-wrap" version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.
```{r}
poly = sf::st_union(obs) |>
sf::st_buffer(dist = 75000) |>
sf::st_union() |>
sf::st_simplify() |>
smoothr::smooth(method = 'chaikin', refinements = 10L)
plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```
That seems the best yet, but we still sample on land. We'll over sample and toss out the ones on land. Let's save this polygon in case we need it later.
```{r}
ok = dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg"))
```
#### Sampling the polygon
Now to sample the within the polygon, we'll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes).
```{r}
set.seed(1234)
bkg = sf::st_sample(poly, nback)
plot(coast, extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
plot(sf::st_geometry(coast), add = TRUE)
```
OK - we can work with that! We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.
#### Purging points that are on land (or very nearshore)
It's great if you have in hand a map the distinguishes between land and sea - like we do with `sst`. We shall extract values `v` from just the first `sst` layer (hence the slice).
```{r}
v = preds['sst'] |>
dplyr::slice(along = "time", 1) |>
stars::st_extract(bkg) |>
sf::st_as_sf() |>
dplyr::mutate(is_water = !is.na(sst), .before = 1) |>
dplyr::glimpse()
```
Values where `sst` are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore). We'll merge our `bkg` object and random dates (`days_sample`), filter to include only water.
```{r}
bkg = sf::st_as_sf(bkg) |>
sf::st_set_geometry("geometry") |>
dplyr::mutate(date = days_sample, .before = 1) |>
dplyr::filter(v$is_water)
plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
```
**Note** that the bottom of the scatter is cut off. That tells us that the `sst` raster has been cropped to that southern limit. We can confirm that easily.
```{r}
plot(preds['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)
```
## Extract environmental covariates for `sst` and `wind`
### Wait, what about dates?
You may have considered already an issue connecting our background points which have daily dates with our covariates which are monthly (identified by the first of each month.) We can manage that by adding a second date, `month_id`, to the `bkg` table.
```{r}
bkg = dplyr::mutate(bkg, month_id = oisster::current_month(date))
```
### Extract background points
Here we go back to the complete covariate dataset, `preds`. We extract specifying which variable in `bkg` is mapped to the time domain in `sst` - in our case the newly computed `month_id` matches the `time` dimension in `sst`. We'll save the values while we are at it.
```{r}
bkg_values = stars::st_extract(preds, bkg, time_column = 'month_id')|>
sf::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) |>
dplyr::glimpse()
```
### Next extract observation points
It's the same workflow to extract covariates for the observations as it was for the background, but let's not forget to add in a variable to identify the month that matches those in the predictors.
```{r}
obs = dplyr::mutate(obs, month_id = oisster::current_month(date))
obs_values = stars::st_extract(preds, obs, time_column = 'month_id')|>
sf::write_sf(file.path("data", "obs", "obs-covariates.gpkg")) |>
dplyr::glimpse()
```
That's it! Next we can start assembling a model.