trad02-covariates.qmd

---
title: "Covariates for observations and background"
cache: false
---

Here we do the multi-step task of associating observations with environmental covariates and creating a background point data set used by the model to characterize the environment.

## Background points to characterize the environment

Presence only modeling using [MaxEnt](https://biodiversityinformatics.amnh.org/open_source/maxent/) or the pure R implementation [maxnet](https://github.com/BigelowLab/maxnet) require that a sample representing "background" is provided for the model building stage. Background points are used to characterize the environment in which the presence points are found, the modeling algorithm uses that information to discriminate between suitability and unsuitability of the environment. It is good practice to sample "background" in the same spatial and temporal range as the presence data.  That means we need to define a bounding polygon around the presence locations from which we can sample, as well as sampling through time.


## Loading data

We have two data sources to load: point observation data and rasterized environmental predictor data.

### Load the observation data
We'll load in our [OBIS](https://obis.org/taxon/127405) observations as a flat table, and immediately filter the data to any occurring from 2000 to present.

```{r}
source("setup.R")

obs = read_obis(form = "sf") |>
  dplyr::filter(date >= as.Date("2000-01-01")) |>
  dplyr::glimpse()
```

### Load the environmental predictors 

Next we load the environmental predictors, `sst` and `wind` (as `windspeed`, `u_wind` and `v_wind`).  For each we first read in the database, then call a convenience reading function that handles the input of the files and assembling into a single `stars` class object.  

```{r}
sst_path = "data/oisst"
sst_db = oisster::read_database(sst_path) |>
  dplyr::arrange(date)

wind_path = "data/nbs"
wind_db = nbs::read_database(wind_path) |>
  dplyr::arrange(date)

windspeed_db = wind_db |>
  dplyr::filter(param == "windspeed")

u_wind_db = wind_db |>
  dplyr::filter(param == "u_wind")

v_wind_db = wind_db |>
  dplyr::filter(param == "v_wind")


preds = read_predictors(sst_db = sst_db,
                        windspeed_db = windspeed_db,
                        u_wind_db = u_wind_db,
                        v_wind_db = v_wind_db)

preds
```

We'll set these aside for a moment and come back to them after we have established our background points.


## Sampling background data

We need to create a random sample of background in both time and space.

#### How many samples?
Now we can sample - **but how many**?  Let's start by selecting approximately **four times** as many background points as we have observation points. If it is too many then we can sub-sample as needed, if it isn't enough we can come back an increase the number.  In addition, we may lose some samples in the subsequent steps making a spatial sample.

### Sampling time

Sampling time requires us to consider that the occurrences are not evenly distributed through time.  We can see that using a histogram of observation dates by month.

First, let's add a variable to `obs` that reflects the first day of the month of the observation.  We'll use the `current_month()` function from the [oisster](https://github.com/BigelowLab/oisster) package to compute that. We then use that to define the breaks (or bins) of a histogram.

```{r}
obs = obs |>
  dplyr::mutate(month_id = oisster::current_month(date))
date_range = range(obs$month_id)
breaks = seq(from = date_range[1], to = date_range[2], by = "month")
H = hist(obs$month_id, breaks = breaks, format = "%Y", 
     freq = TRUE, main = "Observations",
     xlab = "Date")
```

Clearly the observations bunch up during certain times of the year, so they are not randomly distributed in time.

Now we have a choice... sample randomly across the entire time span or weight the sampling to match that the distribution of observations. Context matters.  Since observations are not the product of systematic surveys, but instead are presence observations we need to keep in mind we are modeling human behavior: we are modeling observations of people who report observations.  

#### Unweighted sampling in time
If the purpose of the sampling isn't to mimic the distribution of observations in time, but instead to characterize the environment then we would make an unweighted sample across the time range.

:::{.callout-note}
Note that we set the random number generator seed. This isn't a requirement, but we use it here so that we get the same random selection each time we render the page.  Here's a nice discussion about `set.seed()` [usage](https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function).
:::

```{r}
set.seed(1234)
nback = nrow(obs) * 4
days_sample = sample_time(obs$date, size = nback, by = "month", replace = TRUE)
```

Now we can plot the same histogram, but with the `days_unweighted_sample` data.

```{r}
unweightedH = hist(days_sample, breaks = 'month', 
                   format = "%Y", 
                   freq = TRUE, 
                   main = "Sample",
                   xlab = "Date")
```

#### Weighted sampling in time{#sec-timeweighting}

Let's take a look at the same process but this time we'll use a weight to sample more when we tend to have more observations. We'll use the original histogram counts

```{r}
set.seed(1234)
days_sample = sample_time(obs$date, size = nback, by = "month", replace = TRUE, weighted = TRUE)
```
Now we can plot the same histogram, but with the `days_unweighted_sample` data.

```{r}
weightedH = hist(days_sample, breaks = 'month', 
                   format = "%Y", 
                   freq = TRUE, 
                   main = "Sample",
                   xlab = "Date")
```
In this case, we are modeling the event that an observer spots **and reports** a *Mola mola*, so we want to background to characterize the times when those events occur.  We'll use the weighted time sample.

### Sampling space

The [sf](https://CRAN.R-project.org/package=sf) package provides a function, `st_sample()`, for sampling points within a polygon.  But what polygon?  We have choices as we could use (a) a bounding box around the observations, (b) a convex hull around the observations or (c) a buffered envelope around the observations. Each has it's advantages and disadvantages.  We show how to make one of each.

#### The bounding box polygon

This is the easiest of the three polygons to make.  

```{r}
coast = rnaturalearth::ne_coastline(scale = 'large', returnclass = 'sf') |>
  sf::st_geometry()

box = sf::st_bbox(obs) |>
  sf::st_as_sfc()

plot(coast, extent = box, axes = TRUE)
plot(box, lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```

Hmmm. It is easy to make, but you can see vast stretches of sampling area where no observations have been reported (including on land).  That could limit the utility of the model.

#### The convex hull polygon

Also an easy polygon to make is a convex hull - this is one often described as the rubber-band stretched around the point locations.  The key here is to take the union of the points first which creates a single MULTIPOINT object.  If you don't you'll get a convex hull around every point... oops.

```{r}
chull = sf::st_union(obs) |>
  sf::st_convex_hull()

plot(sf::st_geometry(coast), extent = chull, axes = TRUE)
plot(sf::st_geometry(chull), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```

Well, that's an improvement, but we still get large areas vacant of observations and most of Nova Scotia.

#### The buffered polygon

An alternative is to create a buffered polygon around the MULTIPOINT object.  We like to think of this as the "shrink-wrap" version as it follows the general contours of the points. We arrived at a buffereing distance of 75000m through trial and error, and the add in a smoothing for no other reason to improve aesthetics.

```{r}
poly =  sf::st_union(obs) |>
  sf::st_buffer(dist = 75000) |>
  sf::st_union() |>
  sf::st_simplify() |>
  smoothr::smooth(method = 'chaikin', refinements = 10L)


plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(obs), pch = "+", col = 'blue', add = TRUE)
```
That seems the best yet, but we still sample on land.  We'll over sample and toss out the ones on land. Let's save this polygon in case we need it later.

```{r}
ok = dir.create("data/bkg", recursive = TRUE, showWarnings = FALSE)
sf::write_sf(poly, file.path("data", "bkg", "buffered-polygon.gpkg"))
```

#### Sampling the polygon

Now to sample the within the polygon, we'll sample the same number we selected earlier. Note that we also set the same seed (for demonstration purposes). 

```{r}
set.seed(1234)
bkg = sf::st_sample(poly, nback) 

plot(coast, extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
plot(sf::st_geometry(coast), add = TRUE)
```
OK - we can work with that!  We still have points on land, but most are not. The following section shows how to use SST maps to filter out errant background points.


#### Purging points that are on land (or very nearshore)

It's great if you have in hand a map the distinguishes between land and sea - like we do with `sst`.  We shall extract values `v` from just the first `sst` layer (hence the slice).

```{r}
v = preds['sst'] |>
  dplyr::slice(along = "time", 1) |>
  stars::st_extract(bkg) |>
  sf::st_as_sf() |>
  dplyr::mutate(is_water = !is.na(sst), .before = 1) |>
  dplyr::glimpse()
```
Values where `sst` are NA are beyond the scope of data present in the OISST data set, so we will take that to mean NA is land (or very nearshore).  We'll merge our `bkg` object and random dates (`days_sample`), filter to include only water.

```{r}
bkg = sf::st_as_sf(bkg) |>
  sf::st_set_geometry("geometry") |>
  dplyr::mutate(date = days_sample, .before = 1) |>
  dplyr::filter(v$is_water)

plot(sf::st_geometry(coast), extent = poly, axes = TRUE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = 'blue', add = TRUE)
```

**Note** that the bottom of the scatter is cut off.  That tells us that the `sst` raster has been cropped to that southern limit.  We can confirm that easily.

```{r}
plot(preds['sst'] |> dplyr::slice('time', 1), extent = poly, axes = TRUE, reset = FALSE)
plot(sf::st_geometry(poly), lwd = 2, border = 'orange', add = TRUE)
plot(sf::st_geometry(bkg), pch = ".", col = "blue", add = TRUE)

```


## Extract environmental covariates for `sst` and `wind`

### Wait, what about dates?

You may have considered already an issue connecting our background points which have daily dates with our covariates which are monthly (identified by the first of each month.) We can manage that by adding a second date, `month_id`, to the `bkg` table. 

```{r}
bkg = dplyr::mutate(bkg, month_id = oisster::current_month(date))
```

### Extract background points

Here we go back to the complete covariate dataset, `preds`. We extract specifying which variable in `bkg` is mapped to the time domain in `sst` - in our case the newly computed `month_id` matches the `time` dimension in `sst`.  We'll save the values while we are at it.

```{r}
bkg_values = stars::st_extract(preds, bkg, time_column = 'month_id')|>
  sf::write_sf(file.path("data", "bkg", "bkg-covariates.gpkg")) |>
  dplyr::glimpse()
```


### Next extract observation points

It's the same workflow to extract covariates for the observations as it was for the background, but let's not forget to add in a variable to identify the month that matches those in the predictors.

```{r}
obs = dplyr::mutate(obs, month_id = oisster::current_month(date))
obs_values = stars::st_extract(preds, obs, time_column = 'month_id')|>
  sf::write_sf(file.path("data", "obs", "obs-covariates.gpkg")) |>
  dplyr::glimpse()
```

That's it!  Next we can start assembling a model.