-
Notifications
You must be signed in to change notification settings - Fork 89
/
10-Models.qmd
390 lines (340 loc) · 19.4 KB
/
10-Models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# Statistical modelling of spatial data {#sec-modelling}
```{r message=FALSE, echo=FALSE}
#| code-fold: true
library(tidyverse)
library(sf)
system.file("gpkg/nc.gpkg", package="sf") |>
read_sf() -> nc
```
\index{statistical models}
So far in this book, we mostly addressed the problem of
_describing_ data. This included geometrical measures, predicates,
or transformations that involved geometries, or by summary measures
of attributes, or by plots involving variability in the geometry,
the feature attributes, or both.
Statistical modelling aims at going beyond describing the data,
it considers the data as a sample drawn from a population, and
tries to make assessments (inference) about the population sampled
from, for instance by quantifying relationships between variables,
estimating population parameters, and predicting the outcome of
observations that could have been taken but were not, as is the
case in interpolation problems. This is usually done by adopting a
model for the data, where for instance observations are decomposed
as follows:
$$
\mbox{observed} = \mbox{explained} + \mbox{remainder}
$$ {#eq-model}
where "explained" typically uses external variables (predictors,
covariates, in machine learning confusingly also called features)
that are related to the observed variable and some kind of regression
model to translate into variability of the observed variable, and
"remainder" is remaining variability that could not be explained.
Interest may focus on the nature and magnitude or the relations
between predictors and the observed variable, or in predicting
new observations.
Statistical models, and _sampling_ hinge on the concept of
probability, which in typical spatial data science problems is not a
force of nature but has to be assumed in one way or another. If we
are faced with data that come from (spatially) random sampling and
we are interested in estimating means or totals, a _design-based_
approach that assumes randomness in the sample locations is the most
straightforward analysis approach, as pointed out in more detail in
@sec-design. If observations were not sampled randomly, or if our
interest is in predicting values at specific locations (mapping),
a _model-based_ approach is needed. The remaining chapters in this
part deal with model-based approaches.
## Mapping with non-spatial regression and ML models {#sec-lm}
Regression models or other machine learning (ML) models can be
applied to spatial and spatiotemporal data just the way they are
applied for predicting new observations in non-spatial problems:
1. **estimate**: for a set of observations, a regression or ML model is fitted
using predictor values corresponding to the observations (in ML jargon, this
step is also known as "train")
2. **predict**: for a new situation, known predictor values are combined
with the fitted model to predict the value of the observed variable, along
with a prediction error or prediction interval if possible
Objects of class `sf` need no special treatment, as they are
`data.frame`s. To create maps of the resulting predictions, predicted
values need to be added to the `sf` object, which can be done using the `nc`
dataset loaded as in @sec-intro by:
```{r sid, cache = FALSE}
nc |> mutate(SID = SID74/BIR74, NWB = NWBIR74/BIR74) -> nc1
lm(SID ~ NWB, nc1) |>
predict(nc1, interval = "prediction") -> pr
bind_cols(nc, pr) |> names()
```
where we see that
* `lm` estimates a linear model and works directly on an `sf` object
* the output is used for a `predict` model, which predicts values corresponding
to the observations in `nc1`, the same `sf` object
* `predict` creates three columns: `fit` for predicted values and `lwr`
and `upr` for the 95\% prediction intervals
* these three columns have been added to the final object using `bind_cols`.
In general the datasets for model estimation and prediction do not
have to be the same. @sec-starspredict points out how this can
be done with `stars` objects (essentially by going through a long
`data.frame` representation of the datacube and converting the
predicted results back, potentially in a chunked fashion).
\newpage
Because many regression and ML type problems share this same
structure, packages like **caret** [@R-caret] or **tidymodels**
[@R-tidymodels] allow for automated evaluation and comparison over
a large set of model alternatives, offering a large set of model
evaluation criteria and cross-validation strategies. Such
cross-validation approaches assume independent observations,
which is often not a reasonable assumption for spatial data, for
instance because of spatial correlation [@Ploton2020] or because
of strong spatial clustering in sample data [@meyerpebesmanc], or
both, and a number of R packages provide methods that are meant as
replacements for naive cross-validation, including **spatialsample**
[@R-spatialsample], **CAST** [@R-CAST], **mlr3spatial**
[@R-mlr3spatial], and **mlr3spatiotempcv** [@R-mlr3spatiotempcv].
Strong spatial clustering of sample can arise when sample data are
composed by joining different databases, each with very different
sampling density. This is often the case in global datasets
[@meyerpebesmanc]. Another example of strong clustering arises when,
for sampling ground truth points of a land cover class, polygons
are digitised and points are sampled within these polygons at the
resolution of pixels in satellite imagery.
Spatial correlation in the "remainder" part of the model may be
decreased by adding spatial coordinates or functions of spatial
coordinates to the set of predictors. This also carries a risk
of over-optimistic predictions in extrapolation cases, (cross-)
validation, and model assessment, and is further discussed in
@sec-models-with-coordinates.
## Support and statistical modelling {#sec-supportstatistical}
Support of data (@sec-support; @sec-featureattributes) plays a
lead role in the statistical analysis of spatial data. Methods
for areal data (Chapters [-@sec-area]-[-@sec-spatecon]) are devised for data
with area support, where the set of areas cover the entire area
of interest.
By showing an extensive variable (@sec-extensiveintensive)
in a polygon choropleth map as done in @fig-first-map one runs
the risk that the information is related with the polygon size,
and that the signal shown is actually the size of the polygons,
in colour. For the variable _population count_ one would divide
by the polygon area to show the (intensive) variable _population
density_, in order to create an informative map. In the analysis
of health data, like disease incidences over a time period shown
in @fig-first-map, rather than dividing by polygon area to get a
spatial density, observations are usually converted to probabilities
or _incidence rates_ by dividing over the population size of the
associated polygons. As such they are (still) associated with the
polygon area but their support is associated with the population
total. It is these totals that inform the (Poisson) variability
used by subsequent modelling in for instance CAR-type models
(@sec-spatglmm).
@sec-pointpatterns deals in principle with point support
observations, but at some stage needs to acknowledge that
observations have non-zero size: tree stem "points" cannot be
separated distances smaller than the tree diameter. Also, points in
point pattern analysis are considered in their _observation window_,
the area for which the point dataset is exhaustive, or complete.
The observation window is of influence in many of the analysis
tools. If points are observed on a line network, then the observation
window consists of the set of lines observed, and distances measured
through this network.
Geostatistical data (Chapters [-@sec-interpolation] and [-@sec-stgeostatistics])
usually start with point support observations and may end with
predictions (spatial interpolations) for unobserved point locations
distributed over the area of interest, or, may end in predictions for
means over areas (block kriging; @sec-blockkriging). Alternatively,
observations may be aggregates over regions [@rtop]. In remote
sensing data, pixel values are usually associated with aggregates
over the pixel area. Challenges may be the filling of gaps in images
such as gaps caused by cloud coverage, from pixels neighbouring in
space and time [@gapfill, @Heaton2018, @militinoetal19].
When combining data with different spatial supports, for instance polygons
from administrative regions and raster layers, it is often seen that
all information is "brought together" to the highest resolution,
by simply extracting polygon values at pixel locations, and
proceeding from there, with all the newly created "observations".
This of course bears a large risk of producing non-sensible results
when analysing these "data", and a proper downsampling strategy,
possibly using simulations to cope with uncertainty, would be a
better alternative. For naive users, using software that is not
aware of support of values associated with areas and using software
that does not warn against naive downsampling is of course not a
helpful situation.
## Time in predictive models
@schabenberger+gotway:2005 already noted that in many cases,
statistical analysis of spatiotemporal data proceeds either by
reducing time, then working on the problem spatially (time first,
space later) or reducing space, then working on the problem
temporally (space first, time later). An example of the first
approach is given in @sec-interpolation where a dataset with a year
of hourly values (detailed in @sec-stgeostatistics) are reduced
to station mean values (time first) after which these means are
interpolated spatially (space later). Examples from the area of
remote sensing are
* @sits use supervised machine learning and time series deep learning
to segmentise pixel time series into sequences of land use (time
first), and then smooth the resulting sequences of maps to remove
improbable transitions in isolated pixels (space later)
* @bfast use (unsupervised) structural change algorithms to
find breakpoints in pixel time series (time first), which are interpreted
in the context deforestation later on.
Examples of space first, time later in the area of remote sensing are
any case where a single scene or scenes belonging to a single season
are classified, and multi-year changes in land use or land cover
are assessed by comparing time sequences of classified scenes. An
example of this is @brown2022dynamic. Examples where space and
time are considered _jointly_ are the spatiotemporal interpolation
in @sec-stgeostatistics, and @LU2016227 in the context of remote
sensing.
## Design-based and model-based inference {#sec-design}
\index{inference!design-based}
\index{inference!model-based}
\index{design-based inference}
\index{model-based inference}
Statistical inference means the action of estimating parameters
about a population from sample data. Suppose we denote the variable
of interest with $z(s)$, where $z$ is the attribute value measured
at location $s$, and we are interested in estimating the
mean value of $z(s)$ over a domain $D$,
$$z(s)=\frac{1}{|D|} \int_{ u \in D} z(u)du,$$
with $|D|$ the area of $D$, from sample data $z(s_1),...,z(s_n)$.
Then, there are two possibilities to proceed: model-based, or
design-based. A model-based approach considers $z(s)$ to be a
realisation of a super-population $Z(s)$ (using capital letters to
indicate random variables), and could for instance postulate a
model for its spatial variability in the form of
$$Z(s) = m + e(s), \ \ \mbox{E}(e(s)) = 0, \ \ \mbox{Cov(e(s))} = \Sigma(\theta)$$
with $m$ a constant mean and $e(s)$ a residual with mean zero and
covariance matrix $\Sigma(\theta)$. This
would require choosing the covariance function $\Sigma()$ and
estimating its parameters $\theta$ from $z(s)$, and then computing a
block kriging prediction $\hat{Z}(D)$ (@sec-blockkriging).
This approach makes no assumptions about how $z(s)$ was sampled
_spatially_, but of course it should allow for choosing the
covariance function and estimating its parameters; inference is
conditional to the validity of the postulated model.
\index{superpopulation model}
\index{spatial random sampling}
\index{random sampling}
Rather than assuming a superpopulation model, the design-based
approach [@de1990model; @brus2021; @breidt2017model] assumes
randomness in the locations, which is justified (only) when using
random sampling. It _requires_ that the sample data were obtained
by probability sampling, meaning that some form of spatial random
sampling was used where all elements of $z(s)$ had a known and
positive probability of being included in the sample obtained. The
random process is that of sampling: $z(s_1)$ is a realisation of
the random process $z(S_1)$, the first observation taken _over
repeated random sampling_. Design-based estimators only need
these inclusion probabilities to estimate mean values with standard
errors. This means that for instance given a simple random sample,
the unweighted sample mean is used to estimate the population mean,
and no model parameters need to be fit.
Now the question is whether $z(s_1)$ and $z(s_2)$ can be expected
to be correlated when $s_1$ and $s_2$ are close together. The question
does not work out as long as $z(s_1)$ and $z(s_2)$ are just two numbers:
we need some kind of framework, random variables, that recreates this
situation to form two sets of numbers for which we can consider correlation.
The misconception here, as explained in @brus2021, is that the two
are always spatially correlated, but this is only the case when
working under model-based approaches: $Z(s_1)$ and $Z(s_2)$ may
well be correlated ("model-dependent"), but although in a particular
random sample (realisation) $z(s_1)$ and $z(s_2)$ _may_ be close
in space, the corresponding random variables $z(S_1)$ and $z(S_2)$
considered over repeated random sampling are not close together,
and, are design-independent. Both situations can coexist without
contradiction and are a consequence of choosing to work under one
inference framework or the other.
The choice whether to work under a design-based or model-based
framework depends on the purpose of the study and the data collection
process. The model-based framework lends itself best for cases:
* where predictions are required for individual locations, or
for areas too small to be sampled
* where the available data were not collected using a known random
sampling scheme (i.e., the inclusion probabilities are unknown,
or are zero over particular areas and or times)
Design-based approaches are most suitable when:
* observations were collected using a spatial random sampling process
* aggregated properties of the entire sample region (or sub-region)
are needed
* estimates are required that are not sensitive to model
misspecification, for instance when needed for regulatory or legal
purposes
In case a sampling procedure is to be planned [@de2006sampling], some
form of spatial random sampling is definitely worth considering since
it opens up the possibility of following both inference frameworks.
## Predictive models with coordinates {#sec-models-with-coordinates}
\index{predictive models}
\index{coordinates!as predictors}
In data science projects, coordinates may be seen as features in a
larger set of predictors (or features, or covariates) and treated
accordingly. There are some pitfalls in doing so.
As usual when working with predictors, it is good to choose
predictive methods that are not sensitive to shifts in origin
or shifts in unit (scale). Assuming a two-dimensional problem,
predictive models should also not be sensitive to arbitrary rotations
of the $x$- and $y$- or latitude and longitude axes. For projected (2D,
Cartesian) coordinates this can be assured, e.g., by using polynomials
of order $n$ as $(x+y)^n$, rather than $(x)^n + (y)^n$; for a second
order polynomial this involves including the term $xy$, so that an
ellipsoidal-shape trend surface does not have to be aligned with
the $x-$ or $y-$axis. For a GAM model with spline components, one
would use a spline in two dimensions $s(x,y)$ rather than two
independent splines $s(x)$ and $s(y)$ that do not allow for
interaction. An exception to this "rule" would be when a pure
latitude effect is desired for instance to account for yearly total
solar energy influx.
When the area covered by the data is large, the difference between
using ellipsoidal coordinates and projected coordinates
will automatically become larger, and hence choosing one of both
will have an effect on predictive modelling. For very large extents
and global models, polynomials or splines in latitude and longitude
will not work well as they ignore the circular nature of longitude
and the coordinate singularities at the poles. Here, spherical
harmonics, base functions that are continuous on the sphere with
increasing spatial frequencies can replace polynomials or be used
as spline base functions.
In many cases, the spatial coordinates over which samples were
collected also define the space over which predictions are made,
setting them apart from other features. Many simple predictive
approaches, including most machine learning methods, assume
sample data to be independent. When samples are collected by
spatially random sampling over the spatial target area, this
assumption may be justified when working under a design-based context
[@brusejss]. This context, however, treats the coordinate space as
the variable over which we randomise, which affords predicting
values for a new _randomly chosen_ location but rules out making
predictions for fixed locations; this implies that averages of
areas over which samples were collected, can be obtained, but not
spatial interpolations. In case predictions for fixed locations
are required, or in case data were not collected by spatial
random sampling, a model-based approach (as taken in
@sec-interpolation) is needed and typically some form of spatial
and/or temporal autocorrelation of residuals must be assumed.
\index{cross-validation}
\index{spatial cross-validation}
A common case is where sample data are collected opportunistically
("whatever could be found"), and are then used in a predictive
framework that does not weight them. This has a consequence that the
resulting model may be biased towards over-represented areas (in
predictor space and/or in spatial coordinates space), and that simple
(random) cross-validation statistics may be over-optimistic when
taken as performance measures for spatial prediction [@meyerpebesma;
@meyerpebesmanc; @mila2022nearest]. Adaptive cross-validation
measures such as spatial cross-validation may help getting more
relevant measures for predictive performance.
## Exercises
Use R to solve the following exercises.
1. Following the `lm` example of @sec-lm use a random forest model
to predict `SID` values (e.g., using package **randomForest**), and
plot the random forest predictions against observations, along
with the $x=y$ line.
1. Create a new dataset by randomly sampling 1000 points from the
`nc` dataset, and rerun the linear regression model of @sec-lm on
this dataset. Consider the `summary` of the fitted models, in
particular the estimated coefficients, their standard errors,
and the residual standard error. What has changed?
1. Redo the water-land classification of @sec-starspredict
using `class::knn` instead of `lda`, using a value of `k = 5`,
and compare the resulting predictions with those of `lda`.
1. For the linear model using `nc` and for the `knn` example of the
previous exercise, add a first and a second order linear model in
the spatial coordinates and compare the results (use `st_centroid`
to obtain polygon centroids, and `st_coordinates` to extract the
`x` and `y` coordinates in matrix form).