forked from soilspectroscopy/ossl-manual
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path050-reference_models.Rmd
601 lines (477 loc) · 28.4 KB
/
050-reference_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# OSSL reference models
```{r, results = "asis", echo = FALSE}
status("drafting")
```
We have received important feedback from people that said the current models are doing a great job, but for others not that much. This can be due to the specific soil types not being well represented or due to the spectra not being well aligned with the OSSL. We plan to conduct a systematic analysis of learning algorithms, global vs local fitting, compression methods and preprocessing using the OSSL database. Another great interest is to test if including spatial covariates or fusing different spectral regions really help to deliver more reliable predictions.
The following sections describe steps used to produce OSSL models, which
are used in the OSSL prediction service and API. The current available models
were fitted using the previous OSSL database version 1.0. For more in-depth insight of
the modeling steps please refer to the [ossl-models](https://github.com/soilspectroscopy/ossl-models) repository.
## OSSL global soil spectroscopy calibrarion models
We have fitted a number of calibration models using OSSL MIR and VisNIR spectra
and combination of additional covariates, and which are now publicly available
for use under the CC-BY license. For modeling we use the Ensemble Machine Learning
framework as implemented in the R [mlr](https://mlr.mlr-org.com/) programming environment
with the following parameters:
1. Regression on first 60 Principal Component Analysis PC's of the raw spectra;
2. Four base learners: Random Forest, Gradient Boosting, Cubist and Lasso and Elastic-Net Regularized GLM;
3. Ensemble Machine Learning by [stacking / meta-learner](https://mlr.mlr-org.com/reference/makeStackedLearner.html) through 5-fold Cross-Validation;
4. Blocking using 100×100-km spatial grid to avoid possible overfitting;
5. [Feature selection](https://mlr.mlr-org.com/articles/tutorial/feature_selection.html) and model-parameter fine-tuning using random and sequential backward search;
For more details on how were the models fitted refer to: <https://github.com/soilspectroscopy/ossl-models>.
For standard global soil spectral calibration models we use various calibration frameworks including:
1. Using only MIR spectra for calibration.
2. Using only VisNIR spectra for calibration.
3. Using combination of MIR and VisNIR spectra.
4. Using combination of MIR and VisNIR spectra + geographic covariates.
The methods 1 & 2 are standard methods described in literature [@malone2021soil]. The methods
3 & 4 are referred to as the _"MIR-VisNIR fusion"_ approach [@vohland2022quantification].
The method 4 is the _"field-specific calibration approach"_; we refer to it here
as the _"bundle approach"_ as we basically use all information available
to help produce best predictions of targeted soil properties.
Users can choose to opt for one of the multiple calibration models, which also
depends on the type of data (VisNIR, MIR, coordinates and depths known yes/no) used of course.
The online calibration service (<https://engine.soilspectroscopy.org>) and the API
can be used for predicting values of soil properties for smaller datasets (<1000 rows).
Soil spectral scans obtained by different spectroscopic instruments often cannot be
modeled or analyzed together i.e. they require some kind of calibration [@lei2022achieving]
before one can fit cross-instrument / cross-dataset calibration models.
In the OSSL library we thus use also instrument / dataset code as the indicator variable.
The most complex models currently in the OSSL are, thus, one with which users need to define both:
1. Input MIR spectra,
2. Input VisNIR spectra (same samples),
3. Geographical location and soil depth,
4. Instrument code / dataset code;
## Using default OSSL models to predict soil properties
OSSL model registry (RDS) S3 file services currently includes large number of
prediction models fitted for a list of soil properties. The complete list is available in the file **[OSSL_models_meta.csv](https://github.com/soilspectroscopy/ossl-models)**.
```{r, echo=FALSE, message=FALSE, results='hide'}
remotes::install_github("philipp-baumann/simplerspec")
#devtools::install_github("pierreroudier/asdreader")
ls <- c("plyr", "rgdal", "terra", "mlr", "ranger", "xgboost",
"Cubist", "fastDummies", "prospectr", "simplerspec", "asdreader")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
x = lapply(ls, require, character.only = TRUE)
```
### Predicting soil properties using MIR spectra
For example, to predict [soil organic carbon weight percent](https://soilspectroscopy.github.io/ossl-manual/database.html#oc_usda.calc_wpct) (`oc_usda.calc_wpct`) using MIR spectra, we need to load
two models (1) principal component model `mpca_mir_kssl_v1.rds`,
(2) mlr model for predicting `oc_usda.calc_wpct`:
```{r, eval=FALSE}
source('./R/ossl_functions.R')
rep = "http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_models/"
pcm1 = url(paste0(rep, "pca.ossl/mpca_mir_kssl_v1.rds"), "rb")
eml1 = url(paste0(rep, "log..oc_usda.calc_wpct/mir_mlr..eml_kssl_na_v1.rds"), "rb")
ossl.pca.mir = readRDS(pcm1)
ossl.model = readRDS(eml1)
```
Note that the models (RDS files) can be between 10 to 200 MB in size, depending on complexity of models
and the size of the training data used.
The Principal Component Analysis is described in the [OSSL models repository](https://github.com/soilspectroscopy/ossl-models/).
It is a standard way of compressing 1000+ spectral bands before predictive modeling [@chang2001near].
We use the first 60 PCA components for further modeling, which is an arbitrary decisions.
We can look at the summary of model by typing:
```{r, eval=FALSE}
summary(ossl.model$learner.model$super.model$learner.model)
```
```
Residuals:
Min 1Q Median 3Q Max
-3.4601 -0.0535 -0.0038 0.0442 3.2730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0099587 0.0008464 -11.766 < 2e-16 ***
regr.ranger 0.1507935 0.0048442 31.129 < 2e-16 ***
regr.xgboost -0.0231490 0.0034214 -6.766 1.34e-11 ***
regr.cvglmnet 0.4153936 0.0049604 83.742 < 2e-16 ***
regr.cubist 0.4674533 0.0044788 104.371 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1501 on 72246 degrees of freedom
Multiple R-squared: 0.9776, Adjusted R-squared: 0.9776
F-statistic: 7.891e+05 on 4 and 72246 DF, p-value: < 2.2e-16
```
This shows that the model is an ensemble Machine Learning model based on stacking of
four individual models: `regr.ranger` (Random Forest) [@wright2017ranger], `regr.xgboost` (Gradient Boosting) [@chen2015xgboost],
`regr.cvglmnet` (Lasso and Elastic-Net Regularized GLM) [@friedman2010] and `regr.cubist` (Cubist) [@kuhn2012cubist]. The RMSE
of the model is 0.150 (R-square = 0.978), based on 5-fold Cross Validation with spatial blocking (global grid with 100 km blocks).
Model is based on 72,246 training points.
We can take a look at the accuracy plot for this model:
```{r ac-soc1, echo=FALSE, fig.cap="Accuracy plot for `log..oc_usda.calc_wpct/mir_mlr..eml_kssl_na_v1.rds`.", out.width="70%"}
knitr::include_graphics("http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_models/log..oc_usda.calc_wpct/ap.mir_mlr..eml_kssl_na_v1.rds.png")
```
Next, we load the sample MIR data that we can use to predict soil organic carbon:
```{r, eval=FALSE}
mir.raw = read.csv("./sample-data/sample_mir_data.csv")[,-1]
```
This is a simple table with raw MIR spectral data (absorbances) prepared as a table.
The column names contain the wavelengths / MIR channels:
```{r, eval=FALSE}
dim(mir.raw)
## 20 1765
str(mir.raw[,1:10])
```
```
'data.frame': 20 obs. of 10 variables:
$ X4001.656: num 0.411 0.357 0.296 0.152 0.238 ...
$ X3999.728: num 0.411 0.357 0.297 0.152 0.239 ...
$ X3997.799: num 0.411 0.357 0.297 0.153 0.239 ...
$ X3995.871: num 0.411 0.357 0.298 0.153 0.24 ...
$ X3993.942: num 0.411 0.357 0.298 0.153 0.24 ...
$ X3992.014: num 0.411 0.357 0.299 0.154 0.241 ...
$ X3990.085: num 0.411 0.358 0.299 0.154 0.242 ...
$ X3988.156: num 0.412 0.358 0.299 0.154 0.242 ...
$ X3986.228: num 0.412 0.358 0.3 0.154 0.243 ...
$ X3984.3 : num 0.411 0.358 0.3 0.155 0.243 ...
```
We can generate predictions by using:
```{r, eval=FALSE}
pred.oc = predict.ossl(t.var="log..oc_usda.calc_wpct", mir.raw=mir.raw,
ossl.model=ossl.model, ylim=c(0,100),
ossl.pca.mir=ossl.pca.mir)
str(pred.oc$pred)
```
where `ylim=c(0,100)` specifies that the prediction values need to be in the range 0 to 100 (values out of these range will be replaced with either 0 or 100).
The output of this model consists of four parts:
```
List of 4
$ pred :'data.frame': 20 obs. of 5 variables:
$ x :'data.frame': 20 obs. of 68 variables:
$ model:'data.frame': 72251 obs. of 5 variables:
$ cf : num 1.38
```
1. Predictions with back-transformed value and lower and upper prediction intervals (1 std.);
2. Data frame with all covariate layers needed to generate predictions;
3. Summary meta-learner with predicted and observed values;
4. Ratio between variance in predictions produced using base learners and MSE;
The predictions are in the format:
```{r, eval=FALSE}
str(pred.oc$pred)
```
```
'data.frame': 20 obs. of 5 variables:
$ pred.mean : num 0.0496 0.1128 0.3848 0.776 -0.0608 ...
$ pred.error: num 0.115 0.248 0.085 0.151 0.168 ...
$ tpred.mean: num 0.0509 0.1194 0.4693 1.1728 0 ...
$ lower.1std: num 0 0 0.35 0.868 0 ...
$ upper.1std: num 0.179 0.435 0.6 1.527 0.113 ...
```
which means that the predicted `oc_usda.calc_wpct` for first row is 0 etc. The
`lower.1std` and `upper.1std` indicate standard prediction interval based on the
prediction error derived in the log-space. For higher values of `oc_usda.calc_wpct`
prediction interval gets thus proportionally higher for larger absolute values e.g. for row 14 it is between 24% and 39%:
```
#pred.oc$pred[14,]
pred.mean pred.error tpred.mean lower.1std upper.1std
14 3.464914 0.2329579 30.9737 24.32914 39.3613
```
Finally, we can check how well does the new data (MIR scans) reflect the training
data used to build the calibration models:
```{r, eval=FALSE}
library(hexbin)
library(grid)
ossl.pcax.mir = readRDS(url(paste0(rep, "pca.ossl/pca_mir_kssl_v1.rds"), "rb"))
reds = colorRampPalette(RColorBrewer::brewer.pal(9, "YlOrRd")[-1])
hb <- hexbin(ossl.pcax.mir$x[,1:2], xbins=60)
p <- plot(hb, colramp = reds, main='PCA MIR KSSL')
pushHexport(p$plot.vp)
grid.points(pred.oc$x$mir.PC1, pred.oc$x$mir.PC2, pch=17)
```
```{r pca-sim, echo=FALSE, fig.cap="Similary plot new spectra vs training spectra (PC biplot).", out.width="80%"}
knitr::include_graphics("img/pca_similarity_plot.png")
```
Based on the plot shown above, it appears that some of scans used are less typical
and hence prediction results should be taken with caution.
### Predicting soil properties using MIR spectra and geographical covariates
Next, we can also predict values at new location using calibration model that is
field-specific i.e. we use MIR spectra in combination with geographic covariates.
To predict with such data we also need to load geographical locations for the same
MIR scans:
```{r, eval=FALSE}
new.soil = read.csv("./sample-data/sample_soilsite_data.csv")
lon = new.soil$longitude.std.decimal.degrees
lat = new.soil$latitude.std.decimal.degrees
hzn_depth = new.soil$lay_depth_to_top + (new.soil$lay_depth_to_bottom - new.soil$lay_depth_to_top)/2
```
This now comes with three additional columns `lon` and `lat` and soil depth `hzn_depth`, which is
the middle of the top and bottom depth of the soil horizon. To predict with then specify also the additional
arguments in the function:
```{r, eval=FALSE}
eml2 = url(paste0(rep, "log..oc_usda.calc_wpct/mir_mlr..eml_kssl_ll_v1.rds"), "rb")
ossl.model = readRDS(eml2)
pred.oc2 = predict.ossl(t.var="log..oc_usda.calc_wpct", mir.raw=mir.raw, ossl.model=ossl.model, ylim=c(0,100),
ossl.pca.mir=ossl.pca.mir, geo.type="ll", lon=lon, lat=lat, hzn_depth=hzn_depth, cog.dir="/data/WORLDCLIM/")
```
This type of prediction is somewhat more complex as the function `predict.ossl` needs also to overlay
points (`lon` and `lat`) vs some [60+ GeoTIFFs](https://github.com/soilspectroscopy/ossl-models/blob/main/out/global_layers1km.csv) (containing [WorldClim2.1](https://www.worldclim.org/data/worldclim21.html) layers and [MODIS LST](https://doi.org/10.5281/zenodo.1420114) layers). In this case the layers are located on a local machine, which significantly speeds up predictions function.
We can look at the variable importance table to try to understand how much the
additional geographical covariate layers help with producing predictions:
```{r, eval=FALSE}
xl <- data.frame(importance = ossl.model$learner.model$base.models[[1]]$learner.model$variable.importance)
xl$relative_importance = 100*xl$importance/sum(xl$importance)
xl = xl[order(xl$importance, decreasing = TRUE),]
xl[1:20,]
```
```
importance relative_importance
mir.PC3 6511.36634 27.9140468
mir.PC2 3603.29854 15.4472409
hzn_depth 2936.76269 12.5898202
mir.PC5 2570.81057 11.0209936
mir.PC10 1767.58676 7.5775954
mir.PC1 1276.53105 5.4724532
mir.PC4 925.96202 3.9695735
mir.PC11 416.65775 1.7862002
mir.PC24 226.11697 0.9693572
mir.PC26 158.90309 0.6812131
mir.PC8 116.63143 0.4999957
mir.PC16 102.35918 0.4388110
mir.PC7 98.52643 0.4223801
clm_lst_mod11a2.jul.day_m_1km_s0..0cm_2000..2017_v1.0 65.31996 0.2800248
mir.PC12 62.31378 0.2671374
mir.PC28 60.13722 0.2578066
mir.PC15 59.47144 0.2549524
clm_lst_mod11a2.jun.day_m_1km_s0..0cm_2000..2017_v1.0 59.17326 0.2536741
mir.PC6 49.88579 0.2138590
mir.PC17 49.37638 0.2116752
```
here `hzn_depth` comes as an important covariate and so is `lst_mod11a2.jul.day`
(MOD11A long-term meand daily temperature for July). The original MIR PCA components, however,
dominate the prediction model.
The global layers (global land mask at 1-km spatial resolution) are listed at <https://github.com/soilspectroscopy/ossl-models> and are available as a Cloud-service / Cloud-Optimized GeoTIFFs. The process of spatial overlay for new prediction locations, however, can significantly increase
prediction time, so something to be aware of.
### Predicting soil properties using VisNIR spectra
We can also load VisNIR spectra (reflectances) and predict values using the same function:
```{r, eval=FALSE}
visnir.raw = read.csv("./sample-data/sample_visnir_data.csv")[,-c(1:2)]
str(visnir.raw[,1:5])
```
```
'data.frame': 20 obs. of 5 variables:
$ X350: num 0.0855 0.0677 0.2319 0.3179 0.25 ...
$ X351: num 0.0839 0.0666 0.2305 0.3195 0.2504 ...
$ X352: num 0.0816 0.0647 0.2277 0.3184 0.2508 ...
$ X353: num 0.0811 0.0658 0.2261 0.3134 0.2489 ...
$ X354: num 0.082 0.0674 0.2258 0.3124 0.2464 ...
```
We load another PCA model that we prepared using global VisNIR data and the
mlr-model that was fitted using VisNIR data:
```{r, eval=FALSE}
pmc3 = url(paste0(rep, "pca.ossl/mpca_visnir_kssl_v1.rds"), "rb")
eml3 = url(paste0(rep, "log..oc_usda.calc_wpct/visnir_mlr..eml_ossl_na_v1.rds"), "rb")
ossl.pca.visnir = readRDS(pmc3)
ossl.model = readRDS(eml3)
pred.oc3 = predict.ossl(t.var="log..oc_usda.calc_wpct", visnir.raw=visnir.raw, ossl.model=ossl.model,
ylim=c(0,100), spc.type = "visnir", ossl.pca.visnir = ossl.pca.visnir)
```
This model is similar to the MIR model, but the accuracy plot indicates that the accuracy
might be significantly lower:
```{r ac-soc2, echo=FALSE, fig.cap="Accuracy plot for `log..oc_usda.calc_wpct/visnir_mlr..eml_ossl_na_v1.rds`.", out.width="70%"}
knitr::include_graphics("http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_models/log..oc_usda.calc_wpct/ap.visnir_mlr..eml_ossl_na_v1.rds.png")
```
### Importing soil spectral scans using raw data formats
The examples shown above demonstrate prediction results based on using sample
MIR and VisNIR data (csv files). If you only have raw spectral scan data such as
[ASD](http://support.asdi.com/Document/Viewer.aspx?id=95) (PAN Analytics Inc.) and/or
[OPUS files](https://www.bruker.com/en/products-and-solutions/infrared-and-raman/opus-spectroscopy-software.html) (Bruker Inc.),
you can use various R packages to read files and prepare them for prediction required by OSSL.
Consider for example a sample OPUS file from the [AfSIS-1 project](https://registry.opendata.aws/afsis/),
which is an original file from the Bruker_Alpha_ZnSe instrument. We can read this file
thanks to the [simplerspec package](https://philipp-baumann.github.io/simplerspec) by using:
```{r}
mir.x = simplerspec::read_opus_bin_univ('sample-data/icr056141.0')
names(mir.x)
```
which gives a list of object including `spc` which are the spectral bands:
```{r}
dim(mir.x[["spc"]])
```
we can plot the MIR absorbance curve for this scan by using:
```{r, plot-opus, fig.width=6, out.width="90%", fig.cap="Spectral absorbances plot using MIR data."}
x = as.numeric(names(mir.x[["spc"]]))
matplot(y=as.vector(t(mir.x[["spc"]])), x=x,
ylim = c(0,3),
type = 'l',
xlab = "Wavelength",
ylab = "Absorbance"
)
```
This dataset is hence ready to generate predictions using global OSSL models. We can,
for example generate prediction of soil pH using the following model:
```{r ac-ph1, echo=FALSE, fig.cap="Accuracy plot for `log..oc_usda.calc_wpct/mir_mlr..eml_ossl_ll_v1.rds`.", out.width="70%"}
knitr::include_graphics("http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_models/log..oc_usda.calc_wpct/ap.mir_mlr..eml_ossl_ll_v1.rds.png")
```
in this case we need to add also lon-lat coordinates and specify the soil depth:
```{r, eval=FALSE}
pcm3 = url(paste0(rep, "pca.ossl/mpca_mir_ossl_v1.rds"), "rb")
eml3 = url(paste0(rep, "log..oc_usda.calc_wpct/mir_mlr..eml_ossl_ll_v1.rds"), "rb")
ossl.pca.mir = readRDS(pcm3)
ossl.model = readRDS(eml3)
pred.oc1 = predict.ossl(t.var="log..oc_usda.calc_wpct", mir.raw=mir.x[["spc"]], ossl.model=ossl.model,
ossl.pca.mir=ossl.pca.mir, geo.type="ll", lon=28.50299833, lat=-13.10194667, hzn_depth=10,
cog.dir="/data/WORLDCLIM/", dataset.code_ascii_c="AFSIS1.SSL", ylim=c(0,100))
```
which gives us predictions of `log..oc_usda.calc_wpct`:
```{r, eval=FALSE}
pred.oc1$pred
```
```
pred.mean pred.error tpred.mean lower.1std upper.1std
1 -0.02636144 0.2813155 0 0 0.2904023
```
Note that we had to now specify that this MIR data is from AfSIS project via the
`AFSIS1.SSL` argument.
Likewise, we can also read the ASD files using the [asdreader package](https://github.com/pierreroudier/asdreader), so we can
generate predictions using VisNIR spectra. Consider for example the sample data
from the KSSL VisNIR repository:
```{r}
asd.raw = as.data.frame(asdreader::get_spectra("sample-data/101453MD01.asd"))
dim(asd.raw)
```
```{r, plot-ads, fig.width=6, out.width="90%", fig.cap="Spectral reflectance plot using VisNIR data."}
x = as.numeric(names(asd.raw))
matplot(y=as.vector(t(asd.raw)), x=x,
ylim = c(0,0.7),
type = 'l',
xlab = "Wavelength",
ylab = "Reflectance"
)
```
we can also directly predict `ph.h2o_usda.4c1_index` by using:
```{r, eval=FALSE}
pmc3 = url(paste0(rep, "pca.ossl/mpca_visnir_kssl_v1.rds"), "rb")
eml4 = url(paste0(rep, "ph.h2o_usda.4c1_index/visnir_mlr..eml_ossl_na_v1.rds"), "rb")
ossl.pca.visnir = readRDS(pmc3)
ossl.model = readRDS(eml4)
pred.ph2 = predict.ossl(t.var="ph.h2o_usda.4c1_index", visnir.raw=asd.raw,
ossl.model=ossl.model, spc.type="visnir", ossl.pca.visnir=ossl.pca.visnir)
pred.ph2$pred
```
```
pred.mean pred.error lower.1std upper.1std
1 6.292328 0.9341607 5.358167 7.226489
```
We are currently building an API that will work directly with standard raw files
coming directly from the instruments. This should help make the OSSL more robust
and applicable to thousands of requests on a daily basis.
Wrongly formatted spectral scans can lead to artifacts and should be used with care.
Note that complexity of soil spectral scans remains high, including the file sizes that are
difficult to open and read (few thousands of columns)
and are often not supported in standard tabular software e.g. LibreOffice or similar.
We recommend thus to report any issues you might have via our [project GitHub](https://github.com/soilspectroscopy) so we
can all help each other produce more accurate primary soil data.
### The bundle approach to soil spectroscopy calibration
One of the approaches used to calibrate soil spectroscopy data in the Soil Spectroscopy
for Global Good project is to use a combination of VisNIR, MIR, spatial location
and Earth Observation data, basically any additional information that could potentially
help with calibration. We refer to this as the _"bundle approach"_. The key
assumption of the bundle approach is that accuracy of predictions could be somewhat
improved, which is especially important for soil organic carbon monitoring and similar.
Some previous research e.g. by @hong2022fusion and @vohland2022quantification indicate
that such fusion techniques could help improve accuracy of, especially, _in-situ_ methods.
Because several of the training data sets, especially the KSSL and ICRAF-ISRIC,
come with both VisNIR and MIR scans, and are documented in detail, we were able
to fit global calibration models that can now be used to predict soil properties from a combination of data.
Consider for example a sample from North Dakota with a profile code `S2014ND095001`, and
sample code `235157` (classified as _"Fine-loamy, mixed, superactive, frigid Calcic Hapludoll"_).
This sample was taken at location lon=-99.2064444; lat=48.8000056; and covers
the depth interval 101--125 cm (B2 horizon). We can predict value of soil pH by
using both VisNIR and MIR scans for the same sample:
```{r}
mir.x = simplerspec::read_opus_bin_univ('sample-data/235157XS01.0')
asd.raw = as.data.frame(asdreader::get_spectra("sample-data/235157MD01.asd"))
```
This time we need to load 2 PCA models (1) for VisNIR and (2) MIR spectra, and one
model for the bundle calibration `visnir.mir_mlr..eml_ossl_na_v1.rds`:
To make the prediction, we need to specify multiple inputs:
```{r, eval=FALSE}
pcm.m = url(paste0(rep, "pca.ossl/mpca_mir_ossl_v1.rds"), "rb")
pcm.v = url(paste0(rep, "pca.ossl/mpca_visnir_ossl_v1.rds"), "rb")
eml.b = url(paste0(rep, "log..oc_usda.calc_wpct/visnir.mir_mlr..eml_ossl_na_v1.rds"), "rb")
ossl.pca.mir = readRDS(pcm.m)
ossl.pca.visnir = readRDS(pcm.v)
ossl.model = readRDS(eml.b)
pred.ph.b = predict.ossl(t.var="log..oc_usda.calc_wpct", spc.type="visnir.mir",
mir.raw=mir.x[["spc"]], visnir.raw=asd.raw,
ossl.model=ossl.model, ossl.pca.mir=ossl.pca.mir,
ossl.pca.visnir=ossl.pca.visnir, hzn_depth=113)
```
which gives:
```{r, eval=FALSE}
pred.ph.b$pred
```
```
pred.mean pred.error tpred.mean lower.1std upper.1std
1 0.2532955 0.02471756 0.2882639 0.2568115 0.3205034
```
We can look at the model prediction accuracy based on 5-fold cross-validation:
```
Residuals:
Min 1Q Median 3Q Max
-3.1514 -0.0625 -0.0015 0.0571 3.1702
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.019667 0.001207 -16.297 <2e-16 ***
regr.ranger 0.213650 0.006611 32.319 <2e-16 ***
regr.xgboost -0.012230 0.005358 -2.283 0.0225 *
regr.cvglmnet 0.479673 0.006236 76.923 <2e-16 ***
regr.cubist 0.337446 0.005674 59.470 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1577 on 37312 degrees of freedom
Multiple R-squared: 0.9767, Adjusted R-squared: 0.9767
F-statistic: 3.91e+05 on 4 and 37312 DF, p-value: < 2.2e-16
```
This shows that the model has an RMSE of 0.156 (in log-scale), which seems to be the [among the best performing models](https://github.com/soilspectroscopy/ossl-models)
for predicting soil organic carbon content in % from spectral scans. Thus, in this specific case, also the
prediction error shows relatively narrow range of uncertainty.
```{r ac-phm, echo=FALSE, fig.cap="Accuracy plot for `log..oc_usda.calc_wpct/visnir.mir_mlr..eml_ossl_na_v1.rds` the VisNIR-MIR combination. As compared with other calibration models this seems to be most accurate, although difference with the pure MIR model is not significant.", out.width="70%"}
knitr::include_graphics("http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_models/log..oc_usda.calc_wpct/ap.visnir.mir_mlr..eml_ossl_na_v1.rds.png")
```
The bundle approach in the OSSL is used at the moment by default for all soil
variables, although one should keep in mind that the training points are more
patchy i.e. are less representative of the global conditions. As more and more
training data is added to the OSSL, we should be able to serve more representative
calibration models.
The bundle approach could be further extended to using also e.g.:
- spatial covariates (e.g. WorldClim layers),
- bare-Earth Spectra [@dematte2020bare],
- soil moisture daily estimates (matching the date of soil sampling) [@brocca2019sm2rain],
```{r map-bareearth, echo=FALSE, fig.cap="Example of a satellite-based bare-earth spectral product based on @dematte2020bare. The global data is available at 250-m spatial resolution.", out.width="90%"}
knitr::include_graphics("img/bareearth_spectra_250m.jpg")
```
More and more layers will be gradually added to the OSSL registry to enable users to
increase accuracy of their predictions. For updates in the OSSL model library
please subscribe to [our channels](https://soilspectroscopy.org) to stay up-to-date.
## Registering your own model
### Access OSSL data to fit your own models
To load the complete analysis-ready OSSL regression-matrix dataset (486MB) as a
single table in R and run predictive modeling please use:
```{r, eval=FALSE}
rep2 = "http://s3.us-east-1.wasabisys.com/soilspectroscopy/ossl_import/"
rm.ossl = readRDS(url(paste0(rep2, "rm.ossl_v1.rds"), "rb"))
```
```
dim(rm.ossl)
## 152,146 obs. of 2962 variables
```
The compilation currently contains training MIR (70k) and VisNIR (70k) spectral
scans + soil calibration data (51,732 unique locations).
We are now fitting [globally-applicable prediction models](https://github.com/soilspectroscopy/ossl-models)
to enable free and open production of primary soil data. [The output models](https://doi.org/10.5281/zenodo.5759693) are available via the S3 cloud-service
and/or API under CC-BY license. If you fit your own models and/or if you are interested in contributing
to this project please contact us and help us make better open soil data for global good!
### Registering and hosting models
If you plan to contribute your own models to the Soil Spectroscopy for Global Good project
please follow these steps:
1. We recommend registering and sharing new models via S3 file
service system (e.g. via OpenGeoHub's Wasabi.com cloud-service) with RDS files produced using
[Dockerized solution for full scientific reproducibility](https://github.com/nuest/docker-reproducible-research).
2. For each model the following metadata need to be supplied:
- Registered [docker image](https://hub.docker.com/r/opengeohub/r-geo); which should specify in detail: software in
use, R package versions etc;
- DOI of the zenodo repository with a back-up for the data (i.e. how to cite your models);
- URL for accessing the RDS file via S3;
We can host your models on our infrastructure at no additional costs. Please
[contact us](https://soilspectroscopy.org/contact/) and apply for registering your models via soilspectroscopy.org.