-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter_2_lm.qmd
1438 lines (1214 loc) · 64.2 KB
/
chapter_2_lm.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Assessing uncertainties related to satellite remote sensing indices to estimate Gross Primary Production
## Introduction
Vegetation Gross Primary Production (GPP) is the total amount of carbon fixation
by plants through photosynthesis [@badgley_terrestrial_2019]. Quantifying GPP is
essential for understanding land-atmosphere carbon exchange
[@kohler_global_2018], ecosystem function, and ecosystem responses to climate
change [@guan_comparison_2022; @brown_fiducial_2021; @myneni_relationship_1994].
However, terrestrial GPP cannot be directly measured due to the contribution of
respiration to land carbon fluxes [@anav_spatiotemporal_2015]. Instead, GPP can
be inferred in a non-destructive manner by the net carbon exchange measurements
at the ecosystem level, or at broader scales using models that incorporate
various assumptions and limitations [@reichstein_separation_2005;
@jung_towards_2009].
GPP estimations can be grouped into two broad categories: Eddy Covariance (EC)
techniques, and satellite data-driven methods [@guan_comparison_2022;
@xie_assessments_2020]. EC is the primary in-situ non-destructive method for
measuring terrestrial fluxes, and specifically for quantifying the exchange of
CO2 between land and the atmosphere, using advanced field instrumentation
[@baldocchi_how_2020; @badgley_terrestrial_2019; @ryu_what_2019;
@tramontana_predicting_2016]. However, EC measurements come with certain
limitations, such as their relatively low spatial resolution, typically less
than \< 1km^2^ , which constrains the accuracy of estimating ecosystem carbon
and water fluxes at regional and global scales. [@badgley_canopy_2017].
Additionally, it's important to note that EC techniques directly measure Net
Ecosystem Exchange (NEE), not GPP. Subsequently, GPP must be estimated by both
subtracting respiration using models and ancillary measurements while accounting
for the removal or deposition of carbon stocks due to natural or anthropogenic
transport processes such as water flow, fires, or harvest
[@beer_terrestrial_2010; @reichstein_separation_2005].
The satellite data-driven models constitute the second category of methods to
estimate GPP. Because of their dependency on earth observation platforms, they
are not spatially constrained but can have greater uncertainties than the EC
techniques [@ryu_what_2019; @wang_diagnosing_2011]. Satellite data-driven models
can be classified into process-based models, Light Use Efficiency models (LUE),
and Vegetation Index models [@xie_assessments_2020].
Process-based models integrate climate, canopy, and soil information derived
from multiple sources, including satellite EO, into biophysical models of
carbon, water, energy, and nutrient cycles with varying levels of detail
[@running_general_1988; @harris_global_2021]. While these models can be scaled
globally [@beer_terrestrial_2010; @jung_towards_2009] they require many
parameters that may not be readily available in changing landscapes or for
fine-scale studies.
LUE models are based on the concept of radiation conversion efficiency and take
into consideration ecological processes [@liu1997process;
@heinsch_evaluation_2006]. This efficiency signifies the amount of carbon a
specific vegetation type can fix per unit of solar radiation
[@monteith_solar_1972]. Initially used for NPP estimation [@prince_model_1991],
LUE models were subsequently adapted for GPP and respiration calculations
[@goetz_mapping_1999]. These models explicitly account for the impact of
environmental stress on plant physiological responses. The imposition of
environmental stressors may lead to a reduction in the rates of daily carbon
assimilation, thereby diminishing overall efficiency. [@prince_model_1991;
@running_general_1988].
The GPP product from the Moderate Resolution Imaging Spectroradiometer (MODIS)
employs an algorithm based on the radiation conversion efficiency concept. This
algorithm establishes a connection between absorbed photosynthetically active
radiation (APAR) and the LUE term [@heinsch_evaluation_2006] as shown in
@eq-gpp.
$$
\small
GPP = PAR \times fAPAR \times LUE
$$ {#eq-gpp}
Where PAR is the incident photosynthetically active radiation
[@monteith2013principles] and `fAPAR` is the fraction of the PAR that is
effectively absorbed by plants [@gcos200]. The LUE term depends on vegetation
type but also physiological conditions are driven by water availability,
temperature stress, and vapour pressure deficit [@goetz_mapping_1999;
@running_general_1988]. Obtaining these variables for every vegetation type on
Earth can be challenging, introducing assumptions that amplify uncertainties
[@goetz_mapping_1999]. Nevertheless, under unstressed conditions, LUE remains
constant for a given vegetation type, requiring only PAR and fAPAR to assess
primary productivity [@running_continuous_2004].
VIs are the Satellite data-driven models' third approach. VIs are a summary of
non-linear functions of surface bi-directional reflectance spectra
[@myneni_relationship_1994] derived from optical sensors that are combined with
climate variables to calculate GPP [@wu_predicting_2011]. This is usually done
with some form of regression and physical methods that associate interactions
between vegetation and incoming radiation [@fernandez-martinez_monitoring_2019].
VIs have been used to provide inputs to @eq-gpp related to fAPAR from regional
to global extents or to estimate GPP [@sellers_global_1994;
@running_continuous_2004]. Some of the most common VIs to estimate GPP are the
Normalized Difference Vegetation Index (NDVI), the Enhanced Vegetation Index
(EVI), the Near-Infrared Reflectance Index (NIRv), or the Chlorophyll/Carotenoid
Index (CCI) among others. [@balzarolo_influence_2019; @rahman_potential_2005;
@rahman_potential_2005; @xie_assessments_2020; @badgley_terrestrial_2019;
@zhang_potential_2020; @sellers_global_1994]. These VIs are based on a spectral
reflectance ratio between the red and near-infrared regions of the
electromagnetic spectrum [@glenn_relationship_2008] which tracks an integrated
impact of fraction of photosynthetically active radiation (fAPAR) and LUE on
productivity [@myneni_relationship_1994].
An index such as NDVI is good for detecting structural vegetation changes in
seasonal variability, but it becomes saturated with high biomass conditions
[@badgley_canopy_2017]. Other indices such as EVI can overcome the soil and
atmospheric effects by adding the blue band [@huete1988soil] and may be better
suited for predicting GPP in large biomass forests [@badgley_canopy_2017].
Nonetheless, it has been evaluated in a narrow range of ecosystems and needs
inputs of start and end dates of the growing seasons which can increase
uncertainties [@shi_assessing_2017]. For specific types of ecosystems such as
evergreen conifers, CCI can track the seasonality of daily GPP due to its
sensibility to the chlorophyll/carotenoid pigment ratios [@gamon_remotely_2016].
Other indices such as the near-infrared reflectance index (NIRv) have been
formulated to address the mixed-pixel effect (pixel with vegetated and
non-vegetated features) and to determine the vegetation photosyntetic capacity
[@badgley_terrestrial_2019]. NIRv is defined as the fraction of reflected NIR
light that originates from vegetation. NIRv was originally proposed as a
replacement for fPAR in LUE models in that the NIR and PAR reflectance of
vegetation are correlated and the scaling by NDVI corrects for soil
contributions in the signal [@badgley_terrestrial_2019]. NIRv has been shown to
have a stronger correlation to GPP at flux towers and on a regional basis than
fAPAR notwithstanding the fact that GPP is directly related to fAPAR. NIRv has
been extended to include weighting with PAR [@dechant_nirvp_2022] and replacing
the NIR reflectance with NIRv radiance [@wu_radiance-based_2020]. Both of these
approaches have been shown to have even stronger correlations with GPP than NIRv
as could be expected since they either directly or indirectly weight the NIRv
with PAR.
The stronger correlation between the NIRv index to GPP in comparison to the
correlation between VIs related to fAPAR and GPP seemingly contradicts the
hypotheses in the LUE model that GPP should be linearly related to fAPAR. There
are two explanations: i. Many of the reported comparative studies use fAPAR
based on VIs and not APAR. We hypothesize that the NIR indices are simply better
estimators of APAR than these VI approximations due to lower measurement error
for the NIR indices or a stronger physical relationship between them and APAR
versus the historical VIs. ii. The strength of the correlation between APAR and
GPP depends on LUE having a linear relationship to observed APAR. While this may
hold in some circumstances (e.g. early seasonal measurements for vegetation such
as crops where leaf chlorophyll concentration increases during the growth phase)
it is not the case in general and definitely during stress conditions
[@monteith_solar_1972].
Despite these efforts, relying solely on remote satellite VIs presents a
challenge. The assumptions accompanying VI models suggest the importance of
systematically quantifying their predictive capacity for GPP to validate and
improve their accuracy [@anav_spatiotemporal_2015; @brown_fiducial_2021] across
various growing seasons and locations. This is particularly crucial because
photosynthesis regulation can occur with no major changes in canopy structure or
leaf pigments that can undergo without being detected with reflectance data
[@pabon-moreno_potential_2022; @pierrat_diurnal_2022], implying that temporal
aggregation may be critical for VI models. In-situ eddy-covariance flux
measurements coupled with locally calibrated models for respiration
[@baldocchi_how_2020] represent a suitable reference GPP for validation solution
given that they represent site-level observations
[@chu_representativeness_2021].
As such, the main objective of this M.Sc. thesis chapter is to quantify and
compare the uncertainty associated with the VI/LUE models to estimate GPP. To
control for variability in environmental conditions (E), sites are selected that
share the same land cover, climate, and biome characteristics. VIs models are
evaluated by comparison to EC based estimates of GPP for multiple growing
seasons from the Bartlett Experimental Forest (USA), the Borden Forest Research
Station flux-site (Canada), and the University of Michigan Biological Station
(USA). Considering the reviewed studies, we hypothesize that the uncertainty of
this approach will depend on the nature of the VI and the spatial and temporal
aggregation of application. Specifically, we expect that i. the NIRv and CCI
indices will consistently demonstrate a stronger correlation and lower
prediction uncertainty for GPP compared to NDVI and EVI across the tested sites.
ii. larger temporal aggregations will result in improved predictions due to the
reduction in observation variability.
\newpage
## Methods
```{r libraries and sources}
#| echo: false
#| message: false
#| warning: false
# Libraries
library(ggplot2)
library(cowplot)
library(lubridate)
library(purrr)
library(broom)
library(gt)
library(tidymodels)
library(broom)
library(usemodels)
library(vip)
library(h2o)
library(mgcv)
# Source files
# Source the objects created for the complete GPP trends.
# This file will source the code and load objects to memory.
source("scripts/trend_plots.R")
# Source the objects created for the complete GPP trends
source("scripts/models_data_preparation.R")
# Source file with functions to plot rf predictions
source("R/plot_exploratory.R")
```
### Eddy Covariance sites
We used three deciduous broadleaf forest sites located in the northern
hemisphere (see @fig-sites_locations) with eddy covariance (EC) data collected
by Ameriflux. For each site, we used the daily, weekly, and monthly GPP values
(GPP_DT_VUT_REF variable) estimated using the ONEFlux workflow
[@pastorello2020fluxnet2015]. The ONEFlux processing does the estimation of the
CO~2~ fluxes into GPP and Ecosystem Respiration (RECO) from Net Ecosystem
Exchange (NEE) through two methods known as daytime and nighttime. Here we
selected the daytime method (DT) which uses daytime and nighttime to
parameterize a model with two components: one based on light response curve and
vapour pressure deficit and a second one using a respiration-temperature
relationship to estimate RECO which in turn is used to obtain the difference
with NEE and provide GPP [@pastorello2020fluxnet2015].
![Sites locations](img/ge_sites.jpg){#fig-sites_locations}
@fig-gpp_trends displays the GPP trends for the University of Michigan
Biological Station, Bartlett Experimental Forest, and the Borden Forest Research
Station. Additional details regarding the characteristics of these datasets can
be found in Table @tbl-oneflux_datasets.
| Site | Data range available | Dataset name | Reference |
|--------------------|--------------------|--------------------|--------------------|
| Bartlett | Jan 2015 to Dec 2017 | US-Bar: Barlett Experimental Forest (version: beta-3) | [@staebler2019ameriflux] |
| Borden | Jan 2015 to Jan 2022 | CA-Cbo: Ontario - Mixed Deciduous, Borden Forest Site | [@richardson2016ameriflux] |
| Michigan | Jan 2015 to Jan 2018 | US-UMB: Univ. of Mich. Biological Station (version: beta-4) | [@gough2016ameriflux] |
: ONEFlux sites datasets description {#tbl-oneflux_datasets}
The Bartlett experimental forest is located in New Hampshire, USA (44°06′N,
71°3′W). This site is characterized by a forest with an average canopy height
ranging from 20 to 22 meters with a mean annual temperature of 6°C. Despite
events such as a hurricane in 1938 and small scale forest management, the
forest's mean stand age is around 120-125 years [@ouimette_carbon_2018].
The second flux site is Borden Forest Research Station located in Ontario
(44°19′N, 79°56′W), Canada. This is one of the largest patches of forest in
Southern Ontario which has been collecting EC data since 1996
[@rogers_response_2020]. This site has a forest cover of over 60% with a height
of approximately 22 m. It's a deciduous broadleaf natural re-growth forest since 1916
dominated by woody vegetation [@lee_long-term_1999].
The third site is the University of Michigan Biological Station which is located
in northern Michigan, USA (45°350 N 84°430 W). The site has a forest with
different succesional stages, with an average stand age of 90 years
[@gough_wood_2010], and a mean height of 22m. The mean annual temperature is
around 5.5°C [@gough_disturbanceaccelerated_2021].
The three sites exhibit similar characteristics, indicating their representation
of a specific ecosystem type. This uniformity enables meaningful comparisons and
offers valuable insights into the relationship between GPP and VIs. A tabular
summary of site characteristics, guided by insights detailed in Teets
[-@teets_coupling_2022], is presented in @tbl-site_summary.
```{r}
#| label: tbl-site_summary
#| tbl-cap: "ONEFlux Site characteristics overview"
#| tbl-colwidths: [50,50]
#| echo: false
#| message: false
#| warning: false
data <- data.frame(
Variable = c("Mean annual temperature (°C)", "Mean annual precipitation (mm)", "Elevation (m)", "Dominant genera", "Climate Koeppen"),
Bartlett = c(6, 1246, 272, "Acer, Fagus, Betula", "Dfb"),
Michigan = c(5.5, 803, 234, "Populus, Quercus, Pinus", "Dfb"),
Borden = c(7.4, 784, 209, "Acer, Pinus, Populus", "Dfb")
)
data %>%
gt() %>%
tab_spanner(
label = "Site",
columns = c("Bartlett", "Michigan", "Borden")
) %>%
fmt_number(
columns = c("Bartlett", "Michigan", "Borden"),
decimals = 1
) %>%
tab_footnote(
footnote = md("**D** stands for the warm-summer continental or hemiboreal climate.
**f** indicates that this climate has significant precipitation in all seasons.
**b** indicates that the warmest month has an average temperature between 22°C and 28°C."),
locations = cells_body(columns = Variable, rows = 5)
)
```
```{r}
#| label: fig-gpp_trends
#| fig-cap: "Reference in-situ GPP time series from the study sites on a daily (a), weekly (b), and monthly (c) basis for the University of Michigan Biological Station, Bartlett experimental forest, and the Borden Forest Research Station"
#| fig-width: 11
#| fig-height: 12
#| echo: false
#| message: false
#| warning: false
gpp_trends
```
<!-- ```{r} -->
<!-- #| label: fig-michigan_gpp_trends -->
<!-- #| fig-cap: "GPP trends for Michigan" -->
<!-- #| fig-width: 7 -->
<!-- #| fig-height: 9 -->
<!-- #| echo: false -->
<!-- #| message: false -->
<!-- #| warning: false -->
<!-- michigan_gpp_trends -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| label: fig-bartlett_gpp_trends -->
<!-- #| fig-cap: "GPP trends for Bartlett" -->
<!-- #| fig-width: 7 -->
<!-- #| fig-height: 9 -->
<!-- #| echo: false -->
<!-- #| message: false -->
<!-- #| warning: false -->
<!-- bartlett_gpp_trends -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| label: fig-borden_gpp_trends -->
<!-- #| fig-cap: "GPP trends for Borden" -->
<!-- #| fig-width: 7 -->
<!-- #| fig-height: 9 -->
<!-- #| echo: false -->
<!-- #| message: false -->
<!-- #| warning: false -->
<!-- borden_gpp_trends -->
<!-- ``` -->
<!-- The GPP for each of the sites can -->
<!-- - include description of the ONEFlux processing (gap filling, data quality, -->
<!-- gpp estimation, gpp uncertainty with DT and NT) -->
### Satellite imagery
We used data from the Terra Moderate Resolution Imaging Spectroradiometer
(MODIS), specifically the collection MOD09GA Version 6.1 product (MODIS/Terra
Surface Reflectance Daily L2G Global 1 km and 500 m SIN Grid) chosen for its
daily sampling and broad temporal coverage. Data retrieval was performed using
Google Earth Engine (GEE). For each site, a square polygon with an area of 3 km
surrounding the EC tower was defined, and the pixel values within this polygon
were extracted for comprehensive analysis.
The MODIS contains the surface spectral reflectance from bands 1 through 7
with a spatial resolution of 500m, with corrections for atmospheric conditions
such as aerosols, gasses, and Rayleigh scattering [@vermote2021modis], and
validation of cloud-free pixels as well. Bands used to derive the vegetation
indices are shown in @tbl-MODIS_500_indices_bands
| **Name** | **Description** | **Resolution** | **Wavelength** |
|-------------|-----------------|----------------|----------------|
| sur_refl_01 | Red | 500 meters | 620-670nm |
| sur_refl_02 | NIR | 500 meters | 841-876nm |
| sur_refl_03 | Blue | 500 meters | 459-479nm |
| sur_refl_04 | Green | 500 meters | 545-565nm |
: MODIS (MOD09GA.061 product) bands used to calculate the VIs
{#tbl-MODIS_500_indices_bands}
<!-- | Data | Spatial resolution (m) | GEE image collection | -->
<!-- |-----------------------|------------------------|------------------------------| -->
<!-- | MODIS reflectance | 500 | MODIS/006/MOD09GA | -->
<!-- | MODIS reflectance | 250 | MODIS/061/MOD09A1 | -->
<!-- | Harmonized Sentinel-2 | 20 - 60 - 10 | COPERNICUS/ S2_SR_HARMONIZED | -->
<!-- : Satellite images datasets {#tbl-satellite_images_datasets} -->
The data processing involved three main steps: selecting high-quality pixels,
scaling band values, and calculating vegetation indices. MODIS product data have
4 bit-encoded variables which provide information about the observation
quality. From those variables, only the 1km Reflectance Data State QA
(`state_1km`) and Surface Reflectance 500m Quality Assurance (`qc_500m`)
variables were used along with each of the band's bits quality indicators, as
250m scan value information (`q_scan`) was not informative and Geolocation flags
(`g_flags`) had the same value for all observations. The bit-encoded variables
were transformed into categorical strings, and only the categories indicating
the best quality were selected to filter the pixels (@fig-quality_pixels). The
specific bit strings selected for `state_1km` are shown in
@tbl-state_1km_bitstrings and for `qc_500m` in @tbl-qc_scan_bit_strings.
Subsequently, the surface reflectance for each filtered pixel was determined by
scaling the digital number recorded by `0.0001`.
```{r}
#| label: fig-quality_pixels
#| fig-cap: "Total number of observations (pixels) from MODIS classified as high quality (used in the analysis) or other quality (filtered out from the analysis) per site."
#| echo: false
#| message: false
#| warning: false
source("scripts/quality_observations.R")
all %>%
ggplot(aes(x = site, fill = quality)) +
geom_bar(position = "stack") +
scale_fill_viridis_d(begin = 0.34, end = 0.8) +
labs(x = "Site",
y = "Total observations (pixels)",
fill = "Quality") +
theme_bw(base_size = 12)
```
Following the MODIS Collection 6.1 (C61) LSR Product User Guide
[@vermote2021modis], any scaled value that fell outside the range of `0` to `1`
was considered a fill value or uncorrected Level 1B data and was subsequently
discarded. These values were deemed unreliable or lacking meaningful information
for the analysis. The number of high-quality surface reflectance observations
was summarized on a monthly basis for each site (@fig-complete_quality_pixels)
```{r}
#| label: fig-complete_quality_pixels
#| fig-cap: "Total number of observations (pixels) from MODIS classified as high quality (used in the analysis) or other quality (filtered out from the analysis)"
#| echo: false
#| message: false
#| warning: false
source("scripts/quality_observations.R")
all %>%
filter(quality == "high") %>%
mutate(year_mon = zoo::as.yearmon(date)) %>%
ggplot(aes(x = year_mon, fill = site)) +
geom_bar(position = "stack") +
scale_fill_viridis_d(begin = 0.2, end = 0.8) +
labs(x = "Date",
y = "Total observations (pixels)",
fill = "Site") +
theme_bw(base_size = 12)
```
The VIs `NDVI` (@eq-ndvi), `NIRv` (@eq-nirv), `EVI` (@eq-evi), and `CCI`
(@eq-cci) were calculated and then matched with the corresponding date in the
flux datasets (Green, Red, and NIR correspond to the bands defined in
@tbl-MODIS_500_indices_bands).
<!-- However, since the -->
<!-- MODIS product with a spatial resolution of 250m (`MODIS/061/MOD09A1`) does not -->
<!-- include the blue band EVI calculations were not performed for this dataset. -->
```{=tex}
\begingroup
\fontsize{10pt}{10pt}\selectfont
```
$$
\small
NDVI = \frac{NIR - Red}{NIR + Red}
$$ {#eq-ndvi}
$$
NIRv = NIR\times\frac{NIR - Red}{NIR + Red}
$$ {#eq-nirv}
$$
EVI = 2.5\times\frac{\mathrm{NIR} - \mathrm{Red}}{%
(\mathrm{NIR} + 6\times \mathrm{Red}
- 7.5\times \mathrm{Blue} + 1)}\\
$$ {#eq-evi}
<!-- $$ -->
<!-- kNDVI = tanh(\mathrm{NDVI}^2) -->
<!-- $$ {#eq-kndvi} -->
$$
CCI = \frac{Green - Red}{Green + Red}
$$ {#eq-cci} \endgroup
<!-- For Harmonized Sentinel-2 data the data processing consisted on using the -->
<!-- `msk_cldprb` variable to filter out pixels according to the probabilities of -->
<!-- containing a cloud or the `msk_snwprb` variable for presence of snow -->
<!-- probability. Once the pixels without any probability of having clouds or snow -->
<!-- were filtered from the rest, we scaled all the reflectance bands observations by -->
<!-- a factor of `0.0001`. With the values scaled we proceed to calculate the -->
<!-- vegetation indices `NDVI` (@eq-ndvis), `NIRv` (@eq-nirvs), and `EVI` (@eq-evis). -->
<!-- The bands used to calculate these indices are in -->
<!-- @tbl-harmonized_s2_indices_bands -->
<!-- $$ -->
<!-- NDVI = \frac{b8 - b4}{b8 + b4} -->
<!-- $$ {#eq-ndvis} -->
<!-- $$ -->
<!-- NIRv = b8\times\frac{b8 - b4}{b8 + b4} -->
<!-- $$ {#eq-nirvs} -->
<!-- $$ -->
<!-- EVI = 2.5\times\frac{\mathrm{b8} - \mathrm{b4}}{% -->
<!-- (\mathrm{b8} + 6\times \mathrm{b4} -->
<!-- - 7.5\times \mathrm{b2} + 1)}\\ -->
<!-- $$ {#eq-evis} -->
### Data Preparation
Three datasets were prepared for each site: a daily, a weekly, and a monthly
dataset. These datasets were generated from the satellite imagery data with the
selected high-quality pixels and the ONEFluxprocess data in order to capture
variations in vegetation indices (VI), band values, and GPP over different time
scales.
The daily dataset included the VI values, band values only from the high-quality
pixels, and GPP measurements derived from the ONEFlux process collected on a
daily basis for the time period of available GPP at each site
@tbl-oneflux_datasets. This dataset provided a high-resolution representation of
the variables, allowing for a detailed analysis of their daily fluctuations.
The weekly and monthly datasets were derived from the corresponding daily
dataset. These datasets contained summarized values of the VIs and band values,
aggregated over the weekly and monthly time frames, respectively. The
aggregation process involved calculating an average for the VIs and band values
within each week or month.
For the GPP values in the weekly and monthly datasets, rather than summarizing
the daily GPP values, the GPP measurements for the weekly and monthly time
frames were obtained directly from the ONEFlux process, which provided a
reliable estimation of GPP for these longer time intervals.
By creating these three datasets (daily, weekly, and monthly), the study allowed
for a comprehensive analysis of the VIs, band values, and GPP at different
temporal resolutions. This approach provided insights into the temporal dynamics
and patterns of the variables, enabling a more thorough understanding of the
processes and relationships under investigation.
For each site and across all time scales (daily, weekly, and monthly), GPP
values below 1 gC m^-2^ d^-1^ were excluded. These values were deemed either
below the detection limit or insufficient to make a significant contribution to
the overall analysis, particularly in representing meaningful vegetation
productivity. This process aimed to refine the dataset and focus on values
within a range considered more pertinent to the growing season. The final number
of observations per site after selecting the high-quality pixels and GPP
observations are shown in @fig-obs_used_analysis.
```{r}
#| label: fig-obs_used_analysis
#| fig-cap: "Monthly high-quality MODIS observations after joining with flux observations containing Gross Primary Productivity (GPP) values higher than 1."
#| echo: false
#| message: false
#| warning: false
# Dataset used to analyze the data
daily_plot_500 %>%
# Dataset have all indices per row, so for the same date there are 5 obs
filter(index == "ndvi_mean") %>%
group_by(site, date) %>%
tally() %>%
group_by(site) %>%
mutate(year_mon = zoo::as.yearmon(date)) %>%
ggplot(aes(x = year_mon, fill = site)) +
geom_bar(position = "stack") +
scale_fill_viridis_d(begin = 0.2, end = 0.8) +
labs(x = "Date",
y = "Total observations",
fill = "Site") +
theme_bw(base_size = 12)
# daily_plot_500 %>%
# pivot_wider(names_from = index, values_from = value) %>%
# select(date, total_obs, site) %>%
# mutate(year_mon = zoo::as.yearmon(date)) %>%
# group_by(site, year_mon) %>%
# tally() %>%
# ungroup() %>%
# mutate(year_mon = as.factor(year_mon)) %>%
# ggplot(aes(x = year_mon, y = n, fill = site)) +
# geom_bar(stat = "identity", position = "stack") +
# scale_fill_viridis_d(begin = 0.2, end = 0.8) +
# labs(x = "Date",
# y = "Total observations",
# fill = "Site") +
# theme_bw(base_size = 12) +
# theme(axis.text.x = element_text(angle = 90, h = 1))
```
### Data Analysis
To address the potential non-linearity of the lowest uncertainty data-driven
model across different temporal aggregation scales or for specific VIs, we
employed two distinct modeling approaches: a Linear Model (LM) and a
Generalized Additive Model (GAM). GAM models allow a better fit for those cases
where the distribution and variability observed in the data is greater, due to
the varying temporal scales. We aimed to evaluate whether this variance could be
better explained by this type of model, which might not be optimal to capture
with a linear relationship, potentially leading to greater residuals. Both
models’ approaches were applied individually to test our hypothesis regarding VI
predictive uncertainty. Additionally, a single model was applied to assess
whether a combination of VIs could improve GPP estimation uncertainty.
This facilitated a comparative analysis of how each individual index
independently explains the variation in GPP per site. Additionally, the
performance of each VI was also evaluated using models calibrated using all
sites to assess their robustness to spatial variability within the selected
temperate broadleaf biome across various temporal scales. Another step of the
analysis was the examination of the relationship between GPP and all indices
functioning as covariates, both on an individual site basis and without site
distinction.
In total, per each site plus the category of all-sites we created 5 linear
models and 5 GAM models for every timescale (daily, weekly, and monthly): one
per each index (NDVI, CCI, EVI, NIRv) and one with all the indices as covariates
(NDVI + CCI + EVI + NIRv), resulting in a total of 120 models. To assess and
compare the models' performance, we calculated the coefficient of determination
(R²) to measure the correlation between the actual observations and predictions.
Additionally, we used the Root Mean Squared Error (RMSE) and the Mean Absolute
Error (MAE) as indicators of the model estimation error.
\newpage
## Results
### Analysis of GPP-Vegetation Index Relationships Using Linear Models
<!-- ### Monthly GPP and VIs relations -->
The @tbl-lm_monthly_results provides a summary of linear models used for GPP
estimation at each site, employing the vegetation indices as predictors. For
each site and predictor, the table includes the relevant model summary
statistics, such as R², MAE, and RMSE as indicators of the statistical
significance of the model fit. All models add p-values \< 0.05 More metrics such
as the p-value, adjusted r-squared, the Akaike Information Criterion (AIC), and
Bayesian Information Criterion (BIC) are displayed in the
@tbl-complete_lm_monthly_results for monthly results,
@tbl-complete_lm_weekly_results for weekly results, and the
@tbl-complete_lm_daily_results for daily outputs. A residuals distribution for
each of the models is in @fig-lm_residuals
Our findings show that when using all the indices as covariates within the
linear model, the model's performance demonstrates better outcomes compared to
using any single index alone across all scenarios. In the assessment of
individual VI performance on a monthly basis, CCI tend to perform better than
EVI, NDVI, and NIRv. Although all models were statistically significant (p \<
0.05), it is important to note that for the Bartlett and Michigan sites, while
CCI shows favourable predictive accuracy, its advantage over the NIRv and EVI is
slightly better and mostly due to the MAE
($0.86 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ for Bartlett and
$1.02 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ for Michigan) and RMSE
($1.05 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ for Bartlett and
$1.29 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ for Michigan) results.
The NDVI displays relatively diminished performance, indicating a 9% and 15%
reduction in the R² compared to CCI in the Bartlett and Michigan sites,
respectively. In contrast, EVI exhibits less favourable predictive results (R² =
0.75, MAE = $1.92 \, \mathrm{gC \, m^{-2} \, d^{-1}}$, RMSE =
$2.54 \, \mathrm{gC \, m^{-2} \, d^{-1}}$) for the Borden site and the
aggregated sites category, although in the specific context of the Borden site,
its performance aligns closely with NDVI and NIRv. Among the individual sites,
Bartlett has the most favourable predictive outcomes in terms of R², MAE, and
RMSE for every individual VI model, while the aggregated sites category yields
the least favourable predictive results in the same scenarios.
In the case of the weekly models, NDVI records the least favourable results in
terms of R², MAE, and RMSE when assessed at the three individual sites, while
EVI demonstrated its weakest predictive capabilities when all sites were treated
as a single entity. However, for this case, EVI shows marginal differences for
individual indices such as NDVI and NIRv (R² = 0.01 variance explained and no more
than $0.02 \, \mathrm{gC \, m^{-2} \, d^{-1}}$ in RMSE). The best performant
individual indices were CCI in terms of R², MAE, and RMSE for Borden and the
aggregated sites, NIRv for Michigan and EVI for Bartlett. Nonetheless, those
superior performances are subtle when compared with the other individual
indices. On a weekly basis, differences in variability explained between the
best and least performing models range from R² = 0.11 to R² = 0.2. Bartlett and
Michigan sites consistently yield the most accurate predictive models.
On a daily basis, CCI outperforms the rest of the individual indices in 3 cases:
for Bartlett by R² = 0.02 in variance explanation compared with EVI, for Borden by
R² = 0.01 compared with EVI, and for the combined sites dataset by R² = 0.1.
Nonetheless is worth mentioning that the variance explained in any of the models
by individual indices in Borden or the combined site is less than R² = 0.5 in
most of the cases and the error scores are the highest. In the case of
Michigan, the best performing individual index was EVI which outperformed the
next best-performing individual index NIRv by R² = 0.03. Generally, the Bartlett
and Michigan sites consistently yielded the most accurate predictive models
across various configurations. Conversely, the Borden site consistently
exhibited the poorest model performance across all scenarios.
Overall, when comparing individual indices, CCI consistently performed better
across different timeframes in terms of variance explainability and error
metrics. Nonetheless, those differences are subtle compared to EVI and NIRv. In
contrast, NDVI consistently performs less mentioning across all evaluated
timeframes. Notably, models based on monthly values consistently exhibit better
performance than those based on weekly or daily values as ilustrated in
@fig-lm_metrics.
```{=tex}
\begingroup
\setlength{\LTleft}{0pt minus 500pt}
\setlength{\LTright}{0pt minus 500pt}
\fontsize{5pt}{7pt}\selectfont
\addtolength{\tabcolsep}{-3pt}
```
```{r lm_monthly}
#| label: tbl-lm_monthly_results
#| tbl-cap: "Summary of Linear models for GPP estimation using the vegetation indices on a monthly (a), weekly (b), and daily (c) basis. MAE and RMSE metrics units are gC m-⁻² d-⁻¹"
#| tbl-colwidths: [50,50]
#| echo: false
#| message: false
#| warning: false
source("scripts/lm_preparation.R")
create_metrics_table <- function(data) {
data %>%
gt() %>%
tab_spanner(label = md("NDVI"), columns = ends_with("NDVI")) %>%
tab_spanner(label = "EVI", columns = ends_with("EVI")) %>%
tab_spanner(label = "NIRv", columns = ends_with("NIRv")) %>%
tab_spanner(label = "CCI", columns = ends_with("CCI")) %>%
tab_spanner(label = "All", columns = ends_with("All")) %>%
cols_label(
.list = list(
"site" = "Site",
"r.squared_EVI" = "R2",
"mae_EVI" = "MAE",
"rmse_EVI" = "RMSE",
"r.squared_NDVI" = "R2",
"mae_NDVI" = "MAE",
"rmse_NDVI" = "RMSE",
"r.squared_NIRv" = "R2",
"mae_NIRv" = "MAE",
"rmse_NIRv" = "RMSE",
"r.squared_CCI" = "R2",
"mae_CCI" = "MAE",
"rmse_CCI" = "RMSE",
"r.squared_All" = "R2",
"mae_All" = "MAE",
"rmse_All" = "RMSE"
)
) %>%
cols_align(
align = "center",
columns = 2:16
) %>%
fmt_number(
columns = 2:16,
decimals = 2) %>%
tab_options(
row_group.background.color = "#E9E0E1",
row_group.font.weight = "bold"
) %>%
cols_width(everything() ~ px(50))
}
# Monthly table
vis_site_glance_monthly %>%
bind_rows(all_sites_glance_monthly,
all_sites_all_vis_glance_monthly,
all_vis_glance_monthly) %>%
select(site, index, r.squared, mae, rmse) %>%
pivot_wider(names_from = index,
values_from = c(r.squared, mae, rmse)) %>%
create_metrics_table()
# Weekly table
vis_site_glance_weekly %>%
bind_rows(all_sites_glance_weekly,
all_sites_all_vis_glance_weekly,
all_vis_glance_weekly) %>%
select(site, index, r.squared, mae, rmse) %>%
pivot_wider(names_from = index,
values_from = c(r.squared, mae, rmse)) %>%
create_metrics_table()
# Daily table
vis_site_glance_daily %>%
bind_rows(all_sites_glance_daily,
all_sites_all_vis_glance_daily,
all_vis_glance_daily) %>%
select(site, index, r.squared, mae, rmse) %>%
pivot_wider(names_from = index,
values_from = c(r.squared, mae, rmse)) %>%
create_metrics_table()
```
```{=tex}
\endgroup
```
```{r lm_barplot_metrics}
#| label: fig-lm_metrics
#| fig-cap: "Summary of Linear models for GPP estimation using the vegetation indices on a monthly (a), weekly (b), and daily (c) basis. MAE and RMSE metrics units are gC m⁻² d⁻¹"
#| fig-width: 7
#| fig-height: 9
#| echo: false
#| message: false
#| warning: false
library(cowplot)
# Create a function to generate the plot
create_metrics_plot <- function(timescale, y_var, ylim_range) {
get(paste0("vis_site_glance_", timescale)) %>%
select(site, index, {{ y_var }}) %>%
bind_rows(get(paste0("all_sites_glance_", timescale)),
get(paste0("all_sites_all_vis_glance_", timescale)),
get(paste0("all_vis_glance_", timescale))) %>%
ggplot(aes(x = site, y = .data[[y_var]], fill = index)) +
geom_col(position = "dodge") +
coord_cartesian(ylim = ylim_range) +
scale_fill_viridis_d() +
labs(x = "", fill = "Index") +
theme_minimal_hgrid(font_size = 12) +
theme(axis.text.x = element_text(angle = 45, h = 1))
}
# Response variables are going to be the same:
response_vars <- c("r.squared", "mae", "rmse")
# Monthly plots
monthly_metrics_plots <- map2(response_vars,
list(c(0.3, 1), c(0.5, 4), c(0.5, 5)),
~ create_metrics_plot("monthly", .x, .y))
# Weekly plots
weekly_metrics_plots <- map2(response_vars,
list(c(0.3, 1), c(0.5, 4), c(0.5, 5)),
~ create_metrics_plot("weekly", .x, .y))
# Daily plots
daily_metrics_plots <- map2(response_vars,
list(c(0.3, 1), c(0.5, 4), c(0.5, 5)),
~ create_metrics_plot("daily", .x, .y))
# Grid the plots as should go in the chapter
lm_metrics_plots <- plot_grid(
monthly_metrics_plots[[1]] + labs(y = expression(R^{"2"})) + theme(legend.position = "none"),
weekly_metrics_plots[[1]] + labs(y = expression(R^{"2"})) + theme(legend.position = "none"),
daily_metrics_plots[[1]] + labs(y = expression(R^{"2"})) + theme(legend.position = "none"),
monthly_metrics_plots[[2]] + labs(y = "MAE") + theme(legend.position = "none"),
weekly_metrics_plots[[2]] + labs(y = "MAE") + theme(legend.position = "none"),
daily_metrics_plots[[2]] + labs(y = "MAE") + theme(legend.position = "none"),
monthly_metrics_plots[[3]] + labs(y = "RMSE") + theme(legend.position = "none"),
weekly_metrics_plots[[3]] + labs(y = "RMSE") + theme(legend.position = "none"),
daily_metrics_plots[[3]] + labs(y = "RMSE") + theme(legend.position = "none"),
nrow = 3,
ncol = 3,
labels = c("A", "B", "C"))
plot_legend <- get_legend(
monthly_metrics_plots[[1]] +
guides(color = guide_legend(nrow = 3))
)
plot_grid(lm_metrics_plots, plot_legend, ncol = 2, rel_widths = c(1, .1))
rm(plot_legend)
```
### Analysis of GPP-Vegetation Index Relationships Using GAM Models
In @tbl-gam_model_results, we present a summary of the results obtained from the
GAM models used for GPP estimation at each site, employing the vegetation
indices as predictors. To compare between the models, we include in the table
relevant model summary statistics, such as R², MAE, and RMSE. Furthermore,
additional metrics such as the p-value, the F statistic (f), effective degrees
of freedom (edf), and the Akaike Information Criterion (AIC), are displayed in
@tbl-gam_daily_model_results_complete_a, and
@tbl-gam_daily_model_results_complete_b for daily models,
@tbl-gam_weekly_model_results_complete_a, and
@tbl-gam_weekly_model_results_complete_b for weekly outputs, and for monthly
results the @tbl-gam_monthly_model_results_complete_a, and
@tbl-gam_monthly_model_results_complete_b. A residuals distribution for each of
the models is in @fig-gam_residuals
When using GAM models on a monthly basis, no single VI demonstrates consistent
superiority over the others. For the all sites category, CCI has an R² = 0.03
better variance explanation than NDVI, which is the second best model with an R²
= 0.75. In the case of Michigan EVI and NIRv were the best individual indices
with an R² = 0.96 of the variance in GPP, but EVI had slightly lower error metrics
values in MAE ($0.01 \, \mathrm{gC \, m^{-2} \, d^{-1}}$) and RMSE
($0.02 \, \mathrm{gC \, m^{-2} \, d^{-1}}$). Bartlett had a better model
performance when using CCI with an R² = 0.91 of variance explained with the lowest
error metrics among all the GAM monthly models. It's important to highlight that
for Michigan and Bartlett, implementing a GAM model using all the VIs as
covariates posed challenges due to limited observations for the model
parameters, raising concerns of potential overfitting.
Finally, it's noteworthy that NDVI on a monthly basis displayed suboptimal
performance in two of the sites (Michigan and Bartlett) with a difference of R²
= 0.1 and R² = 0.13 with the best performing models respectively, while EVI
exhibited less favourable results in just the Borden site with MAE
$1.92 \, \mathrm{gC \, m^{-2} \, d^{-1}}$) and RMSE
$2.54 \, \mathrm{gC \, m^{-2} \, d^{-1}}$).
On a weekly basis, models incorporating all VIs as covariates consistently
obtained better performance compared to any individual VI, irrespective of the
site. Specifically when evaluating the all sites category, the inclusion of all
indices as covariates yielded an R² = 0.07 increase in variance explanation
compared with the best individual VI result CCI. Further, for Michigan, this
improvement amounted to R² = 0.02 compared to EVI, R² = 0.01 for Bartlett in
contrast to EVI, and an R² = 0.02 enhancement for Borden when compared with EVI.
Conversely, NDVI showed a diminished performance as an individual index when
compared with all the other individual VIs. In the context of all sites, NDVI
yielded R² = 0.07 less variance explanation than CCI. Notably, for Michigan,
NDVI's performance lagged by R² = 0.17 compared to EVI, Bartlett an R² = 0.18
reduction relative to EVI, and Borden exhibited an R² = 0.09 deficit when
contrasted with EVI.
On a daily basis, when considering the all sites category, the model with all
the VIs as covariates explained R² = 0.04 more variance in GPP when compared
with the best performing individual VI CCI. In the case of Bartlett, the
increase was also R² = 0.04 but in this case, the best individual performing VI
was EVI. For Michigan, the model using all VIs as covariates outperformed EVI by
R² = 0.05 in GPP prediction, and for Borden, it was an R² = 0.09 improvement
when compared with NIRv.
Among the individual VIs, NDVI consistently demonstrated the poorest performance
across all four cases. As an individual VI, EVI performed better in Bartlett,
Michigan, and Borden. However is worth noting that for Borden the variance
explained was limited to R² = 0.5.
In summary, among the three individual sites, Bartlett consistently produced the
most favourable results in terms of models explaining variance and yielding
lower residuals, followed by Michigan. In the case of Borden, when employing
individual indices, the models struggled to achieve a variance explanation
exceeding R² = 0.5.
<!-- ### Daily and weekly GPP and VIs relations -->
<!-- **TODO: Single vs Individual GAM model** -->
<!-- *This is a section under construction. Several models are presented in order -->
<!-- to discuss which one will be better to describe the patterns found* -->
<!-- Richard's suggestion is to create a single model using all indices as covariates. -->
<!-- The notes are described also in the issue [Ref #49](https://github.com/ronnyhdez/thesis_msc/issues/49) -->
<!-- **Single model** -->
<!-- - Using all the indices that I have as multiple predictors will allow me to -->
<!-- explore the relation between GPP and each VI while also considering potential -->
<!-- interactions. But this will be an examination of the combined effects of the -->
<!-- VIs on GPP and determine their relative importance. -->
<!-- **Individual model** -->
<!-- - A separate GAM model for each VI will allow me to explore the relation -->
<!-- between GPP and each VI separately, without the exploration of potential -->
<!-- interactions between the VIs. -->