forked from seankross/bookdown-start
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathWorkflowExample.qmd
247 lines (182 loc) · 9.1 KB
/
WorkflowExample.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Example Forecast Workflow {#sec-example}
Here is an example of a complete workflow for generating a forecast submission to the Challenge. The example is for the aquatics theme and it forecasts water temperature and oxygen. The water temperature forecast uses a linear regression between air temperature and water temperature to predict water temperature in the future. It then uses the prediction of water temperature to predict oxygen water oxygen concentration by assuming that the oxygen is 100% saturated given the predicted temperature.
To generate the forecast we need to:
1) Build a relationship between past air temperature and past water temperature.\
2) Apply the relationships from #1 to forecasted air temperature.\
3) Calculate oxygen concentration from the forecasted water temperature.
Therefore we need to:
1) Download the historical water temperature data for the NEON sites (called "targets").\
2) Download historical air temperature data for the NEON sites (we the stacked NOAA GEFS weather).\
3) Download NOAA weather forecast for the NEON sites.\
4) Create linear regression model based on historical data for each NEON site.\
5) Apply linear regression to using weather forecasts for each NEON.\
6) Write forecast output file.\
7) Submit forecast to Challenge.
Each of these steps are below
## Step 0: Set up R environment and directories
We will be downloading NOAA forecasts from the Challenge s3 bucket and submitting to the s3 bucket. Therefore the AWS information is needed.
```{r eval = FALSE}
library(tidyverse)
library(neon4cast)
library(lubridate)
library(rMR)
library(arrow)
```
Define the date that the forecast starts. For demonstration purposes, we are setting the date to `2022-02-15`. In a real-time application, use `forecast_date <- Sys.Date()` or `forecast_date <- Sys.Date() - lubridate::days(1)` (in some cases the NOAA data the current day is not available by the time you run your forecast).
```{r eval = FALSE}
forecast_date <- lubridate::as_date("2022-02-15")
```
## Step 0: Define team name
```{r eval = FALSE}
model_id <- "neon4cast-example"
```
## Download latest target data and site description data
These targets are updated when new data is available from NEON.
```{r eval = FALSE}
target <- readr::read_csv("https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz", guess_max = 1e6)
```
A table is available with NEON site descriptions. The calculation of oxygen saturation requires the elevation of each site, which is included in the site description table.
```{r eval = FALSE}
site_data <- readr::read_csv("https://raw.githubusercontent.com/eco4cast/neon4cast-targets/main/NEON_Field_Site_Metadata_20220412.csv") |>
filter(aquatics == 1)
```
## Download past NOAA forecast stacked together
To build the relations between air and water temperature, we need historical air temperature data to associate with historical water temperature data. Here we use a product that the Challenge organizers created that combines day 1 NOAA weather forecasts (i.e., when the forecasts are most accurate) together to generate an estimate of past weather. He we download this "stack" NOAA product for the set of NEON sites in the targets file.
```{r eval = FALSE}
df_past <- neon4cast::noaa_stage3()
```
Load a helper function that averages over predicted 0h horizon ensembles to get 'historic values' for each site
```{r}
noaa_mean_historical <- function(df_past, site, var) {
df_past |>
dplyr::filter(site_id == site,
variable == var) |>
dplyr::rename(ensemble = parameter) |>
dplyr::select(datetime, prediction, ensemble) |>
dplyr::mutate(date = as_date(datetime)) |>
dplyr::group_by(date) |>
dplyr::summarize(air_temperature = mean(prediction, na.rm = TRUE),
.groups = "drop") |>
dplyr::rename(datetime = date) |>
dplyr::mutate(air_temperature = air_temperature - 273.15) |>
dplyr::collect()
}
```
## Download NOAA future forecast
We need NOAA Weather forecasts of the future. Fortunately, the Challenge organizers are downloading and subsetting the weather forecasts for each NEON site. Here we download the weather forecast (`start_date = forecast_date`) that started at mid-night UTC (`cycle=0`) for the set of sites in the target file.
```{r eval = FALSE}
noaa_mean_forecast <- function(site, var, reference_date) {
endpoint = "data.ecoforecast.org"
bucket <- glue::glue("neon4cast-drivers/noaa/gefs-v12/stage1/0/{reference_date}")
s3 <- arrow::s3_bucket(bucket, endpoint_override = endpoint, anonymous = TRUE)
# stage1 air temp is Celsius
arrow::open_dataset(s3) |>
dplyr::filter(site_id == site,
datetime >= lubridate::as_datetime(forecast_date),
variable == var) |>
dplyr::select(datetime, prediction, parameter) |>
dplyr::mutate(datetime = as_date(datetime)) |>
dplyr::group_by(datetime, parameter) |>
dplyr::summarize(air_temperature = mean(prediction), .groups = "drop") |>
dplyr::select(datetime, air_temperature, parameter) |>
dplyr::rename(ensemble = parameter) |>
dplyr::collect()
}
```
We'll skip any site that doesn't have both temperature and oxygen
```{r eval = FALSE}
sites <- target |> na.omit() |> distinct(site_id, variable) |>
filter(variable %in% c("oxygen", "temperature")) |>
count(site_id) |> filter(n==2) |> pull(site_id)
```
## Define the forecasts model for a site
```{r eval = FALSE}
forecast_site <- function(site) {
message(paste0("Running site: ", site))
# Get site information for elevation
site_info <- site_data |> dplyr::filter(field_site_id == site)
# historical temperatures
noaa_past_mean <- noaa_mean_historical(df_past, site, "air_temperature")
# Merge in past NOAA data into the targets file, matching by date.
site_target <- target |>
dplyr::select(datetime, site_id, variable, observation) |>
dplyr::filter(variable %in% c("temperature", "oxygen"),
site_id == site) |>
tidyr::pivot_wider(names_from = "variable", values_from = "observation") |>
dplyr::left_join(noaa_past_mean, by = c("datetime"))
rm(noaa_past_mean) # save RAM
# Fit linear model based o # n past data: water temperature = m * air temperature + b
fit <- lm(temperature ~ air_temperature, data = site_target)
# Get 30-day predicted temperature ensemble at the site
noaa_future <- noaa_mean_forecast(site, "TMP", noaa_date)
# use the linear model (predict.lm) to forecast water temperature for each ensemble member
temperature <-
noaa_future |>
mutate(site_id = site,
prediction = predict(fit, tibble(air_temperature)),
variable = "temperature")
# use forecasted water temperature to predict oxygen by assuming that oxygen is saturated.
forecasted_oxygen <-
rMR::Eq.Ox.conc(temperature$prediction,
elevation.m = site_info$field_mean_elevation_m,
bar.press = NULL,
bar.units = NULL,
out.DO.meas = "mg/L",
salinity = 0,
salinity.units = "pp.thou")
# stick bits together
oxygen <-
noaa_future |>
mutate(site_id = site,
prediction = forecasted_oxygen,
variable = "oxygen")
forecast <- dplyr::bind_rows(temperature, oxygen)
# Format results to EFI standard
forecast <- forecast |>
mutate(reference_datetime = forecast_date,
family = "ensemble",
model_id = model_id) |>
rename(parameter = ensemble) |>
select(model_id, datetime, reference_datetime,
site_id, family, parameter, variable, prediction)
}
```
AND HERE WE GO! We're ready to start forecasting
Test with a single site first!
```{r eval = FALSE}
forecast <- forecast_site( sites[1] )
```
Visualize the ensemble predictions -- what do you think?
```{r eval = FALSE}
forecast |>
ggplot(aes(x = datetime, y = prediction, group = parameter)) +
geom_line(alpha=0.3) +
facet_wrap(~variable, scales = "free")
```
Run all sites -- may be slow!
```{r eval = FALSE}
forecast <- map_dfr(sites, forecast_site)
```
Forecast output file name in standards requires for Challenge.
csv.gz means that it will be compressed
```{r eval = FALSE}
file_date <- Sys.Date() #forecast$reference_datetime[1]
forecast_file <- paste0("aquatics","-",file_date,"-",model_id,".csv.gz")
```
Write csv to disk
```{r eval = FALSE}
write_csv(forecast, forecast_file)
```
## Submit forecast!
```{r eval = FALSE}
neon4cast::submit(forecast_file = forecast_file, metadata = NULL, ask = FALSE)`
```
You can check on the status of your submission using
```{r eval = FALSE}
neon4cast::check_submission(forecast_file)
```
On following day after submission, you can see the forecast on the dashboard at [shiny.ecoforecast.org](https://shiny.ecoforecast.org)
## Complete registration and metadata
Complete the [registration and model overview](https://forms.gle/kg2Vkpho9BoMXSy57){target="_blank"} that defines your model
## Example on github
The example code above can be found on [GitHub as a template repository](https://github.com/eco4cast/neon4cast-example.git). See the Readme for more information about using the template