WorkflowExample.qmd

# Example Forecast Workflow {#sec-example}

Here is an example of a complete workflow for generating a forecast submission to the Challenge. The example is for the aquatics theme and it forecasts water temperature and oxygen. The water temperature forecast uses a linear regression between air temperature and water temperature to predict water temperature in the future. It then uses the prediction of water temperature to predict oxygen water oxygen concentration by assuming that the oxygen is 100% saturated given the predicted temperature.

To generate the forecast we need to:

1)  Build a relationship between past air temperature and past water temperature.\
2)  Apply the relationships from #1 to forecasted air temperature.\
3)  Calculate oxygen concentration from the forecasted water temperature.

Therefore we need to:

1)  Download the historical water temperature data for the NEON sites (called "targets").\
2)  Download historical air temperature data for the NEON sites (we the stacked NOAA GEFS weather).\
3)  Download NOAA weather forecast for the NEON sites.\
4)  Create linear regression model based on historical data for each NEON site.\
5)  Apply linear regression to using weather forecasts for each NEON.\
6)  Write forecast output file.\
7)  Submit forecast to Challenge.

Each of these steps are below

## Step 0: Set up R environment and directories

We will be downloading NOAA forecasts from the Challenge s3 bucket and submitting to the s3 bucket. Therefore the AWS information is needed.

```{r eval = FALSE}
library(tidyverse)
library(neon4cast)
library(lubridate)
library(rMR)
library(arrow)
```

Define the date that the forecast starts. For demonstration purposes, we are setting the date to `2022-02-15`. In a real-time application, use `forecast_date <- Sys.Date()` or `forecast_date <- Sys.Date() - lubridate::days(1)` (in some cases the NOAA data the current day is not available by the time you run your forecast).

```{r eval = FALSE}
forecast_date <- lubridate::as_date("2022-02-15")  
```

## Step 0: Define team name

```{r eval = FALSE}
model_id <- "neon4cast-example"
```

## Download latest target data and site description data

These targets are updated when new data is available from NEON.

```{r eval = FALSE}
target <- readr::read_csv("https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz", guess_max = 1e6)
```

A table is available with NEON site descriptions. The calculation of oxygen saturation requires the elevation of each site, which is included in the site description table.

```{r eval = FALSE}
site_data <- readr::read_csv("https://raw.githubusercontent.com/eco4cast/neon4cast-targets/main/NEON_Field_Site_Metadata_20220412.csv") |> 
  filter(aquatics == 1)
```

## Download past NOAA forecast stacked together

To build the relations between air and water temperature, we need historical air temperature data to associate with historical water temperature data. Here we use a product that the Challenge organizers created that combines day 1 NOAA weather forecasts (i.e., when the forecasts are most accurate) together to generate an estimate of past weather. He we download this "stack" NOAA product for the set of NEON sites in the targets file.

```{r eval = FALSE}
df_past <- neon4cast::noaa_stage3()
```

Load a helper function that averages over predicted 0h horizon ensembles to get 'historic values' for each site

```{r}
noaa_mean_historical <- function(df_past, site, var) {
  df_past |>
    dplyr::filter(site_id == site,
                  variable == var) |>
    dplyr::rename(ensemble = parameter) |>
    dplyr::select(datetime, prediction, ensemble) |>
    dplyr::mutate(date = as_date(datetime)) |>
    dplyr::group_by(date) |>
    dplyr::summarize(air_temperature = mean(prediction, na.rm = TRUE),
                     .groups = "drop") |>
    dplyr::rename(datetime = date) |>
    dplyr::mutate(air_temperature = air_temperature - 273.15) |>
    dplyr::collect()
}
```


## Download NOAA future forecast

We need NOAA Weather forecasts of the future. Fortunately, the Challenge organizers are downloading and subsetting the weather forecasts for each NEON site. Here we download the weather forecast (`start_date = forecast_date`) that started at mid-night UTC (`cycle=0`) for the set of sites in the target file.

```{r eval = FALSE}
noaa_mean_forecast <- function(site, var, reference_date) {
  endpoint = "data.ecoforecast.org"
  bucket <- glue::glue("neon4cast-drivers/noaa/gefs-v12/stage1/0/{reference_date}")
  s3 <- arrow::s3_bucket(bucket, endpoint_override = endpoint, anonymous = TRUE)
  
  # stage1 air temp is Celsius
  arrow::open_dataset(s3) |>
    dplyr::filter(site_id == site,
                  datetime >= lubridate::as_datetime(forecast_date),
                  variable == var) |>
    dplyr::select(datetime, prediction, parameter) |>
    dplyr::mutate(datetime = as_date(datetime)) |>
    dplyr::group_by(datetime, parameter) |>
    dplyr::summarize(air_temperature = mean(prediction), .groups = "drop") |>
    dplyr::select(datetime, air_temperature, parameter) |>
    dplyr::rename(ensemble = parameter) |>
    dplyr::collect()
  
}
```

We'll skip any site that doesn't have both temperature and oxygen

```{r eval = FALSE}
sites <- target |> na.omit() |> distinct(site_id, variable) |> 
  filter(variable %in% c("oxygen", "temperature")) |>
  count(site_id) |> filter(n==2) |> pull(site_id)
```

## Define the forecasts model for a site
```{r eval = FALSE}
forecast_site <- function(site) {
  message(paste0("Running site: ", site))
  
  # Get site information for elevation
  site_info <- site_data |> dplyr::filter(field_site_id == site)
  
  # historical temperatures
  noaa_past_mean <- noaa_mean_historical(df_past, site, "air_temperature")
  
  # Merge in past NOAA data into the targets file, matching by date.
  site_target <- target |>
    dplyr::select(datetime, site_id, variable, observation) |>
    dplyr::filter(variable %in% c("temperature", "oxygen"), 
                  site_id == site) |>
    tidyr::pivot_wider(names_from = "variable", values_from = "observation") |>
    dplyr::left_join(noaa_past_mean, by = c("datetime"))
  
  rm(noaa_past_mean) # save RAM 
  
  # Fit linear model based o # n past data: water temperature = m * air temperature + b
  fit <- lm(temperature ~ air_temperature, data = site_target)
  
  #  Get 30-day predicted temperature ensemble at the site
  noaa_future <- noaa_mean_forecast(site, "TMP", noaa_date)
  
  # use the linear model (predict.lm) to forecast water temperature for each ensemble member
  temperature <- 
    noaa_future |> 
    mutate(site_id = site,
           prediction = predict(fit, tibble(air_temperature)),
           variable = "temperature")
  
  # use forecasted water temperature to predict oxygen by assuming that oxygen is saturated.
  forecasted_oxygen <- 
    rMR::Eq.Ox.conc(temperature$prediction, 
                    elevation.m = site_info$field_mean_elevation_m,
                    bar.press = NULL,
                    bar.units = NULL,
                    out.DO.meas = "mg/L",
                    salinity = 0,
                    salinity.units = "pp.thou")
  # stick bits together                  
  oxygen <- 
    noaa_future |> 
    mutate(site_id = site,
           prediction = forecasted_oxygen,
           variable = "oxygen")
  
  forecast <- dplyr::bind_rows(temperature, oxygen)
  
  # Format results to EFI standard
  forecast <- forecast |>
    mutate(reference_datetime = forecast_date,
           family = "ensemble",
           model_id = model_id) |>
    rename(parameter = ensemble) |>
    select(model_id, datetime, reference_datetime,
           site_id, family, parameter, variable, prediction)
}

```

AND HERE WE GO! We're ready to start forecasting

Test with a single site first!

```{r eval = FALSE}
forecast <- forecast_site( sites[1] )
```

Visualize the ensemble predictions -- what do you think?

```{r eval = FALSE}
forecast |> 
  ggplot(aes(x = datetime, y = prediction, group = parameter)) +
  geom_line(alpha=0.3) +
  facet_wrap(~variable, scales = "free")
```

Run all sites -- may be slow!

```{r eval = FALSE}
forecast <- map_dfr(sites, forecast_site)
```

Forecast output file name in standards requires for Challenge.
csv.gz means that it will be compressed

```{r eval = FALSE}
file_date <- Sys.Date() #forecast$reference_datetime[1]
forecast_file <- paste0("aquatics","-",file_date,"-",model_id,".csv.gz")
```

Write csv to disk

```{r eval = FALSE}
write_csv(forecast, forecast_file)
```

## Submit forecast!

```{r eval = FALSE}
neon4cast::submit(forecast_file = forecast_file, metadata = NULL, ask = FALSE)`
```

You can check on the status of your submission using

```{r eval = FALSE}
neon4cast::check_submission(forecast_file)
```

On following day after submission, you can see the forecast on the dashboard at [shiny.ecoforecast.org](https://shiny.ecoforecast.org)

## Complete registration and metadata

Complete the [registration and model overview](https://forms.gle/kg2Vkpho9BoMXSy57){target="_blank"} that defines your model


## Example on github

The example code above can be found on [GitHub as a template repository](https://github.com/eco4cast/neon4cast-example.git). See the Readme for more information about using the template