modeltime_fit_resamples() automatically 'lags' the .row column #10

luzbarbosa · 2022-04-27T21:07:40Z

The out-of-sample predictions generated by modeltime_fit_resamples() incorrectly show the predicted variable as if it was lagged.

When I run the following code it outputs out-of-sample projections, as expected :

  submodels_resamples_tscv_tbl <- submodels_tbl %>% 
    modeltime_fit_resamples(
      resamples    = cv_resamples
    )

However, a closer inspection of the output provided by modeltime_fit_resamples() shows that the predicted variable is not indexed by the same ".row" (id column) as it was inside "cv_resamples"


# Input received by modeltime_fit_resamples()
input <- cv_resamples %>%
      filter(id == "Slice01") %>% 
      pull(splits) %>%
      first() %>% 
      training() %>% 
      select(.row, value)

input %>% filter(.row >= 1816, .row <= 1825 ) %>% print()

# Output generated by modeltime_fit_resamples()

output <- submodels_resamples_tscv_tbl %>% 
    select(.model_desc, .resample_results) %>% 
    unnest(.resample_results) %>% 
    select(.model_desc, id, .predictions) %>% 
    unnest(.predictions)


output %>% select(.row, value) %>%  filter(.row >= 1816, .row <= 1825 ) %>% print()

If we plot those variables, we can clearly see that the test set was lagged by a few days.
The first slice inside 'cv_resamples' includes the test and the training set, both depicted in blue.
The output, includes the out-of-sample projections for the test set but we ignore those and plot only the realized value of the target variable, in red.
I include the code below just for completeness.

ggplot() +
      geom_line(data = output, aes(x = .row, y = value, color = "input")) +
      geom_line(data = input,  aes(x = .row, y = value, color = "output"))

The text was updated successfully, but these errors were encountered:

AlbertoAlmuinha · 2022-08-09T20:38:32Z

Hi @luzbarbosa ,

Could you provide a reproducible example?

It is very complicated to be able to help you without knowing how you have generated the resamples or how you have reached those results.

In the tests that I have done, I do not see that the variable is lagging and both show that they are identical. Here is the code I used to test it in case it could be of any use to you:

library(tidymodels)
library(modeltime)
library(modeltime.resample)
library(timetk)
library(tidyverse)
library(tidyquant)


full_data_tbl <- walmart_sales_weekly %>%
  select(id, Date, Weekly_Sales) %>%
  
  # Apply Group-wise Time Series Manipulations
  group_by(id) %>%
  future_frame(
    .date_var   = Date,
    .length_out = "3 months",
    .bind_data  = TRUE
  ) %>%
  ungroup() %>%
  
  # Consolidate IDs
  mutate(id = fct_drop(id))

# Training Data
data_prepared_tbl <- full_data_tbl %>%
  filter(!is.na(Weekly_Sales))

# Forecast Data
future_tbl <- full_data_tbl %>%
  filter(is.na(Weekly_Sales))

walmart_tscv <- data_prepared_tbl %>%
  time_series_cv(
    date_var    = Date, 
    assess      = "3 months",
    skip        = "3 months",
    cumulative  = TRUE,
    slice_limit = 6
  )


walmart_tscv %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(Date, Weekly_Sales, 
                           .facet_ncol = 2, .interactive = F)


recipe_spec <- recipe(Weekly_Sales ~ ., 
                      data = training(walmart_tscv$splits[[1]])) %>%
  step_timeseries_signature(Date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(day)|(hour)|(minute)|(second)|(am.pm)")) %>%
  step_mutate(Date_week = factor(Date_week, ordered = TRUE)) %>%
  step_dummy(all_nominal(), one_hot = TRUE)


wflw_fit_xgboost <- workflow() %>%
  add_model(
    boost_tree() %>% set_engine("xgboost") 
  ) %>%
  add_recipe(recipe_spec %>% step_rm(Date)) %>%
  fit(training(walmart_tscv$splits[[1]]))



model_tbl <- modeltime_table(
  wflw_fit_xgboost
)


resample_results <- model_tbl %>%
  modeltime_fit_resamples(
    resamples = walmart_tscv,
    control   = control_resamples(verbose = FALSE)
  )

input <- walmart_tscv %>%
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 


input_2<-resample_results$.resample_results[[1]] %>% 
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 

identical(input, input_2)

In case you need more help, please send a reprex.

Regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modeltime_fit_resamples() automatically 'lags' the .row column #10

modeltime_fit_resamples() automatically 'lags' the .row column #10

luzbarbosa commented Apr 27, 2022

AlbertoAlmuinha commented Aug 9, 2022

modeltime_fit_resamples() automatically 'lags' the .row column #10

modeltime_fit_resamples() automatically 'lags' the .row column #10

Comments

luzbarbosa commented Apr 27, 2022

AlbertoAlmuinha commented Aug 9, 2022