Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modeltime_fit_resamples() automatically 'lags' the .row column #10

Open
luzbarbosa opened this issue Apr 27, 2022 · 1 comment
Open

Comments

@luzbarbosa
Copy link

The out-of-sample predictions generated by modeltime_fit_resamples() incorrectly show the predicted variable as if it was lagged.

When I run the following code it outputs out-of-sample projections, as expected :

  submodels_resamples_tscv_tbl <- submodels_tbl %>% 
    modeltime_fit_resamples(
      resamples    = cv_resamples
    )

However, a closer inspection of the output provided by modeltime_fit_resamples() shows that the predicted variable is not indexed by the same ".row" (id column) as it was inside "cv_resamples"


# Input received by modeltime_fit_resamples()
input <- cv_resamples %>%
      filter(id == "Slice01") %>% 
      pull(splits) %>%
      first() %>% 
      training() %>% 
      select(.row, value)

input %>% filter(.row >= 1816, .row <= 1825 ) %>% print()

image

# Output generated by modeltime_fit_resamples()

output <- submodels_resamples_tscv_tbl %>% 
    select(.model_desc, .resample_results) %>% 
    unnest(.resample_results) %>% 
    select(.model_desc, id, .predictions) %>% 
    unnest(.predictions)


output %>% select(.row, value) %>%  filter(.row >= 1816, .row <= 1825 ) %>% print()

image

If we plot those variables, we can clearly see that the test set was lagged by a few days.
The first slice inside 'cv_resamples' includes the test and the training set, both depicted in blue.
The output, includes the out-of-sample projections for the test set but we ignore those and plot only the realized value of the target variable, in red.
I include the code below just for completeness.

ggplot() +
      geom_line(data = output, aes(x = .row, y = value, color = "input")) +
      geom_line(data = input,  aes(x = .row, y = value, color = "output")) 

image

@AlbertoAlmuinha
Copy link
Contributor

Hi @luzbarbosa ,

Could you provide a reproducible example?

It is very complicated to be able to help you without knowing how you have generated the resamples or how you have reached those results.

In the tests that I have done, I do not see that the variable is lagging and both show that they are identical. Here is the code I used to test it in case it could be of any use to you:

library(tidymodels)
library(modeltime)
library(modeltime.resample)
library(timetk)
library(tidyverse)
library(tidyquant)


full_data_tbl <- walmart_sales_weekly %>%
  select(id, Date, Weekly_Sales) %>%
  
  # Apply Group-wise Time Series Manipulations
  group_by(id) %>%
  future_frame(
    .date_var   = Date,
    .length_out = "3 months",
    .bind_data  = TRUE
  ) %>%
  ungroup() %>%
  
  # Consolidate IDs
  mutate(id = fct_drop(id))

# Training Data
data_prepared_tbl <- full_data_tbl %>%
  filter(!is.na(Weekly_Sales))

# Forecast Data
future_tbl <- full_data_tbl %>%
  filter(is.na(Weekly_Sales))

walmart_tscv <- data_prepared_tbl %>%
  time_series_cv(
    date_var    = Date, 
    assess      = "3 months",
    skip        = "3 months",
    cumulative  = TRUE,
    slice_limit = 6
  )


walmart_tscv %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(Date, Weekly_Sales, 
                           .facet_ncol = 2, .interactive = F)


recipe_spec <- recipe(Weekly_Sales ~ ., 
                      data = training(walmart_tscv$splits[[1]])) %>%
  step_timeseries_signature(Date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(day)|(hour)|(minute)|(second)|(am.pm)")) %>%
  step_mutate(Date_week = factor(Date_week, ordered = TRUE)) %>%
  step_dummy(all_nominal(), one_hot = TRUE)


wflw_fit_xgboost <- workflow() %>%
  add_model(
    boost_tree() %>% set_engine("xgboost") 
  ) %>%
  add_recipe(recipe_spec %>% step_rm(Date)) %>%
  fit(training(walmart_tscv$splits[[1]]))



model_tbl <- modeltime_table(
  wflw_fit_xgboost
)


resample_results <- model_tbl %>%
  modeltime_fit_resamples(
    resamples = walmart_tscv,
    control   = control_resamples(verbose = FALSE)
  )

input <- walmart_tscv %>%
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 


input_2<-resample_results$.resample_results[[1]] %>% 
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 

identical(input, input_2)

In case you need more help, please send a reprex.

Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants