diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION index 3fb62007..8ca7fefb 100644 --- a/CRAN-SUBMISSION +++ b/CRAN-SUBMISSION @@ -1,3 +1,3 @@ -Version: 0.4.0 -Date: 2023-11-30 18:00:42 UTC -SHA: 1235adc3b2d33e0656c5d2be6c511b412899df27 +Version: 0.5.0 +Date: 2024-10-25 17:33:48 UTC +SHA: 6c5e3e5422865a2a86f84b65472dc8d907a787cd diff --git a/DESCRIPTION b/DESCRIPTION index 1ebe790e..ce55ae64 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: finnts Title: Microsoft Finance Time Series Forecasting Framework -Version: 0.4.0.9008 +Version: 0.5.0 Authors@R: c(person(given = "Mike", family = "Tokic", diff --git a/NEWS.md b/NEWS.md index 24eaa754..6c37ec41 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,4 @@ -# finnts 0.4.0.9008 (DEVELOPMENT VERSION) +# finnts 0.5.0 ## Improvements diff --git a/vignettes/models-used-in-finnts.Rmd b/vignettes/models-used-in-finnts.Rmd index fb2da330..1b6f8330 100644 --- a/vignettes/models-used-in-finnts.Rmd +++ b/vignettes/models-used-in-finnts.Rmd @@ -84,7 +84,7 @@ By default within `prep_models()`, the `multistep_horizon` argument is set to FA - svm-rbf - xgboost -A multistep model optimizes for each period in a forecast horizon. Let's take an example of a monthly data set with a forecast horizon of 3. When creating the features for the R1 recipe, finnts will create lags of 1, 2, 3, 6, 9, 12 months. Then when training a mulitstep model it will iteratively use specific features to train the model. First it will train a model on the first forecast horizon (H1), where it will use all available feature lags. Then for H2 it will use lags of 2 or more. Finally for H3 it will use lags of 3 or more. So the final model is actually a collection of multiple models that each trained on a specific horizon. This lets the model optimize for using all available data when creating the forecast. So in our example, one glmnet model actually has three separate horizon specific models under the hood. +A multistep model optimizes for each period in a forecast horizon. Let's take an example of a monthly data set with a forecast horizon of 3. When creating the features for the R1 recipe, finnts will create lags of 1, 2, 3, 6, 9, 12 months. Then when training a multistep model it will iteratively use specific features to train the model. First it will train a model on the first forecast horizon (H1), where it will use all available feature lags. Then for H2 it will use lags of 2 or more. Finally for H3 it will use lags of 3 or more. So the final model is actually a collection of multiple models that each trained on a specific horizon. This lets the model optimize for using all available data when creating the forecast. So in our example, one glmnet model actually has three separate horizon specific models under the hood. A few more things to mention. If `multistep_horizon` is TRUE then other multivariate models like arima-boost or prophet-xregs will not run a multistep horizon approach. Instead they will use lags that are equal to or greater than the forecast horizon. One set of hyperparameters will be chosen for each multistep model, meaning glmnet will only use one combination of final hyperparameters and apply it to each horizon model. Multistep models are not ran for the R2 recipe, since it has it's own way of dealing with multiple horizons. Finally if `feature_selection` is turned on, it will be ran for each horizon specific model, meaning for a 3 month forecast horizon the feature selection process will be ran 3 times. One for each combination of features tied to a specific horizon. diff --git a/vignettes/parallel-processing.Rmd b/vignettes/parallel-processing.Rmd index b9ecc616..15f9a9f6 100644 --- a/vignettes/parallel-processing.Rmd +++ b/vignettes/parallel-processing.Rmd @@ -24,7 +24,7 @@ If `parallel_processing` is set to NULL and `inner_parallel` is set to TRUE with To leverage the full power of Finn, running within Azure is the best choice in building production ready forecasts that can easily scale. The most efficient way to run Finn is to set `parallel_processing` to "spark" within `forecast_time_series()`. This will run each time series in parallel across a spark compute cluster. -[Sparklyr](https://spark.rstudio.com/) is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using [spark on Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/sparklyr). Also check out the growing R support with using [spark on Azure Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-r-language). +[Sparklyr](https://spark.posit.co/) is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using [spark on Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/sparklyr). Also check out the growing R support with using [spark on Azure Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-r-language). ```{r, message = FALSE, eval = FALSE} @@ -72,6 +72,6 @@ finn_output_tbl <- get_forecast_data( ) ``` -The above example runs each time series on a separate core on a spark cluster. You can also submit multiple time series where each time series runs on a separate spark executor (VM) and then leverage all of the cores on that executor to run things like hyperparameter tuning or model refitting in parallel. This creates two levels of parallelization. One at the time series level, then another when doing things like hyperparameter tuning within a specific time series. To do that set `inner_parallel` to TRUE in `forecast_time_series()`. Also make sure that you adjust the number of spark executor cores to 1, that ensures that only 1 time series runs on an executor at a time. Leverage the "spark.executor.cores" argument when configuring your spark connection. This can be done using [sparklyr](https://spark.rstudio.com/guides/connections#:~:text=In%20sparklyr%2C%20Spark%20properties%20can%20be%20set%20by,customized%20as%20shown%20in%20the%20example%20code%20below.) or within the cluster manager itself within the Azure resource. Use the "num_cores" argument in the "forecast_time_series" function to control how many cores should be used within an executor when running things like hyperparameter tuning. +The above example runs each time series on a separate core on a spark cluster. You can also submit multiple time series where each time series runs on a separate spark executor (VM) and then leverage all of the cores on that executor to run things like hyperparameter tuning or model refitting in parallel. This creates two levels of parallelization. One at the time series level, then another when doing things like hyperparameter tuning within a specific time series. To do that set `inner_parallel` to TRUE in `forecast_time_series()`. Also make sure that you adjust the number of spark executor cores to 1, that ensures that only 1 time series runs on an executor at a time. Leverage the "spark.executor.cores" argument when configuring your spark connection. This can be done using [sparklyr](https://spark.posit.co/guides/connections#:~:text=In%20sparklyr%2C%20Spark%20properties%20can%20be%20set%20by,customized%20as%20shown%20in%20the%20example%20code%20below.) or within the cluster manager itself within the Azure resource. Use the "num_cores" argument in the "forecast_time_series" function to control how many cores should be used within an executor when running things like hyperparameter tuning. `forecast_time_series()` will be looking for a variable called "sc" to use when submitting tasks to the spark cluster, so make sure you use that as the variable name when connecting to spark. Also it's important that you mount your spark session to an Azure Data Lake Storage (ADLS) account, and provide the mounted path to where you'd like your Finn results to be written to within `set_run_info()`.