Skip to content

Commit

Permalink
Merge pull request #169 from microsoft/mitokic/10252024/cran-submission
Browse files Browse the repository at this point in the history
Mitokic/10252024/cran submission
  • Loading branch information
mitokic authored Oct 25, 2024
2 parents 757e9d1 + 05fd499 commit f5825bd
Show file tree
Hide file tree
Showing 5 changed files with 8 additions and 8 deletions.
6 changes: 3 additions & 3 deletions CRAN-SUBMISSION
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Version: 0.4.0
Date: 2023-11-30 18:00:42 UTC
SHA: 1235adc3b2d33e0656c5d2be6c511b412899df27
Version: 0.5.0
Date: 2024-10-25 17:33:48 UTC
SHA: 6c5e3e5422865a2a86f84b65472dc8d907a787cd
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: finnts
Title: Microsoft Finance Time Series Forecasting Framework
Version: 0.4.0.9008
Version: 0.5.0
Authors@R:
c(person(given = "Mike",
family = "Tokic",
Expand Down
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# finnts 0.4.0.9008 (DEVELOPMENT VERSION)
# finnts 0.5.0

## Improvements

Expand Down
2 changes: 1 addition & 1 deletion vignettes/models-used-in-finnts.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ By default within `prep_models()`, the `multistep_horizon` argument is set to FA
- svm-rbf
- xgboost

A multistep model optimizes for each period in a forecast horizon. Let's take an example of a monthly data set with a forecast horizon of 3. When creating the features for the R1 recipe, finnts will create lags of 1, 2, 3, 6, 9, 12 months. Then when training a mulitstep model it will iteratively use specific features to train the model. First it will train a model on the first forecast horizon (H1), where it will use all available feature lags. Then for H2 it will use lags of 2 or more. Finally for H3 it will use lags of 3 or more. So the final model is actually a collection of multiple models that each trained on a specific horizon. This lets the model optimize for using all available data when creating the forecast. So in our example, one glmnet model actually has three separate horizon specific models under the hood.
A multistep model optimizes for each period in a forecast horizon. Let's take an example of a monthly data set with a forecast horizon of 3. When creating the features for the R1 recipe, finnts will create lags of 1, 2, 3, 6, 9, 12 months. Then when training a multistep model it will iteratively use specific features to train the model. First it will train a model on the first forecast horizon (H1), where it will use all available feature lags. Then for H2 it will use lags of 2 or more. Finally for H3 it will use lags of 3 or more. So the final model is actually a collection of multiple models that each trained on a specific horizon. This lets the model optimize for using all available data when creating the forecast. So in our example, one glmnet model actually has three separate horizon specific models under the hood.

A few more things to mention. If `multistep_horizon` is TRUE then other multivariate models like arima-boost or prophet-xregs will not run a multistep horizon approach. Instead they will use lags that are equal to or greater than the forecast horizon. One set of hyperparameters will be chosen for each multistep model, meaning glmnet will only use one combination of final hyperparameters and apply it to each horizon model. Multistep models are not ran for the R2 recipe, since it has it's own way of dealing with multiple horizons. Finally if `feature_selection` is turned on, it will be ran for each horizon specific model, meaning for a 3 month forecast horizon the feature selection process will be ran 3 times. One for each combination of features tied to a specific horizon.

Expand Down
4 changes: 2 additions & 2 deletions vignettes/parallel-processing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ If `parallel_processing` is set to NULL and `inner_parallel` is set to TRUE with

To leverage the full power of Finn, running within Azure is the best choice in building production ready forecasts that can easily scale. The most efficient way to run Finn is to set `parallel_processing` to "spark" within `forecast_time_series()`. This will run each time series in parallel across a spark compute cluster.

[Sparklyr](https://spark.rstudio.com/) is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using [spark on Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/sparklyr). Also check out the growing R support with using [spark on Azure Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-r-language).
[Sparklyr](https://spark.posit.co/) is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using [spark on Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/sparklyr). Also check out the growing R support with using [spark on Azure Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-r-language).


```{r, message = FALSE, eval = FALSE}
Expand Down Expand Up @@ -72,6 +72,6 @@ finn_output_tbl <- get_forecast_data(
)
```

The above example runs each time series on a separate core on a spark cluster. You can also submit multiple time series where each time series runs on a separate spark executor (VM) and then leverage all of the cores on that executor to run things like hyperparameter tuning or model refitting in parallel. This creates two levels of parallelization. One at the time series level, then another when doing things like hyperparameter tuning within a specific time series. To do that set `inner_parallel` to TRUE in `forecast_time_series()`. Also make sure that you adjust the number of spark executor cores to 1, that ensures that only 1 time series runs on an executor at a time. Leverage the "spark.executor.cores" argument when configuring your spark connection. This can be done using [sparklyr](https://spark.rstudio.com/guides/connections#:~:text=In%20sparklyr%2C%20Spark%20properties%20can%20be%20set%20by,customized%20as%20shown%20in%20the%20example%20code%20below.) or within the cluster manager itself within the Azure resource. Use the "num_cores" argument in the "forecast_time_series" function to control how many cores should be used within an executor when running things like hyperparameter tuning.
The above example runs each time series on a separate core on a spark cluster. You can also submit multiple time series where each time series runs on a separate spark executor (VM) and then leverage all of the cores on that executor to run things like hyperparameter tuning or model refitting in parallel. This creates two levels of parallelization. One at the time series level, then another when doing things like hyperparameter tuning within a specific time series. To do that set `inner_parallel` to TRUE in `forecast_time_series()`. Also make sure that you adjust the number of spark executor cores to 1, that ensures that only 1 time series runs on an executor at a time. Leverage the "spark.executor.cores" argument when configuring your spark connection. This can be done using [sparklyr](https://spark.posit.co/guides/connections#:~:text=In%20sparklyr%2C%20Spark%20properties%20can%20be%20set%20by,customized%20as%20shown%20in%20the%20example%20code%20below.) or within the cluster manager itself within the Azure resource. Use the "num_cores" argument in the "forecast_time_series" function to control how many cores should be used within an executor when running things like hyperparameter tuning.

`forecast_time_series()` will be looking for a variable called "sc" to use when submitting tasks to the spark cluster, so make sure you use that as the variable name when connecting to spark. Also it's important that you mount your spark session to an Azure Data Lake Storage (ADLS) account, and provide the mounted path to where you'd like your Finn results to be written to within `set_run_info()`.

0 comments on commit f5825bd

Please sign in to comment.