Skip to content

API comparison: sktime vs HCrystalball, comments and suggestions

Pavel Krizek edited this page Aug 21, 2020 · 4 revisions

A comparison of sktime and HCrystalball API designs for forecasting, and proposed way forward by sktime updated with HCrystalball comments (see text in bold and HCrystalball comments column in tables).

Design comparison

Both sktime and HCrystalball adopt a sklearn-like fit/predict design, and a unified interface.

High-level differences

The below table summarizes the main differences:

Area sktime HCrystalball HCrystalball comments
data container pandas series pandas DataFrame pandas DataFrame
supports multivariate no yes not natively on wrapper level
(i.e. Prophet is not multivariate model by construction as opposed to i.e. VAR models)
supports exogeneous experimental yes yes
supports iloc use yes no yes
X.iloc[-5:] will return the last 5 rows even with datetime index
supports loc use no yes yes
X.loc["2020-05-01":] will return all rows from "2020-05-01"
type consistent composition yes no unsure
HCrystalball aims to utilize maximum from sklearn with minimum custom reimplementations of already existing objects --> we don't have a custom implementation of sklearn GridSearchCV but we use it directly, discussing concrete points would help us to understand this issue
task interoperability yes no no
HCrystalball aims to support only time series forecasting. The limited scope is a design decision.

For explanation:

  • type consistent composition means: composites inherit from, and follow the same interface as a class type ancestor. For example, GridSearchCV in sklearn behaves as a classifier, when constructed with a classifier. The compositor itself is an estimator class.
  • task interoperability means: the interface is designed to allow reduction to other time series related tasks
  • loc and iloc usage implies support for integer and date/time indices, and specification of the forecasting horizon as relative steps ahead and absolute time points respectively - HCrystalball's implementation allows you to leverage both indexing schemes - integers and datetimes

On a high-level, HCrystalball's interface seems inspired by Facebook's prophet. sktime's interface is closer to statsmodels and the Hyndman interfaces in R (e.g. forecast, fable).

Advantages and disadvantages

This section highlights advantages, disadvantages, and problems, according to our opinion.

Advantages of sktime:

  • "natural" interface in univariate case
  • higher-order operations, including composition and reduction, are well-handled

Problems of sktime:

  • lack of loc support
  • no good multivariate support

Advantages of HCrystalball:

  • support for multivariate and exogeneous
  • uses abc

Problems of HCrystalball:

  • higher-order operations are not well-designed or consistent - example would help to see the point
  • lack of iloc support - (see above)
  • interface is unintuitive in the univariate case - HCrystalball intention is as close compatibility with sklearn as possible with one exception - leveraging pandas as the main data interface instead of NumPy, this design decision leads to the natural choice of having X in two-dimensions (pandas dataframe) and y pandas series (1D NumPy is also supported) as input for fit and having X (dataframe) for the predict method. This implies an empty data frame with datetime index in the univariate case. HCrystalball in the past supported also just one input for fit and integer (horizon) for predict method for the univariate case, but over time experience showed that using more generic interface leads to better modeling experience (no need to change interface after adding one column, frequent usage of many exogenous variables, less error-prone and cleaner implementations, direct compatibility with the whole sklearn ecosystem...). The design decision to stick with sklearn API also demonstrates our intention to address primarily the ML community rather than a more traditional statistical community around statsmodels).

Problems of both:

  • does not consistently cover both univariate, multivariate use well - user frustration in at least one sub-case
  • user cannot use series and DataFrame
  • no support for both iloc and loc (indexed, e.g., datetime) indexing

Fit/predict API signatures

Up to naming of variables, both sktime and HCrystalball adopt a fit/predict API, of the type

fit(y_past, [x_past], horizon)
predict([x_future], horizon)

where:

  • y_past is the time series in the past,
  • horizon is the indices (loc or iloc) to predict at - note that some methods already require this in fit
  • x_past is exogeneous time series in the past
  • x_future is exogeneous time series in the future

The differences are mainly in expected type:

variable sktime HCrystalball HCrystalball comments
y_past pandas series pandas DataFrame pandas series (on wrapper level)
horizon in fit integer sequence not supported (instead fitting is moved to predict in cases where horizon is required for fitting) in order to follow sklearn API we agreed to stick with original fit and predict signature (fitting in the predict is also done in i.e. KNN implementation in sklearn)
horizon in predict integer sequence empty DataFrame with loc indices (see above)
x_past pandas DataFrame (experimental) pandas DataFrame pandas DataFrame
x_future pandas DataFrame (experimental) pandas DataFrame pandas DataFrame

Proposed way forward

The interface differences suggest:

  • different signature and type choices cover different use cases well (e.g., univariate vs multivariate) - a joint/merged interface may therefore be desirable.
  • the interfaces are currently incompatible, while compatibility will require support for both series and DataFrames, and support for both loc and iloc indexing.
  • the sktime interface has an advantage in composition and other higher-order operations. A joint interface should perhaps adopt this.

Requirements for a unified interface

More precisely, a "good" consensus interface should satisfy the following requirements:

  • support for both series and DataFrames as inputs/outputs. - We prefer just one way how to do things, as sklearn expects 2D for X, passing pandas series wouldn't allow us to leverage the whole sklearn ecosystem directly, two types of interfaces usually led to ambiguities in other established packages (pandas) and maintenance overhead in a long run
  • support for both loc and iloc indexing - agree (see above)
  • support for exogeneous variables - agree
  • horizon can be passed in fit - horizon is controlled by X passed to predict method in HCrystalball and the additional horizon parameter introduces ambiguity
  • consistent typing in higher-order motifs including composition, wrappers, reduction (inherits from resultant type class, components passed in constructor) open to hear more and improve

Way of working, forward

We therefore suggest:

  • sktime and HCrystalball work together towards a unified forecasting interface in the next release.
  • This unified interface should satisfy the requirements outlined above
  • HCrystalball becomes an affiliated package of sktime (means: compatible interface) - displayed on the landing page with other affiliated and coordinated packages
  • HCrystalball specifies a scope and roadmaps, e.g., adapters to advanced forecasters with major package dependencies?
  • individual HCrystalball team members are acknowledged as contributors to sktime, insofar they contribute to the re-factor
  • optionally, Heidelberg Cement is acknowledged as a contributing organization to sktime post-refactor, pending approval of Heidelberg Cement comms

Proposed API re-design principles

The proposed re-design is based on two work items:

  • HCrystalball adapts sktime's higher-order composition/reduction interface (correct class inheritance structure)
  • re-factor of fit/predict signatures towards a consensus, which is type union based

The consensus could be as follows:

variable consensus type
y_past pandas series or DataFrame
return of predict same as type of y_past
horizon integer sequence (iloc) or sequence of loc indices or empty DataFrame with loc indices
x_past pandas series or DataFrame
x_future pandas series or DataFrame, needs same type and variables as x_past

There may be an additional flag for whether loc or iloc indices are used.

The low-level design could look similar to this, though the linked proposal is mainly concerned with support or datetime.

If common agreement about API interfaces will be met we can further coordinate our development roadmaps otherwise we would like to stay in touch and possibly introduce common wrapper for sktime forecasters