📜 arXiv 📚 Other tabular DL projects
This package provides the officially recommended implementation of the paper "On Embeddings for Numerical Features in Tabular Deep Learning".
Note
This package is strongly recommended over prior implementations, including the original one used in the paper.
- Compared to the original implementation (the
bin
andlib
directories), the code in this package is better across all dimensions: correctness, clarity, efficiency. Some differences between this package and the original implementation are explained in the source code of this package in the comments marked withNOTE[DIFF]
. - Long time ago, there was also an unfinished implementation in the old, and now deprecated,
rtdl
package. If, by any chance, you used it, please, switch to this package.
(RTDL ~ Research on Tabular Deep Learning)
pip install rtdl_num_embeddings
# Scikit-learn is an optional dependency. It is needed only for
# computing the bins for the piecewise-linear transformations
# using decision trees: `compute_bins(..., tree_kwargs={...})`.
pip install "scikit-learn>=1.0,<2"
Tip
It is recommended to first read the TL;DR of the paper: link
Let's consider a toy task on tabular data where objects are represented by three continuous features (for simplicity, other feature types are omitted, but they are covered in the end-to-end example linked later in this document):
# NOTE: all code snippets can be copied and executed as-is.
import torch
import torch.nn as nn
from rtdl_num_embeddings import (
LinearEmbeddings,
LinearReLUEmbeddings,
PeriodicEmbeddings,
PiecewiseLinearEncoding,
PiecewiseLinearEmbeddings,
compute_bins,
)
# NOTE: pip install rtdl_revisiting_models
from rtdl_revisiting_models import MLP
batch_size = 256
n_cont_features = 3
x = torch.randn(batch_size, n_cont_features)
This is how a vanilla MLP without embeddings would look like:
mlp_config = {
'd_out': 1, # For example, a single regression task.
'n_blocks': 2,
'd_block': 256,
'dropout': 0.1,
}
model = MLP(d_in=n_cont_features, **mlp_config)
y_pred = model(x)
And this is how MLP with embeddings for continuous features can be created:
d_embedding = 24
embeddings = LinearReLUEmbeddings(n_cont_features, d_embedding)
model_with_embeddings = nn.Sequential(
# Input shape: (batch_size, n_cont_features)
embeddings,
# After embeddings: (batch_size, n_cont_features, d_embedding)
# NOTE: `nn.Flatten` is not needed for Transformer-like architectures.
nn.Flatten(),
# After flattening: (batch_size, n_cont_features * d_embedding)
MLP(d_in=n_cont_features * d_embedding, **mlp_config)
# The final shape: (batch_size, d_out)
)
# The usage is the same as for the model without embeddings:
y_pred = model_with_embeddings(x)
In other words, the whole paper is about the fact that having such a thing as
embeddings
can (significantly) improve the task performance.
The following sections cover all available types of embeddings, including:
- Simple embeddings
- Piecewise-linear embeddings
- Periodic embeddings
Tip
The choice between the embedding types is discussed later in this document.
This section covers simple embeddings based on linear layers and ReLU activation. It is possible to compose arbitrarily deep embeddings of this kind, as shown in the advanced example below.
Module | Definition for a single feature x_i |
---|---|
LinearEmbeddings |
Linear(x_i) |
LinearReLUEmbeddings |
ReLU(Linear(x_i)) |
Notes
The paper focuses on non-linear embeddings, so LinearEmbeddings
is provided only for
completeness.
Hyperparameters of LinearReLUEmbeddings
:
- The default value of
d_embedding
is set with the MLP backbone in mind. For Transformer-like models, the embedding size is usually larger and depends on a model. - For MLP, on most tasks (at least on non-small tasks),
tuning
d_embedding
will not have much effect. - See other notes on hyperparameters in "Practical notes".
d_embedding = 32
model = nn.Sequential(
LinearReLUEmbeddings(n_cont_features, d_embedding),
nn.Flatten(),
MLP(d_in=n_cont_features * d_embedding, **mlp_config)
)
y_pred = model(x)
How to compose deeper embeddings
To further illustrate the overall idea, let's compose a three-layer embeddings with the goal of reducing the embedding dimension.
# NOTE: pip install delu
import delu
d_embedding = 8
embeddings = nn.Sequential(
# First, each feature is embedded linearly
LinearEmbeddings(n_cont_features, 48),
# Second, the non-linearity is applied.
nn.ReLU(),
# Finally, the feature embeddings are projected to a lower dimension.
# NLinear contains `n_cont_features` linear layers, i.e. one per each feature embedding.
# (in other words, the linear layers are not shared between the features).
delu.nn.NLinear(n_cont_features, 48, d_embedding)
)
model = nn.Sequential(
embeddings,
nn.Flatten(),
MLP(d_in=n_cont_features * d_embedding, **mlp_config)
)
y_pred = model(x)
(Described in Section 3.2 of the paper)
The above figure illustrates the piecewise-linear encoding. The scalar feature $x$ is encoded to a vector using four bins: $[b_0, b_1], [b_1, b_2], [b_2, b_3], [b_3, b_4]$.
Module | Definition for a single feature x_i |
---|---|
PiecewiseLinearEncoding |
PLE(x_i) |
PiecewiseLinearEmbeddings(activation=False) |
Linear(PLE(x_i)) |
PiecewiseLinearEmbeddings(activation=True) |
ReLU(Linear(PLE(x_i))) |
In the above table, PLE
stands for "Piecewise-linear encoding".
Notes
There are two distinct modules: PiecewiseLinearEncoding
and PiecewiseLinearEmbeddings
.
PiecewiseLinearEmbeddings
is similar to all other modules in this package, and
produces object representations of the shape (n_features, d_embedding)
. The intuition behind
PiecewiseLinearEmbeddings
is that each bin receives its own trainable embedding, and the feature
embedding is the aggregation of its bin embeddings with the aggregation weights provided by the
piecewise-linear encoding.
PiecewiseLinearEncoding
is different from all other modules in this package.
It represents a fixed (non-trainable) transformation, fully defined by the provided bin boundaries,
as illustrated above. Its output is the concatenation of the piecewise-linear representations
of all features, i.e. this representation has the shape (d_encoding,)
, where d_encoding
equals
the total number of bins across all features. Because of that, PiecewiseLinearEncoding
can be used
only with MLP-like models, but not with Transformer-like models.
In practice, if there is enough data to train additional weights,
PiecewiseLinearEmbeddings
may be a better starting point even for MLP-like models.
In particular, PiecewiseLinearEmbeddings
makes it possible to set a large number of bins,
but a small embedding size.
Hyperparameters
Important
For tuning hyperparameters with the TPE sample from Optuna, as it was done in the paper, take the tuning spaces from the more recent TabM paper.
- The default number of bins in
compute_bins
is set with MLPs in mind. - For
PiecewiseLinearEmbeddings
, by default, useversion="B"
andactivation=False
. - The
version
argument ofPiecewiseLinearEmbeddings
defines implementation details, such as parametrization and initialization. "A" is the original version used in the paper. "B" is the version used in the different paper about the TabM model. - The possible starting points are
d_embedding=12
withactivation=False
andd_embedding=24
withactivation=True
. For Transformer-like models, the embedding size is usually larger and depends on a model. - Some models outside of this project provide different recommendations for default hyperparameters.
- See other notes on hyperparameters in "Practical notes".
X_train = torch.randn(10000, n_cont_features)
Y_train = torch.randn(len(X_train)) # Regression.
# Quantile-based bins.
bins = compute_bins(X_train)
# Target-aware tree-based bins.
bins = compute_bins(
X_train,
# NOTE: requires scikit-learn>=1.0 to be installed.
tree_kwargs={'min_samples_leaf': 64, 'min_impurity_decrease': 1e-4},
y=Y_train,
regression=True,
)
d_embedding = 12
model = nn.Sequential(
PiecewiseLinearEmbeddings(bins, d_embedding, activation=False, version="B"),
nn.Flatten(),
MLP(d_in=n_cont_features * d_embedding, **mlp_config)
)
y_pred = model(x)
total_n_bins = sum(len(b) - 1 for b in bins)
model = nn.Sequential(
PiecewiseLinearEncoding(bins),
MLP(d_in=total_n_bins, **mlp_config)
)
y_pred = model(x)
(Described in Section 3.3 of the paper)
Module | Definition for a single feature x_i |
---|---|
PeriodicEmbeddings(lite=False) |
ReLU(Linear(CosSin(2 * pi * Linear(x, bias=False)))) |
PeriodicEmbeddings(lite=True) |
Same as above, but the outer Linear is shared between features |
In the above table:
CosSin
is a pseudo-module that applies thecos
andsin
functions in parallel and concatenates the result. In particular, it means that the output size is twice the input size.- The weights of the innermost
Linear
layer can be interpreted as frequencies, which is reflected in the argument names ofPeriodicEmbeddings
.
Notes
PeriodicEmbeddings
implements what is also known as "PLR" embeddings.
Hyperparameters
Important
For tuning hyperparameters with the TPE sample from Optuna, as it was done in the paper, take the tuning spaces from the more recent TabM paper.
-
The
lite
option is introduced in a different paper (this one). It makes the embedding significantly more lightweight, and, on some tasks, the performance loss is not critical.lite=True
is a reasonable starting point. -
n_frequencies
isout_features
of the innermostLinear
layer. -
The default value of
d_embedding
is set with the MLP backbone in mind. For Transformer-like models, the embedding size is usually larger and depends on a model. -
The
frequency_init_scale
hyperparameter is important. While the default values is safe, on some tasks, different values may be needed to reveal the full potential of the periodic embedding.How to tune
frequency_init_scale
Prioritize testing smaller values, because they are safer:
- Smaller-than-the-optimal value will still yield decent performance.
- Larger-than-the-optimal value can lead to bad performance.
- Because of the above, the default value is already small (usually, there is no need to try smaller-than-the-default values).
Some approximate numbers:
- For 25% of tasks, the optimal
frequency_init_scale
is less than 0.03. - For 50% of tasks, the optimal
frequency_init_scale
is less than 0.05. - For 75% of tasks, the optimal
frequency_init_scale
is less than 0.5.
Other comments:
- In practice, with enough budget for tuning (e.g. 50-100 trials of the TPE Optuna sampler),
a reasonable distribution for
frequency_init_scale
isLogUniform[0.01, 1.0]
. In research projects, it can beLogUniform[0.01, 10.0]
. - If the budget is low, than it is better to reduce the upper bound.
Another option is to exclude
frequency_init_scale
from tuning, since the default value is safe for most tasks.
-
See other notes on hyperparameters in "Practical notes".
d_embedding = 24
model = nn.Sequential(
PeriodicEmbeddings(n_cont_features, d_embedding, lite=False),
nn.Flatten(),
MLP(d_in=n_cont_features * d_embedding, **mlp_config)
)
y_pred = model(x)
The example.ipynb notebook (Colab link inside) provides an example of training a model with embeddings.
- Embeddings for continuous features are applicable to most tabular DL models and often lead to better performance.
- Despite the formal overhead in terms of parameter count, in practice, embeddings are perfectly affordable, even on large datasets. That said, with large enough number of features and/or with strict enough latency requirements, the new overhead associated with embeddings may become an issue.
- MLPs with embeddings can perform on par or even better than Transformer-like models (e.g. FT-Transformer), while being significantly more efficient.
Note
This section is written with MLP-like models in mind, though some advice is applicable to Transformer models as well.
General guidelines
- If you have not used embeddings before, try
LinearReLUEmbeddings
for a quick start to get familiar with the interface. - To choose between
PiecewiseLinearEmbeddings
andPeriodicEmbeddings
, a possible strategy is to try both with default hyperparameters, and tune what works best on a given task. - If the provided advanced embeddings are not suitable for any reason, consider composing deep simple embeddings as shown in Simple embeddings.
- On smaller datasets, consider trying
PiecewiseLinearEncoding
with a small number of bins (this idea is not tested, since small datasets are not covered in the paper). - Learn the proc and cons below to make more informed decision.
Pros & Cons
LinearReLUEmbeddings
- Pros: lightweight; ease to use.
- Cons: usually, the performance is lower compared to more advanced embeddings, though still better than without embeddings.
PiecewiseLinearEmbeddings
- Pros: good average performance; high peak performance on some datasets; less sensitive to data preprocessing; easier to understand and reason about compared to the periodic embeddings.
- Cons: requires the additional step of the bin computation before the training (this can be important when full end-to-end trainability is required); on some datasets, the optimal number of bins may be low (in such cases, the performance benefits may also be limited).
PeriodicEmbeddings
- Pros: good average performance; has a reliable default configuration; contrary to the piecewise-linear embeddings, does not introduce any additional computation steps.
- Cons: the behavior and hyperparameters are harder to interpret compared to the piecewise-linear
embeddings; too high values of the
frequency_init_scale
hyperparameter can lead to bad performance.
PiecewiseLinearEncoding
. This module is a bit less explored.
Perhaps, it may worth attention on smaller datasets, with somewhat lower number of bins.
- The default hyperparameters are set with MLPs in mind.
For Transformer-like models, one may need to (significantly) increase
d_embedding
. - In the paper, for hyperparameter tuning, the
TPE sampler from Optuna
was used with
study.optimize(..., n_trials=100)
(sometimes,n_trials=50
). - For embeddings ending with a non-linear activation (i.e.
activation=True
in some of the modules), the minimum reasonable embedding size is larger than for embeddings ending with linear layers. The default values and recommendations in this package follow this guideline.
- The proposed embeddings are applicable only to continuous ("numerical") features, so they should not be applied to binary or categorical features.
- It is possible to embed only a subset of features (e.g. to improve efficiency).
- It is possible to combine embeddings and apply different embeddings to different features (e.g. when some features need different embedding hyperparameters than others).
- If an embedding ends with a linear layer, and its output is passed to MLP, then that linear layer can be fused with the first linear layer of MLP after the training.
- (An optional advanced tip for those who read such long documents until the end)
On some datasets, MLP with linear embeddings (i.e.
LinearEmbeddings
) performs better than MLP, which is intriguing (the paper does not provide explanations for why that happens, and this phenomenon does not happen very often). Combined with the previous point, it means that, on some datasets, one can train MLP with linear embeddings and transform it to a plain embedding-free MLP after the training, thus obtaining better MLP without any overhead on inference.
There are two ways to explore the available API and docstrings.
(1) Generate the documentation (requires Pixi to be installed):
git clone https://github.com/yandex-research/rtdl-num-embeddings
cd rtdl-num-embeddings/package
pixi run docs
(2) Open the source file and:
- On GitHub, use the Symbols panel.
- In VSCode, use the Outline view.
- Check the
__all__
variable.
Set up the environment:
pixi install
Use this command before committing:
make pre-commit
Publish the package to PyPI (requires PyPI account & configuration):
pixi run -e dev flit publish