diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index a0fc3b6..53d59f8 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-21T05:56:33","documenter_version":"1.3.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-31T20:06:03","documenter_version":"1.3.0"}} \ No newline at end of file diff --git a/dev/api/index.html b/dev/api/index.html index a97cea7..3c2b18f 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -12,7 +12,7 @@ print_every_n=9999, verbosity=1, return_logger=false, - device="cpu")

Main training function. Performs model fitting given configuration params, dtrain, target_name and other optional kwargs.

Arguments

Keyword arguments

source
fit_evotree(
+    device="cpu")

Main training function. Performs model fitting given configuration params, dtrain, target_name and other optional kwargs.

Arguments

Keyword arguments

source
fit_evotree(
     params::EvoTypes{L};
     x_train::AbstractMatrix, 
     y_train::AbstractVector, 
@@ -24,4 +24,4 @@
     offset_eval=nothing,
     early_stopping_rounds=9999,
     print_every_n=9999,
-    verbosity=1)

Main training function. Performs model fitting given configuration params, x_train, y_train and other optional kwargs.

Arguments

Keyword arguments

source

predict

MLJModelInterface.predictFunction
predict(model::EvoTree, X::AbstractMatrix; ntree_limit = length(model.trees))

Predictions from an EvoTree model - sums the predictions from all trees composing the model. Use ntree_limit=N to only predict with the first N trees.

source

importance

EvoTrees.importanceFunction
importance(model::EvoTree; fnames=model.info[:fnames])

Sorted normalized feature importance based on loss function gain. Feature names associated to the model are stored in model.info[:fnames] as a string Vector and can be updated at any time. Eg: model.info[:fnames] = new_fnames_vec.

source
+ verbosity=1)

Main training function. Performs model fitting given configuration params, x_train, y_train and other optional kwargs.

Arguments

Keyword arguments

source

predict

MLJModelInterface.predictFunction
predict(model::EvoTree, X::AbstractMatrix; ntree_limit = length(model.trees))

Predictions from an EvoTree model - sums the predictions from all trees composing the model. Use ntree_limit=N to only predict with the first N trees.

source

importance

EvoTrees.importanceFunction
importance(model::EvoTree; fnames=model.info[:fnames])

Sorted normalized feature importance based on loss function gain. Feature names associated to the model are stored in model.info[:fnames] as a string Vector and can be updated at any time. Eg: model.info[:fnames] = new_fnames_vec.

source
diff --git a/dev/index.html b/dev/index.html index b987c9e..a83306b 100644 --- a/dev/index.html +++ b/dev/index.html @@ -40,4 +40,4 @@ "a" "b" "missing value"

Target

Target variable must have its element type <:Real. Only exception is for EvoTreeClassifier for which CategoricalValue, Integer, String and Char are supported.

Save/Load

EvoTrees.save(m, "data/model.bson")
-m = EvoTrees.load("data/model.bson");
+m = EvoTrees.load("data/model.bson"); diff --git a/dev/internals/index.html b/dev/internals/index.html index 66d653b..930a1ae 100644 --- a/dev/internals/index.html +++ b/dev/internals/index.html @@ -1,5 +1,5 @@ -Internals · EvoTrees.jl

Internal API

General

EvoTrees.EvoTreeType
EvoTree{L,K}

An EvoTree holds the structure of a fitted gradient-boosted tree.

Fields

  • trees::Vector{Tree{L,K}}
  • info::Dict

EvoTree acts as a functor to perform inference on input data:

pred = (m::EvoTree; ntree_limit=length(m.trees))(x)
source
EvoTrees.check_parameterFunction
check_parameter(::Type{<:T}, value, min_value::Real, max_value::Real, label::Symbol) where {T<:Number}

Check model parameter if it's valid

source
EvoTrees.check_argsFunction
check_args(args::Dict{Symbol,Any})

Check model arguments if they are valid

source
check_args(model::EvoTypes{L}) where {L}

Check model arguments if they are valid (eg, after mutation when tuning hyperparams) Note: does not check consistency of model type and loss selected

source

Training utils

EvoTrees.initFunction
init(
+Internals · EvoTrees.jl

Internal API

General

EvoTrees.EvoTreeType
EvoTree{L,K}

An EvoTree holds the structure of a fitted gradient-boosted tree.

Fields

  • trees::Vector{Tree{L,K}}
  • info::Dict

EvoTree acts as a functor to perform inference on input data:

pred = (m::EvoTree; ntree_limit=length(m.trees))(x)
source
EvoTrees.check_parameterFunction
check_parameter(::Type{<:T}, value, min_value::Real, max_value::Real, label::Symbol) where {T<:Number}

Check model parameter if it's valid

source
EvoTrees.check_argsFunction
check_args(args::Dict{Symbol,Any})

Check model arguments if they are valid

source
check_args(model::EvoTypes{L}) where {L}

Check model arguments if they are valid (eg, after mutation when tuning hyperparams) Note: does not check consistency of model type and loss selected

source

Training utils

EvoTrees.initFunction
init(
     params::EvoTypes,
     dtrain,
     device::Type{<:Device}=CPU;
@@ -7,7 +7,7 @@
     fnames=nothing,
     w_name=nothing,
     offset_name=nothing
-)

Initialise EvoTree

source
init(
+)

Initialise EvoTree

source
init(
     params::EvoTypes,
     x_train::AbstractMatrix,
     y_train::AbstractVector,
@@ -15,14 +15,14 @@
     fnames=nothing,
     w_train=nothing,
     offset_train=nothing
-)

Initialise EvoTree

source
EvoTrees.grow_evotree!Function
grow_evotree!(evotree::EvoTree{L,K}, cache, params::EvoTypes{L}, ::Type{<:Device}=CPU) where {L,K}

Given a instantiate

source
EvoTrees.grow_evotree!Function
grow_evotree!(evotree::EvoTree{L,K}, cache, params::EvoTypes{L}, ::Type{<:Device}=CPU) where {L,K}

Given a instantiate

source
EvoTrees.update_gains!Function
update_gains!(
     loss::L,
     node::TrainNode{T},
     js::Vector,
-    params::EvoTypes, K, monotone_constraints) where {L,T,S}

Generic fallback

source
EvoTrees.predict!Function
predict!(pred::Matrix, tree::Tree, X)

Generic fallback to add predictions of tree to existing pred matrix.

source
EvoTrees.subsampleFunction
subsample(out::AbstractVector, mask::AbstractVector, rowsample::AbstractFloat)

Returns a view of selected rows ids.

source
EvoTrees.split_set_chunk!Function
Multi-threaded split_set!
-    Take a view into left and right placeholders. Right ids are assigned at the end of the length of the current node set.
source

Histogram

EvoTrees.get_edgesFunction
get_edges(X::AbstractMatrix{T}; fnames, nbins, rng=Random.TaskLocalRNG()) where {T}
-get_edges(df; fnames, nbins, rng=Random.TaskLocalRNG())

Get the histogram breaking points of the feature data.

source
EvoTrees.binarizeFunction
binarize(X::AbstractMatrix; fnames, edges)
-binarize(df; fnames, edges)

Transform feature data into a UInt8 binarized matrix.

source
+ params::EvoTypes, K, monotone_constraints) where {L,T,S}

Generic fallback

source
EvoTrees.predict!Function
predict!(pred::Matrix, tree::Tree, X)

Generic fallback to add predictions of tree to existing pred matrix.

source
EvoTrees.subsampleFunction
subsample(out::AbstractVector, mask::AbstractVector, rowsample::AbstractFloat)

Returns a view of selected rows ids.

source
EvoTrees.split_set_chunk!Function
Multi-threaded split_set!
+    Take a view into left and right placeholders. Right ids are assigned at the end of the length of the current node set.
source

Histogram

EvoTrees.get_edgesFunction
get_edges(X::AbstractMatrix{T}; fnames, nbins, rng=Random.TaskLocalRNG()) where {T}
+get_edges(df; fnames, nbins, rng=Random.TaskLocalRNG())

Get the histogram breaking points of the feature data.

source
EvoTrees.binarizeFunction
binarize(X::AbstractMatrix; fnames, edges)
+binarize(df; fnames, edges)

Transform feature data into a UInt8 binarized matrix.

source
diff --git a/dev/models/index.html b/dev/models/index.html index 8c9003a..e2f0402 100644 --- a/dev/models/index.html +++ b/dev/models/index.html @@ -11,7 +11,7 @@ model = EvoTreeRegressor(max_depth=5, nbins=32, nrounds=100) X, y = @load_boston mach = machine(model, X, y) |> fit! -preds = predict(mach, X)source

EvoTreeClassifier

EvoTrees.EvoTreeClassifierType

EvoTreeClassifier(;kwargs...)

A model type for constructing a EvoTreeClassifier, based on EvoTrees.jl, and implementing both an internal API and the MLJ model interface. EvoTreeClassifier is used to perform multi-class classification, using cross-entropy loss.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain improvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=1.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeClassifier(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, K] where K is the number of classes:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeClassifier = @load EvoTreeClassifier pkg=EvoTrees

Do model = EvoTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeClassifier(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)
  • y: is the target, which can be any AbstractVector whose element scitype is <:Multiclas or <:OrderedFactor; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): return predictions of the target given features Xnew having the same scitype as X above. Predictions are probabilistic.

  • predict_mode(mach, Xnew): returns the mode of each of the prediction above.

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
+preds = predict(mach, X)
source

EvoTreeClassifier

EvoTrees.EvoTreeClassifierType

EvoTreeClassifier(;kwargs...)

A model type for constructing a EvoTreeClassifier, based on EvoTrees.jl, and implementing both an internal API and the MLJ model interface. EvoTreeClassifier is used to perform multi-class classification, using cross-entropy loss.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain improvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=1.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeClassifier(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, K] where K is the number of classes:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeClassifier = @load EvoTreeClassifier pkg=EvoTrees

Do model = EvoTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeClassifier(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)
  • y: is the target, which can be any AbstractVector whose element scitype is <:Multiclas or <:OrderedFactor; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): return predictions of the target given features Xnew having the same scitype as X above. Predictions are probabilistic.

  • predict_mode(mach, Xnew): returns the mode of each of the prediction above.

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
 using EvoTrees
 config = EvoTreeClassifier(max_depth=5, nbins=32, nrounds=100)
 nobs, nfeats = 1_000, 5
@@ -24,7 +24,7 @@
 X, y = @load_iris
 mach = machine(model, X, y) |> fit!
 preds = predict(mach, X)
-preds = predict_mode(mach, X)

See also EvoTrees.jl.

source

EvoTreeCount

EvoTrees.EvoTreeCountType

EvoTreeCount(;kwargs...)

A model type for constructing a EvoTreeCount, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeCount is used to perform Poisson probabilistic regression on count target.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=1.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing).
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeCount() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeCount(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Vector of length nobs:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeCount = @load EvoTreeCount pkg=EvoTrees

Do model = EvoTreeCount() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeCount(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with mach = machine(model, X, y) where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)
  • y: is the target, which can be any AbstractVector whose element scitype is <:Count; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Poisson distributions given features Xnew having the same scitype as X above. Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
+preds = predict_mode(mach, X)

See also EvoTrees.jl.

source

EvoTreeCount

EvoTrees.EvoTreeCountType

EvoTreeCount(;kwargs...)

A model type for constructing a EvoTreeCount, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeCount is used to perform Poisson probabilistic regression on count target.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=1.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing).
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeCount() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeCount(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Vector of length nobs:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeCount = @load EvoTreeCount pkg=EvoTrees

Do model = EvoTreeCount() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeCount(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with mach = machine(model, X, y) where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)
  • y: is the target, which can be any AbstractVector whose element scitype is <:Count; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Poisson distributions given features Xnew having the same scitype as X above. Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
 using EvoTrees
 config = EvoTreeCount(max_depth=5, nbins=32, nrounds=100)
 nobs, nfeats = 1_000, 5
@@ -40,7 +40,7 @@
 preds = predict_mean(mach, X)
 preds = predict_mode(mach, X)
 preds = predict_median(mach, X)
-

See also EvoTrees.jl.

source

EvoTreeMLE

EvoTrees.EvoTreeMLEType

EvoTreeMLE(;kwargs...)

A model type for constructing a EvoTreeMLE, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeMLE performs maximum likelihood estimation. Assumed distribution is specified through loss kwargs. Both Gaussian and Logistic distributions are supported.

Hyper-parameters

loss=:gaussian: Loss to be be minimized during training. One of:

  • :gaussian / :gaussian_mle
  • :logistic / :logistic_mle
  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0.

A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.

  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=8.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing). !Experimental feature: note that for MLE regression, constraints may not be enforced systematically.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeMLE() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeMLE(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, nparams] where the second dimensions refer to μ & σ for Normal/Gaussian and μ & s for Logistic.

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeMLE = @load EvoTreeMLE pkg=EvoTrees

Do model = EvoTreeMLE() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeMLE(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)

  • y: is the target, which can be any AbstractVector whose element scitype is <:Continuous; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Gaussian or Logistic distributions (according to provided loss) given features Xnew having the same scitype as X above.

Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
+

See also EvoTrees.jl.

source

EvoTreeMLE

EvoTrees.EvoTreeMLEType

EvoTreeMLE(;kwargs...)

A model type for constructing a EvoTreeMLE, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeMLE performs maximum likelihood estimation. Assumed distribution is specified through loss kwargs. Both Gaussian and Logistic distributions are supported.

Hyper-parameters

loss=:gaussian: Loss to be be minimized during training. One of:

  • :gaussian / :gaussian_mle
  • :logistic / :logistic_mle
  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0.

A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.

  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=8.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing). !Experimental feature: note that for MLE regression, constraints may not be enforced systematically.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeMLE() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeMLE(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, nparams] where the second dimensions refer to μ & σ for Normal/Gaussian and μ & s for Logistic.

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeMLE = @load EvoTreeMLE pkg=EvoTrees

Do model = EvoTreeMLE() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeMLE(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)

  • y: is the target, which can be any AbstractVector whose element scitype is <:Continuous; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Gaussian or Logistic distributions (according to provided loss) given features Xnew having the same scitype as X above.

Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
 using EvoTrees
 config = EvoTreeMLE(max_depth=5, nbins=32, nrounds=100)
 nobs, nfeats = 1_000, 5
@@ -55,7 +55,7 @@
 preds = predict(mach, X)
 preds = predict_mean(mach, X)
 preds = predict_mode(mach, X)
-preds = predict_median(mach, X)
source

EvoTreeGaussian

EvoTreeGaussian is to be deprecated. Please use EvoTreeMLE with loss = :gaussian_mle.

EvoTrees.EvoTreeGaussianType

EvoTreeGaussian(;kwargs...)

A model type for constructing a EvoTreeGaussian, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeGaussian is used to perform Gaussian probabilistic regression, fitting μ and σ parameters to maximize likelihood.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=8.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing). !Experimental feature: note that for Gaussian regression, constraints may not be enforce systematically.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeGaussian() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeGaussian(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, 2] where the second dimensions refer to μ and σ respectively:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeGaussian = @load EvoTreeGaussian pkg=EvoTrees

Do model = EvoTreeGaussian() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeGaussian(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)

  • y: is the target, which can be any AbstractVector whose element scitype is <:Continuous; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Gaussian distributions given features Xnew having the same scitype as X above.

Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
+preds = predict_median(mach, X)
source

EvoTreeGaussian

EvoTreeGaussian is to be deprecated. Please use EvoTreeMLE with loss = :gaussian_mle.

EvoTrees.EvoTreeGaussianType

EvoTreeGaussian(;kwargs...)

A model type for constructing a EvoTreeGaussian, based on EvoTrees.jl, and implementing both an internal API the MLJ model interface. EvoTreeGaussian is used to perform Gaussian probabilistic regression, fitting μ and σ parameters to maximize likelihood.

Hyper-parameters

  • nrounds=100: Number of rounds. It corresponds to the number of trees that will be sequentially stacked. Must be >= 1.
  • eta=0.1: Learning rate. Each tree raw predictions are scaled by eta prior to be added to the stack of predictions. Must be > 0. A lower eta results in slower learning, requiring a higher nrounds but typically improves model performance.
  • L2::T=0.0: L2 regularization factor on aggregate gain. Must be >= 0. Higher L2 can result in a more robust model.
  • lambda::T=0.0: L2 regularization factor on individual gain. Must be >= 0. Higher lambda can result in a more robust model.
  • gamma::T=0.0: Minimum gain imprvement needed to perform a node split. Higher gamma can result in a more robust model. Must be >= 0.
  • max_depth=6: Maximum depth of a tree. Must be >= 1. A tree of depth 1 is made of a single prediction leaf. A complete tree of depth N contains 2^(N - 1) terminal leaves and 2^(N - 1) - 1 split nodes. Compute cost is proportional to 2^max_depth. Typical optimal values are in the 3 to 9 range.
  • min_weight=8.0: Minimum weight needed in a node to perform a split. Matches the number of observations by default or the sum of weights as provided by the weights vector. Must be > 0.
  • rowsample=1.0: Proportion of rows that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • colsample=1.0: Proportion of columns / features that are sampled at each iteration to build the tree. Should be in ]0, 1].
  • nbins=64: Number of bins into which each feature is quantized. Buckets are defined based on quantiles, hence resulting in equal weight bins. Should be between 2 and 255.
  • monotone_constraints=Dict{Int, Int}(): Specify monotonic constraints using a dict where the key is the feature index and the value the applicable constraint (-1=decreasing, 0=none, 1=increasing). !Experimental feature: note that for Gaussian regression, constraints may not be enforce systematically.
  • tree_type="binary" Tree structure to be used. One of:
    • binary: Each node of a tree is grown independently. Tree are built depthwise until max depth is reach or if min weight or gain (see gamma) stops further node splits.
    • oblivious: A common splitting condition is imposed to all nodes of a given depth.
  • rng=123: Either an integer used as a seed to the random number generator or an actual random number generator (::Random.AbstractRNG).

Internal API

Do config = EvoTreeGaussian() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeGaussian(max_depth=...).

Training model

A model is built using fit_evotree:

model = fit_evotree(config; x_train, y_train, kwargs...)

Inference

Predictions are obtained using predict which returns a Matrix of size [nobs, 2] where the second dimensions refer to μ and σ respectively:

EvoTrees.predict(model, X)

Alternatively, models act as a functor, returning predictions when called as a function with features as argument:

model(X)

MLJ

From MLJ, the type can be imported using:

EvoTreeGaussian = @load EvoTreeGaussian pkg=EvoTrees

Do model = EvoTreeGaussian() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in EvoTreeGaussian(loss=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor; check column scitypes with schema(X)

  • y: is the target, which can be any AbstractVector whose element scitype is <:Continuous; check the scitype with scitype(y)

Train the machine using fit!(mach, rows=...).

Operations

  • predict(mach, Xnew): returns a vector of Gaussian distributions given features Xnew having the same scitype as X above.

Predictions are probabilistic.

Specific metrics can also be predicted using:

  • predict_mean(mach, Xnew)
  • predict_mode(mach, Xnew)
  • predict_median(mach, Xnew)

Fitted parameters

The fields of fitted_params(mach) are:

  • :fitresult: The GBTree object returned by EvoTrees.jl fitting algorithm.

Report

The fields of report(mach) are:

  • :features: The names of the features encountered in training.

Examples

# Internal API
 using EvoTrees
 params = EvoTreeGaussian(max_depth=5, nbins=32, nrounds=100)
 nobs, nfeats = 1_000, 5
@@ -70,4 +70,4 @@
 preds = predict(mach, X)
 preds = predict_mean(mach, X)
 preds = predict_mode(mach, X)
-preds = predict_median(mach, X)
source
+preds = predict_median(mach, X)source diff --git a/dev/objects.inv b/dev/objects.inv index e7a1f66..3a3dce1 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/tutorials/classification-iris/index.html b/dev/tutorials/classification-iris/index.html index a05e862..436b14d 100644 --- a/dev/tutorials/classification-iris/index.html +++ b/dev/tutorials/classification-iris/index.html @@ -38,4 +38,4 @@ 1.0 julia> mean(idx_eval .== levelcode.(y_eval)) -0.9333333333333333 +0.9333333333333333 diff --git a/dev/tutorials/examples-API/index.html b/dev/tutorials/examples-API/index.html index 1d85ba7..3a7bfb6 100644 --- a/dev/tutorials/examples-API/index.html +++ b/dev/tutorials/examples-API/index.html @@ -94,4 +94,4 @@ loss=:gaussian_mle, nrounds=100, nbins=100, lambda=0.0, gamma=0.0, eta=0.1, - max_depth=6, rowsample=0.5) + max_depth=6, rowsample=0.5) diff --git a/dev/tutorials/examples-MLJ/index.html b/dev/tutorials/examples-MLJ/index.html index 1164c51..5a07da9 100644 --- a/dev/tutorials/examples-MLJ/index.html +++ b/dev/tutorials/examples-MLJ/index.html @@ -34,4 +34,4 @@ # predict on test data pred_test = predict(mach, selectrows(X, test)) -mean(abs.(pred_test - selectrows(Y, test))) +mean(abs.(pred_test - selectrows(Y, test))) diff --git a/dev/tutorials/logistic-regression-titanic/index.html b/dev/tutorials/logistic-regression-titanic/index.html index 180d977..66bcf4d 100644 --- a/dev/tutorials/logistic-regression-titanic/index.html +++ b/dev/tutorials/logistic-regression-titanic/index.html @@ -53,4 +53,4 @@ "Pclass" => 0.11354283043193575 "SibSp" => 0.05129209383816148 "Parch" => 0.017385183317069588 - "Age_ismissing" => 0.013685310503669728 + "Age_ismissing" => 0.013685310503669728 diff --git a/dev/tutorials/ranking-LTRC/index.html b/dev/tutorials/ranking-LTRC/index.html index 70dbd3f..d4b9fca 100644 --- a/dev/tutorials/ranking-LTRC/index.html +++ b/dev/tutorials/ranking-LTRC/index.html @@ -100,4 +100,4 @@ @info "ndcg_test LogLoss" ndcg_test ┌ Info: ndcg_test LogLoss -└ ndcg_test = 0.80267

Conclusion

We've seen that a ranking problem can be efficiently handled with generic regression tasks, yet achieve comparable performance to specialized ranking loss functions. Below, we present the NDCG obtained from the above experiments along those published on CatBoost's benchmarks.

ModelNDCG
EvoTrees - mse0.80080
EvoTrees - logistic0.80267
cat-rmse0.802115
cat-query-rmse0.802229
cat-pair-logit0.797318
cat-pair-logit-pairwise0.790396
cat-yeti-rank0.802972
xgb-rmse0.798892
xgb-pairwise0.800048
xgb-lambdamart-ndcg0.800048
lgb-rmse0.8013675
lgb-pairwise0.801347

It should be noted that the later results were not reproduced in the scope of current tutorial, so one should be careful about any claim of model superiority. The results from CatBoost's benchmarks were however already indicative of strong performance of non-specialized ranking loss functions, to which this tutorial brings further support.

+└ ndcg_test = 0.80267

Conclusion

We've seen that a ranking problem can be efficiently handled with generic regression tasks, yet achieve comparable performance to specialized ranking loss functions. Below, we present the NDCG obtained from the above experiments along those published on CatBoost's benchmarks.

ModelNDCG
EvoTrees - mse0.80080
EvoTrees - logistic0.80267
cat-rmse0.802115
cat-query-rmse0.802229
cat-pair-logit0.797318
cat-pair-logit-pairwise0.790396
cat-yeti-rank0.802972
xgb-rmse0.798892
xgb-pairwise0.800048
xgb-lambdamart-ndcg0.800048
lgb-rmse0.8013675
lgb-pairwise0.801347

It should be noted that the later results were not reproduced in the scope of current tutorial, so one should be careful about any claim of model superiority. The results from CatBoost's benchmarks were however already indicative of strong performance of non-specialized ranking loss functions, to which this tutorial brings further support.

diff --git a/dev/tutorials/regression-boston/index.html b/dev/tutorials/regression-boston/index.html index d5946ea..142f95a 100644 --- a/dev/tutorials/regression-boston/index.html +++ b/dev/tutorials/regression-boston/index.html @@ -33,4 +33,4 @@ 1.056997874224627 julia> mean(abs.(pred_eval .- y_eval)) -2.3298767665825264 +2.3298767665825264