Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Variables Not Working #260

Closed
ParadaCarleton opened this issue Oct 12, 2023 · 7 comments · Fixed by #262
Closed

Categorical Variables Not Working #260

ParadaCarleton opened this issue Oct 12, 2023 · 7 comments · Fixed by #262

Comments

@ParadaCarleton
Copy link

It looks like ordered categorical variables error:

tuned_machine = machine(tuning, hscf[:, [:Education, :Income]], hscf[:, :Age]) |> fit!
┌ Info: Training machine(DeterministicTunedModel(model = EvoTreeRegressor{EvoTrees.MSE}
│  - nrounds: 25- lambda: 0.05- gamma: 0.0- eta: 0.04- max_depth: 6- min_weight: 1.0- rowsample: 1.0- colsample: 1.0- nbins: 64- alpha: 0.5- monotone_constraints: Dict{Int64, Int64}()
│  - tree_type: binary
│  - rng: MersenneTwister(123)
└ , ), ).
[ Info: Attempting to evaluate 64 models.
┌ Error: Problem fitting the machine machine(EvoTreeRegressor{EvoTrees.MSE}
│  - nrounds: 25- lambda: NaN- gamma: 0.0- eta: 0.04- max_depth: 8- min_weight: 1.0- rowsample: 1.0- colsample: 1.0- nbins: 64- alpha: 0.5- monotone_constraints: Dict{Int64, Int64}()
│  - tree_type: binary
│  - rng: MersenneTwister(123)
│ , ). 
└ @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:682


1-element ExceptionStack:
AssertionError: T <: Real
Stacktrace:
  [1] get_edges(X::SubArray{Any, 2, Matrix{…}, Tuple{…}, false}; fnames::Vector{Symbol}, nbins::Int64, rng::MersenneTwister)
    @ EvoTrees ~/.julia/packages/EvoTrees/W4JTd/src/fit-utils.jl:8
  [2] init_core(params::EvoTreeRegressor{…}, ::Type{…}, data::SubArray{…}, fnames::Vector{…}, y_train::SubArray{…}, w::Vector{…}, offset::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/W4JTd/src/init.jl:4
  [3] init(params::EvoTreeRegressor{…}, x_train::SubArray{…}, y_train::SubArray{…}, device::Type{…}; fnames::Nothing, w_train::Nothing, offset_train::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/W4JTd/src/init.jl:222
  [4] fit(model::EvoTreeRegressor{…}, verbosity::Int64, A::@NamedTuple{}, y::SubArray{…}, w::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/W4JTd/src/MLJ.jl:2
  [5] fit_only!(mach::Machine{…}; rows::Vector{…}, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
  [6] fit_only!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
  [7] #fit!#63
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
  [8] fit!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
  [9] fit_and_extract_on_fold
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1235 [inlined]
 [10] (::MLJBase.var"#273#274"{MLJBase.var"#fit_and_extract_on_fold#300"{}, Machine{}, Int64})(k::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1062
 [11] _mapreduce(f::MLJBase.var"#273#274"{}, op::typeof(vcat), ::IndexLinear, A::UnitRange{…})
    @ Base ./reduce.jl:440
 [12] _mapreduce_dim
    @ MLJBase ./reducedim.jl:365 [inlined]
 [13] mapreduce
    @ MLJBase ./reducedim.jl:357 [inlined]
 [14] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#300"{}, mach::Machine{…}, ::CPU1{…}, nfolds::Int64, verbosity::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1061
 [15] evaluate!(mach::Machine{…}, resampling::Vector{…}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{…}, operations::Vector{…}, acceleration::CPU1{…}, force::Bool, per_observation_flag::Bool, logger::Nothing, user_resampling::CV)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1282
 [16] evaluate!(::Machine{…}, ::CV, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{…}, ::Vector{…}, ::CPU1{…}, ::Bool, ::Bool, ::Nothing, ::CV)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1374
 [17] fit(::Resampler{CV, Nothing}, ::Int64, ::DataFrame, ::Vector{Int8})
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:1535
 [18] fit_only!(mach::Machine{Resampler{…}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
 [19] fit_only!
    @ MLJTuning ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
 [20] #fit!#63
    @ MLJTuning ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
 [21] fit!
    @ MLJTuning ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
 [22] event!(metamodel::EvoTreeRegressor{…}, resampling_machine::Machine{…}, verbosity::Int64, tuning::LatinHypercube, history::Nothing, state::@NamedTuple{})
    @ MLJTuning ~/.julia/packages/MLJTuning/nZnsJ/src/tuned_models.jl:436
 [23] #35
    @ Base ~/.julia/packages/MLJTuning/nZnsJ/src/tuned_models.jl:474 [inlined]
 [24] iterate(g::Base.Generator, s::Vararg{Any})
    @ Base ./generator.jl:47 [inlined]
 [25] _collect(c::Vector{…}, itr::Base.Generator{…}, ::Base.EltypeUnknown, isz::Base.HasShape{…})
    @ Base ./array.jl:852
 [26] collect_similar
    @ MLJTuning ./array.jl:761 [inlined]
 [27] map
    @ MLJTuning ./abstractarray.jl:3275 [inlined]
 [28] assemble_events!(metamodels::Vector{…}, resampling_machine::Machine{…}, verbosity::Int64, tuning::LatinHypercube, history::Nothing, state::@NamedTuple{}, acceleration::CPU1{…})
    @ MLJTuning ~/.julia/packages/MLJTuning/nZnsJ/src/tuned_models.jl:473
 [29] build!(history::Nothing, n::Int64, tuning::LatinHypercube, model::EvoTreeRegressor{…}, model_buffer::Channel{…}, state::@NamedTuple{}, verbosity::Int64, acceleration::CPU1{…}, resampling_machine::Machine{…})
    @ MLJTuning ~/.julia/packages/MLJTuning/nZnsJ/src/tuned_models.jl:667
 [30] fit(::MLJTuning.DeterministicTunedModel{LatinHypercube, EvoTreeRegressor{…}}, ::Int64, ::DataFrame, ::Vector{Int8})
    @ MLJTuning ~/.julia/packages/MLJTuning/nZnsJ/src/tuned_models.jl:747
 [31] fit_only!(mach::Machine{…}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
 [32] fit_only!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
 [33] #fit!#63
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
 [34] fit!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
 [35] |>(x::Machine{MLJTuning.DeterministicTunedModel{LatinHypercube, EvoTreeRegressor{…}}, false}, f::typeof(fit!))
    @ Base ./operators.jl:917

Unordered categorical variables seem to error as well, but I'm not sure if that's because they're not supported or because of a bug.

@jeremiedb
Copy link
Member

Both ordered and unordered Categorical variables are supported when using Tables compatible inputs.
These cases are also covered by tests: https://github.com/Evovest/EvoTrees.jl/blob/main/test/tables.jl

I noticed however in your example lambda: NaN which looks suspicious. I'd double check how such value was generated as I'd expect to result ini error. If this doesn't fix the situation, I'd recommend to first run using the EvoTrees' internal API: https://evovest.github.io/EvoTrees.jl/dev/#Tables-and-DataFrames-input to better isolate the source of the issue as the bug might be on the MLJ binding side.

@ParadaCarleton
Copy link
Author

I noticed however in your example lambda: NaN which looks suspicious.

I thought so too, but I kept trying to reproduce it using a plain matrix of floats and it worked just fine. The bug only comes up when I try and pass a DataFrame with categorical variables. But it does seem to be an MLJ issue, since trying to use the internal API is just fine.

@ablaom
Copy link
Contributor

ablaom commented Oct 12, 2023

@ParadaCarleton Could we please have a minimum working example, and enough detail to reproduce? For starters, can we get rid of the TunedModel wrapping? And can you reproduce with a simple dummy data set?

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Oct 13, 2023

@ParadaCarleton Could we please have a minimum working example, and enough detail to reproduce? For starters, can we get rid of the TunedModel wrapping? And can you reproduce with a simple dummy data set?

Here's an example with dummy data; sorry about the delay in getting an MWE, I've been trying to put one together since yesterday but I couldn't reproduce it. Eventually I figured out that it only seems to happen when I use categorical arrays with more than 2 categories, so my examples using Bool worked fine.

Here's an example:

julia> x = hcat(DataFrame(randn(100, 10), :auto), DataFrame(CategoricalArray.(eachcol(rand(["1", "2", "3", "4"], 100, 5))), :auto); makeunique=true)

julia> config = EvoTreeRegressor()

julia> tuned_machine = machine(config, x[:, Not(1)], x[:, 1]) |> fit!

@jeremiedb
Copy link
Member

jeremiedb commented Oct 13, 2023

Ok I see, currently EvoTrees MLJ wrapper still only support the original Matrix based API (https://github.com/Evovest/EvoTrees.jl/#matrix-features-input), for which Categorical support don't exist. Test with Bool only worked since a conversion to a numeric matrix was possible.

I'm not yet clear as to how to update the MLJ wrapper to support the Tables based API. On one hand, it should be natural since MLJ naturally works with Tables. However, there are some different input logic between EvoTrees and MLJ API, notably that EvoTrees expects data to be passed as a single Table with additional kwarg to specify the target variable (and optionally the variables to be considered as features), whereas MLJ, AFAIK, assumes a Features Table along a target vector.

@ablaom
Copy link
Contributor

ablaom commented Oct 17, 2023

@ParadaCarleton Thanks for teasing out a MWE. I realize this can be some work 🙏.

According to the MLJ doc string for this model, the data provided is unacceptable: You can have inputs with OrderedFactor scitype, but not Multiclass scitype, as in the posted MWE. @jeremiedb might correct me, but my understanding is that unordered categoricals are not supported by EvoTrees.jl because determining the optimal split in a label-invariant way is is more complicated in that case.

This is fixed as follows:

using MLJ # or MLJBase or ScientificTypes
x = coerce(x, Multiclass => OrderedFactor)

(Or, one can use CategoricalArrays.categorical(...; ordered=true).)

That said, the fit! still fails. So this looks to me like a bug in the EvoTree.jl interface for the MLJ version of the model.

However, there are some different input logic between EvoTrees and MLJ API, notably that EvoTrees expects data to be passed as a single Table with additional kwarg to specify the target variable (and optionally the variables to be considered as features), whereas MLJ, AFAIK, assumes a Features Table along a target vector.

@jeremiedb This is all correct, but presumably already addressed in the existing interface. I think what is needed is some extra logic and preprocessing in the MLJModelInterface.reformat methods: if a column of input table X has scitype OrderedFactor then the corresponding column in the matrix output needs to be coerced into whatever MLJModelInterface.fit(model, verbosity, X, y) expects, which I expect is Integers, or float integers. You can use MLJModelInterface.int plus float (??) for that purpose. (The CategoricalArrays.jl "reference type", is used.) Maybe for the GPU case, you need to take care about what kind of float you get?? I don't remember if the MLJ interface supports GPU option.)

@jeremiedb jeremiedb mentioned this issue Oct 17, 2023
@jeremiedb
Copy link
Member

@ablaom I think the referenced PR should resolve the mentionned issues (the provided MWE now successfully trains).
I haven't took the time to look at the GPU integration for the MLJ wrapper. If the MLJInterface design is in a similar spirit than the one use by EvoTrees internal API, it may be very trivial to integrate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants