-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlj table #262
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #262 +/- ##
==========================================
+ Coverage 51.05% 51.90% +0.84%
==========================================
Files 22 22
Lines 1986 1944 -42
==========================================
- Hits 1014 1009 -5
+ Misses 972 935 -37
☔ View full report in Codecov by Sentry. |
Happy to review this shortly. |
MMI.selectrows(::EvoTypes, I, A, y) = | ||
((matrix=view(A.matrix, I, :), names=A.names), view(y, I)) | ||
MMI.selectrows(::EvoTypes, I, A) = ((matrix=view(A.matrix, I, :), names=A.names),) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is this: previously table -> matrix conversion happened in reformat
and not in EvoTree's internal training, which means MLJ's IteratedModel
wrapper avoided repeating these conversions over and over in controlled iteration.
The change here suggests that table -> matrix conversion is now "internalised", which means the performance gain is lost, right? What am I missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this brings back at the forefront the issue with regard to MLJ's handling of iterated process for models relying on caching / initialization mechanism.
Once an EvoTree model is initialized with a training dataset, this dataset should be considered as fixed, no subsampling should be performed to it throughout the iterations. The algo is responsible for performing the subsampling operations throughout the iterations using tricks that are algo specific. This is true AFAIK for all "high performance" GBT models (incl. XGBoost, LightGBM, CatBoost).
Looking at the error resulting when trying an IteratedModel uder this PR, why MLJ's fit_only!
needs to perform a data resampling: fit(model, verbosity, _resampled_data(mach, model, rows)...)
?
The logic that has been used with EvoTrees so far to handle the MMI's update
method has been to completely ignore the data args A
and y
:
Line 34 in 20b9e2a
grow_evotree!(fitresult, cache, model) |
I'm still unclear why an issue happened with
IteratedModel
at first since update
behavior where only the model
and cache
are actually used when iterating the model. And the EvoTrees' MLJ tests pass, so the basic operations seem to work fine, including iterative calls to fit!
.
This is still much WIP, but I'm trying to lay down what I have experiencced as shortcomings from ML toolkits, incl Scikitlearn, which assumes that a model can be instantiated with a MyModelRegressor
rather than having distinct management of a model conguration, its initialization and the fitting of that instantiated model: https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/experiments/learnAPI.jl. This is the kind of approach that we need to rely on internally to handle our ML pipelines, but I'd definitely like to see such support in an open framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ablaom The issue I encountered testing on IteratedModel
appeared to be a red herring: the failure was caused by the log_loss_
measure not accepting Deterministic
output of the EvoTreeRegressor (isn't it unnecessarily restrictive? I actually don't see a logistic regression as less deterministic than "mse", but that's another topic!).
It seems like the MLJ's fit!
call to to reformat effectively results in a no-op as expected (each iteration following the 1st one directly mutate the instantiate fitted_model and the cache). Training speed is aligned with internal's fit_evotree
if using very large evaluation Step
(but quickly blows up, going from 12 sec reference time to ~68 sec with Step(5)
).
So back the this specific topic of moving the EvoTrees's table API, AFAICT the MLJ integration with MLJ works fine. Or at least as well as the existing one, but with added benefits in terms of richer input feature types.
Okay, I think I was mistaken to think dumping the model-specific data pre-processing ( On the other hand, I do expect some impact on # on EvoTrees master branch:
@btime fit!($mach, verbosity=0)
# 15.437 μs (100 allocations: 4.55 KiB)
# on this PR:
@btime fit!($mach, verbosity=0)
# 17.439 μs (100 allocations: 4.55 KiB) Benchmark details are attached below. So, I won't push back against your reluctance to update the I'm not sure I fully understand all the objections to the current API, but I suggest we have this discussion in a different forum. (Still happy to take a call to discuss this.) API changes to MLJModelInterface, if they are breaking, will not happen in any hurry. I will proceed with the rest of my review shortly. using MLJModels
using MLJIteration
using MLJBase
using BenchmarkTools
using StatisticalMeasures
using MLJTuning
EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()
X, y = make_regression(1000, 5)
# # CONTROLLED ITERATION
# number of iterations:
N = 300
steps = [1, 5, 30]
iterated_boosters = [
IteratedModel(
model=booster,
resampling=Holdout(fraction_train=0.8),
controls=[Step(s), NumberLimit(round(Int, N/s))],
measure=l1,
)
for s in steps
]
for ib in iterated_boosters
mach = machine(ib, X, y)
print("step=", ib.controls[1].n)
@btime fit!($mach, verbosity=0)
@assert report(mach).n_iterations == N
end
# on EvoTrees master branch:
# step=1 14.351 μs (101 allocations: 4.61 KiB)
# step=5 14.301 μs (101 allocations: 4.61 KiB)
# step=30 14.342 μs (101 allocations: 4.61 KiB)
# on this PR:
# step=1 13.701 μs (101 allocations: 4.61 KiB)
# step=5 13.726 μs (101 allocations: 4.61 KiB)
# step=30 13.776 μs (101 allocations: 4.61 KiB)
# # TUNING
r = range(booster, :lambda, lower = 0, upper = 0.5)
tuned_booster = TunedModel(
booster,
tuning=RandomSearch(),
range=r,
measure = l2,
n = 500,
)
mach = machine(tuned_booster, X, y)
# on EvoTrees master branch:
@btime fit!($mach, verbosity=0)
# 15.437 μs (100 allocations: 4.55 KiB)
# on this PR:
@btime fit!($mach, verbosity=0)
# 17.439 μs (100 allocations: 4.55 KiB) |
@jeremiedb There seems to be a problem with handling tables that are row-based. Perhaps you are calling Pkg.activate("explore002", shared=true)
using MLJBase
using MLJModels
using Tables
EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()
X, y = make_regression(1000, 5)
# this works:
mach = machine(booster, X, y) |> fit!
# this doesn't
X = Tables.rowtable(X)
mach = machine(booster, X, y) |> fit!
# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]
# Stacktrace:
# [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)
# @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
# [2] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64}, w::Nothing)
# @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
# [3] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64})
# @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
# [4] fit_only!(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}; rows::Nothing,
# verbosity::Int64, force::Bool, composite::Nothing)
# @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
# [5] fit_only!
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
# [6] #fit!#63
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
# [7] fit!
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
# [8] |>(x::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, f::typeof(fit!))
# @ Base ./operators.jl:907
# [9] top-level scope
# @ REPL[12]:1 |
Guilty :) The Tables implementation effectively assumed working with Regarding the impact on Thanks a lot for taking the time to look at the PR. And sorry about the digression about the iterated API. I'm trying to gather some coherent proposal so as to have a constructive take, will ping back once done! |
I think that row-table problem persists with EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()
X, y = make_regression(1000, 5)
X = Tables.rowtable(X);
# smoke tests:
mach = machine(booster, X, y) |> fit! # this now works, thanks
predict(mach, X) # fails
# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1] Stack trace
|
Thanks for the catch. I had neglected adding a test for rowtables, which has now been added within the MLJ test suite (both fit! and predict). Tests are now passing followin the patch to convert to Regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finished reviewing those parts of the PR that seemed relevant to the MLJ interface. It looks good to go.
Fix #260
@ablaom I moved the MLJ wrapper from the orignal Matrix API to the new Tables one, which provides native support for richer inputs, incl. Ordered and Unordered (MultiClass) CategoricalValues.