-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PrecompileTools.jl #284
Comments
Sounds good! Have you tested the Speedup already? |
Some issues I'm running into:
For now I'm just trying to test what speedup we could hope for by making a separate "startup" package (as is suggested here) that loads all of |
Hmm this approach brings the TTF-epoch from 77s to about 65s, which is a speed up for sure, but I was kind of hoping for even more. I will have to look a bit deeper at where the time is spent. It might be all GPU stuff, in which case we'll need to wait for the above mentioned issue to conclude. There's also the possibility that on first execution cuDNN has to run a bunch of micro-benchmarks to determine some algorithms choices. I filed a WIP PR to cache that a while ago, but haven't looked at it in a while JuliaGPU/CUDA.jl#1948. If it turns out that the TTF-epoch is dominated by that I'll push that a bit more. |
Another update - I ran a training similar to the code above, but without any FastAI.jl/FluxTraining.jl, i.e. just Flux.jl and Metalhead.jl (see code below). With using the precompile approach from above, timings are 27s for CPU only and 55s for GPU. In particular, 55s is only about 15% less than 65s - in other words, the fact that my above measurements are at 65s seems mostly dominated not by the FastAI infrastructure, but rather by the GPUCompiler etc. It still might be worth it to follow through with this issue, or at least write some instructions how to make a startup package, but further improvements must come from the Flux infrastructure itself. see codeusing Flux, Metalhead
device_ = eval(device)
import Flux: gradient
import Flux.OneHotArrays: onehotbatch
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(Float32, 32, 32, 3) for _ in 1:100],
[rand(labels) for _ in 1:100])
model = ResNet(18; nclasses=10) |> device_
train_loader = Flux.DataLoader(data, batchsize=10, shuffle=true, collate=true)
opt = Flux.Optimise.Descent()
ps = Flux.params(model)
loss = Flux.Losses.logitcrossentropy
for epoch in 1:2
for (x, y) in train_loader
yb = onehotbatch(y, labels) |> device_
model(x |> device_)
grads = gradient(ps) do
loss(model(x |> device_), yb)
end
Flux.Optimise.update!(opt, ps, grads)
end
end |
I suspected as much. You'll want to drill further down into the timings to see if something like JuliaGPU/GPUCompiler.jl#65 is at play. |
Thanks. When i find some time, I'll also check if JuliaGPU/CUDA.jl#1947 helps. But probably I'll move that discussion somewhere else. |
Update on this: Since JuliaGPU/CUDA.jl#2006 seems to be fixed, it's possible to just write your own little precompile directive, which reduces TTF Epoch to about 12 seconds -- quite workable! MyModule.jl: module FastAIStartup
using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch
import PrecompileTools: @setup_workload, @compile_workload
@setup_workload begin
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
@compile_workload begin
# with FastAI.jl
data = ([rand(RGB{N0f8}, 32, 32) for _ in 1:100],
[rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data,
backbone=backbone(EfficientNet(:b0)),
callbacks = [ToGPU()])
fitonecycle!(learner, 2)
end
end
end # module FastAIStartup benchmark.jl using FastAI, FastVision, Metalhead
import FastVision: RGB, N0f8
import Flux
import Flux: gradient
import Flux.OneHotArrays: onehotbatch
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
[rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)
learner = tasklearner(task, data,
backbone=backbone(EfficientNet(:b0)),
callbacks = [ToGPU()])
fitonecycle!(learner, 2) julia> @time include("benchmark.jl")
11.546966 seconds (7.37 M allocations: 731.768 MiB, 4.15% gc time, 27.73% compilation time: 3% of which was recompilation) |
I still think though it makes sense to move some of the precompile directives into this module. @compile_workload begin
labels = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
data = ([rand(RGB{N0f8}, 64, 64) for _ in 1:100],
[rand(labels) for _ in 1:100])
blocks = (Image{2}(), FastAI.Label{String}(labels))
task = ImageClassificationSingle(blocks)j
learner = tasklearner(task, data,
backbone=backbone(mockmodel(task)))
fitonecycle!(learner, 2)
# enable this somehow only if CUDA is loaded?
learner_gpu = tasklearner(task, data,
backbone=backbone(mockmodel(task)),
callbacks = [ToGPU()])
fitonecycle!(learner_gpu, 2)
end |
I'm on board with adding precompile workloads, but only if we can ensure they don't use a bunch of CPU + memory at runtime (compile time is fine), don't modify any global state (e.g. default RNG) and don't do any I/O. That last one is most important because it's caused hangs during precompilation for other packages. That may mean strategic calls to |
Motivation and description
Currently the startup time for using this package is quite long.
For example, running the code snippet below takes about 80s on my machine, which is 99% overhead time (the two epochs are practically instant).
To compare, a basic Flux model only takes about 6s after startup. Since in Julia 1.9 and 1.10 a lot of the compile time can be "cached away" I think we'd greatly benefit from integrating something like
PrecompileTools.jl
into the packages.Possible Implementation
I saw there's already a
workload.jl
file (which basically just runs all the tests) which is used for sysimg creation. Perhaps we can do something similar for the PrecompileTools.jl directive.I can try to get a PR started in the coming days.
Sample code
The text was updated successfully, but these errors were encountered: