-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with MemoryConstrainedTreeBoosting.jl? #91
Comments
Very impressive performance! I thought I had managed to get decent optimisation on the histogram building and that remaining speed bottle neck was mainly coming from the handling for the sets of indices flowing through the various nodes, but it seems your approach pushes performance a step above! I'm definitly interested to look into how the best features of each library could be brought together. I would need some time to digest a bit how BTW, are you using any subsampling trick on the observations such as based on 1st order gradient magnitude or are all the histograms built from 100% of the obs? |
100%. But always calculating the smaller of two siblings first so the other can be derived. There's a couple further tricks to get performance. The first/easiest to try is loop tiling. Because it's not just the indices that slow you down, but the δ δ² and 𝑤 arrays. Each feature-datapoint is 1 byte, but the inner loop is also consuming 4 or 8 bytes for its index, 4 bytes for δ, 4 bytes for δ² and 4 bytes for 𝑤. For 1M obs, 100 features, that's 1000000000*(1+4+4+4+4)*100 = 1.7TB of memory traffic, even though the histograms at least will always be in L1 cache. To keep both the histogram(s) and indices/δ/δ²/𝑤 in cache as best as possible, you can tile: have each thread process ~10 features at a time, ~2,000 indices at a time. 2,000*(4+4+4+4) = 32KB of indices/δ/δ²/𝑤, 10*(64 bins * 3 * 4bytes) = 8KB of histograms. That will handily fit in a L2 cache, and effectively reduces the traffic to RAM for indices/δ/δ²/𝑤 by 10x. Currently, in MCTB each thread builds 12 histograms at a time, 20736 or 320 indices at a time, depending on whether it's the first pass (which can be expressed as a range over all the data) vs the sparse passes deeper in the tree. Other tricks in MCTB:
Yes, perhaps sometime next week. I think mostly it will just be porting over the above performance hacks. |
@brianhempel Thanks a lot for providing those explanations. I've started makig some refactoring in the
Regarding the split of the observation ids into left/right children, I've attempted a non allocating approach here, but logic has apparently some flaws as it breaks if using a row-subsampling < 1.0 or for depth > 3 (when more than a single depth reusses the pre-allocated left/right views). See here for how there's initialized. The above had a very minor impact on speed on smaller dataset, but the size of allocations was cut in half, which appears to have greater benefits on larger dataset (30% speed improvement on a 5M rows / 100 features). Does this restructuring effectively looks better adapted to accomodate the tiling and other optimizations? I somehow held the belief that the compiler was able to perform those kind of optimisations, a bit like it seems able to do on simple loops where |
Quick thoughts:
|
Thank you for these hints! I've spent most of my recent efforts to bring the EvoTrees ready for a release baesd on the restructuring to better handle the various optimisations. In particular, refactoring the GPU part to handle the move to the set split. It would be part of the v0.8.0 release. On Julia 1.6.1 (which saw the main pain point about loops slowdown resolved), EvoTrees is now 3X slower on your benchmark (15s vs 5s). Allocations have also been reduced significatively. EvoTrees initialization is slow, mostly because of the binning process, which takes about 2 sec out of the 15 sec, so that could be another low hanging fruit, though less impactful for large iterations.
Benchmark shows slight speedup for some reason when adding the @simd and with correct results, so I would let it there at the moement. That being said, given the significant speed gap with your implementation, my understanding is that it's likely on the histograms building that I should next focus with the tiled+simd approach. If it's something feasible on your side, having PoC of such implementation that performs the equivalent of this hist building would be much helpful. Having it benchmark could confirm whether it's really that hist building that's the remaining bottleneck. Otherwise, I'd assume it's the remaining memory allocations during the loop that hurts the process.
Absolutely! Thanks for pointing this out. A multi-thread version has also been implemented. |
With @code_native I see it is using SIMD instructions, but not in any useful way: it's using SIMD registers to load/add one float at a time, just like it would with normal registers. I'm guessing it's dumb luck or lower register pressure that makes it faster. I think x86 kind of doesn't have enough general purpose registers.
I'm sure it is. It's the only part of the computation that's O(nobs*nfeatures). Set partitioning is O(nobs) and split selection is O(nbins*nfeatures).
On it. |
Maybe we can merge projects?
EvoTrees:
MemoryConstrainedTreeBoosting:
MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.
With our powers combined...!
A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.
The text was updated successfully, but these errors were encountered: