-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736
Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736
Conversation
Testing the changes against rapidsai/raft#1736
Testing the changes against rapidsai/raft#1736
A note on cachingSince cublas 12,
Cache scopeIn raft, I see three ways to implement caching:
|
if (!res.has_resource_factory(resource_type::USER_DEFINED)) { | ||
res.add_resource_factory(std::make_shared<user_resource_factory>()); | ||
} | ||
return res.get_resource<user_resource>(resource_type::USER_DEFINED)->load<Store>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to reuse the resources' lock here rather than having two different locks for getting the resource and loading the store from the resource. This could be done e.g. by passing an optional lambda to the resource::get_resource
to apply post-processing while holding the lock.
I haven't added this to not make the PR overly invasive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user defined resource is a great concept and it should solve some of the problems we have with extensibility from using the fixed-length array for the other resources.
Of course this does come with the cost of hashmap lookups, but I think the benefits will outweigh the drawbacks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me here @achirkin. It looks like we've made a little progress on the discussions for this PR.
} | ||
|
||
private: | ||
std::unordered_map<std::type_index, std::shared_ptr<void>> map_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen so far, most of the "users" of the user_resource are within raft itself (e.g. caching in the algorithms).
Your comment here makes me struggle with the name a little bit, because I was absolutely thinking we would want to invite users to use a "user resource". One of the reasons we didn't use an unordered map for the other resources is because lookup has shown to be slow, especially when the map is small.
Rather than a "user resource", this seems to be more of a "runtime shared resource cache". I'd prefer to name it appropriately based on its use and I think that name is generalized enough to tell a potential user (internal or external) what it's for. Docs can help describe it even better for anyone who might have further confusion on the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick look, and overall the PR looks good to me!
I am curious whether there is any observable effect of this PR on the following two benchmarks:
- GramMatrix linear kernels
- KMeansBalanced. Currently this test does not run GEMM, but if we change the distance type to
InnerProduct
then it would use linalg::gemm calls, and that gives us a way to benchmark the new GEMM interface.
Thanks, @tfeher for suggestion. I ran the GramMatrix and KMeansBalanced/InnerProduct tests a few times today; there's no significant difference, even if I set somewhat larger benchmark time the iteration time variance is simply larger than the difference. I guess, one need rather specific, small matrix sizes to not hide the kernel run latency and see the difference. |
…ustom resources under assumption they are never big enough for the map
Update: I've changed the resource and the cache implementation to not use unordered_map and rerun the relevant tests with small param adjustments. Now I see some measurable difference (this PR vs current state of 24.02):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @achirkin for the updated numbers! LGTM! |
/merge |
This PR replaces the current cublas
gemm
backend ofraft::linalg::gemm
with cublasLtmatmul
. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution.Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor).
Performance on specific workloads
IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches).
This PR roughly halves the gemm kernel launch latency (approx 10us -> 5us, as measured by NVTX from entering
matmul
wrapper on the host to the launch of the kernel).As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) .
Synthetic benchmarks: no significant difference
Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time.
Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance. Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency.