Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

Merged
merged 52 commits into from
Jan 23, 2024

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Aug 14, 2023

This PR replaces the current cublas gemm backend of raft::linalg::gemm with cublasLt matmul. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution.
Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor).

Performance on specific workloads

IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches).
This PR roughly halves the gemm kernel launch latency (approx 10us -> 5us, as measured by NVTX from entering matmul wrapper on the host to the launch of the kernel).
As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) .

Synthetic benchmarks: no significant difference

Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time.

  • Overall, the average exec time reduction of ~0.5%
  • 100+ benchmarks show 5-10% time reduction
  • 9 benchmarks show 5-10% time increase (none of them use GEMM)

Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance. Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency.

@achirkin achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Aug 14, 2023
@achirkin achirkin self-assigned this Aug 14, 2023
@github-actions github-actions bot added the cpp label Aug 14, 2023
achirkin added a commit to achirkin/cuml that referenced this pull request Aug 14, 2023
achirkin added a commit to achirkin/cuml that referenced this pull request Aug 14, 2023
@achirkin
Copy link
Contributor Author

achirkin commented Aug 16, 2023

A note on caching

Since cublas 12, cublasLtMatmulAlgoGetHeuristic actually does the heuristics caching on their own (can be controlled by CUBLASLT_HEURISTICS_CACHE_CAPACITY env variable).
Yet I think, raft caching still makes sense for these reasons:

  • It caches not only heuristics, but also the descriptors saving a little bit more work (my NVTX profiling shows the a small reduction in the total launch latency ~ 6-7us -> 5us).
  • raft implementation restricts the space of possibly layouts, thus slightly increases the chance of the cache hit and the size of the cache

Cache scope

In raft, I see three ways to implement caching:

  1. static thread_local variable inside the function: create a new cache per-thread and per input types combination (float/half/etc)
  2. static variable inside the function: compared to (1) would allow to reuse the cache between threads, but would require mutexes/guards
  3. [currently selected] New raft::resource. Same as (2) but more flexible and explicit. However, this would erase input types and thus increase the size of the cache keys (may hurt performance?..).

@achirkin achirkin marked this pull request as ready for review August 16, 2023 12:38
@achirkin achirkin requested a review from a team as a code owner August 16, 2023 12:38
@achirkin achirkin added 3 - Ready for Review and removed 2 - In Progress Currenty a work in progress labels Aug 16, 2023
@achirkin achirkin requested a review from cjnolet August 21, 2023 14:01
if (!res.has_resource_factory(resource_type::USER_DEFINED)) {
res.add_resource_factory(std::make_shared<user_resource_factory>());
}
return res.get_resource<user_resource>(resource_type::USER_DEFINED)->load<Store>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to reuse the resources' lock here rather than having two different locks for getting the resource and loading the store from the resource. This could be done e.g. by passing an optional lambda to the resource::get_resource to apply post-processing while holding the lock.
I haven't added this to not make the PR overly invasive.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user defined resource is a great concept and it should solve some of the problems we have with extensibility from using the fixed-length array for the other resources.

Of course this does come with the cost of hashmap lookups, but I think the benefits will outweigh the drawbacks here.

cpp/include/raft/linalg/gemm.cuh Outdated Show resolved Hide resolved
cpp/include/raft/linalg/gemm.cuh Show resolved Hide resolved
cpp/include/raft/core/resource/user_resource.hpp Outdated Show resolved Hide resolved
@achirkin achirkin requested a review from cjnolet August 22, 2023 09:04
@achirkin achirkin changed the base branch from branch-23.10 to branch-23.12 November 20, 2023 07:26
@achirkin achirkin changed the base branch from branch-23.12 to branch-24.02 December 14, 2023 10:08
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me here @achirkin. It looks like we've made a little progress on the discussions for this PR.

cpp/include/raft/linalg/detail/cublaslt_wrappers.hpp Outdated Show resolved Hide resolved
docs/source/cpp_api/core_resources.rst Outdated Show resolved Hide resolved
}

private:
std::unordered_map<std::type_index, std::shared_ptr<void>> map_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen so far, most of the "users" of the user_resource are within raft itself (e.g. caching in the algorithms).

Your comment here makes me struggle with the name a little bit, because I was absolutely thinking we would want to invite users to use a "user resource". One of the reasons we didn't use an unordered map for the other resources is because lookup has shown to be slow, especially when the map is small.

Rather than a "user resource", this seems to be more of a "runtime shared resource cache". I'd prefer to name it appropriately based on its use and I think that name is generalized enough to tell a potential user (internal or external) what it's for. Docs can help describe it even better for anyone who might have further confusion on the name.

@achirkin achirkin requested a review from cjnolet January 15, 2024 10:27
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick look, and overall the PR looks good to me!

I am curious whether there is any observable effect of this PR on the following two benchmarks:

@achirkin
Copy link
Contributor Author

achirkin commented Jan 19, 2024

Thanks, @tfeher for suggestion. I ran the GramMatrix and KMeansBalanced/InnerProduct tests a few times today; there's no significant difference, even if I set somewhat larger benchmark time the iteration time variance is simply larger than the difference. I guess, one need rather specific, small matrix sizes to not hide the kernel run latency and see the difference.

@achirkin
Copy link
Contributor Author

achirkin commented Jan 23, 2024

Update: I've changed the resource and the cache implementation to not use unordered_map and rerun the relevant tests with small param adjustments. Now I see some measurable difference (this PR vs current state of 24.02):

benchmark adjustments iteration time difference
KNN/*/*/ivf_pq_knn/0/0/1 n_queries = 1 -7.0±5.3%
GramMatrix -0.1±0.5%
KMeansBalanced distance: InnerProduct -4.1±3.4%

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tfeher
Copy link
Contributor

tfeher commented Jan 23, 2024

Thanks @achirkin for the updated numbers! LGTM!

@cjnolet
Copy link
Member

cjnolet commented Jan 23, 2024

/merge

@rapids-bot rapids-bot bot merged commit 558dc8f into rapidsai:branch-24.02 Jan 23, 2024
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review cpp feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants