Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

achirkin · 2023-08-14T14:43:34Z

This PR replaces the current cublas gemm backend of raft::linalg::gemm with cublasLt matmul. The latter is more flexible and allows to decouple selection of the algorithm heuristics from its execution.
Thanks to this change, this PR adds memoization of the matmul heuristics and the other arguments (matrix layouts and the matmul descriptor).

Performance on specific workloads

IVF-PQ performs two gemm operations during pre-processing on small work sizes. The preprocessing consists of a few kernel launches and a rather heavy logic on CPU side (which results in gaps between the kernel launches).
This PR roughly halves the gemm kernel launch latency (approx 10us -> 5us, as measured by NVTX from entering matmul wrapper on the host to the launch of the kernel).
As a motivation example: this PR improves QPS of IVF-PQ by ~5-15% on small batches (tested on SIFT-128, n_queries = 1, n_probes = 20 and 200) .

Synthetic benchmarks: no significant difference

Running all 4K+ benchmarks across RAFT does not bring significant difference in CPU/GPU exec time.

Overall, the average exec time reduction of ~0.5%
100+ benchmarks show 5-10% time reduction
9 benchmarks show 5-10% time increase (none of them use GEMM)

Only a small fraction of RAFT benchmarks actually use GEMM, so most of the stronger deviations are likely due to pure chance. Having no gain across all benchmarks is not surprising, because we've designed most of them for somewhat larger work sizes, which hides the gemm latency.

Testing the changes against rapidsai/raft#1736

…gemm

achirkin · 2023-08-16T12:38:01Z

A note on caching

Since cublas 12, cublasLtMatmulAlgoGetHeuristic actually does the heuristics caching on their own (can be controlled by CUBLASLT_HEURISTICS_CACHE_CAPACITY env variable).
Yet I think, raft caching still makes sense for these reasons:

It caches not only heuristics, but also the descriptors saving a little bit more work (my NVTX profiling shows the a small reduction in the total launch latency ~ 6-7us -> 5us).
raft implementation restricts the space of possibly layouts, thus slightly increases the chance of the cache hit and the size of the cache

Cache scope

In raft, I see three ways to implement caching:

static thread_local variable inside the function: create a new cache per-thread and per input types combination (float/half/etc)
static variable inside the function: compared to (1) would allow to reuse the cache between threads, but would require mutexes/guards
[currently selected] New raft::resource. Same as (2) but more flexible and explicit. However, this would erase input types and thus increase the size of the cache keys (may hurt performance?..).

cpp/include/raft/linalg/detail/gemm.hpp

achirkin · 2023-08-21T16:52:11Z

cpp/include/raft/core/resource/user_resource.hpp

+  if (!res.has_resource_factory(resource_type::USER_DEFINED)) {
+    res.add_resource_factory(std::make_shared<user_resource_factory>());
+  }
+  return res.get_resource<user_resource>(resource_type::USER_DEFINED)->load<Store>();


Would be nice to reuse the resources' lock here rather than having two different locks for getting the resource and loading the store from the resource. This could be done e.g. by passing an optional lambda to the resource::get_resource to apply post-processing while holding the lock.
I haven't added this to not make the PR overly invasive.

cjnolet

I think the user defined resource is a great concept and it should solve some of the problems we have with extensibility from using the fixed-length array for the other resources.

Of course this does come with the cost of hashmap lookups, but I think the benefits will outweigh the drawbacks here.

cpp/include/raft/linalg/gemm.cuh

cpp/include/raft/core/resource/cublaslt_handle.hpp

cpp/include/raft/core/resource/user_resource.hpp

cjnolet

Thanks for bearing with me here @achirkin. It looks like we've made a little progress on the discussions for this PR.

cpp/include/raft/linalg/detail/cublaslt_wrappers.hpp

docs/source/cpp_api/core_resources.rst

cjnolet · 2024-01-09T23:45:44Z

cpp/include/raft/core/resource/user_resource.hpp

+  }
+
+ private:
+  std::unordered_map<std::type_index, std::shared_ptr<void>> map_{};


From what I've seen so far, most of the "users" of the user_resource are within raft itself (e.g. caching in the algorithms).

Your comment here makes me struggle with the name a little bit, because I was absolutely thinking we would want to invite users to use a "user resource". One of the reasons we didn't use an unordered map for the other resources is because lookup has shown to be slow, especially when the map is small.

Rather than a "user resource", this seems to be more of a "runtime shared resource cache". I'd prefer to name it appropriately based on its use and I think that name is generalized enough to tell a potential user (internal or external) what it's for. Docs can help describe it even better for anyone who might have further confusion on the name.

cpp/include/raft/core/resource/custom_resource.hpp

tfeher

Had a quick look, and overall the PR looks good to me!

I am curious whether there is any observable effect of this PR on the following two benchmarks:

GramMatrix linear kernels
KMeansBalanced. Currently this test does not run GEMM, but if we change the distance type to InnerProduct then it would use linalg::gemm calls, and that gives us a way to benchmark the new GEMM interface.

achirkin · 2024-01-19T15:33:11Z

Thanks, @tfeher for suggestion. I ran the GramMatrix and KMeansBalanced/InnerProduct tests a few times today; there's no significant difference, even if I set somewhat larger benchmark time the iteration time variance is simply larger than the difference. I guess, one need rather specific, small matrix sizes to not hide the kernel run latency and see the difference.

…ustom resources under assumption they are never big enough for the map

achirkin · 2024-01-23T14:43:41Z

Update: I've changed the resource and the cache implementation to not use unordered_map and rerun the relevant tests with small param adjustments. Now I see some measurable difference (this PR vs current state of 24.02):

benchmark	adjustments	iteration time difference
`KNN///ivf_pq_knn/0/0/1`	n_queries = 1	-7.0±5.3%
`GramMatrix`		-0.1±0.5%
`KMeansBalanced`	distance: `InnerProduct`	-4.1±3.4%

cjnolet

LGTM

tfeher · 2024-01-23T15:18:29Z

Thanks @achirkin for the updated numbers! LGTM!

cjnolet · 2024-01-23T18:51:47Z

/merge

Replace GEMM backend: cublas.gemm -> cublaslt.matmul

2cc477b

achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Aug 14, 2023

achirkin self-assigned this Aug 14, 2023

github-actions bot added the cpp label Aug 14, 2023

achirkin added a commit to achirkin/cuml that referenced this pull request Aug 14, 2023

Test against raft PR rapidsai#1736

ac38188

Testing the changes against rapidsai/raft#1736

achirkin mentioned this pull request Aug 14, 2023

[DO NOT MERGE] Test against raft PR #1736 achirkin/cuml#1

Closed

achirkin added a commit to achirkin/cuml that referenced this pull request Aug 14, 2023

Test against raft PR rapidsai#1736

751c392

Testing the changes against rapidsai/raft#1736

achirkin mentioned this pull request Aug 14, 2023

Test against raft PR #1736 rapidsai/cuml#5554

Closed

achirkin and others added 6 commits August 14, 2023 19:25

Replace broken (due to missing direct includes) direct uses of cublas…

dc7a9a4

…gemm

Merge branch 'branch-23.10' into fea-cublaslt-matmul

34a9479

Fix docs

71c03c0

Replace cublasgemm where it makes sense

a2fb088

Fix a typo

699de0c

Merge branch 'branch-23.10' into fea-cublaslt-matmul

f994f19

achirkin marked this pull request as ready for review August 16, 2023 12:38

achirkin requested a review from a team as a code owner August 16, 2023 12:38

achirkin added 3 - Ready for Review and removed 2 - In Progress Currenty a work in progress labels Aug 16, 2023

cjnolet reviewed Aug 16, 2023

View reviewed changes

cpp/include/raft/linalg/detail/gemm.hpp Outdated Show resolved Hide resolved

Put the cache into the resource handle as a user-define resource

f4d634a

achirkin requested a review from cjnolet August 21, 2023 14:01

achirkin commented Aug 21, 2023

View reviewed changes

cjnolet requested changes Aug 22, 2023

View reviewed changes

cpp/include/raft/linalg/gemm.cuh Outdated Show resolved Hide resolved

cpp/include/raft/linalg/gemm.cuh Show resolved Hide resolved

cpp/include/raft/core/resource/cublaslt_handle.hpp Show resolved Hide resolved

cpp/include/raft/core/resource/user_resource.hpp Outdated Show resolved Hide resolved

achirkin and others added 3 commits August 22, 2023 09:44

Merge branch 'branch-23.10' into fea-cublaslt-matmul

2d1bf5c

Move matmul into a separate file

e57eebf

Complete the docs

d44bf20

achirkin requested a review from cjnolet August 22, 2023 09:04

achirkin changed the base branch from branch-23.10 to branch-23.12 November 20, 2023 07:26

Merge branch 'branch-23.12' into fea-cublaslt-matmul

cd4663a

achirkin changed the base branch from branch-23.12 to branch-24.02 December 14, 2023 10:08

achirkin and others added 4 commits December 14, 2023 11:09

Merge branch 'branch-24.02' into fea-cublaslt-matmul

c2f1daa

Merge branch 'branch-24.02' into fea-cublaslt-matmul

c976de0

Merge branch 'branch-24.02' into fea-cublaslt-matmul

a5de437

Update copyright year for changed files

7849786

cjnolet requested changes Jan 9, 2024

View reviewed changes

achirkin and others added 5 commits January 12, 2024 16:06

Merge branch 'branch-24.02' into fea-cublaslt-matmul

9bec3cf

Merge branch 'branch-24.02' into fea-cublaslt-matmul

ceb8d10

Deprecate linalg/gemm.cuh

1f39534

Update copyright years

b2e3b8b

Rename user_resource -> custom_resource

05c64fc

achirkin requested a review from cjnolet January 15, 2024 10:27

achirkin added 2 commits January 17, 2024 09:40

Merge branch 'branch-24.02' into fea-cublaslt-matmul

fdbe003

Merge branch 'branch-24.02' into fea-cublaslt-matmul

9e08c0f

achirkin commented Jan 17, 2024

View reviewed changes

cpp/include/raft/core/resource/custom_resource.hpp Outdated Show resolved Hide resolved

tfeher approved these changes Jan 17, 2024

View reviewed changes

achirkin added 2 commits January 18, 2024 14:12

Merge branch 'branch-24.02' into fea-cublaslt-matmul

97f1d49

Merge branch 'branch-24.02' into fea-cublaslt-matmul

6164e4f

achirkin and others added 4 commits January 19, 2024 16:33

Merge branch 'branch-24.02' into fea-cublaslt-matmul

f6ded84

Merge branch 'branch-24.02' into fea-cublaslt-matmul

88ecbb0

Use plain the vector instead of the unordered_map for the cache and c…

ca11f9f

…ustom resources under assumption they are never big enough for the map

Merge branch 'branch-24.02' into fea-cublaslt-matmul

47303b7

cjnolet approved these changes Jan 23, 2024

View reviewed changes

rapids-bot bot merged commit 558dc8f into rapidsai:branch-24.02 Jan 23, 2024
61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

achirkin commented Aug 14, 2023 •

edited

Loading

achirkin commented Aug 16, 2023 •

edited

Loading

achirkin Aug 21, 2023

cjnolet left a comment

cjnolet left a comment

cjnolet Jan 9, 2024

tfeher left a comment

achirkin commented Jan 19, 2024 •

edited

Loading

achirkin commented Jan 23, 2024 •

edited

Loading

cjnolet left a comment

tfeher commented Jan 23, 2024

cjnolet commented Jan 23, 2024

Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

Replace GEMM backend: cublas.gemm -> cublaslt.matmul #1736

Conversation

achirkin commented Aug 14, 2023 • edited Loading

Performance on specific workloads

Synthetic benchmarks: no significant difference

achirkin commented Aug 16, 2023 • edited Loading

A note on caching

Cache scope

achirkin Aug 21, 2023

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Jan 9, 2024

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

achirkin commented Jan 19, 2024 • edited Loading

achirkin commented Jan 23, 2024 • edited Loading

cjnolet left a comment

Choose a reason for hiding this comment

tfeher commented Jan 23, 2024

cjnolet commented Jan 23, 2024

achirkin commented Aug 14, 2023 •

edited

Loading

achirkin commented Aug 16, 2023 •

edited

Loading

achirkin commented Jan 19, 2024 •

edited

Loading

achirkin commented Jan 23, 2024 •

edited

Loading