-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding .contiguous
call before matmul
#1965
Comments
Matmuls are actually also supported for transposed layouts on top of contiguous ones, so we enforce contiguous or transposed contiguous as these are supported on all devices + mkl & accelerate and likely to be the fastest. Supporting arbitrary strides is dodgy, e.g. on mkl hence the limitation at the moment. |
I plan on writing this for CUDA, would that work? Do you know of any good resources or places where I can find a sample implementation? |
I'm not sure it would, we use cublas for the matmul (actually this function) and it doesn't let you work with arbitrary strides except for the batch dimensions, you can just specify transpose and not transpose which is what we're doing. |
That sounds exciting! I am trying to create a fast LLM engine, targeting being faster than llama.cpp with Candle, so any optimization is beneficial. I have been using my fork as a playground to test out new ideas, and have implemented some fused kernels. Are there any directions which I could look into for optimizing performance, specifically on CUDA? |
Random thoughts on performance:
|
I have been following the recent changes and look forward to trying them out! I am running what is essentially the Regarding inplace-ops, could that be implemented for CUDA by passing the src slice = dst slice (depending on kernel impl, of course)? |
Yes, we don't have a cuda test to show this but here is the cpu one, custom_op_test. Also besides this I've spent quite some time using |
I ran a flamegraph with |
Is it a normal cpu flamegraph? If so it's not representative of where the time is actually spent if you use cuda. I've optimized a bit the synchronization points with the gpu - removing most of them in #1886 - and it actually doesn't make much of a difference except for very small transformers (which we do use internally so that was actually quite a win there), for a 7b it's negligilble at least in my setup. You certainly see them disappearing in the Here is the kernel occupation on my RTX 2080 for the quantized example using a q4 7b model and you can see that we spent mostly all the time in the |
Yes I wonder what llama.cpp is doing that makes it so much faster. Clearly, candle is spending much of its time in matmul, so there is probably not much more trim. I noticed that they use some sort of graph for model execution, do you know of any other techniques they employ? |
I'm not sure what llama.cpp is doing differently, certainly interested if you manage to dig some of this out. |
Thanks for the help! |
During my work with Candle I have noticed that
matmul
seems to only accept contiguous tensors. That is, after calling.reshape
,matmul
will fail because it is not contiguous in the right dimensions. The logical fix (#1948) seems to be to call.contiguous
and this is implemented in the provided models.It is likely that supporting non-contiguous tensors cause a good performance increase (#1939), as not copying as many tensors around unnecessarily is always better. After reading the dfdx code, it seems like they are able to handle non-contiguous tensors.
Is there a way to implement this technique in Candle? Thanks!
The text was updated successfully, but these errors were encountered: