-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support matmul with pre-transposed weights. #9709
Comments
I don't fully understand the specs and terminology for GGML (to me, input order and transpose vs. non-transposed all looks flipped), but this is what happens for our ops (applies for
|
Yes. The term pre-transposed comes from traditional HPC. In normal MatMul the right matrix is read on column order. Which is slow because DRAM and cache pattern. One common solution if that matrix will be reused, is to store the transposed matrix. And perform all operations against that transposed one. Thus memory access would be linear and fast. GGML has this optimization built into the stored model files. |
We can build support for this into our |
I think most projects ended up adding a parameter to denote if tensor A/B are pre-transposed. I'm only interested in transposed B. But feel free to add transposed A support. Ref: CBLAS and TensotFlow API void cblas_sgemm(CBLAS_LAYOUT layout,
CBLAS_TRANSPOSE TransA,
CBLAS_TRANSPOSE TransB, <--- add this as an optional parameter
const CBLAS_INT M,
const CBLAS_INT N,
const CBLAS_INT K,
const float alpha,
const float * A,
const CBLAS_INT lda,
const float * B,
const CBLAS_INT ldb,
const float beta,
float * C,
const CBLAS_INT ldc
) tf.compat.v1.linalg.matmul(
a,
b,
transpose_a=False,
transpose_b=False, <-- add this
adjoint_a=False,
adjoint_b=False,
a_is_sparse=False,
b_is_sparse=False,
output_type=None,
grad_a=False,
grad_b=False,
name=None
) |
I will look into this and let you know when it's supported |
…near - Fix transpose to respect padding when swapping dims during compute_output_shapes
Take a look at this PR and my comments on what was done: #9836. Once this is merged, it should unblock you. (I didn't rigorously test every API with this new feature, but it should work.) |
@TT-BrianLiu Sorry I caused confusion. I don't think the PR addresses the issue. I already manually transpose the wright matrix. And the reason for this feature request is to get rid of the transpose for better performance. Or least only run transpose once. The tensor tracking part is an alternative I can do to achieve better performance. But very error prone and thus not desirable. My bad. |
…near - Fix transpose to respect padding when swapping dims during compute_output_shapes
…near - Fix transpose to respect padding when swapping dims during compute_output_shapes
Ah ok. I think it's still cleaner to have the transposes baked into our matmul API and it will help us think about how to support the transposes natively in the matmul in the future. |
…near - Fix transpose to respect padding when swapping dims during compute_output_shapes
I am going to close this issue for now. If there's an urgency for more performant support of this feature, we can revisit. |
Is your feature request related to a problem? Please describe.
I'm working on integrating TTNN/Metalium into GGML/llama.cpp. Currently in the very early stages of hooking everything up and see where is breaks. And for now looking into integrating the matrix multiplication API.
And I ran into a problem. Weights in GGML are pre-transposed. Thus during inference, I have to transpose the weights before sending it for matrix multiplication. Neither
tt::tt_metal::fully_connected
orttnn::operations::matmul::matmul
has a parameter to allow multiplying with a pre-transposed matrix. Forcing me to transpose it every time running inference.Describe the solution you'd like
It'll be great if TTNN natively support multiplying with pre-transposed matrices.
I understand I can do some tricks like tagging the tensors and only transpose once during inference (can't do during load as I won't know what operation will be used on the weight yet. But that's ugly and error prone. Epically for cases where GGML would want to read back the weights (ex: generating LoRA).
Describe alternatives you've considered
Anything is on the table as long as I can make GGML happy without additional compute.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: