Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support matmul with pre-transposed weights. #9709

Closed
marty1885 opened this issue Jun 26, 2024 · 9 comments
Closed

[Feature Request] Support matmul with pre-transposed weights. #9709

marty1885 opened this issue Jun 26, 2024 · 9 comments
Labels
feature-request External feature request

Comments

@marty1885
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I'm working on integrating TTNN/Metalium into GGML/llama.cpp. Currently in the very early stages of hooking everything up and see where is breaks. And for now looking into integrating the matrix multiplication API.

And I ran into a problem. Weights in GGML are pre-transposed. Thus during inference, I have to transpose the weights before sending it for matrix multiplication. Neither tt::tt_metal::fully_connected or ttnn::operations::matmul::matmul has a parameter to allow multiplying with a pre-transposed matrix. Forcing me to transpose it every time running inference.

// in GGML a is the weight and b is theactivation
auto aT = tt::tt_metal::transpose(a, -2, -1);
auto out = tt::tt_metal::fully_connected(b, aT)

Describe the solution you'd like

It'll be great if TTNN natively support multiplying with pre-transposed matrices.

I understand I can do some tricks like tagging the tensors and only transpose once during inference (can't do during load as I won't know what operation will be used on the weight yet. But that's ugly and error prone. Epically for cases where GGML would want to read back the weights (ex: generating LoRA).

Describe alternatives you've considered

Anything is on the table as long as I can make GGML happy without additional compute.

Additional context
Add any other context or screenshots about the feature request here.

@marty1885 marty1885 added the feature-request External feature request label Jun 26, 2024
@TT-BrianLiu
Copy link
Contributor

I don't fully understand the specs and terminology for GGML (to me, input order and transpose vs. non-transposed all looks flipped), but this is what happens for our ops (applies for fully_connected or matmul):

  • We don't do any transposes internally for the op. So we always expect inputs to shapes along the inner two dims to look like: [M, K] x [K, N].
  • What you need is a matmul that supports [M, K] x [N, K]. Is this correct?

@marty1885
Copy link
Contributor Author

marty1885 commented Jun 26, 2024

What you need is a matmul that supports [M, K] x [N, K]. Is this correct?

Yes.

The term pre-transposed comes from traditional HPC. In normal MatMul the right matrix is read on column order. Which is slow because DRAM and cache pattern. One common solution if that matrix will be reused, is to store the transposed matrix. And perform all operations against that transposed one. Thus memory access would be linear and fast. GGML has this optimization built into the stored model files.

@TT-BrianLiu
Copy link
Contributor

TT-BrianLiu commented Jun 26, 2024

We can build support for this into our matmul ops. What do you propose to help distinguish regular matmul or the pre-transposed one API-wise?

@marty1885
Copy link
Contributor Author

I think most projects ended up adding a parameter to denote if tensor A/B are pre-transposed. I'm only interested in transposed B. But feel free to add transposed A support.

Ref: CBLAS and TensotFlow API

void cblas_sgemm(CBLAS_LAYOUT layout,
		CBLAS_TRANSPOSE  	TransA,
		CBLAS_TRANSPOSE  	TransB,  <--- add this as an optional parameter
		const CBLAS_INT  	M,
		const CBLAS_INT  	N,
		const CBLAS_INT  	K,
		const float  	alpha,
		const float *  	A,
		const CBLAS_INT  	lda,
		const float *  	B,
		const CBLAS_INT  	ldb,
		const float  	beta,
		float *  	C,
		const CBLAS_INT  	ldc 
	) 		
tf.compat.v1.linalg.matmul(
    a,
    b,
    transpose_a=False,
    transpose_b=False, <-- add this
    adjoint_a=False,
    adjoint_b=False,
    a_is_sparse=False,
    b_is_sparse=False,
    output_type=None,
    grad_a=False,
    grad_b=False,
    name=None
)

@TT-BrianLiu
Copy link
Contributor

I will look into this and let you know when it's supported

TT-BrianLiu added a commit that referenced this issue Jun 28, 2024
…near

- Fix transpose to respect padding when swapping dims during compute_output_shapes
@TT-BrianLiu
Copy link
Contributor

TT-BrianLiu commented Jun 28, 2024

Take a look at this PR and my comments on what was done: #9836. Once this is merged, it should unblock you. (I didn't rigorously test every API with this new feature, but it should work.)

@marty1885
Copy link
Contributor Author

marty1885 commented Jun 29, 2024

@TT-BrianLiu Sorry I caused confusion. I don't think the PR addresses the issue. I already manually transpose the wright matrix. And the reason for this feature request is to get rid of the transpose for better performance. Or least only run transpose once.

The tensor tracking part is an alternative I can do to achieve better performance. But very error prone and thus not desirable.

My bad.

TT-BrianLiu added a commit that referenced this issue Jul 2, 2024
…near

- Fix transpose to respect padding when swapping dims during compute_output_shapes
TT-BrianLiu added a commit that referenced this issue Jul 2, 2024
…near

- Fix transpose to respect padding when swapping dims during compute_output_shapes
@TT-BrianLiu
Copy link
Contributor

Ah ok. I think it's still cleaner to have the transposes baked into our matmul API and it will help us think about how to support the transposes natively in the matmul in the future.

TT-BrianLiu added a commit that referenced this issue Jul 2, 2024
…near

- Fix transpose to respect padding when swapping dims during compute_output_shapes
@TT-BrianLiu
Copy link
Contributor

I am going to close this issue for now. If there's an urgency for more performant support of this feature, we can revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request External feature request
Projects
None yet
Development

No branches or pull requests

2 participants