[Feature Request] Support matmul with pre-transposed weights. #9709

marty1885 · 2024-06-26T11:02:11Z

Is your feature request related to a problem? Please describe.

I'm working on integrating TTNN/Metalium into GGML/llama.cpp. Currently in the very early stages of hooking everything up and see where is breaks. And for now looking into integrating the matrix multiplication API.

And I ran into a problem. Weights in GGML are pre-transposed. Thus during inference, I have to transpose the weights before sending it for matrix multiplication. Neither tt::tt_metal::fully_connected or ttnn::operations::matmul::matmul has a parameter to allow multiplying with a pre-transposed matrix. Forcing me to transpose it every time running inference.

// in GGML a is the weight and b is theactivation
auto aT = tt::tt_metal::transpose(a, -2, -1);
auto out = tt::tt_metal::fully_connected(b, aT)

Describe the solution you'd like

It'll be great if TTNN natively support multiplying with pre-transposed matrices.

I understand I can do some tricks like tagging the tensors and only transpose once during inference (can't do during load as I won't know what operation will be used on the weight yet. But that's ugly and error prone. Epically for cases where GGML would want to read back the weights (ex: generating LoRA).

Describe alternatives you've considered

Anything is on the table as long as I can make GGML happy without additional compute.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

TT-BrianLiu · 2024-06-26T15:17:09Z

I don't fully understand the specs and terminology for GGML (to me, input order and transpose vs. non-transposed all looks flipped), but this is what happens for our ops (applies for fully_connected or matmul):

We don't do any transposes internally for the op. So we always expect inputs to shapes along the inner two dims to look like: [M, K] x [K, N].
What you need is a matmul that supports [M, K] x [N, K]. Is this correct?

marty1885 · 2024-06-26T15:33:26Z

What you need is a matmul that supports [M, K] x [N, K]. Is this correct?

Yes.

The term pre-transposed comes from traditional HPC. In normal MatMul the right matrix is read on column order. Which is slow because DRAM and cache pattern. One common solution if that matrix will be reused, is to store the transposed matrix. And perform all operations against that transposed one. Thus memory access would be linear and fast. GGML has this optimization built into the stored model files.

TT-BrianLiu · 2024-06-26T18:07:18Z

We can build support for this into our matmul ops. What do you propose to help distinguish regular matmul or the pre-transposed one API-wise?

marty1885 · 2024-06-27T06:20:43Z

I think most projects ended up adding a parameter to denote if tensor A/B are pre-transposed. I'm only interested in transposed B. But feel free to add transposed A support.

Ref: CBLAS and TensotFlow API

void cblas_sgemm(CBLAS_LAYOUT layout,
		CBLAS_TRANSPOSE  	TransA,
		CBLAS_TRANSPOSE  	TransB,  <--- add this as an optional parameter
		const CBLAS_INT  	M,
		const CBLAS_INT  	N,
		const CBLAS_INT  	K,
		const float  	alpha,
		const float *  	A,
		const CBLAS_INT  	lda,
		const float *  	B,
		const CBLAS_INT  	ldb,
		const float  	beta,
		float *  	C,
		const CBLAS_INT  	ldc 
	)

tf.compat.v1.linalg.matmul(
    a,
    b,
    transpose_a=False,
    transpose_b=False, <-- add this
    adjoint_a=False,
    adjoint_b=False,
    a_is_sparse=False,
    b_is_sparse=False,
    output_type=None,
    grad_a=False,
    grad_b=False,
    name=None
)

TT-BrianLiu · 2024-06-27T17:04:26Z

I will look into this and let you know when it's supported

…near

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu · 2024-06-28T21:56:34Z

Take a look at this PR and my comments on what was done: #9836. Once this is merged, it should unblock you. (I didn't rigorously test every API with this new feature, but it should work.)

marty1885 · 2024-06-29T01:10:13Z

@TT-BrianLiu Sorry I caused confusion. I don't think the PR addresses the issue. I already manually transpose the wright matrix. And the reason for this feature request is to get rid of the transpose for better performance. Or least only run transpose once.

The tensor tracking part is an alternative I can do to achieve better performance. But very error prone and thus not desirable.

My bad.

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu · 2024-07-02T15:53:03Z

Ah ok. I think it's still cleaner to have the transposes baked into our matmul API and it will help us think about how to support the transposes natively in the matmul in the future.

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu · 2024-07-03T15:22:19Z

I am going to close this issue for now. If there's an urgency for more performant support of this feature, we can revisit.

marty1885 added the feature-request External feature request label Jun 26, 2024

TT-BrianLiu added a commit that referenced this issue Jun 28, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and li…

c890061

…near

TT-BrianLiu added a commit that referenced this issue Jun 28, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and li…

bdc3d60

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu mentioned this issue Jun 28, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and linear #9836

Merged

3 tasks

TT-BrianLiu added a commit that referenced this issue Jul 2, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and li…

47738eb

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu added a commit that referenced this issue Jul 2, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and li…

ef7055f

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu added a commit that referenced this issue Jul 2, 2024

#9709: Add optional transpose_a and transpose_b to ttnn matmul and li…

2008386

…near - Fix transpose to respect padding when swapping dims during compute_output_shapes

TT-BrianLiu closed this as completed Jul 3, 2024

marty1885 mentioned this issue Sep 8, 2024

Look into matmul transpose - using enums instead of bools and fusing the work into matmul #12342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support matmul with pre-transposed weights. #9709

[Feature Request] Support matmul with pre-transposed weights. #9709

marty1885 commented Jun 26, 2024

TT-BrianLiu commented Jun 26, 2024

marty1885 commented Jun 26, 2024 •

edited

Loading

TT-BrianLiu commented Jun 26, 2024 •

edited

Loading

marty1885 commented Jun 27, 2024

TT-BrianLiu commented Jun 27, 2024

TT-BrianLiu commented Jun 28, 2024 •

edited

Loading

marty1885 commented Jun 29, 2024 •

edited

Loading

TT-BrianLiu commented Jul 2, 2024

TT-BrianLiu commented Jul 3, 2024

[Feature Request] Support matmul with pre-transposed weights. #9709

[Feature Request] Support matmul with pre-transposed weights. #9709

Comments

marty1885 commented Jun 26, 2024

TT-BrianLiu commented Jun 26, 2024

marty1885 commented Jun 26, 2024 • edited Loading

TT-BrianLiu commented Jun 26, 2024 • edited Loading

marty1885 commented Jun 27, 2024

TT-BrianLiu commented Jun 27, 2024

TT-BrianLiu commented Jun 28, 2024 • edited Loading

marty1885 commented Jun 29, 2024 • edited Loading

TT-BrianLiu commented Jul 2, 2024

TT-BrianLiu commented Jul 3, 2024

marty1885 commented Jun 26, 2024 •

edited

Loading

TT-BrianLiu commented Jun 26, 2024 •

edited

Loading

TT-BrianLiu commented Jun 28, 2024 •

edited

Loading

marty1885 commented Jun 29, 2024 •

edited

Loading