PyTorch vs GGML Linear #156

FL33TW00D · 2024-04-10T13:38:31Z

FL33TW00D
Apr 10, 2024
Maintainer

Understanding this difference is really important.

torch.nn.Linear does the following:
y = xW^t + b

x - input hidden states
W - weights

This operation should take up 90% of inference time.
It can be written as the following:
y = x @ (trans(W)) + b

GGML does things very differently, for good reason, with it's function ggml_mul_mat(ctx, W, x)
y^T = Wx^t

It can be written as the following
trans(y) = W @ (trans(x))

Why are these so different?
The crux of the issue is quantization.
During row-wise quantization (most common), we pack a $m$ x $n$ matrix into an $m$ x $n / p$ (where p is the pack factor, so 4xf32 -> u32 gives a pack factor of 4).

See below for an example

Consecutive floats are now packed into a single value. This causes BIG BIG problems if you want to do a transposed matrix multiply (i.e you can't without swizzling).

Therefore, you must forbid transposing quantized weights.

How then can we get the same result as PyTorch for our matmul?

Well you can rewrite the torch.nn.Linear behaviour as follows:

y = xW^t + b -> y^t = Wx^t + b.

As you can see, this is what GGML has done.
Therefore they can happily dequantize and then matmul without having to transpose first!

Furthermore, because iff y is a vector, then y^t = y, then you get happy memory access patterns on both the weight and the hidden states!

So, you'll see throughout ratchet that the weight is always the LHS of the matmul.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch vs GGML Linear #156

{{title}}

Replies: 0 comments

Select a reply

PyTorch vs GGML Linear #156

FL33TW00D Apr 10, 2024 Maintainer

Replies: 0 comments

FL33TW00D
Apr 10, 2024
Maintainer